i.AI Assessment: Theme Mapping Analysis

This repository contains the solution for the AI Evaluation Engineer Technical Task for i.AI. The task involves generating synthetic consultation responses, extracting themes using Themefinder, creating a second set of theme mappings with controlled randomness, and comparing the two sets of theme mappings.

Project Structure (files in /data are generated when run the commands to fetch from API).

project/
├── data/                          # Data directory
│   ├── synthetic_responses.json   # Generated consultation responses
│   ├── theme_mapping_1.json       # First set of theme mappings
│   ├── theme_mapping_2.json       # Second set of theme mappings
│   ├── comparison_results.json    # Detailed comparison results
│   ├── visualizations/            # Visualizations of the comparison
│   ├── sample_responses.json      # Sample consultation responses
│   ├── sample_theme_mapping_1.json # Sample first theme mapping
│   ├── sample_theme_mapping_2.json # Sample second theme mapping
│   ├── sample_comparison_results.json # Sample comparison results
│   └── theme_descriptions.md      # Descriptions of theme meanings
├── scripts/                       # Python scripts
│   ├── data_generation.py         # Script to generate synthetic responses
│   ├── theme_extraction.py        # Script to extract themes using Themefinder
│   ├── theme_variation.py         # Script to create a second set of theme mappings
│   ├── theme_comparison.py        # Script to compare the two sets of theme mappings
│   ├── utils.py                   # Utility functions
│   └── test_utils.py              # Tests for utility functions
├── docs/                          # Documentation
│   ├── themefinder_usage.md       # Guide for using Themefinder
│   └── metrics_explanation.md     # Explanation of comparison metrics
├── summary.md                     # Summary paragraph of the comparison
├── plan.md                        # Working plan for the assessment
├── run_pipeline.sh                # Shell script to run the complete pipeline
├── run_pipeline.py                # Python script to run the complete pipeline
└── README.md                      # This file

Requirements

Python 3.9+
Required packages:
- openai
- numpy
- matplotlib
- pandas
- scikit-learn
- themefinder (if available)

Installation

Option 1: Direct Installation

Clone the repository
Install the required packages:

pip3 install openai numpy matplotlib pandas scikit-learn
pip3 install themefinder  # Optional, fallback implementation provided

Option 2: Virtual Environment

Clone the repository
Create and activate a virtual environment:

# Create a virtual environment
python3 -m venv venv

# Activate the virtual environment
source venv/bin/activate

# Upgrade pip
pip install --upgrade pip

# Install the required packages
pip install -r requirements.txt

To deactivate the virtual environment when done:

deactivate

To activate it again later:

source venv/bin/activate

Option 3: Conda Environment (Recommended)

Clone the repository
Create and activate a conda environment:

Method A: Using environment.yml (Simplest)

# Create conda environment from the environment.yml file
conda env create -f environment.yml

# Activate the environment
conda activate iai-assessment

Method B: Manual Setup

# Create a new conda environment with Python 3.9
conda create -n iai-assessment python=3.9

# Activate the environment
conda activate iai-assessment

# Install core dependencies using conda
conda install -c conda-forge numpy pandas matplotlib scikit-learn

# Install remaining dependencies using pip
pip install openai pytest black flake8 mypy

# Try to install themefinder (may not be available)
pip install themefinder

To deactivate the conda environment when done:

conda deactivate

To activate it again later:

conda activate iai-assessment

Usage

1. Generate Synthetic Data (Handling API Rate Limits)

Generate 300 synthetic consultation responses:

python3 generate_in_batches.py

This script generates responses in small batches with delays between batches to avoid rate limits (takes around 15-20 minutes to gather and combine the 300 resoponses). See BATCH_GENERATION_README.md for details.

2. Generate Themes

Extract themes from the synthetic responses using Themefinder (or fallback implementation):

python3 scripts/theme_extraction.py --input data/synthetic_responses.json --output data/theme_mapping_1.json

To force the fallback implementation even if Themefinder is available:

python3 scripts/theme_extraction.py --force-fallback

3. Create a Second Theme Mapping

Create a second set of theme mappings with controlled randomness:

python3 scripts/theme_variation.py --input data/theme_mapping_1.json --output data/theme_mapping_2.json --variation 0.3

The --variation parameter controls the degree of randomness (0.0 to 1.0).

4. Compare Theme Mappings

Compare the two sets of theme mappings and generate a summary:

python3 scripts/theme_comparison.py --mapping1 data/theme_mapping_1.json --mapping2 data/theme_mapping_2.json --output data/comparison_results.json --summary summary.md

Running the Complete Pipeline

You can run the complete pipeline using either the shell script or Python script:

Using the Shell Script (skips the data generation from Task 1 and runs the rest of the script)

./run_pipeline.sh --skip-generation

Using the Python Script

python3 run_pipeline.py --count 300 --variation 0.3

To skip the data generation step (if you've already generated the data):

python3 run_pipeline.py --skip-generation

Both scripts will:

Generate synthetic consultation responses (using sample data by default to avoid OpenAI API issues)
Extract themes using Themefinder (or fallback)
Create a second theme mapping with controlled randomness
Compare the two theme mappings and generate a summary

Note: The pipeline is configured to use sample data by default to avoid OpenAI API rate limits and dependency issues. If you want to generate new data, you'll need to modify the scripts to remove the --use-sample flag and ensure the openai package is installed.

Documentation

Additional documentation is available in the docs directory:

docs/themefinder_usage.md: Guide for using the Themefinder API
docs/metrics_explanation.md: Explanation of the metrics used in theme comparison

Sample Data

Sample data files are provided in the data directory:

data/sample_responses.json: Sample consultation responses
data/sample_theme_mapping_1.json: Sample first theme mapping
data/sample_theme_mapping_2.json: Sample second theme mapping
data/sample_comparison_results.json: Sample comparison results
data/theme_descriptions.md: Descriptions of what each theme represents

AI Assistance

This project was developed with AI assistance. The following AI tools were used:

Azure OpenAI API (GPT-4o) for generating synthetic consultation responses.
AI assistance for code development and documentation, ChatGPT 4.0 (custom prompting GPT), Perplexity MCP, Claude 3.7, Cline for Agentic AI capabilities inside VS code.

Author

Thomas James Butler Date: 31/03/2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

i.AI Assessment: Theme Mapping Analysis

Project Structure (files in /data are generated when run the commands to fetch from API).

Requirements

Installation

Option 1: Direct Installation

Option 2: Virtual Environment

Option 3: Conda Environment (Recommended)

Method A: Using environment.yml (Simplest)

Method B: Manual Setup

Usage

1. Generate Synthetic Data (Handling API Rate Limits)

2. Generate Themes

3. Create a Second Theme Mapping

4. Compare Theme Mappings

Running the Complete Pipeline

Using the Shell Script (skips the data generation from Task 1 and runs the rest of the script)

Using the Python Script

To skip the data generation step (if you've already generated the data):

Documentation

Sample Data

AI Assistance

Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.vscode		.vscode
Prompts		Prompts
data		data
docs		docs
scripts		scripts
.DS_Store		.DS_Store
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
BATCH_GENERATION_README.md		BATCH_GENERATION_README.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
generate_in_batches.py		generate_in_batches.py
plan.md		plan.md
run_pipeline.py		run_pipeline.py
run_pipeline.sh		run_pipeline.sh
summary.md		summary.md

License

ThomasJButler/Python-Coding-Assessment

Folders and files

Latest commit

History

Repository files navigation

i.AI Assessment: Theme Mapping Analysis

Project Structure (files in /data are generated when run the commands to fetch from API).

Requirements

Installation

Option 1: Direct Installation

Option 2: Virtual Environment

Option 3: Conda Environment (Recommended)

Method A: Using environment.yml (Simplest)

Method B: Manual Setup

Usage

1. Generate Synthetic Data (Handling API Rate Limits)

2. Generate Themes

3. Create a Second Theme Mapping

4. Compare Theme Mappings

Running the Complete Pipeline

Using the Shell Script (skips the data generation from Task 1 and runs the rest of the script)

Using the Python Script

To skip the data generation step (if you've already generated the data):

Documentation

Sample Data

AI Assistance

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages