# Notebook Description

- **Purpose**: This notebook aims to augment the math problem dataset from [this](https://www.kaggle.com/competitions/classification-of-math-problems-by-kasut-academy/overview) Kaggle competition by generating synthetic data, particularly for classes with fewer samples, to achieve a more balanced class distribution (targeting 3000 samples per class).
- **Methodology**: It utilizes AWS Bedrock, specifically the anthropic.claude-3-5-sonnet model for generation.
- **Generation Strategy**: For each underrepresented topic, it takes example problems from the original dataset within that topic. It prompts the Bedrock model to generate a batch of new math problems inspired by the example, matching the topic and complexity but using varied contexts and phrasing. The prompt instructs the model to return a list of problems separated by a specific delimiter (\n---\n) for easier parsing.
- **Implementation**: The generation process is parallelized using Python's ThreadPoolExecutor for efficiency. It includes error handling for potential Bedrock API issues like throttling, incorporating retries with exponential backoff.
Output: The newly generated questions and their corresponding labels are combined into a pandas DataFrame and saved as CSV.

# Import Libraries

In [None]:
import re
import os
import boto3
import json
import time
import pandas as pd
from botocore.exceptions import ClientError
from concurrent.futures import ThreadPoolExecutor

In [None]:
# Set if needed.
if False:
    os.environ['AWS_ACCESS_KEY_ID'] = ''
    os.environ['AWS_SECRET_ACCESS_KEY'] = ''
    os.environ['AWS_SESSION_TOKEN'] = ''
    os.environ['AWS_DEFAULT_REGION'] = ''

    print(boto3.client('sts').get_caller_identity())

# Initialize Bedrock and Load Data

In [None]:
try:
    bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')
    print("Bedrock client initialized successfully.")
except Exception as e:
    print(f"Failed to initialize Bedrock client: {e}")
    print("Ensure AWS credentials (including session token) are set in ~/.aws/credentials or environment variables.")
    bedrock = None


TOPICS = {
    0: "Algebra",
    1: "Geometry and Trigonometry",
    2: "Calculus and Analysis",
    3: "Probability and Statistics",
    4: "Number Theory",
    5: "Combinatorics and Discrete Math",
    6: "Linear Algebra",
    7: "Abstract Algebra and Topology"
}

# Load dataset
train_data_path = "https://raw.githubusercontent.com/PrudhvirajuChekuri/Final-Project-Group8/refs/heads/master/code/data/train.csv"
train_df = pd.read_csv(train_data_path)
label_counts = train_df['label'].value_counts()
target_samples = 3000

Bedrock client initialized successfully.


# Create and Test Function to Generate Questions

In [None]:
# Function to generate multiple questions in one API call
def generate_questions_batch(original_question, topic, num_questions=5, retries=3):
    if not bedrock:
        return [f"Mock question {i+1} for {topic}: Similar to '{original_question}'" for i in range(num_questions)]
    
    prompt = f"""
You are a math expert tasked with generating {num_questions} new math problems for a specific topic, inspired by an example problem but with distinct structure and wording.

### Topic:
{topic}

### Example Problem:
{original_question}

### Instructions:
1. Generate exactly {num_questions} new math problems in the topic of {topic}.
2. Each problem must have a different context (e.g., use dice, cards, experiments, or surveys instead of balls/urns if the example uses those).
3. Use varied phrasing and question styles (e.g., ask for probability as a decimal, percentage, simplified fraction, conditional probability, or expected value).
4. Match the mathematical complexity and key concepts of the example problem.
5. Ensure all problems are valid, solvable, and suitable for a Kaggle competition.
6. Avoid repetitive phrases like "expressed as a fraction in lowest terms" and diversify wording.
7. Return the problems as a list of strings, with each problem separated by exactly "\n---\n" (newline, three dashes, newline).
8. Each problem should be a single string. Do not include explanations, solutions, or numbering.
"""
    for attempt in range(retries):
        try:
            response = bedrock.invoke_model(
                modelId='anthropic.claude-3-5-sonnet-20240620-v1:0',
                body=json.dumps({
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 1000,  # Increased for multiple questions
                    "messages": [
                        {
                            "role": "user",
                            "content": prompt
                        }
                    ]
                })
            )
            result = json.loads(response['body'].read().decode('utf-8'))
            text = result['content'][0]['text'].strip()
            # Parse delimited list
            questions = re.split(r'\n---\n', text)
            questions = [q.strip() for q in questions if q.strip()]
            if len(questions) < num_questions:
                print(f"Warning: Got {len(questions)} questions instead of {num_questions}. Retrying...")
                continue
            return questions[:num_questions]
        except ClientError as e:
            if "Throttling" in str(e):
                print(f"Throttling detected. Retrying {attempt+1}/{retries} after delay...")
                time.sleep(2 ** attempt)  # Exponential backoff
            elif "AccessDenied" in str(e):
                print("Credentials lack Bedrock permissions. Contact your instructor.")
                return None
            elif "InvalidClientTokenId" in str(e) or "SignatureDoesNotMatch" in str(e):
                print("Invalid or expired credentials. Refresh from your portal.")
                return None
            elif "ValidationError" in str(e):
                print("Invalid model ID. Verify with 'aws bedrock list-foundation-models'.")
                return None
            else:
                print(f"Bedrock error: {e}")
                return None
        except Exception as e:
            print(f"Unexpected error: {e}")
            return None
    print(f"Failed to generate questions after {retries} attempts.")
    return None

# Parallel generation function
def generate_questions_parallel(original_questions, topic, target_count, batch_size=5, max_workers=8):
    questions_needed = target_count
    generated_questions = []
    batches = (questions_needed + batch_size - 1) // batch_size  # Ceiling division
    
    print(f"Generating {questions_needed} questions in {batches} batches...")
    start_time = time.time()
    
    def process_batch(index):
        # Cycle through original questions
        original = original_questions[index % len(original_questions)]
        result = generate_questions_batch(original, topic, batch_size)
        return result if result else []

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Generate batches in parallel
        results = list(executor.map(process_batch, range(batches)))
    
    # Flatten results
    for batch in results:
        generated_questions.extend(batch)
    
    elapsed = time.time() - start_time
    print(f"Generated {len(generated_questions)} questions in {elapsed:.2f} seconds.")
    return generated_questions[:questions_needed]

# Test the function
if __name__ == "__main__":
    try:
        train_df = pd.read_csv('train.')
        # Filter for Probability and Statistics (label=3)
        topic = TOPICS[3]
        sample_questions = train_df[train_df['label'] == 3]['Question'].head(5).tolist()
        if not sample_questions:
            print("Error: No questions found for Probability and Statistics.")
            exit(1)
        
        print(f"Testing generation for {topic}...")
        target_count = 400
        generated = generate_questions_parallel(sample_questions, topic, target_count)
        
        if generated:
            print(f"Sample generated questions:")
            for i, q in enumerate(generated[:5], 1):
                print(f"{i}. {q}")
        else:
            print("Failed to generate questions.")
    except FileNotFoundError:
        print("Error: train.csv not found.")
    except KeyError:
        print("Error: Required columns ('Question', 'label') not found. Available columns:", train_df.columns.tolist())

Bedrock client initialized successfully.
Testing generation for Probability and Statistics...
Generating 400 questions in 80 batches...
Generated 400 questions in 86.91 seconds.
Sample generated questions:
1. Here are 5 new math problems on Probability and Statistics based on your requirements:

A card game involves drawing from a standard 52-card deck without replacement. Players continue drawing until they have three cards of the same suit or five cards total. What is the probability, as a percentage rounded to two decimal places, that a player will draw exactly five cards?
2. In a psychology experiment, subjects are shown a series of 10 images and must remember them in order. The probability of remembering each image correctly is 0.8, independent of other images. Calculate the expected number of images remembered correctly in their proper sequence from the beginning before the first mistake.
3. A factory produces widgets with a 5% defect rate. Quality control randomly selects 20 wid

In [None]:
generated[:10]

['Here are 5 new math problems on Probability and Statistics based on your requirements:\n\nA card game involves drawing from a standard 52-card deck without replacement. Players continue drawing until they have three cards of the same suit or five cards total. What is the probability, as a percentage rounded to two decimal places, that a player will draw exactly five cards?',
 'In a psychology experiment, subjects are shown a series of 10 images and must remember them in order. The probability of remembering each image correctly is 0.8, independent of other images. Calculate the expected number of images remembered correctly in their proper sequence from the beginning before the first mistake.',
 'A factory produces widgets with a 5% defect rate. Quality control randomly selects 20 widgets for inspection. What is the probability that they find at least one defective widget? Express your answer as a decimal rounded to four places.',
 'Six fair dice are rolled simultaneously. Let X be t

# Run Full Generation

In [None]:
# Full augmentation function
def augment_data(df, target_samples=3000, batch_size=5, max_workers=8):
    label_counts = df['label'].value_counts()
    augmented_data = []
    total_start_time = time.time()
    
    for label, count in label_counts.items():
        print(f"Processing label {label} ({TOPICS[label]}): {count} samples in data.")
        if count < target_samples:
            topic = TOPICS[label]
            samples_needed = target_samples - count
            original_questions = df[df['label'] == label]['Question'].tolist()
            if not original_questions:
                print(f"Error: No questions found for {topic}. Skipping...")
                continue
            new_questions = generate_questions_parallel(original_questions, topic, samples_needed, batch_size, max_workers)
            for q in new_questions:
                augmented_data.append({'Question': q, 'label': label})
    
    augmented_df = pd.DataFrame(augmented_data)
    total_elapsed = time.time() - total_start_time
    print(f"Total augmentation completed in {total_elapsed:.2f} seconds. Generated {len(augmented_df)} new questions.")
    return augmented_df

# Run full augmentation
if __name__ == "__main__":
    try:
        train_df = pd.read_csv(train_data_path)
        print("Loaded train.csv. Starting augmentation...")
        augmented_df = augment_data(train_df)
        output_file = 'augmented_train_second_attempt.csv'
        augmented_df.to_csv(output_file, index=False)
        print(f"Saved augmented data to {output_file}.")
    except FileNotFoundError:
        print("Error: train.csv not found.")
    except KeyError:
        print("Error: Required columns ('Question', 'label') not found. Available columns:", train_df.columns.tolist())

Loaded train.csv. Starting augmentation...
Processing label 0 (Algebra): 2618 samples in data.
Generating 382 questions in 77 batches...
Generated 385 questions in 97.54 seconds.
Processing label 1 (Geometry and Trigonometry): 2439 samples in data.
Generating 561 questions in 113 batches...
Generated 565 questions in 123.05 seconds.
Processing label 5 (Combinatorics and Discrete Math): 1827 samples in data.
Generating 1173 questions in 235 batches...
Generated 1175 questions in 250.67 seconds.
Processing label 4 (Number Theory): 1712 samples in data.
Generating 1288 questions in 258 batches...
Throttling detected. Retrying 1/3 after delay...
Throttling detected. Retrying 1/3 after delay...
Throttling detected. Retrying 1/3 after delay...
Throttling detected. Retrying 1/3 after delay...
Throttling detected. Retrying 2/3 after delay...
Throttling detected. Retrying 1/3 after delay...
Throttling detected. Retrying 1/3 after delay...
Throttling detected. Retrying 2/3 after delay...
Throttl