# Notebook Description

- **Purpose**: Similar to the generic notebook, this script augments the math problem dataset from [this](https://www.kaggle.com/competitions/classification-of-math-problems-by-kasut-academy/overview) Kaggle competition to address class imbalance, aiming for a target number of samples per class (e.g., 3000). It only exposes the first 9189 rows of the original training data as the basis for augmentation so that remaining samples can be used for validation.
- **Methodology**: It employs AWS Bedrock with claud 3.5 sonnet for generation.
- **Generation Strategy**: The core strategy involves providing the model with an example problem from the target topic and prompting it to generate a batch of new, similar problems. The prompt specifically instructs the model to match the mathematical complexity and style, use varied contexts/phrasing, and importantly, to include LaTeX or math symbols if they were present in the original example problem. Generated problems are requested in a delimited list format.
- **Key Difference with the Generic Notebook**: This code adds specific instruction about LaTeX/math symbols to ensure the generated synthetic data more closely mirrors the style and formatting of the original math problems. Since mathematical notation is crucial for problem understanding and structure, forcing the model to replicate this aspect aims to create higher-quality, more realistic augmented data. Also the previous notebook exposes all the original training data in the prompts for some classes, this version only uses the first 9189 samples so that remaining 1000 samples are untouched.
- **Implementation**: Leverages parallel processing via ThreadPoolExecutor to speed up the generation for different topics and batches. Includes error handling for Bedrock API calls, such as throttling, with retry logic.
Output: The augmented data (generated questions and labels) is compiled into a DataFrame and saved as a CSV file (named augmented_train_second_attempt.csv in the notebook's example run).

# Import Libraries

In [None]:
import re
import os
import boto3
import json
import time
import pandas as pd
from botocore.exceptions import ClientError
from concurrent.futures import ThreadPoolExecutor

# Setup AWS Credentials

In [None]:
# Set if needed.
if False:
    os.environ['AWS_ACCESS_KEY_ID'] = ''
    os.environ['AWS_SECRET_ACCESS_KEY'] = ''
    os.environ['AWS_SESSION_TOKEN'] = ''
    os.environ['AWS_DEFAULT_REGION'] = ''

    print(boto3.client('sts').get_caller_identity())

# Initialize Bedrock and Load Data

In [None]:
# Initialize Bedrock client
try:
    bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')
    print("Bedrock client initialized successfully.")
except Exception as e:
    print(f"Failed to initialize Bedrock client: {e}")
    print("Ensure AWS credentials (including session token) are set in ~/.aws/credentials or environment variables.")
    bedrock = None

# Define topic mapping
TOPICS = {
    0: "Algebra",
    1: "Geometry and Trigonometry",
    2: "Calculus and Analysis",
    3: "Probability and Statistics",
    4: "Number Theory",
    5: "Combinatorics and Discrete Math",
    6: "Linear Algebra",
    7: "Abstract Algebra and Topology"
}

# Load dataset
train_df = pd.read_csv('train.csv')
label_counts = train_df['label'].value_counts()
target_samples = 3000

Bedrock client initialized successfully.


# Create and Test Function to Generate Questions

Ask the model to generate questions similar to the original training data

In [None]:
# Function to generate multiple questions in one API call
def generate_questions_batch(original_question, topic, num_questions=5, retries=3):
    if not bedrock:
        return [f"Mock question {i+1} for {topic}: Similar to '{original_question}'" for i in range(num_questions)]
    
    prompt = f"""
You are a math expert tasked with generating {num_questions} new math problems for a specific topic, inspired by an example problem but with distinct structure and wording.

### Topic:
{topic}

### Example Problem:
{original_question}

### Instructions:
1. Generate exactly {num_questions} new math problems in the topic of {topic}.
2. If the example problem contains latex or math symbols, use them in the generated problems as well.
3. Each problem must have a different context (e.g., use dice, cards, experiments, or surveys instead of balls/urns if the example uses those).
4. Use varied phrasing and question styles (e.g., ask for probability as a decimal, percentage, simplified fraction, conditional probability, or expected value).
5. Match the mathematical complexity and style of the example problem.
6. Avoid repetitive phrases like "expressed as a fraction in lowest terms" and diversify wording.
7. Return the problems as a list of strings, with each problem separated by exactly "\n---\n" (newline, three dashes, newline).
8. Each problem should be a single string. **Do not include explanations, solutions, or numbering**.
9. Do not include any additional text or formatting in the beginning or ending of your response.
"""
    for attempt in range(retries):
        try:
            response = bedrock.invoke_model(
                modelId='anthropic.claude-3-5-sonnet-20240620-v1:0',
                body=json.dumps({
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 1000,
                    "messages": [
                        {
                            "role": "user",
                            "content": prompt
                        }
                    ]
                })
            )
            result = json.loads(response['body'].read().decode('utf-8'))
            text = result['content'][0]['text'].strip()
            # Parse delimited list
            questions = re.split(r'\n---\n', text)
            questions = [q.strip() for q in questions if q.strip()]
            if len(questions) < num_questions:
                print(f"Warning: Got {len(questions)} questions instead of {num_questions}. Retrying...")
                continue
            return questions[:num_questions]
        except ClientError as e:
            if "Throttling" in str(e):
                print(f"Throttling detected. Retrying {attempt+1}/{retries} after delay...")
                time.sleep(2 ** attempt)  # Exponential backoff
            elif "AccessDenied" in str(e):
                print("Credentials lack Bedrock permissions. Contact your instructor.")
                return None
            elif "InvalidClientTokenId" in str(e) or "SignatureDoesNotMatch" in str(e):
                print("Invalid or expired credentials. Refresh from your portal.")
                return None
            elif "ValidationError" in str(e):
                print("Invalid model ID. Verify with 'aws bedrock list-foundation-models'.")
                return None
            else:
                print(f"Bedrock error: {e}")
                return None
        except Exception as e:
            print(f"Unexpected error: {e}")
            return None
    print(f"Failed to generate questions after {retries} attempts.")
    return None

# Parallel generation function
def generate_questions_parallel(original_questions, topic, target_count, batch_size=5, max_workers=8):
    questions_needed = target_count
    generated_questions = []
    batches = (questions_needed + batch_size - 1) // batch_size  # Ceiling division
    
    print(f"Generating {questions_needed} questions in {batches} batches...")
    start_time = time.time()
    
    def process_batch(index):
        # Cycle through original questions
        original = original_questions[index % len(original_questions)]
        result = generate_questions_batch(original, topic, batch_size)
        return result if result else []

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Generate batches in parallel
        results = list(executor.map(process_batch, range(batches)))
    
    # Flatten results
    for batch in results:
        generated_questions.extend(batch)
    
    elapsed = time.time() - start_time
    print(f"Generated {len(generated_questions)} questions in {elapsed:.2f} seconds.")
    return generated_questions[:questions_needed]

# Test the function
if __name__ == "__main__":
    try:
        train_df = pd.read_csv('train.csv')
        # Filter for Probability and Statistics (label=3)
        topic = TOPICS[3]
        sample_questions = train_df[train_df['label'] == 3]['Question'].head(5).tolist()
        if not sample_questions:
            print("Error: No questions found for Probability and Statistics.")
            exit(1)
        
        print(f"Testing generation for {topic}...")
        # Generate 400 questions to test speed
        target_count = 400
        generated = generate_questions_parallel(sample_questions, topic, target_count)
        
        if generated:
            print(f"Sample generated questions:")
            for i, q in enumerate(generated[:5], 1):
                print(f"{i}. {q}")
        else:
            print("Failed to generate questions.")
    except FileNotFoundError:
        print("Error: train.csv not found.")
    except KeyError:
        print("Error: Required columns ('Question', 'label') not found. Available columns:", train_df.columns.tolist())

Bedrock client initialized successfully.
Testing generation for Probability and Statistics...
Generating 400 questions in 80 batches...
Generated 400 questions in 123.72 seconds.
Sample generated questions:
1. A deck of 52 cards contains 13 cards of each suit (hearts, diamonds, clubs, spades). Cards are drawn without replacement until either all four aces or all four kings are drawn. Let $p/q$ be the probability of drawing all four aces before all four kings, where $p$ and $q$ are coprime integers. Determine $p+q$.
2. In a laboratory, genetic mutations occur independently in bacteria with probability 0.01 per generation. If a colony starts with 100 bacteria, what is the probability that exactly 3 bacteria will have mutated after one generation? Give your answer as a percentage rounded to two decimal places.
3. A fair six-sided die is rolled repeatedly until either a 6 appears or the sum of the rolls exceeds 10. Let $E$ be the expected number of rolls. Find $\lfloor 100E \rfloor$ (the f

In [5]:
generated[:20]

['A deck of 52 cards contains 13 cards of each suit (hearts, diamonds, clubs, spades). Cards are drawn without replacement until either all four aces or all four kings are drawn. Let $p/q$ be the probability of drawing all four aces before all four kings, where $p$ and $q$ are coprime integers. Determine $p+q$.',
 'In a laboratory, genetic mutations occur independently in bacteria with probability 0.01 per generation. If a colony starts with 100 bacteria, what is the probability that exactly 3 bacteria will have mutated after one generation? Give your answer as a percentage rounded to two decimal places.',
 'A fair six-sided die is rolled repeatedly until either a 6 appears or the sum of the rolls exceeds 10. Let $E$ be the expected number of rolls. Find $\\lfloor 100E \\rfloor$ (the floor of 100E).',
 'A bag contains 5 red marbles, 3 blue marbles, and 2 green marbles. Marbles are drawn one at a time without replacement until all marbles of one color have been drawn. What is the probab

# Run Full Generation

In [7]:
# Full augmentation function
def augment_data(df, target_samples=3000, batch_size=5, max_workers=8):
    label_counts = df['label'].value_counts()
    augmented_data = []
    total_start_time = time.time()
    
    for label, count in label_counts.items():
        print(f"Processing label {label} ({TOPICS[label]}): {count} samples in data.")
        if count < target_samples:
            topic = TOPICS[label]
            samples_needed = target_samples - count
            original_questions = df[df['label'] == label]['Question'].tolist()
            if not original_questions:
                print(f"Error: No questions found for {topic}. Skipping...")
                continue
            new_questions = generate_questions_parallel(original_questions, topic, samples_needed, batch_size, max_workers)
            for q in new_questions:
                augmented_data.append({'Question': q, 'label': label})
    
    augmented_df = pd.DataFrame(augmented_data)
    total_elapsed = time.time() - total_start_time
    print(f"Total augmentation completed in {total_elapsed:.2f} seconds. Generated {len(augmented_df)} new questions.")
    return augmented_df

# Run full augmentation
if __name__ == "__main__":
    try:
        train_df = pd.read_csv('train.csv')
        train_df = train_df[:9189].copy()
        print("Loaded train.csv. Starting augmentation...")
        augmented_df = augment_data(train_df)
        # Save to CSV
        output_file = 'augmented_train_second_attempt.csv'
        augmented_df.to_csv(output_file, index=False)
        print(f"Saved augmented data to {output_file}.")
    except FileNotFoundError:
        print("Error: train.csv not found.")
    except KeyError:
        print("Error: Required columns ('Question', 'label') not found. Available columns:", train_df.columns.tolist())

Loaded train.csv. Starting augmentation...
Processing label 0 (Algebra): 2361 samples in data.
Generating 639 questions in 128 batches...
Generated 640 questions in 173.89 seconds.
Processing label 1 (Geometry and Trigonometry): 2205 samples in data.
Generating 795 questions in 159 batches...
Generated 795 questions in 234.13 seconds.
Processing label 5 (Combinatorics and Discrete Math): 1654 samples in data.
Generating 1346 questions in 270 batches...
Generated 1350 questions in 378.01 seconds.
Processing label 4 (Number Theory): 1535 samples in data.
Generating 1465 questions in 293 batches...
Generated 1465 questions in 339.72 seconds.
Processing label 2 (Calculus and Analysis): 936 samples in data.
Generating 2064 questions in 413 batches...
Generated 2065 questions in 513.66 seconds.
Processing label 3 (Probability and Statistics): 334 samples in data.
Generating 2666 questions in 534 batches...
Generated 2670 questions in 764.33 seconds.
Processing label 6 (Linear Algebra): 88 sa