# Arithmetic Word Problem Compendium - Dataset Exploration

This notebook demonstrates how to work with the Arithmetic Word Problem Compendium dataset, exploring its structure and analyzing the problems it contains.

## Setup

First, let's install the required dependencies:

In [None]:
!pip install pandas numpy matplotlib seaborn

## Import Dependencies

In [None]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette("husl")

## Load and Prepare Data

Let's load both the training and evaluation datasets:

In [None]:
def load_jsonl(file_path):
    """Load JSONL file into a list of dictionaries."""
    with open(file_path, 'r', encoding='utf-8') as f:
        return [json.loads(line) for line in f]

# Load both training and evaluation datasets
train_data = load_jsonl('/kaggle/input/arithmetic-word-problem-compendium/sample_train.jsonl')
eval_data = load_jsonl('/kaggle/input/arithmetic-word-problem-compendium/sample_eval.jsonl')

# Convert to pandas DataFrames
train_df = pd.DataFrame(train_data)
eval_df = pd.DataFrame(eval_data)

print(f"Training set size: {len(train_df)}")
print(f"Evaluation set size: {len(eval_df)}")

## Dataset Overview

Let's examine the structure and contents of our dataset:

In [None]:
# Display basic information about the training dataset
print("Training Dataset Info:")
train_df.info()

# Display first few examples
print("
First few examples:")
pd.set_option('display.max_colwidth', None)
display(train_df.head(2))

## Analyzing Problem Domains

Let's visualize the distribution of problems across different domains:

In [None]:
plt.figure(figsize=(12, 6))
domain_counts = train_df['metadata'].apply(lambda x: x['domain']).value_counts()
sns.barplot(x=domain_counts.values, y=domain_counts.index)
plt.title('Distribution of Problem Domains')
plt.xlabel('Number of Problems')
plt.tight_layout()
plt.show()

# Print exact counts
print("
Domain Distribution:")
for domain, count in domain_counts.items():
    print(f"{domain}: {count} problems")

## Analyzing Mathematical Operations

Let's examine the types of mathematical operations used in the problems:

In [None]:
def get_operators(metadata):
    return metadata['operators']

# Collect all operators
all_operators = [op for meta in train_df['metadata'] for op in get_operators(meta)]
operator_counts = Counter(all_operators)

plt.figure(figsize=(10, 6))
sns.barplot(x=list(operator_counts.values()), y=list(operator_counts.keys()))
plt.title('Distribution of Mathematical Operations')
plt.xlabel('Number of Occurrences')
plt.tight_layout()
plt.show()

# Print exact counts
print("
Operation Distribution:")
for op, count in operator_counts.most_common():
    print(f"{op}: {count} occurrences")

## Example Problems

Let's look at some example problems from different domains:

In [None]:
def display_problem(problem):
    print(f"Domain: {problem['metadata']['domain']}")
    print(f"Question: {problem['question']}")
    print(f"Operations: {', '.join(problem['metadata']['operators'])}")
    print(f"Solution: {problem['metadata']['solution']}")
    print("-" * 80)

# Display one example from each domain
domains = set(train_df['metadata'].apply(lambda x: x['domain']))
for domain in sorted(domains):
    example = train_df[train_df['metadata'].apply(lambda x: x['domain'] == domain)].iloc[0]
    display_problem(example)

## Analyzing Problem Complexity

Let's analyze the complexity of problems based on the number of operations required:

In [None]:
operation_counts = train_df['metadata'].apply(lambda x: len(x['operators']))

plt.figure(figsize=(10, 6))
sns.histplot(operation_counts, bins=range(1, max(operation_counts) + 2))
plt.title('Distribution of Number of Operations per Problem')
plt.xlabel('Number of Operations')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

# Print summary statistics
print("
Operation Count Statistics:")
print(operation_counts.describe())

## Analyzing Decimal Precision

Let's examine the distribution of decimal places in the problems:

In [None]:
decimal_places = train_df['metadata'].apply(lambda x: x['decimals'])

plt.figure(figsize=(10, 6))
sns.histplot(decimal_places, bins=range(0, max(decimal_places) + 2))
plt.title('Distribution of Decimal Places in Solutions')
plt.xlabel('Number of Decimal Places')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

# Print summary statistics
print("
Decimal Places Statistics:")
print(decimal_places.describe())

## Cross-Domain Analysis

Let's analyze how problem complexity varies across different domains:

In [None]:
# Create a DataFrame with domain and operation count
domain_complexity = pd.DataFrame({
    'domain': train_df['metadata'].apply(lambda x: x['domain']),
    'num_operations': train_df['metadata'].apply(lambda x: len(x['operators'])),
    'decimal_places': train_df['metadata'].apply(lambda x: x['decimals'])
})

plt.figure(figsize=(12, 6))
sns.boxplot(data=domain_complexity, x='domain', y='num_operations')
plt.title('Number of Operations by Domain')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Print summary statistics by domain
print("
Complexity by Domain:")
print(domain_complexity.groupby('domain')['num_operations'].describe())

## Working with the Dataset

Here's a complete example of how to load and process problems from the dataset:

In [None]:
def process_problem(problem):
    """Example function to process a single problem."""
    return {
        'domain': problem['metadata']['domain'],
        'num_operations': len(problem['metadata']['operators']),
        'has_decimals': problem['metadata']['decimals'] > 0,
        'question_length': len(problem['question'].split()),
        'solution': problem['metadata']['solution']
    }

# Process all problems
processed_problems = [process_problem(problem) for problem in train_data[:5]]
processed_df = pd.DataFrame(processed_problems)

print("Example of processed problems:")
display(processed_df)