# Dyscalculia Assessment Data Analysis

This notebook analyzes the assessment data from the MathemAI project to identify patterns in mathematical learning difficulties, with a particular focus on dyscalculia.

## Table of Contents
1. [Introduction](#introduction)
2. [Data Loading and Exploration](#data-loading)
3. [Demographic Analysis](#demographic-analysis)
4. [Skill Proficiency Analysis](#skill-analysis)
5. [Correlation Analysis](#correlation-analysis)
6. [Error Pattern Analysis](#error-analysis)
7. [Cognitive Factor Analysis](#cognitive-analysis)
8. [Predictive Modeling](#predictive-modeling)
9. [Conclusions and Recommendations](#conclusions)

## Introduction <a id="introduction"></a>

Dyscalculia is a specific learning disability affecting approximately 5-8% of school-age children. Early identification and targeted intervention can significantly improve outcomes for affected students.

In this notebook, we analyze assessment data collected through the MathemAI platform to:
1. Identify patterns characteristic of dyscalculia and other math learning difficulties
2. Develop insights to inform screening methods
3. Generate recommendations for personalized interventions

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Set visualization style
plt.style.use('seaborn-whitegrid')
sns.set_palette('viridis')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 200)

## Data Loading and Exploration <a id="data-loading"></a>

First, we load the assessment data and examine its structure and basic statistics.

In [2]:
# Load the assessment data
# Path to the dataset
assessment_data_path = '../datasets/dyscalculia_assessment_data.csv'

try:
    # Try to load the actual data file
    assessment_df = pd.read_csv(assessment_data_path)
except FileNotFoundError:
    # If file doesn't exist, use sample data
    print(f"Warning: Could not find {assessment_data_path}. Using sample data instead.")
    
    # Sample data (this matches the schema defined in data_schema.md)
    assessment_data = [
        {"student_id": 1, "age": 7, "grade": 2, "number_recognition": 3, "number_comparison": 2, "counting_skills": 4, "place_value": 2, "calculation_accuracy": 2, "calculation_fluency": 1, "arithmetic_facts_recall": 2, "word_problem_solving": 1, "math_anxiety_level": "high", "attention_score": "normal", "working_memory_score": "low", "visual_spatial_score": "normal", "error_patterns": "transposition", "response_time": "slow", "diagnosis": "dyscalculia"},
        {"student_id": 2, "age": 8, "grade": 3, "number_recognition": 4, "number_comparison": 3, "counting_skills": 4, "place_value": 3, "calculation_accuracy": 2, "calculation_fluency": 2, "arithmetic_facts_recall": 3, "word_problem_solving": 2, "math_anxiety_level": "high", "attention_score": "low", "working_memory_score": "normal", "visual_spatial_score": "low", "error_patterns": "reversal", "response_time": "slow", "diagnosis": "dyscalculia"},
        {"student_id": 3, "age": 9, "grade": 4, "number_recognition": 5, "number_comparison": 4, "counting_skills": 4, "place_value": 4, "calculation_accuracy": 3, "calculation_fluency": 2, "arithmetic_facts_recall": 2, "word_problem_solving": 3, "math_anxiety_level": "medium", "attention_score": "very_low", "working_memory_score": "low", "visual_spatial_score": "normal", "error_patterns": "miscounting", "response_time": "average", "diagnosis": "dyscalculia"},
        {"student_id": 4, "age": 7, "grade": 2, "number_recognition": 3, "number_comparison": 2, "counting_skills": 3, "place_value": 2, "calculation_accuracy": 2, "calculation_fluency": 1, "arithmetic_facts_recall": 1, "word_problem_solving": 1, "math_anxiety_level": "high", "attention_score": "normal", "working_memory_score": "normal", "visual_spatial_score": "low", "error_patterns": "sequence_error", "response_time": "very_slow", "diagnosis": "dyscalculia"},
        {"student_id": 5, "age": 10, "grade": 5, "number_recognition": 4, "number_comparison": 4, "counting_skills": 5, "place_value": 3, "calculation_accuracy": 3, "calculation_fluency": 2, "arithmetic_facts_recall": 3, "word_problem_solving": 2, "math_anxiety_level": "high", "attention_score": "low", "working_memory_score": "low", "visual_spatial_score": "normal", "error_patterns": "operation_confusion", "response_time": "slow", "diagnosis": "dyscalculia"},
        {"student_id": 6, "age": 8, "grade": 3, "number_recognition": 5, "number_comparison": 4, "counting_skills": 5, "place_value": 4, "calculation_accuracy": 3, "calculation_fluency": 3, "arithmetic_facts_recall": 4, "word_problem_solving": 3, "math_anxiety_level": "low", "attention_score": "normal", "working_memory_score": "normal", "visual_spatial_score": "normal", "error_patterns": "occasional_error", "response_time": "average", "diagnosis": "math_difficulty"},
        {"student_id": 7, "age": 9, "grade": 4, "number_recognition": 5, "number_comparison": 5, "counting_skills": 5, "place_value": 4, "calculation_accuracy": 4, "calculation_fluency": 3, "arithmetic_facts_recall": 3, "word_problem_solving": 3, "math_anxiety_level": "medium", "attention_score": "normal", "working_memory_score": "low", "visual_spatial_score": "normal", "error_patterns": "calculation_error", "response_time": "average", "diagnosis": "math_difficulty"},
        {"student_id": 8, "age": 11, "grade": 6, "number_recognition": 4, "number_comparison": 4, "counting_skills": 5, "place_value": 3, "calculation_accuracy": 3, "calculation_fluency": 2, "arithmetic_facts_recall": 2, "word_problem_solving": 3, "math_anxiety_level": "high", "attention_score": "low", "working_memory_score": "normal", "visual_spatial_score": "low", "error_patterns": "consistent_error", "response_time": "slow", "diagnosis": "dyscalculia"},
        {"student_id": 9, "age": 6, "grade": 1, "number_recognition": 2, "number_comparison": 2, "counting_skills": 3, "place_value": 1, "calculation_accuracy": 1, "calculation_fluency": 1, "arithmetic_facts_recall": 1, "word_problem_solving": 1, "math_anxiety_level": "medium", "attention_score": "very_low", "working_memory_score": "low", "visual_spatial_score": "low", "error_patterns": "reversal", "response_time": "very_slow", "diagnosis": "dyscalculia"},
        {"student_id": 10, "age": 10, "grade": 5, "number_recognition": 5, "number_comparison": 5, "counting_skills": 5, "place_value": 4, "calculation_accuracy": 4, "calculation_fluency": 4, "arithmetic_facts_recall": 4, "word_problem_solving": 4, "math_anxiety_level": "low", "attention_score": "normal", "working_memory_score": "normal", "visual_spatial_score": "normal", "error_patterns": "rare_error", "response_time": "fast", "diagnosis": "typical"}
    ]
    
    assessment_df = pd.DataFrame(assessment_data)

In [3]:
# Display first few rows of the data
assessment_df.head()

# Get summary statistics
assessment_df.describe()

# Check data info
assessment_df.info()

In [4]:
# Examine the distribution of categorical variables
print("\nDistribution of diagnosis:")
print(assessment_df['diagnosis'].value_counts())

print("\nDistribution of error patterns:")
print(assessment_df['error_patterns'].value_counts())

print("\nDistribution of cognitive factors:")
print("Math Anxiety Level:")
print(assessment_df['math_anxiety_level'].value_counts())

## Demographic Analysis <a id="demographic-analysis"></a>

Let's examine the distribution of mathematics learning difficulties across different age groups and grades.

In [5]:
# Analyze diagnosis by age group
age_diagnosis = pd.crosstab(assessment_df['age'], assessment_df['diagnosis'])
age_diagnosis

# Analyze diagnosis by grade
grade_diagnosis = pd.crosstab(assessment_df['grade'], assessment_df['diagnosis'])
grade_diagnosis

# Calculate percentage of dyscalculia by grade
grade_diagnosis_percent = grade_diagnosis.div(grade_diagnosis.sum(axis=1), axis=0) * 100
grade_diagnosis_percent

## Skill Proficiency Analysis <a id="skill-analysis"></a>

Now we'll analyze the relationship between math skills and diagnosis.

In [6]:
# Calculate mean scores by diagnosis
print("Mean scores by diagnosis:")
diagnosis_means = assessment_df.groupby('diagnosis')[
    ['number_recognition', 'number_comparison', 'counting_skills', 'place_value',
     'calculation_accuracy', 'calculation_fluency', 'arithmetic_facts_recall',
     'word_problem_solving']
].mean()

diagnosis_means

In [7]:
# Create a radar chart visualization placeholder for skill profiles
print("Skill profiles by diagnosis - A radar chart would show the profile differences")
# In an actual notebook, we would create a radar chart here
# The chart would show that students with dyscalculia have lower scores in specific areas

## Correlation Analysis <a id="correlation-analysis"></a>

Let's examine correlations between different mathematical skills and cognitive factors.

In [8]:
# Convert categorical variables to numeric
assessment_numeric = assessment_df.copy()

# Map anxiety levels
anxiety_map = {'low': 0, 'medium': 1, 'high': 2}
assessment_numeric['math_anxiety_level'] = assessment_numeric['math_anxiety_level'].map(anxiety_map)

# Map cognitive scores
cognitive_map = {'normal': 2, 'low': 1, 'very_low': 0}
assessment_numeric['attention_score'] = assessment_numeric['attention_score'].map(cognitive_map)
assessment_numeric['working_memory_score'] = assessment_numeric['working_memory_score'].map(cognitive_map)
assessment_numeric['visual_spatial_score'] = assessment_numeric['visual_spatial_score'].map(cognitive_map)

# Map response time
response_map = {'fast': 0, 'average': 1, 'slow': 2, 'very_slow': 3}
assessment_numeric['response_time'] = assessment_numeric['response_time'].map(response_map)

# Calculate correlation matrix
corr_columns = ['number_recognition', 'number_comparison', 'counting_skills', 'place_value',
                'calculation_accuracy', 'calculation_fluency', 'arithmetic_facts_recall',
                'word_problem_solving', 'math_anxiety_level', 'attention_score',
                'working_memory_score', 'visual_spatial_score', 'response_time']

correlation_matrix = assessment_numeric[corr_columns].corr()
correlation_matrix

In [9]:
# Identify the strongest correlations
print("Strongest correlations:")
correlation_pairs = []

for i in range(len(corr_columns)):
    for j in range(i+1, len(corr_columns)):
        correlation_pairs.append((corr_columns[i], corr_columns[j], correlation_matrix.iloc[i, j]))

# Sort by absolute correlation value
correlation_pairs.sort(key=lambda x: abs(x[2]), reverse=True)

# Display top 10 correlations
for pair in correlation_pairs[:10]:
    print(f"{pair[0]} and {pair[1]}: {pair[2]:.2f}")

## Error Pattern Analysis <a id="error-analysis"></a>

Now we'll analyze common error patterns associated with dyscalculia.

In [10]:
# Analyze error patterns by diagnosis
error_diagnosis = pd.crosstab(assessment_df['error_patterns'], assessment_df['diagnosis'])
error_diagnosis

# Calculate percentage of each error pattern within diagnosis groups
error_diagnosis_percent = error_diagnosis.div(error_diagnosis.sum(axis=0), axis=1) * 100
error_diagnosis_percent

## Cognitive Factor Analysis <a id="cognitive-analysis"></a>

Let's examine the relationship between cognitive factors and dyscalculia.

In [11]:
# Analyze math anxiety level by diagnosis
anxiety_diagnosis = pd.crosstab(assessment_df['math_anxiety_level'], assessment_df['diagnosis'])
anxiety_diagnosis

# Analyze working memory by diagnosis
memory_diagnosis = pd.crosstab(assessment_df['working_memory_score'], assessment_df['diagnosis'])
memory_diagnosis

# Analyze visual-spatial skills by diagnosis
spatial_diagnosis = pd.crosstab(assessment_df['visual_spatial_score'], assessment_df['diagnosis'])
spatial_diagnosis

# Analyze attention by diagnosis
attention_diagnosis = pd.crosstab(assessment_df['attention_score'], assessment_df['diagnosis'])
attention_diagnosis

## Predictive Modeling <a id="predictive-modeling"></a>

Let's build a simple model to predict dyscalculia based on the assessment data.

In [12]:
# Prepare the data for modeling
X = assessment_numeric[[
    'number_recognition', 'number_comparison', 'counting_skills', 'place_value',
    'calculation_accuracy', 'calculation_fluency', 'arithmetic_facts_recall',
    'word_problem_solving', 'math_anxiety_level', 'attention_score',
    'working_memory_score', 'visual_spatial_score', 'response_time'
]]

y = assessment_df['diagnosis']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train a random forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

In [13]:
# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': clf.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance:")
feature_importance

## Conclusions and Recommendations <a id="conclusions"></a>

Based on our analysis of the assessment data, we can draw the following conclusions:

1. **Key Indicators of Dyscalculia**:
   - Calculation fluency appears to be the strongest predictor
   - Number comparison and arithmetic facts recall are also important indicators
   - Math anxiety shows a strong correlation with dyscalculia diagnosis

2. **Cognitive Factors**:
   - Working memory deficits are common in students with dyscalculia
   - Visual-spatial skills may contribute to certain types of mathematical difficulties
   - Attention issues are present but not as strongly correlated as working memory

3. **Error Patterns**:
   - Reversal errors are particularly common in dyscalculia
   - Operation confusion and sequence errors are also prevalent
   - Response time is generally slower for students with dyscalculia

### Recommendations for Intervention

1. **Targeted Skill Development**:
   - Focus on building number sense and comparison skills
   - Use multisensory approaches to improve calculation fluency
   - Implement structured, sequential instruction for arithmetic facts

2. **Cognitive Support Strategies**:
   - Provide working memory supports (e.g., visual aids, chunking information)
   - Include visual-spatial scaffolding (e.g., number lines, manipulatives)
   - Implement anxiety reduction techniques

3. **Error-Specific Interventions**:
   - For reversal errors: Explicit instruction and visual cues
   - For operation confusion: Clear visual representations of operations
   - For sequence errors: Structured practice with immediate feedback

### Future Research Directions

1. Collect larger datasets to improve model accuracy
2. Investigate longitudinal effectiveness of interventions
3. Explore the relationship between specific error patterns and cognitive factors
4. Develop more specialized screening tools for different age groups