# EDA & Preprocessing for Student Readiness Classification

In this notebook, we will:
1. **Explore** the ASSISTments dataset  
2. **Clean** and handle missing values  
3. **Engineer** features for an early‑interaction “readiness” label  
4. **Transform** data via pipelines  
5. **Split** into train/test sets  

---

## 1. Import Libraries

We import **pandas** and **numpy** for data handling, and **scikit‑learn** modules for preprocessing and splitting.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

---

## 2. Load the Dataset

Load the full ASSISTments CSV into a DataFrame.  

In [2]:
df = pd.read_csv('assistments_2009_2010.csv')

---

## 3. Initial Exploration

- **Shape** tells us total rows & columns.  
- **head()** previews the first 5 rows.  
- **info()** shows dtypes & non‑null counts.  
- **isnull().sum()** reveals missing values per column.

In [3]:
print("Shape:", df.shape)
display(df.head())
df.info()
display(df.isnull().sum())

Shape: (552977, 20)


Unnamed: 0,order_id,assignment_id,user_id,assistment_id,problem_id,original,correct,attempt_count,ms_first_response_time,tutor_mode,answer_type,sequence_id,student_class_id,position,problem_set_type,base_sequence_id,list_skill_ids,list_skills,teacher_id,school_id
0,20224085,232368,73963.0,42904.0,76429.0,0.0,0.0,3.0,106016.0,tutor,choose_1,6272.0,11816.0,93.0,MasterySection,6272.0,,,22763.0,73.0
1,20224095,232368,73963.0,42904.0,76430.0,0.0,1.0,1.0,194187.0,tutor,choose_1,6272.0,11816.0,93.0,MasterySection,6272.0,,,22763.0,73.0
2,20224113,232368,73963.0,42904.0,76431.0,0.0,1.0,1.0,12734.0,tutor,algebra,6272.0,11816.0,93.0,MasterySection,6272.0,,,22763.0,73.0
3,20224123,232368,73963.0,42904.0,76432.0,0.0,1.0,1.0,333484.0,tutor,choose_1,6272.0,11816.0,93.0,MasterySection,6272.0,,,22763.0,73.0
4,20224142,232368,73963.0,42904.0,76433.0,0.0,0.0,2.0,52828.0,tutor,algebra,6272.0,11816.0,93.0,MasterySection,6272.0,,,22763.0,73.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552977 entries, 0 to 552976
Data columns (total 20 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   order_id                552977 non-null  int64  
 1   assignment_id           552977 non-null  int64  
 2   user_id                 552976 non-null  float64
 3   assistment_id           552976 non-null  float64
 4   problem_id              552976 non-null  float64
 5   original                552976 non-null  float64
 6   correct                 552976 non-null  float64
 7   attempt_count           552976 non-null  float64
 8   ms_first_response_time  552853 non-null  float64
 9   tutor_mode              552976 non-null  object 
 10  answer_type             552976 non-null  object 
 11  sequence_id             552976 non-null  float64
 12  student_class_id        552976 non-null  float64
 13  position                552976 non-null  float64
 14  problem_set_type    

Unnamed: 0,0
order_id,0
assignment_id,0
user_id,1
assistment_id,1
problem_id,1
original,1
correct,1
attempt_count,1
ms_first_response_time,124
tutor_mode,1


In [4]:
# Map answer types to broader categories
answer_type_mapping = {
    'algebra': 'Problem-Solving',
    'choose_1': 'Knowledge-Based',
    'fill_in_1': 'Knowledge-Based',
    'choose_n': 'Strategic Reasoning',
    'open_response': 'Problem-Solving',
    'rank': 'Strategic Reasoning',
    'external': 'Knowledge-Based'  # assuming "external" is simple lookup questions
}

# Apply the mapping
df['answer_type_grouped'] = df['answer_type'].map(answer_type_mapping)
df['answer_type_grouped'].value_counts()

Unnamed: 0_level_0,count
answer_type_grouped,Unnamed: 1_level_1
Problem-Solving,305845
Knowledge-Based,241049
Strategic Reasoning,6082


---

## 4. Handle Missing Values

- We fill **ms_first_response_time** with its median (only ~125 missing).  
- We drop columns with large null counts or that aren’t needed for our early‑interaction model.  

In [5]:
# 4.1 Impute ms_first_response_time
median_time = df['ms_first_response_time'].median()
df['ms_first_response_time'] = df['ms_first_response_time'].fillna(median_time)

# 4.2 Drop columns with excessive or non‑essential nulls
cols_to_drop = [
    'sequence_id', 'student_class_id', 'position', 'problem_set_type',
    'base_sequence_id', 'teacher_id', 'school_id',
    'list_skill_ids', 'list_skills'
]
df = df.drop(columns=cols_to_drop)

---

## 5. Feature Engineering: First Interaction

We extract **each student’s very first problem interaction**. This lets us classify readiness after just one event.  

In [6]:
# Sort by student and time, then select their first event
df_sorted = df.sort_values(['user_id', 'order_id'])
first_interactions = df_sorted.groupby('user_id').first().reset_index()

---

## 6. Define Readiness Categories

We label each student into:
1. **Fundamentals** (Category 1)  
2. **More practice** (Category 2)  
3. **Advanced** (Category 3)  
based on correctness, attempts, and response time.

In [7]:
# Determine a “fast/slow” threshold (median response time)
threshold_time = first_interactions['ms_first_response_time'].median()

def assign_category(row):
    if (row['correct'] == 0) or (row['attempt_count'] > 2) or (row['ms_first_response_time'] > threshold_time):
        return 1  # Needs reinforcement of fundamentals
    elif (row['correct'] == 1) and (row['attempt_count'] == 1) and (row['ms_first_response_time'] <= threshold_time):
        return 3  # Advanced level
    else:
        return 2  # Needs more practice

first_interactions['category'] = first_interactions.apply(assign_category, axis=1)

---

## 7. Select Features & Labels

We’ll use **attempt_count**, **response time**, and two categorical columns to predict the category.  

In [8]:
feature_cols = ['attempt_count', 'ms_first_response_time', 'tutor_mode', 'answer_type_grouped']
X = first_interactions[feature_cols]
y = first_interactions['category']

---

## 8. Build Preprocessing Pipelines

We create separate transformers for numeric vs categorical data, then merge via `ColumnTransformer`.

In [9]:
# Numeric pipeline: median impute + standard scaling
numeric_features = ['attempt_count', 'ms_first_response_time']
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline: constant impute + one‑hot encode
categorical_features = ['tutor_mode', 'answer_type_grouped']
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine them
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

---

## 9. Apply Transformations & Split

- **fit_transform** applies all imputations, scalings, and encodings.  
- We then split data for modeling (80% train, 20% test).

In [10]:
# Transform features
X_processed = preprocessor.fit_transform(X)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y, test_size=0.1, random_state=42, stratify=y
)

# Saving the split
np.savez('data_split.npz', X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test)

---

# Next Steps

1. **Train** a classifier (e.g., RandomForest, XGBoost).  
2. **Evaluate** with accuracy, confusion matrix, etc.  
3. **Iterate**: add more features, tune thresholds, or try cross‑validation.

---

# Saving a Copy from the processed data

In [11]:
X.to_csv('X_processed.csv', index=False)
y.to_csv('y_processed.csv', index=False)

---

# Compressing Original Data

For preparing to upload to github

In [12]:
import gzip
import shutil

# Path to the original CSV file
input_file = 'assistments_2009_2010.csv'
# Path to the compressed output file
output_file = 'assistments_2009_2010.csv.gz'

# Compress the CSV file
with open(input_file, 'rb') as f_in:
    with gzip.open(output_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

print(f"File '{input_file}' has been compressed to '{output_file}'")

File 'assistments_2009_2010.csv' has been compressed to 'assistments_2009_2010.csv.gz'
