# 05. Modelling and evaluation

## Objectives

The purpose of this notebook is to train, evaluate, and interpret machine learning models that predict **CEFR levels** based on learners’ language proficiency scores and engineered features. Specifically, we aim to:

- Train classification models using the processed dataset (numeric + categorical engineered features).  
- Compare model performance across multiple algorithms (e.g., Logistic Regression, Random Forest, Gradient Boosting).  
- Evaluate models using accuracy, precision, recall, F1-score, and confusion matrices to assess classification reliability.  
- Select the best-performing model for deployment and future integration into a personalized recommendation system.  
- Save the trained model and evaluation results for reproducibility and downstream use.  


## Inputs

- **Processed dataset**: `data/processed/features.csv`  
  - Includes original skill scores, engineered features (e.g., strongest/weakest skill, profile type), and encoded CEFR target variable.  
- **Feature matrix (X)**: Scaled numeric features + encoded categorical engineered features.  
- **Target vector (y)**: Encoded CEFR levels (A1–C2).  


## Outputs

- Trained machine learning models (baseline + advanced).  
- Evaluation metrics (accuracy, precision, recall, F1-score, confusion matrix).  
- Visualizations of model performance.  
- Final selected model, serialized (e.g., `model.joblib`) for reuse.  
- Documentation of why the chosen model best supports the **business goal** of automatic learner placement.  


## Additional Information

This stage directly addresses the **business requirement**: predicting learners’ CEFR levels to enable **automatic placement** and **personalized learning recommendations**.  
While the model outputs only the predicted CEFR level, the engineered features (skill strengths, weaknesses, balance profiles) provide the contextual insights needed for tailored feedback.  
By rigorously evaluating different models and selecting the best one, we ensure predictions are not only accurate but also interpretable, reliable, and suitable for real-world integration into an adaptive learning platform.  

---

# Project Directory Structure

## Change working directory

We need to change the working directory from its current folder to the folder the code of this project is currently located

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\husse\\OneDrive\\Projects\\lang-level-pred\\jupyter_notebooks'

In [2]:
from pathlib import Path

# swtich to project root directory
project_root = Path.cwd().parent
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

Working directory: c:\Users\husse\OneDrive\Projects\lang-level-pred


---

# Data loading
This code block imports fundamental Python libraries for data analysis and visualization and checks their versions

- pandas: For data manipulation and analysis
- numpy: For numerical computations
- matplotlib: For creating visualizations and plots
- seaborn: creating attractive and informative statistical graphics from datasets

The version checks help ensure:
- Code compatibility across different environments
- Reproducibility of analysis
- Easy debugging of version-specific issues

In [3]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")
print(f"seaborn version: {sns.__version__}")

pandas version: 2.3.1
NumPy version: 2.3.1
matplotlib version: 3.10.5
seaborn version: 0.13.2


### List Files and Folders
- This code shows what files and folders are in our data/clean folder and what folder we are currently in. 

In [5]:
import os
from pathlib import Path

dataset_dir = Path("data/processed")
print(f"[INFO] Files/folders available in {dataset_dir}:")
os.listdir(dataset_dir)

[INFO] Files/folders available in data\processed:


['features.csv', 'target.csv']

## Load Processed Data

In this step, we will load the processed dataset that was prepared in the Feature Engineering Notebook.  
The data has been saved in two separate files:

- `features.csv` → contains the engineered and scaled features.  
- `target.csv` → contains the encoded CEFR levels.  

We will:
- Load both files.  
- Inspect their structure (rows, columns, datatypes).  
- Confirm they align correctly (same number of rows).  
- Prepare them as `X` (features) and `y` (target) for model training.    

In [9]:
import pandas as pd

# Load processed features and target
X = pd.read_csv("data/processed/features.csv")
y = pd.read_csv("data/processed/target.csv").squeeze()  # convert to Series

# Inspect shapes
print("Features shape:", X.shape)
print("Target shape:", y.shape)

# Preview
print("\nFeature columns:\n", X.columns.tolist())
print("\nTarget preview:\n", y.head())

# Validate alignment
assert X.shape[0] == y.shape[0], "❌ Row mismatch between features and target!"
print("✅ Features and target aligned correctly.")

Features shape: (1004, 12)
Target shape: (1004,)

Feature columns:
 ['speaking_score', 'reading_score', 'listening_score', 'writing_score', 'strongest_skill', 'weakest_skill', 'second_weakest_skill', 'skill_std', 'strength_weakness_gap', 'productive_receptive_ratio', 'avg_score', 'learning_profile']

Target preview:
 0    0
1    4
2    2
3    2
4    3
Name: cefr_encoded, dtype: int64
✅ Features and target aligned correctly.


---

## 1. Define Features and Target

Now that we have successfully loaded the processed dataset, we need to separate it into:

- **X (features):** all engineered and scaled variables used by the model to make predictions.  
- **y (target):** the encoded CEFR levels that the model will learn to predict.  

We will confirm that both `X` and `y` are correctly structured and aligned before proceeding to train/test splitting.

In [14]:
# Confirm feature matrix (X) and target vector (y)

print("Feature matrix (X):")
print(X.head())

print("\nTarget vector (y):")
print(y.head())

print("\nShapes:")
print("X:", X.shape)
print("y:", y.shape)

Feature matrix (X):
   speaking_score  reading_score  listening_score  writing_score  \
0              24             38               30             34   
1              93             91               90             89   
2              62             64               64             55   
3              63             59               54             54   
4              79             74               85             79   

  strongest_skill weakest_skill second_weakest_skill  skill_std  \
0         reading      speaking            listening   5.972158   
1        speaking       writing            listening   1.707825   
2         reading       writing             speaking   4.272002   
3        speaking     listening            listening   4.358899   
4       listening       reading             speaking   4.500000   

   strength_weakness_gap  productive_receptive_ratio  avg_score  \
0                     14                    0.852941      31.50   
1                      4          

## 2. Train/Test Split with One-Hot Encoding

In this step, we will:

- Split the dataset into **training** and **testing** sets using stratified sampling to preserve CEFR distribution.  
- Identify categorical engineered features:  
  - `strongest_skill`  
  - `weakest_skill`  
  - `second_weakest_skill`  
  - `learning_profile`  
- Apply **One-Hot Encoding (OHE)** to convert these categorical variables into numeric dummy variables.  
- Keep the numeric features as they are.  
- Verify the final transformed feature sets and check their dimensions.

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Identify categorical and numeric features
categorical_features = ["strongest_skill", "weakest_skill", "second_weakest_skill", "learning_profile"]
numeric_features = [col for col in X.columns if col not in categorical_features]

print("Categorical features:", categorical_features)
print("Numeric features:", numeric_features)

# Define preprocessing: One-Hot Encode categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(drop="first"), categorical_features),
        ("num", "passthrough", numeric_features)
    ]
)

# Split the dataset (stratify ensures CEFR balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

# Fit and transform training set, transform test set
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print("Processed training set shape:", X_train_processed.shape)
print("Processed test set shape:", X_test_processed.shape)

Categorical features: ['strongest_skill', 'weakest_skill', 'second_weakest_skill', 'learning_profile']
Numeric features: ['speaking_score', 'reading_score', 'listening_score', 'writing_score', 'skill_std', 'strength_weakness_gap', 'productive_receptive_ratio', 'avg_score']
Training set size: (803, 12)
Test set size: (201, 12)
Processed training set shape: (803, 18)
Processed test set shape: (201, 18)
