# 04. Feature engineering

## Objectives

The goal of this notebook is to prepare the dataset for machine learning by creating, transforming, and encoding features that improve the model’s ability to predict CEFR levels. Specifically, we aim to:

- Encode the categorical target variable (CEFR levels) into numerical labels suitable for classification.  
- Ensure feature scaling/normalization so that all skill scores contribute fairly to the model.  
- Explore potential derived features (e.g., average score, skill differences) that may enhance predictive performance.  
- Split the dataset into training and test sets with stratification to preserve class distribution.  
- Generate a final feature matrix (`X`) and target vector (`y`) ready for model training.  


## Inputs

- Cleaned dataset: `data/clean/cleaned_lang_proficiency_results.csv`  
- Columns: `speaking_score`, `reading_score`, `listening_score`, `writing_score`, `overall_cefr`  


## Outputs

- Encoded target labels for CEFR levels  
- Scaled/normalized feature set  
- Optional engineered features (e.g., mean score, modality balance)  
- Train/test splits saved for modeling  
- Final processed dataset in a format ready for the ML notebook  


## Additional Information

Feature engineering bridges the gap between raw data and machine learning readiness. Since the business requirement is **automatic learner placement and personalized recommendations**, ensuring that CEFR levels can be predicted accurately depends on well-prepared features. In this step, we transform the raw language skill scores into an optimized input space for classification models, laying the foundation for robust and interpretable predictions.  

---

# Project Directory Structure

## Change working directory

We need to change the working directory from its current folder to the folder the code of this project is currently located

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\husse\\OneDrive\\Projects\\lang-level-pred\\jupyter_notebooks'

In [2]:
from pathlib import Path

# swtich to project root directory
project_root = Path.cwd().parent
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

Working directory: c:\Users\husse\OneDrive\Projects\lang-level-pred


---

# Data loading
This code block imports fundamental Python libraries for data analysis and visualization and checks their versions

- pandas: For data manipulation and analysis
- numpy: For numerical computations
- matplotlib: For creating visualizations and plots

The version checks help ensure:
- Code compatibility across different environments
- Reproducibility of analysis
- Easy debugging of version-specific issues

In [None]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")
print(f"seaborn version: {sns.__version__}")

### List Files and Folders
- This code shows what files and folders are in our data/clean folder and what folder we are currently in. 

In [3]:
import os
from pathlib import Path

dataset_dir = Path("data/clean")
print(f"[INFO] Files/folders available in {dataset_dir}:")
os.listdir(dataset_dir)

[INFO] Files/folders available in data\clean:


['cleaned_lang_proficiency_results.csv']

## Load dataset
This code loads the dataset from the data/clean folder that is then displayed in the dataframe.

In [4]:
import pandas as pd
from pathlib import Path

# Define the path to the CSV file
file_path = Path("data/clean/cleaned_lang_proficiency_results.csv")

# Read the CSV file
df = pd.read_csv(file_path)

---

## 1. Target Variable Encoding

The target variable `overall_cefr` is categorical (A1–C2).  
To use it in machine learning classification models, we need to encode it into numeric labels.  

We will apply the following mapping:

- A1 → 0  
- A2 → 1  
- B1 → 2  
- B2 → 3  
- C1 → 4  
- C2 → 5  

This preserves the natural order of proficiency levels while making the target compatible with scikit-learn classifiers.

In [5]:
# Define mapping for CEFR levels
cefr_mapping = {"A1": 0, "A2": 1, "B1": 2, "B2": 3, "C1": 4, "C2": 5}

# Apply mapping to target column
df["cefr_encoded"] = df["overall_cefr"].map(cefr_mapping)

# Verify encoding
print("✅ Target variable encoded successfully!")
print(df[["overall_cefr", "cefr_encoded"]].head(10))

# Check value counts to confirm distribution
print("\nEncoded CEFR distribution:")
print(df["cefr_encoded"].value_counts().sort_index())

✅ Target variable encoded successfully!
  overall_cefr  cefr_encoded
0           A1             0
1           C1             4
2           B1             2
3           B1             2
4           B2             3
5           B1             2
6           A2             1
7           B2             3
8           A2             1
9           B1             2

Encoded CEFR distribution:
cefr_encoded
0    208
1    219
2    208
3    192
4     94
5     83
Name: count, dtype: int64


## 2. Feature Engineering

In this step, we create additional features from the four base skill scores (*speaking, reading, listening, writing*).  
These engineered features provide richer insights into learner profiles and enhance the model’s ability to predict CEFR levels.  

The following **8 engineered features** are created:

1. **strongest_skill** → identifies the learner’s best-performing skill.  
2. **weakest_skill** → identifies the learner’s lowest-performing skill (main bottleneck).  
3. **second_weakest_skill** → captures the learner’s secondary weakness for targeted recommendations.  
4. **skill_std** → standard deviation across the four skills, measuring balance vs imbalance.  
5. **strength_weakness_gap** → difference between strongest and weakest skill scores.  
6. **productive_receptive_ratio** → ratio of productive (speaking + writing) to receptive (reading + listening) skills.  
7. **avg_score** → mean score across all four skills, proxy for overall proficiency.  
8. **learning_profile** → categorical label:  
   - *Balanced* → skills are evenly developed (low variance).  
   - *Uneven Development* → large discrepancies between strengths and weaknesses.  

These features not only improve predictive modeling but also support the business requirement of **automatic learner placement with personalized recommendations**.

In [13]:
import numpy as np

# List of core skill columns
skill_cols = ["speaking_score", "reading_score", "listening_score", "writing_score"]

# Strongest and weakest skills
df["strongest_skill"] = df[skill_cols].idxmax(axis=1).str.replace("_score", "")
df["weakest_skill"] = df[skill_cols].idxmin(axis=1).str.replace("_score", "")

# Second weakest skill
df["second_weakest_skill"] = df[skill_cols].apply(
    lambda row: row.sort_values().index[1].replace("_score", ""), axis=1
)

# Skill standard deviation (imbalance indicator)
df["skill_std"] = df[skill_cols].std(axis=1)

# Gap between strongest and weakest
df["strength_weakness_gap"] = df[skill_cols].max(axis=1) - df[skill_cols].min(axis=1)

# Productive (speaking+writing) vs receptive (reading+listening) ratio
df["productive_receptive_ratio"] = (
    (df["speaking_score"] + df["writing_score"]) /
    (df["reading_score"] + df["listening_score"] + 1e-6)  # avoid division by zero
)

# Average score
df["avg_score"] = df[skill_cols].mean(axis=1)

# Learning profile classification

# Determine threshold (75th percentile of gaps)
gap_threshold = df["strength_weakness_gap"].quantile(0.75)
print(f"Gap threshold for Uneven Development: {gap_threshold:.2f}")

# Create learning profile classification
df["learning_profile"] = np.where(
    df["strength_weakness_gap"] > gap_threshold,
    "Uneven Development",
    "Balanced"
)

# Preview engineered features
df.head()

Gap threshold for Uneven Development: 11.00


Unnamed: 0,speaking_score,reading_score,listening_score,writing_score,overall_cefr,cefr_encoded,strongest_skill,weakest_skill,second_weakest_skill,skill_std,strength_weakness_gap,productive_receptive_ratio,avg_score,learning_profile
0,24,38,30,34,A1,0,reading,speaking,listening,5.972158,14,0.852941,31.5,Uneven Development
1,93,91,90,89,C1,4,speaking,writing,listening,1.707825,4,1.005525,90.75,Balanced
2,62,64,64,55,B1,2,reading,writing,speaking,4.272002,9,0.914062,61.25,Balanced
3,63,59,54,54,B1,2,speaking,listening,listening,4.358899,9,1.035398,57.5,Balanced
4,79,74,85,79,B2,3,listening,reading,speaking,4.5,11,0.993711,79.25,Balanced


## 2. Scaling / Normalization

Machine learning models are sensitive to the scale of input features, especially distance-based methods (e.g., KNN, SVM) and gradient-based optimizers (e.g., Logistic Regression, Neural Networks).  
Since the skill scores (0–100) and engineered features (e.g., ratios, gaps) exist on different scales, we standardize all numeric features to have:

- Mean = 0  
- Standard Deviation = 1  

This ensures that each feature contributes fairly to the model and avoids bias toward high-magnitude features.  
We use **StandardScaler** from scikit-learn for normalization.

In [14]:
# --- Step 6: Scaling / Normalization ---
from sklearn.preprocessing import StandardScaler

# Select feature columns (exclude target)
feature_cols = [
    "speaking_score", "reading_score", "listening_score", "writing_score",
    "avg_score", "strength_weakness_gap", "skill_std",
    "productive_receptive_ratio"
]

# Initialize scaler
scaler = StandardScaler()

# Fit and transform
X_scaled = scaler.fit_transform(df[feature_cols])

# Convert back to DataFrame for readability
X_scaled_df = pd.DataFrame(X_scaled, columns=feature_cols)

# Preview scaled features
X_scaled_df.head()

Unnamed: 0,speaking_score,reading_score,listening_score,writing_score,avg_score,strength_weakness_gap,skill_std,productive_receptive_ratio
0,-1.748439,-1.07899,-1.456782,-1.270976,-1.409432,1.42769,1.209576,-1.584816
1,1.50088,1.393353,1.346852,1.322354,1.412041,-1.225282,-1.273163,0.01607
2,0.041041,0.133857,0.131944,-0.280795,0.007257,0.101204,0.219728,-0.94354
3,0.088133,-0.099382,-0.335328,-0.327947,-0.171317,0.101204,0.27032,0.329497
4,0.841598,0.600337,1.113216,0.85084,0.864413,0.631798,0.35247,-0.107883


## 3. Train/Test Split (with Categorical Encoding)

In this step, we prepare the dataset for model training by ensuring the data is correctly split and all features are numerical.  

**Process:**  
1. **Split the dataset** into training and testing sets using stratified sampling to preserve the distribution of CEFR levels.  
2. **Identify categorical engineered features**:  
   - `strongest_skill`  
   - `weakest_skill`  
   - `second_weakest_skill`  
   - `learning_profile`  
3. **Encode categorical features** using **One-Hot Encoding (OHE)** to convert them into numerical dummy variables.  
4. **Retain numeric features** (already scaled in Step 6) without further modification.  
5. Combine the numeric and encoded categorical features to form the final training and test sets (`X_train_final`, `X_test_final`).  

This ensures that the input dataset is fully numeric, consistent, and ready for use in the **Modeling & Evaluation notebook**.  

In [17]:
# Step 7: Train/Test Split with Categorical Encoding
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Define target
y = df["cefr_encoded"]

# Features (drop raw CEFR columns)
X = df.drop(columns=["overall_cefr", "cefr_encoded"])

# Train/Test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set shape (before encoding):", X_train.shape)
print("Test set shape (before encoding):", X_test.shape)

# ---- Handle Categorical Features ---- #
categorical_features = ["strongest_skill", "weakest_skill", "second_weakest_skill", "learning_profile"]

# Initialize OneHotEncoder
ohe = OneHotEncoder(drop="first", sparse_output=False)  # drop="first" avoids dummy trap

# Fit on training set only (to avoid data leakage)
X_train_cat = ohe.fit_transform(X_train[categorical_features])
X_test_cat = ohe.transform(X_test[categorical_features])

# Get feature names after encoding
ohe_feature_names = ohe.get_feature_names_out(categorical_features)

# Convert to DataFrame
import pandas as pd
X_train_cat = pd.DataFrame(X_train_cat, columns=ohe_feature_names, index=X_train.index)
X_test_cat = pd.DataFrame(X_test_cat, columns=ohe_feature_names, index=X_test.index)

# Drop original categorical columns and join encoded ones
X_train_final = pd.concat([X_train.drop(columns=categorical_features), X_train_cat], axis=1)
X_test_final = pd.concat([X_test.drop(columns=categorical_features), X_test_cat], axis=1)

print("Training set shape (after encoding):", X_train_final.shape)
print("Test set shape (after encoding):", X_test_final.shape)

# Quick check on final dataset
X_train_final.head()

Training set shape (before encoding): (803, 12)
Test set shape (before encoding): (201, 12)
Training set shape (after encoding): (803, 18)
Test set shape (after encoding): (201, 18)


Unnamed: 0,speaking_score,reading_score,listening_score,writing_score,skill_std,strength_weakness_gap,productive_receptive_ratio,avg_score,strongest_skill_reading,strongest_skill_speaking,strongest_skill_writing,weakest_skill_reading,weakest_skill_speaking,weakest_skill_writing,second_weakest_skill_reading,second_weakest_skill_speaking,second_weakest_skill_writing,learning_profile_Uneven Development
575,32,40,27,25,6.683313,15,0.850746,31.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
576,26,34,31,39,5.446712,13,1.0,32.5,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
914,80,74,78,73,3.304038,7,1.006579,76.25,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
665,81,79,83,72,4.787136,11,0.944444,78.75,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
632,65,72,64,65,3.696846,8,0.955882,66.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


## 4. Final Feature Set Overview  

In this step, we will validate the final dataset before moving to the **Modelling & Evaluation** stage. Specifically, we will:  

- Inspect the final **feature matrix (`X`)** and **target (`y`)**.  
- Confirm the **dimensions** of the train/test splits after preprocessing.  
- Preview the **transformed features** to check that both numerical (scaled) and categorical (one-hot encoded) variables are included.  
- Verify that **stratified sampling** preserved the distribution of CEFR levels in both training and testing sets.

This step ensures that the dataset is properly structured and balanced, which is essential for reliable model training and evaluation.  

In [19]:
# Check shapes of train/test splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# Preview feature matrix (first 5 rows)
print("\nSample of transformed features (X_train):")
print(X_train[:5])


unique, counts = np.unique(y_train, return_counts=True)
train_dist = dict(zip(unique, counts))
unique, counts = np.unique(y_test, return_counts=True)
test_dist = dict(zip(unique, counts))

print("\nTarget distribution (train):", train_dist)
print("Target distribution (test):", test_dist)


X_train shape: (803, 12)
X_test shape: (201, 12)
y_train shape: (803,)
y_test shape: (201,)

Sample of transformed features (X_train):
     speaking_score  reading_score  listening_score  writing_score  \
575              32             40               27             25   
576              26             34               31             39   
914              80             74               78             73   
665              81             79               83             72   
632              65             72               64             65   

    strongest_skill weakest_skill second_weakest_skill  skill_std  \
575         reading       writing            listening   6.683313   
576         writing      speaking            listening   5.446712   
914        speaking       writing              reading   3.304038   
665       listening       writing              reading   4.787136   
632         reading     listening             speaking   3.696846   

     strength_weakness_gap  p

---