# 04. Feature engineering

## Objectives

The goal of this notebook is to prepare the dataset for machine learning by creating, transforming, and encoding features that improve the model’s ability to predict CEFR levels. Specifically, we aim to:

- Encode the categorical target variable (CEFR levels) into numerical labels suitable for classification.  
- Ensure feature scaling/normalization so that all skill scores contribute fairly to the model.  
- Explore potential derived features (e.g., average score, skill differences) that may enhance predictive performance.  
- Split the dataset into training and test sets with stratification to preserve class distribution.  
- Generate a final feature matrix (`X`) and target vector (`y`) ready for model training.  


## Inputs

- Cleaned dataset: `data/clean/cleaned_lang_proficiency_results.csv`  
- Columns: `speaking_score`, `reading_score`, `listening_score`, `writing_score`, `overall_cefr`  


## Outputs

- Encoded target labels for CEFR levels  
- Scaled/normalized feature set  
- Optional engineered features (e.g., mean score, modality balance)  
- Train/test splits saved for modeling  
- Final processed dataset in a format ready for the ML notebook  


## Additional Information

Feature engineering bridges the gap between raw data and machine learning readiness. Since the business requirement is **automatic learner placement and personalized recommendations**, ensuring that CEFR levels can be predicted accurately depends on well-prepared features. In this step, we transform the raw language skill scores into an optimized input space for classification models, laying the foundation for robust and interpretable predictions.  

---

# Project Directory Structure

## Change working directory

We need to change the working directory from its current folder to the folder the code of this project is currently located

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\husse\\OneDrive\\Projects\\lang-level-pred\\jupyter_notebooks'

In [2]:
from pathlib import Path

# swtich to project root directory
project_root = Path.cwd().parent
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

Working directory: c:\Users\husse\OneDrive\Projects\lang-level-pred


---

# Data loading
This code block imports fundamental Python libraries for data analysis and visualization and checks their versions

- pandas: For data manipulation and analysis
- numpy: For numerical computations
- matplotlib: For creating visualizations and plots
- seaborn: creating attractive and informative statistical graphics from datasets

The version checks help ensure:
- Code compatibility across different environments
- Reproducibility of analysis
- Easy debugging of version-specific issues

In [None]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")
print(f"seaborn version: {sns.__version__}")

### List Files and Folders
- This code shows what files and folders are in our data/clean folder and what folder we are currently in. 

In [3]:
import os
from pathlib import Path

dataset_dir = Path("data/clean")
print(f"[INFO] Files/folders available in {dataset_dir}:")
os.listdir(dataset_dir)

[INFO] Files/folders available in data\clean:


['cleaned_lang_proficiency_results.csv']

## Load dataset
This code loads the dataset from the data/clean folder that is then displayed in the dataframe.

In [41]:
import pandas as pd
from pathlib import Path

# Define the path to the CSV file
file_path = Path("data/clean/cleaned_lang_proficiency_results.csv")

# Read the CSV file
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,speaking_score,reading_score,listening_score,writing_score,overall_cefr
0,24,38,30,34,A1
1,93,91,90,89,C1
2,62,64,64,55,B1
3,63,59,54,54,B1
4,79,74,85,79,B2


---

## 1. Target Variable Encoding

The target variable `overall_cefr` is categorical (A1–C2).  
To use it in machine learning classification models, we need to encode it into numeric labels.  

We will apply the following mapping:

- A1 → 0  
- A2 → 1  
- B1 → 2  
- B2 → 3  
- C1 → 4  
- C2 → 5  

This preserves the natural order of proficiency levels while making the target compatible with scikit-learn classifiers.

In [42]:
# Define mapping for CEFR levels
cefr_mapping = {"A1": 0, "A2": 1, "B1": 2, "B2": 3, "C1": 4, "C2": 5}

# Apply mapping to target column
df["cefr_encoded"] = df["overall_cefr"].map(cefr_mapping)

df.info()

# Verify encoding
print("✅ Target variable encoded successfully!")
print(df[["overall_cefr", "cefr_encoded"]].head(10))

# Check value counts to confirm distribution
print("\nEncoded CEFR distribution:")
print(df["cefr_encoded"].value_counts().sort_index())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   speaking_score   1004 non-null   int64 
 1   reading_score    1004 non-null   int64 
 2   listening_score  1004 non-null   int64 
 3   writing_score    1004 non-null   int64 
 4   overall_cefr     1004 non-null   object
 5   cefr_encoded     1004 non-null   int64 
dtypes: int64(5), object(1)
memory usage: 47.2+ KB
✅ Target variable encoded successfully!
  overall_cefr  cefr_encoded
0           A1             0
1           C1             4
2           B1             2
3           B1             2
4           B2             3
5           B1             2
6           A2             1
7           B2             3
8           A2             1
9           B1             2

Encoded CEFR distribution:
cefr_encoded
0    208
1    219
2    208
3    192
4     94
5     83
Name: count, dtype: int64


## 2. Feature Engineering

In this step, we create additional features from the four base skill scores (*speaking, reading, listening, writing*).  
These engineered features provide richer insights into learner profiles and enhance the model’s ability to predict CEFR levels.

The following **7 engineered features** are created:

1. **strongest_skill** → identifies the learner’s best-performing skill.  
2. **weakest_skill** → identifies the learner’s lowest-performing skill (main bottleneck).  
3. **second_weakest_skill** → captures the learner’s secondary weakness for targeted recommendations.  
4. **skill_std** → standard deviation across the four skills, measuring balance vs imbalance.  
5. **strength_weakness_gap** → difference between strongest and weakest skill scores.  
6. **productive_receptive_ratio** → ratio of productive (speaking + writing) to receptive (reading + listening) skills.  
7. **learning_profile** → categorical label:  
   - *Balanced* → skills are evenly developed (low variance).  
   - *Uneven Development* → large discrepancies between strengths and weaknesses.  
8. **speaking_minus_avg** → difference between speaking score and learner’s average.  
9. **reading_minus_avg** → difference between reading score and learner’s average.  
10. **listening_minus_avg** → difference between listening score and learner’s average.  
11. **writing_minus_avg** → difference between writing score and learner’s average.  
12. **speaking_to_reading** → ratio of speaking score to reading score.  
13. **writing_to_listening** → ratio of writing score to listening score.

**Important: Preventing Data Leakage**  
The raw scores (*speaking_score, reading_score, listening_score, writing_score*) are highly correlated with CEFR and would cause the model to “cheat” by memorizing direct score-to-level mappings.  
To ensure fairness and generalization, we **drop the raw scores after feature engineering**, keeping only the derived features listed above.  

This ensures the model learns from **relative skill patterns and learner profiles**, not from absolute exam results.


In [43]:
import numpy as np
import pandas as pd

# List of core skill columns
skill_cols = ["speaking_score", "reading_score", "listening_score", "writing_score"]

# Strongest and weakest skills
df["strongest_skill"] = df[skill_cols].idxmax(axis=1).str.replace("_score", "")
df["weakest_skill"] = df[skill_cols].idxmin(axis=1).str.replace("_score", "")

# Second weakest skill
df["second_weakest_skill"] = df[skill_cols].apply(
    lambda row: row.sort_values().index[1].replace("_score", ""), axis=1
)

# Skill standard deviation (imbalance indicator)
df["skill_std"] = df[skill_cols].std(axis=1)

# Gap between strongest and weakest
df["strength_weakness_gap"] = df[skill_cols].max(axis=1) - df[skill_cols].min(axis=1)

# Productive (speaking+writing) vs receptive (reading+listening) ratio
df["productive_receptive_ratio"] = (
    (df["speaking_score"] + df["writing_score"]) /
    (df["reading_score"] + df["listening_score"] + 1e-6)  # avoid division by zero
)

# Learning profile classification
gap_threshold = df["strength_weakness_gap"].quantile(0.75)
print(f"Gap threshold for Uneven Development: {gap_threshold:.2f}")

df["learning_profile"] = np.where(
    df["strength_weakness_gap"] > gap_threshold,
    "Uneven Development",
    "Balanced"
)

# 🔧 New engineered features

# Relative strength of each skill compared to learner’s average
df["speaking_minus_avg"] = df["speaking_score"] - df[skill_cols].mean(axis=1)
df["reading_minus_avg"] = df["reading_score"] - df[skill_cols].mean(axis=1)
df["listening_minus_avg"] = df["listening_score"] - df[skill_cols].mean(axis=1)
df["writing_minus_avg"] = df["writing_score"] - df[skill_cols].mean(axis=1)

# Extra ratios (pairwise imbalances)
df["speaking_to_reading"] = df["speaking_score"] / (df["reading_score"] + 1e-6)
df["writing_to_listening"] = df["writing_score"] / (df["listening_score"] + 1e-6)


# ✅ Drop raw scores AFTER all feature engineering
df = df.drop(columns=skill_cols)

# Preview final engineered dataset
df.head()

Gap threshold for Uneven Development: 11.00


Unnamed: 0,overall_cefr,cefr_encoded,strongest_skill,weakest_skill,second_weakest_skill,skill_std,strength_weakness_gap,productive_receptive_ratio,learning_profile,speaking_minus_avg,reading_minus_avg,listening_minus_avg,writing_minus_avg,speaking_to_reading,writing_to_listening
0,A1,0,reading,speaking,listening,5.972158,14,0.852941,Uneven Development,-7.5,6.5,-1.5,2.5,0.631579,1.133333
1,C1,4,speaking,writing,listening,1.707825,4,1.005525,Balanced,2.25,0.25,-0.75,-1.75,1.021978,0.988889
2,B1,2,reading,writing,speaking,4.272002,9,0.914062,Balanced,0.75,2.75,2.75,-6.25,0.96875,0.859375
3,B1,2,speaking,listening,listening,4.358899,9,1.035398,Balanced,5.5,1.5,-3.5,-3.5,1.067797,1.0
4,B2,3,listening,reading,speaking,4.5,11,0.993711,Balanced,-0.25,-5.25,5.75,-0.25,1.067568,0.929412


## 2. Scaling / Normalization

Machine learning models are sensitive to the scale of input features, especially distance-based methods (e.g., KNN, SVM) and gradient-based optimizers (e.g., Logistic Regression, Neural Networks).  
Since the skill scores (0–100) and engineered features (e.g., ratios, gaps) exist on different scales, we standardize all numeric features to have:

- Mean = 0  
- Standard Deviation = 1  

This ensures that each feature contributes fairly to the model and avoids bias toward high-magnitude features.  
We use **StandardScaler** from scikit-learn for normalization.

In [50]:
# --- Step 6: Scaling / Normalization ---
from sklearn.preprocessing import StandardScaler

# Select feature columns (exclude target)
feature_cols = [
    "strength_weakness_gap", "skill_std",
    "productive_receptive_ratio",
    "speaking_minus_avg", "reading_minus_avg", "listening_minus_avg", "writing_minus_avg",
]

# Initialize scaler
scaler = StandardScaler()

# Fit and transform
X_scaled = scaler.fit_transform(df[feature_cols])

# Convert back to DataFrame for readability
X_scaled_df = pd.DataFrame(X_scaled, columns=feature_cols)

# ✅ Replace original columns with scaled ones
df[feature_cols] = X_scaled_df

# Preview scaled features
X_scaled_df.head()

Unnamed: 0,strength_weakness_gap,skill_std,productive_receptive_ratio,speaking_minus_avg,reading_minus_avg,listening_minus_avg,writing_minus_avg
0,1.42769,1.209576,-1.584816,-2.0695,1.751171,-0.421848,0.720782
1,-1.225282,-1.273163,0.01607,0.60982,0.058795,-0.221437,-0.438501
2,0.101204,0.219728,-0.94354,0.197617,0.735745,0.713814,-1.665976
3,0.101204,0.27032,0.329497,1.502926,0.39727,-0.956277,-0.915852
4,0.631798,0.35247,-0.107883,-0.077185,-1.430496,1.515458,-0.029342


## 3. Train/Test Split (with Categorical Encoding)

In this step, we prepare the dataset for model training by ensuring the data is correctly split and all features are numerical.  

**Process:**  
1. **Split the dataset** into training and testing sets using stratified sampling to preserve the distribution of CEFR levels.  
2. **Identify categorical engineered features**:  
   - `strongest_skill`  
   - `weakest_skill`  
   - `second_weakest_skill`  
   - `learning_profile`  
3. **Encode categorical features** using **One-Hot Encoding (OHE)** to convert them into numerical dummy variables.  
4. **Retain numeric features** (already scaled in Step 2) without further modification.  
5. Combine the numeric and encoded categorical features to form the final training and test sets (`X_train_final`, `X_test_final`).  

This ensures that the input dataset is fully numeric, consistent, and ready for use in the **Modeling & Evaluation notebook**.  

In [None]:
# Train/Test Split with Categorical Encoding
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Define target
y = df["cefr_encoded"]

# Features (drop raw CEFR columns)
X = df.drop(columns=["overall_cefr", "cefr_encoded"])

# Train/Test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set shape (before encoding):", X_train.shape)
print("Test set shape (before encoding):", X_test.shape)

# ---- Handle Categorical Features ---- #
categorical_features = ["strongest_skill", "weakest_skill", "second_weakest_skill", "learning_profile"]

# Initialize OneHotEncoder
ohe = OneHotEncoder(drop="first", sparse_output=False)  # drop="first" avoids dummy trap

# Fit on training set only (to avoid data leakage)
X_train_cat = ohe.fit_transform(X_train[categorical_features])
X_test_cat = ohe.transform(X_test[categorical_features])

# Get feature names after encoding
ohe_feature_names = ohe.get_feature_names_out(categorical_features)

# Convert to DataFrame
import pandas as pd
X_train_cat = pd.DataFrame(X_train_cat, columns=ohe_feature_names, index=X_train.index)
X_test_cat = pd.DataFrame(X_test_cat, columns=ohe_feature_names, index=X_test.index)

# Drop original categorical columns and join encoded ones
X_train_final = pd.concat([X_train.drop(columns=categorical_features), X_train_cat], axis=1)
X_test_final = pd.concat([X_test.drop(columns=categorical_features), X_test_cat], axis=1)

print("Training set shape (after encoding):", X_train_final.shape)
print("Test set shape (after encoding):", X_test_final.shape)

# Quick check on final dataset
X_train_final.head()

Training set shape (before encoding): (803, 13)
Test set shape (before encoding): (201, 13)
Training set shape (after encoding): (803, 19)
Test set shape (after encoding): (201, 19)


Unnamed: 0,skill_std,strength_weakness_gap,productive_receptive_ratio,speaking_minus_avg,reading_minus_avg,listening_minus_avg,writing_minus_avg,speaking_to_reading,writing_to_listening,strongest_skill_reading,strongest_skill_speaking,strongest_skill_writing,weakest_skill_reading,weakest_skill_speaking,weakest_skill_writing,second_weakest_skill_reading,second_weakest_skill_speaking,second_weakest_skill_writing,learning_profile_Uneven Development
575,1.623618,1.692987,-1.607845,0.266317,2.428122,-1.089884,-1.597783,0.8,0.925926,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
576,0.903656,1.162393,-0.041896,-1.794698,0.39727,-0.421848,1.811872,0.764706,1.258064,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
914,-0.343831,-0.42939,0.027129,1.022023,-0.618156,0.4466,-0.847659,1.081081,0.935897,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
665,0.519644,0.631798,-0.624777,0.60982,0.058795,1.114636,-1.802363,1.025316,0.86747,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
632,-0.115134,-0.164093,-0.504772,-0.420688,1.480391,-0.689062,-0.370308,0.902778,1.015625,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


## 4. Final Feature Set Overview  

In this step, we will validate the final dataset before moving to the **Modelling & Evaluation** stage. Specifically, we will:  

- Inspect the final **feature matrix (`X`)** and **target (`y`)**.  
- Confirm the **dimensions** of the train/test splits after preprocessing.  
- Preview the **transformed features** to check that both numerical (scaled) and categorical (one-hot encoded) variables are included.  
- Verify that **stratified sampling** preserved the distribution of CEFR levels in both training and testing sets.

This step ensures that the dataset is properly structured and balanced, which is essential for reliable model training and evaluation.  

In [52]:
# Check shapes of train/test splits
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

# Preview feature matrix (first 5 rows)
print("\nSample of transformed features (X_train):")
print(X_train[:5])


unique, counts = np.unique(y_train, return_counts=True)
train_dist = dict(zip(unique, counts))
unique, counts = np.unique(y_test, return_counts=True)
test_dist = dict(zip(unique, counts))

print("\nTarget distribution (train):", train_dist)
print("Target distribution (test):", test_dist)


X_train shape: (803, 13)
X_test shape: (201, 13)
y_train shape: (803,)
y_test shape: (201,)

Sample of transformed features (X_train):
    strongest_skill weakest_skill second_weakest_skill  skill_std  \
575         reading       writing            listening   1.623618   
576         writing      speaking            listening   0.903656   
914        speaking       writing              reading  -0.343831   
665       listening       writing              reading   0.519644   
632         reading     listening             speaking  -0.115134   

     strength_weakness_gap  productive_receptive_ratio    learning_profile  \
575               1.692987                   -1.607845  Uneven Development   
576               1.162393                   -0.041896  Uneven Development   
914              -0.429390                    0.027129            Balanced   
665               0.631798                   -0.624777            Balanced   
632              -0.164093                   -0.504772      

## 9. Save Processed Data  

In this step, we will save the **final preprocessed dataset** so that it can be reused in the **Modelling & Evaluation notebook** without repeating all preprocessing steps.  

We will:  
- Save the **feature matrix (`X`)** and **target (`y`)** into CSV files.  
- Store them inside the `data/processed/` folder for good project structure.  
- Confirm that the files were saved successfully by reloading and inspecting their shapes.  

This ensures reproducibility and keeps a clear separation between **raw**, **intermediate**, and **processed** data.  

In [53]:
import os

# Create processed data folder if it doesn't exist
os.makedirs("data/processed", exist_ok=True)

# Save processed features and target
X.to_csv("data/processed/features.csv", index=False)
y.to_csv("data/processed/target.csv", index=False)

print("✅ Processed data saved successfully!")

# Quick validation
print("Features shape:", X.shape)
print("Target shape:", y.shape)

✅ Processed data saved successfully!
Features shape: (1004, 13)
Target shape: (1004,)


---

# Conclusion

## 10. Summary & Next Steps  

In this notebook, we:  

- Imported and verified the **cleaned dataset**.  
- Encoded the **target variable (CEFR levels)** into numerical labels.  
- Engineered **8 new features** to capture learner skill profiles (e.g., strongest/weakest skill, learning profile, avg score, etc.).  
- Scaled numeric features to ensure consistent ranges.  
- Encoded categorical engineered features using **One-Hot Encoding (OHE)**.  
- Performed a **stratified train/test split** to preserve CEFR class balance.  
- Saved the **processed dataset** for future use.  

✅ At this point, we have a **ready-to-use dataset** with features (`X`) and target (`y`) prepared for Machine Learning.  

### 🚀 Next Steps (Modelling & Evaluation Notebook)
In the next notebook, we will:  
- Load the processed features and target.  
- Build and train ML models (classification).  
- Evaluate model performance (classification report, confusion matrix, etc.).  
- Interpret results in terms of the **business requirements**.  
- Prepare the foundation for generating **personalised learning recommendations**.  