# 04. Feature engineering

## Objectives

The goal of this notebook is to prepare the dataset for machine learning by creating, transforming, and encoding features that improve the model’s ability to predict CEFR levels. Specifically, we aim to:

- Encode the categorical target variable (CEFR levels) into numerical labels suitable for classification.  
- Ensure feature scaling/normalization so that all skill scores contribute fairly to the model.  
- Explore potential derived features (e.g., average score, skill differences) that may enhance predictive performance.  
- Split the dataset into training and test sets with stratification to preserve class distribution.  
- Generate a final feature matrix (`X`) and target vector (`y`) ready for model training.  


## Inputs

- Cleaned dataset: `data/clean/cleaned_lang_proficiency_results.csv`  
- Columns: `speaking_score`, `reading_score`, `listening_score`, `writing_score`, `overall_cefr`  


## Outputs

- Encoded target labels for CEFR levels  
- Scaled/normalized feature set  
- Optional engineered features (e.g., mean score, modality balance)  
- Train/test splits saved for modeling  
- Final processed dataset in a format ready for the ML notebook  


## Additional Information

Feature engineering bridges the gap between raw data and machine learning readiness. Since the business requirement is **automatic learner placement and personalized recommendations**, ensuring that CEFR levels can be predicted accurately depends on well-prepared features. In this step, we transform the raw language skill scores into an optimized input space for classification models, laying the foundation for robust and interpretable predictions.  

---

# Project Directory Structure

## Change working directory

We need to change the working directory from its current folder to the folder the code of this project is currently located

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\husse\\OneDrive\\Projects\\lang-level-pred\\jupyter_notebooks'

In [2]:
from pathlib import Path

# swtich to project root directory
project_root = Path.cwd().parent
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

Working directory: c:\Users\husse\OneDrive\Projects\lang-level-pred


---

# Data loading
This code block imports fundamental Python libraries for data analysis and visualization and checks their versions

- pandas: For data manipulation and analysis
- numpy: For numerical computations
- matplotlib: For creating visualizations and plots

The version checks help ensure:
- Code compatibility across different environments
- Reproducibility of analysis
- Easy debugging of version-specific issues

In [None]:
# Import data analysis tools
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns


print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")
print(f"seaborn version: {sns.__version__}")

### List Files and Folders
- This code shows what files and folders are in our data/clean folder and what folder we are currently in. 

In [3]:
import os
from pathlib import Path

dataset_dir = Path("data/clean")
print(f"[INFO] Files/folders available in {dataset_dir}:")
os.listdir(dataset_dir)

[INFO] Files/folders available in data\clean:


['cleaned_lang_proficiency_results.csv']