# 02 - Feature Engineering and Selection

## Introduction

This notebook focuses on feature engineering and selection for the VocabuloQuizz recommendation system. Based on the insights from my exploratory data analysis, I will create new features, transform existing ones, and select the most relevant features for my model.
This notebook presents the evolution of my approach to data preparation and modelling, showing how I have iteratively improved my system to meet the specific challenges of this project.

It's important to note that we are working with synthetically generated data, which allows us to experiment freely but may not fully represent real-world user behaviors.

## 1. Data Loading and Preparation

### initial dataprocessing :
Script in src/feature_engineering/data_prep_initial.py

In [3]:
from Vocabulo_quiz.src.feature_engineering.data_prep_initial import prepare_data_for_model

In [10]:
X, y, user_features, word_features = prepare_data_for_model()
print("User features:")
print(user_features.head())

print("Word features:")
print(word_features.head())

postgresql://lsf_user:lsf_password@localhost:5432/lsf_app
Successful connection to the database.
Data loaded successfully.
Quiz data preview:
                                quiz_id                      date mot_id  \
0  555e9ba5-3764-4728-acf6-d7f0e32d8aa8 2024-06-07 00:21:36+00:00   None   
1  be0972b2-c5f8-4af1-907a-f1cd6968f798 2024-06-08 00:21:36+00:00   None   
2  d6f5761d-1357-4e15-8d14-826935baaa46 2024-06-09 00:21:36+00:00   None   
3  33ade695-7276-4676-918f-799f5179b37c 2024-06-10 00:21:36+00:00   None   
4  a7201a26-a7b9-477f-90e2-0d52875e4e6a 2024-06-11 00:21:36+00:00   None   

                                user_id  
0  1e34bff7-8f44-496b-aed5-ecb19ea96eb0  
1  1e34bff7-8f44-496b-aed5-ecb19ea96eb0  
2  1e34bff7-8f44-496b-aed5-ecb19ea96eb0  
3  1e34bff7-8f44-496b-aed5-ecb19ea96eb0  
4  1e34bff7-8f44-496b-aed5-ecb19ea96eb0  

Score data preview:
                               score_id                               quiz_id  \
0  7700fd23-d106-4082-9d70-20b8b73937f2  555e9b

In this stage:
* **User features** were calculated based on quiz participation and average scores
* **Word features** were based on frequency and difficulty, with some basic normalization
* **Normalization** was done using `StandardScaler` for numeric features, a simple but effective approach at this point.

**Observations**:
- The initial data preparation provides basic user and word features.
- User features include quiz count, average score, and evaluation distributions.
- Word features contain basic information like difficulty, score_diff(frequence), and grammatical categories.

**Limitations**:
- Lack of temporal features: No consideration of time-based patterns in user behavior.
- Limited word representation: Simple categorical encoding for words, missing semantic information.
- Absence of interaction features: No features capturing the relationship between users and words.
- Basic normalization: Simple standardization of numeric features may not capture complex patterns.


### initial data processing ameliorated
File in ../src/feature_engineering/data_prep_ameliored.py

As the project evolved, I expanded the feature set and improved the data preprocessing techniques. New pipelines were introduced to handle categorical and missing data more efficiently.

This script introduce :
- **Missing value handling** with `SimpleImputer`
- **Categorical data encoding** using `OneHotEncoder`
- **Pipeline** to manage the preprocessing of both numerical and categorical features efficiently

In [1]:
from Vocabulo_quiz.src.feature_engineering.data_prep_ameliored import prepare_data_for_model

postgresql://lsf_user:lsf_password@localhost:5432/lsf_app
Successful connection to the database.


In [2]:
import numpy as np

X, y, preprocessor = prepare_data_for_model()
print("Data preparation complete.")
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"Features: {X.columns.tolist()}")
print(f"Number of features after preprocessing: {X.shape[1]}")
print(f"Example rows from X:")
print(X.head())
print(f"Distribution of y: {np.bincount(y)}")

postgresql://lsf_user:lsf_password@localhost:5432/lsf_app
Successful connection to the database.
Data loaded successfully.
Quiz data preview:
                                quiz_id                      date mot_id  \
0  555e9ba5-3764-4728-acf6-d7f0e32d8aa8 2024-06-07 00:21:36+00:00   None   
1  be0972b2-c5f8-4af1-907a-f1cd6968f798 2024-06-08 00:21:36+00:00   None   
2  d6f5761d-1357-4e15-8d14-826935baaa46 2024-06-09 00:21:36+00:00   None   
3  33ade695-7276-4676-918f-799f5179b37c 2024-06-10 00:21:36+00:00   None   
4  a7201a26-a7b9-477f-90e2-0d52875e4e6a 2024-06-11 00:21:36+00:00   None   

                                user_id  
0  1e34bff7-8f44-496b-aed5-ecb19ea96eb0  
1  1e34bff7-8f44-496b-aed5-ecb19ea96eb0  
2  1e34bff7-8f44-496b-aed5-ecb19ea96eb0  
3  1e34bff7-8f44-496b-aed5-ecb19ea96eb0  
4  1e34bff7-8f44-496b-aed5-ecb19ea96eb0  

Score data preview:
                               score_id                               quiz_id  \
0  7700fd23-d106-4082-9d70-20b8b73937f2  555e9b

**Improvements in v2**:
- Enhanced user profiling: More detailed user-level features capturing learning patterns.
- Improved word representation: Additional word-level features providing more context.
- Basic temporal features: Introduction of time-based features to capture user progress over time.

**Remaining challenges**:
- Still lacking advanced NLP techniques for word representation.
- Limited consideration of user-word interactions.
- Room for improvement in handling categorical variables.

### Final data processing
File in ../src/feature_engineering/data_prep_final.py

In the final phase, the preprocessing pipeline was expanded further to include more complex features and time-based transformations.


In [20]:
from Vocabulo_quiz.src.feature_engineering.data_prep_final import prepare_data_for_model

X, y, preprocessor = prepare_data_for_model()
print("Data preparation complete.")
print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

postgresql://lsf_user:lsf_password@localhost:5432/lsf_app
Successful connection to the database.
Data loaded. DataFrame shape: (576, 18)


AttributeError: 'Engine' object has no attribute 'close'

Key advancements in data preparation:

1. Comprehensive data retrieval:
   - Efficient SQL query to gather relevant data from multiple tables in a single operation.
   - Inclusion of user feedback data (eval_mot) for a more holistic view of user interactions.

2. Sophisticated temporal features:
   - Conversion of timestamps to datetime objects for easier manipulation.
   - Extraction of hour, day of week, and month from quiz dates.
   - Cyclical encoding of time features (hour, day of week, month) to capture periodic patterns.
   - Calculation of 'days_since_last_seen' to capture recency effects.

3. Advanced feature preprocessing:
   - Standardization of numeric features using StandardScaler for consistent scale across variables.
   - One-hot encoding of categorical variables, including handling of unknown categories.
   - Retention of cyclical features without further transformation to preserve their inherent structure.

4. Flexible preprocessing pipeline:
   - Use of ColumnTransformer to apply different preprocessing steps to different types of features.
   - Easy addition or modification of feature transformations for future improvements.

5. Integrated word difficulty metrics:
   - Incorporation of various word difficulty measures (freqfilms, freqlivres, nbr_syll, cp_cm2_sfi) for a multi-faceted representation of word complexity.

6. User interaction history:
   - Inclusion of user-specific interaction data (times_correct, times_seen) to capture individual learning patterns.

7. Contextual features:
   - Integration of category and subcategory information to provide context for each word.
   - Inclusion of grammatical information (gramm_id) for linguistic context.

These improvements provide a robust foundation for my model, capturing a wide range of factors that may influence a user's performance on a given word. The preprocessing pipeline ensures that all features are appropriately scaled and encoded for machine learning algorithms.

Future directions for enhancement could include:
- Implementation of word embeddings (e.g., Word2Vec, GloVe) for richer semantic representation of words.
- Exploration of more advanced feature interactions, particularly between user history and word characteristics (but on real data).
- Investigation of time series features to capture trends in user performance over time.
- Application of dimensionality reduction techniques (e.g., PCA) if the feature space becomes too large.
- Integration of external language resources for additional linguistic features.
- Experimentation with more advanced NLP techniques like BERT or transformers for word representation, if computational resources allow.

## 2. Preparation for Modeling

In [21]:
# Split into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data split into training and testing sets.")
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

Data split into training and testing sets.
Training set shape: (16671, 30)
Testing set shape: (4168, 30)


## 7. Conclusion and Next Steps

Conclusion:

In this notebook, I have performed extensive feature engineering and selection for the VocabuloQuizz recommendation system. Key steps included:

1. Creating time-based features to capture temporal patterns in user behavior.
2. Developing user proficiency features to track learning progress.
3. Engineering word complexity features to better represent difficulty levels.
4. Generating interaction features to capture user-word relationships.
5. Transforming features using standardization and one-hot encoding.
6. Performing correlation analysis to remove redundant features.

The final feature set includes a combination of engineered features and original features selected based on their importance. This refined feature set should provide a strong foundation for building our recommendation model.

Next Steps:
1. Develop and train multiple models (e.g., Random Forest, XGBoost, Neural Networks) using the selected feature set.
2. Perform hyperparameter tuning for each model type.
3. Evaluate and compare model performances using appropriate metrics (e.g., accuracy, F1-score, NDCG).
4. Analyze model predictions to gain insights into the factors influencing word recommendations.
5. Prepare for model deployment, including setting up a pipeline for real-time feature engineering.

Remember that while these features and insights are based on synthetic data, they provide a solid starting point. As I transition to real user data, I should be prepared to refine our feature engineering process and model selection based on actual user behaviors and performance patterns.
