## **Title**

**Personalized Education Analytics for Online Learning Platforms**

## **Introduction**

In this notebook, we build on the Exploratory Data Analysis of Notebook 1 by preparing the dataset for modeling and recommendation. We focus on encoding the categorical variables and engineer meaningful features for model development.

## **Objectives**

1.   Prepare the dataset for machine learning modeling and clustering

2.   Encode categorical variables appropriately for modeling

3.   Perform feature engineering to extract behavioral metrics:
  
  a. Engagement Score: Measures overall activity level

  b. Learning Outcome Score: Measures the academic performance

  c. Age Groups: Helps cluster users by demographic bands

4. Create a clean and transformed dataset ready for clustering, classification and recommendation logic.


## **Import Libraries**

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style ='whitegrid')

import warnings
warnings.filterwarnings('ignore')

## **Load Datasets**

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/results/cleaned_personalized_learning_dataset.csv')

## **Copy Dataset for Modeling**

In [None]:
df_model = df.copy()

## **Encoding Categorical Variables**

### **Label Encoding**

In [None]:
df_model['gender'] = df_model['gender'].map({'Male': 0, 'Female': 1, 'Other': 2})

### **Binary Encoding**

In [None]:
df_model['dropout_likelihood'] = df_model['dropout_likelihood'].map({'No': 0, 'Yes': 1})

### **Ordinal Encoding**

In [None]:
df_model['engagement_level'] = df_model['engagement_level'].map({'Low': 0, 'Medium': 1, 'High': 2})

### **One-hot encode multiclass categoricals**

In [None]:
df_model = pd.get_dummies(df_model, columns=['education_level', 'course_name', 'learning_style'], drop_first=True)

## **Feature Engineering**

**Engagement Score**

In [None]:
df_model['engagement_score']  = (
    0.3 * df['time_spent_on_videos'] +
    0.2 * df['forum_participation'] +
    0.2 * df['quiz_attempts'] +
    0.3 * df['assignment_completion_rate']
).round(2)

**Learning Outcome Score**

In [None]:
df_model['learning_outcome_score'] = (
    0.5 * df['final_exam_score'] +
    0.3 * df['quiz_scores'] +
    0.2 * df['feedback_score']
).round(2)

**Age Group Binning**

In [None]:
df_model['age_group'] = pd.cut(df_model['age'], bins=[10, 18, 25, 35, 50, 70],
                         labels=['10–18', '19–25', '26–35', '36–50', '51+'])
df_model = pd.get_dummies(df_model, columns=['age_group'], drop_first=True)


**Engagament Category**

In [None]:
def classify_engagement(score):
    if score >= 300:
        return 'High'
    elif score >= 200:
        return 'Medium'
    else:
        return 'Low'

df_model['engagement_category'] = df_model['engagement_score'].apply(classify_engagement)
df_model = pd.get_dummies(df_model, columns=['engagement_category'], drop_first=True)


## **Save the dataset**

In [None]:
df_model.to_csv('/content/drive/MyDrive/results/final_personalized_learning_dataset.csv', index = False)

## **Summary**

We encoded all relevant categorical features using appropriate techniques:

a. Label Encoding for: gender

b. Binary Encoding for dropout_likelihood

c. Ordinal Encoding for engagement level

d. One hot encoding for multiclass categoricals - course name, education level and learning style.

We created composite features: engagement score, learning outcome score and age groups. Finally, we save this processed dataset to use in notebook 3 for building the recommendation model.

**Next Steps**

We cluster learners based on engagement and performance. We then train a model to predict dropout risk and generate personalized course recommendation.