# Predicting Player Engagement Using Supervised Learning

This notebook presents a complete supervised learning pipeline for predicting player engagement levels in an online gaming dataset. The target variable is **`EngagementLevel`**, which has three categories: `High`, `Medium`, and `Low`.

We explore:
- Preprocessing of mixed-type data
- Training multiple machine learning models
- Evaluating models using classification metrics
- Selecting the best performing model
- Discussion on overfitting, hyperparameters, and model quality

---

In [2]:
import kagglehub
#!pip install --upgrade scikit-learn

# Download latest version
#path = kagglehub.dataset_download("rabieelkharoua/predict-online-gaming-behavior-dataset")

#print("Path to dataset files:", path)

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score, precision_score, recall_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier


## 📥 Load the Dataset

In [5]:
df = pd.read_csv('online_gaming_behavior_dataset.csv')
df.head()


Unnamed: 0,PlayerID,Age,Gender,Location,GameGenre,PlayTimeHours,InGamePurchases,GameDifficulty,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked,EngagementLevel
0,9000,43,Male,Other,Strategy,16.271119,0,Medium,6,108,79,25,Medium
1,9001,29,Female,USA,Strategy,5.525961,0,Medium,5,144,11,10,Medium
2,9002,22,Female,USA,Sports,8.223755,0,Easy,16,142,35,41,High
3,9003,35,Male,USA,Action,5.265351,1,Easy,9,85,57,47,Medium
4,9004,33,Male,Europe,Action,15.531945,0,Medium,2,131,95,37,Medium


## 🛠️ Data Preprocessing

In [7]:
# Encode target
le = LabelEncoder()
df['EngagementLevel_encoded'] = le.fit_transform(df['EngagementLevel'])

# One-hot encode categorical variables
df = pd.get_dummies(df, columns=['Location', 'Gender', 'GameGenre', 'GameDifficulty'], drop_first=True)

# Drop non-informative columns
X = df.drop(['PlayerID', 'EngagementLevel', 'EngagementLevel_encoded', 'PlayerLevel','AchievementsUnlocked'], axis=1)
y = df['EngagementLevel_encoded']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

The original categorical lables of EngagementLevels is "low", "medium" and "high". Since these labels are not useful when training datasets we create EngagementLevel_encoded, this label transforms "low", "medium" and "high" into "0", "1" and "2" instead.

## 🔍 Model 1: Logistic Regression

In [10]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_lr))

Logistic Regression Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.83      0.85      2035
           1       0.80      0.70      0.75      2093
           2       0.80      0.88      0.84      3879

    accuracy                           0.82      8007
   macro avg       0.83      0.80      0.81      8007
weighted avg       0.82      0.82      0.82      8007



## 🌲 Model 2: Random Forest

In [12]:
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))


Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.79      0.84      2035
           1       0.87      0.78      0.82      2093
           2       0.82      0.92      0.87      3879

    accuracy                           0.85      8007
   macro avg       0.87      0.83      0.85      8007
weighted avg       0.86      0.85      0.85      8007



## 👥 Model 3: K-Nearest Neighbors

In [14]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)

print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn))


KNN Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.76      0.78      2035
           1       0.73      0.59      0.65      2093
           2       0.72      0.81      0.77      3879

    accuracy                           0.74      8007
   macro avg       0.75      0.72      0.73      8007
weighted avg       0.74      0.74      0.74      8007



## 📊 Model Comparison & Evaluation Metrics

In [16]:
print("F1 Scores:")
print("Logistic Regression:", f1_score(y_test, y_pred_lr, average='weighted'))
print("Random Forest:", f1_score(y_test, y_pred_rf, average='weighted'))
print("KNN:", f1_score(y_test, y_pred_knn, average='weighted'))


F1 Scores:
Logistic Regression: 0.8171027850695111
Random Forest: 0.8512166900964486
KNN: 0.7382354337544175


### ✅ Model Selection

Based on the **F1 score**, which balances precision and recall, we can select the model with the best performance on the test data.

### ❗ Overfitting

**Overfitting** occurs when a model performs very well on training data but poorly on test data. It usually happens when the model is too complex or memorizes training patterns.

**How to spot overfitting:**
- Large gap between training and validation scores
- Extremely high accuracy on training data but low F1 on test

Cross-validation and regularization can help mitigate overfitting.

### 🛠️ Hyperparameters Used

- **Random Forest**: `n_estimators=100`, `max_depth=10` (limits complexity)
- **KNN**: `n_neighbors=5` (common default, controls model flexibility)
- **Logistic Regression**: Used `max_iter=1000` to ensure convergence

### 📏 Evaluation Metrics Explained

- **Accuracy**: Proportion of correct predictions (but can be misleading with imbalanced classes)
- **Precision**: Of the predicted positives, how many were truly positive
- **Recall (Sensitivity)**: Of all actual positives, how many were correctly predicted
- **F1 Score**: Harmonic mean of precision and recall — balances both

F1 score is especially useful for multi-class problems where class imbalance may exist.

---


In [18]:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

In [19]:
joblib.dump(rf_model, 'engagement_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

['scaler.pkl']

In [20]:
print(X_train.shape[1])

15


In [21]:
print(X.columns.tolist())

['Age', 'PlayTimeHours', 'InGamePurchases', 'SessionsPerWeek', 'AvgSessionDurationMinutes', 'Location_Europe', 'Location_Other', 'Location_USA', 'Gender_Male', 'GameGenre_RPG', 'GameGenre_Simulation', 'GameGenre_Sports', 'GameGenre_Strategy', 'GameDifficulty_Hard', 'GameDifficulty_Medium']
