# Game Behavior Prediction

This notebook is about exploring data on predicting gaming behavior and running through the typical data science lifecycle. We'll just run through the basics so we can later focus on deployment to AWS.

In [276]:
import pandas as pd

In [277]:
df = pd.read_csv("online_gaming_behavior_dataset.csv")
df.shape

(40034, 13)

In [278]:
df.head()

Unnamed: 0,PlayerID,Age,Gender,Location,GameGenre,PlayTimeHours,InGamePurchases,GameDifficulty,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked,EngagementLevel
0,9000,43,Male,Other,Strategy,16.271119,0,Medium,6,108,79,25,Medium
1,9001,29,Female,USA,Strategy,5.525961,0,Medium,5,144,11,10,Medium
2,9002,22,Female,USA,Sports,8.223755,0,Easy,16,142,35,41,High
3,9003,35,Male,USA,Action,5.265351,1,Easy,9,85,57,47,Medium
4,9004,33,Male,Europe,Action,15.531945,0,Medium,2,131,95,37,Medium


In [279]:
df.isna().sum()

PlayerID                     0
Age                          0
Gender                       0
Location                     0
GameGenre                    0
PlayTimeHours                0
InGamePurchases              0
GameDifficulty               0
SessionsPerWeek              0
AvgSessionDurationMinutes    0
PlayerLevel                  0
AchievementsUnlocked         0
EngagementLevel              0
dtype: int64

In [280]:
df.describe()

Unnamed: 0,PlayerID,Age,PlayTimeHours,InGamePurchases,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked
count,40034.0,40034.0,40034.0,40034.0,40034.0,40034.0,40034.0,40034.0
mean,29016.5,31.992531,12.024365,0.200854,9.471774,94.792252,49.655568,24.526477
std,11556.964675,10.043227,6.914638,0.400644,5.763667,49.011375,28.588379,14.430726
min,9000.0,15.0,0.000115,0.0,0.0,10.0,1.0,0.0
25%,19008.25,23.0,6.067501,0.0,4.0,52.0,25.0,12.0
50%,29016.5,32.0,12.008002,0.0,9.0,95.0,49.0,25.0
75%,39024.75,41.0,17.963831,0.0,14.0,137.0,74.0,37.0
max,49033.0,49.0,23.999592,1.0,19.0,179.0,99.0,49.0


## EDA

The data looks relatively clean so shouldn't be much to do. Let's explore our features.

In [281]:
df.columns

Index(['PlayerID', 'Age', 'Gender', 'Location', 'GameGenre', 'PlayTimeHours',
       'InGamePurchases', 'GameDifficulty', 'SessionsPerWeek',
       'AvgSessionDurationMinutes', 'PlayerLevel', 'AchievementsUnlocked',
       'EngagementLevel'],
      dtype='object')

Let's split our features into X and y.

In [282]:
X_labels = list(df.columns)
y_label = X_labels.pop(-1)

In [283]:
X = df[X_labels]
y = df[y_label]

In [284]:
print(X_labels)
print(y_label)

['PlayerID', 'Age', 'Gender', 'Location', 'GameGenre', 'PlayTimeHours', 'InGamePurchases', 'GameDifficulty', 'SessionsPerWeek', 'AvgSessionDurationMinutes', 'PlayerLevel', 'AchievementsUnlocked']
EngagementLevel


### Target

`EngagementLevel` is the target variable.

In [285]:
y.unique()

array(['Medium', 'High', 'Low'], dtype=object)

In [286]:
y.value_counts()

EngagementLevel
Medium    19374
High      10336
Low       10324
Name: count, dtype: int64

Standard label encoding will suffice here.

In [264]:
y = y.map({'Low': 0, 'Medium': 1, 'High': 2})
y.head()

0    1
1    1
2    2
3    1
4    1
Name: EngagementLevel, dtype: int64

### Independent Features

Let's look at the categorical ones first. Since we will likely use XGBoost, we don't need to worry about scaling the numeric columns - although they are all similar in scale anyway.

In [265]:
independent_features = ['Gender', 'Location', 'GameGenre', 'GameDifficulty']

for feature in independent_features:
    print(X[feature].unique())

['Male' 'Female']
['Other' 'USA' 'Europe' 'Asia']
['Strategy' 'Sports' 'Action' 'RPG' 'Simulation']
['Medium' 'Easy' 'Hard']


So we can convert `Gender` to binary and label encode `GameDifficulty` as it has a natural sense of order. The other two we will need to do one hot encoding. And we need to make sure all datatypes are float/int.

In [266]:
X.loc[:, 'Gender'] = X['Gender'].map({'Male': 1, 'Female': 0})
X['Gender'] = X['Gender'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['Gender'] = X['Gender'].astype(int)


In [267]:
X.loc[:, 'GameDifficulty'] = X['GameDifficulty'].map({'Easy': 0, 'Medium': 1, 'Hard': 2})
X['GameDifficulty'] = X['GameDifficulty'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['GameDifficulty'] = X['GameDifficulty'].astype(int)


We can one hot encode and drop one of the columns (take n-1 columns).

In [268]:
X_encoded = pd.get_dummies(X, columns=['Location', 'GameGenre'], drop_first=True)

In [269]:
#this outputs True/False columns so let's ensure it's of int type
encoded_cols = list(set(X_encoded.columns) - set(X.columns))

X_encoded[encoded_cols] = X_encoded[encoded_cols].astype(int)

In [270]:
X_encoded.head()

Unnamed: 0,PlayerID,Age,Gender,PlayTimeHours,InGamePurchases,GameDifficulty,SessionsPerWeek,AvgSessionDurationMinutes,PlayerLevel,AchievementsUnlocked,Location_Europe,Location_Other,Location_USA,GameGenre_RPG,GameGenre_Simulation,GameGenre_Sports,GameGenre_Strategy
0,9000,43,1,16.271119,0,1,6,108,79,25,0,1,0,0,0,0,1
1,9001,29,0,5.525961,0,1,5,144,11,10,0,0,1,0,0,0,1
2,9002,22,0,8.223755,0,0,16,142,35,41,0,0,1,0,0,1,0
3,9003,35,1,5.265351,1,0,9,85,57,47,0,0,1,0,0,0,0
4,9004,33,1,15.531945,0,1,2,131,95,37,1,0,0,0,0,0,0


In [271]:
X_encoded = X_encoded.drop(columns=['PlayerID'])

## Modeling

We just run through XGBoost with Grid Search. The main evaluation metric will be the cohen kappa score as it works well for classification tasks.

In [289]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, cohen_kappa_score, classification_report
from xgboost import XGBClassifier

In [273]:
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

Let's test out some hyperparameters with Grid Search.

In [290]:
model = XGBClassifier(objective='multi:softmax', num_class=3, eval_metric='mlogloss')

# Define the parameter grid
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [None, 50, 100, 200]
}

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)

# Fit the model using grid search
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_}")

# Use the best model from grid search to make predictions
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Evaluate the best model
accuracy = accuracy_score(y_test, y_pred)
kappa = cohen_kappa_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(f'Cohen Kappa Score: {kappa}')
print('Classification Report:')
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best Parameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': None}
Best Score: 0.9179440722410173
Accuracy: 0.9153240914200075
Cohen Kappa Score: 0.865304102683122
Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.88      0.90      2093
           1       0.92      0.95      0.93      3879
           2       0.92      0.89      0.90      2035

    accuracy                           0.92      8007
   macro avg       0.92      0.91      0.91      8007
weighted avg       0.92      0.92      0.92      8007



The cohen kappa score is a harsh evaluation metric, so this score is actually incredible. For sure had it easy with the dataset in this case. But now let's just train it again with these best hyperparameters.

In [291]:
model = XGBClassifier(objective='multi:softmax', num_class=3, eval_metric='mlogloss', learning_rate=0.1, max_depth=7)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
kappa = cohen_kappa_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(f'Cohen Kappa Score: {kappa}')
print('Classification Report:')
print(classification_report(y_test, y_pred))

Accuracy: 0.9153240914200075
Cohen Kappa Score: 0.865304102683122
Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.88      0.90      2093
           1       0.92      0.95      0.93      3879
           2       0.92      0.89      0.90      2035

    accuracy                           0.92      8007
   macro avg       0.92      0.91      0.91      8007
weighted avg       0.92      0.92      0.92      8007

