# DSCI100 Final Group Project
Hasan Al Tawfiq, Baptistin Bourban, Rhys Babkin, Shaheed Alelg

## 1: Introduction

### Background

A research group at UBC Computer Science is gathering data on how people engage with video games. Specifically, player actions are recorded as they navigate through the virtual world of Minecraft in order to train AI models (with the end goal of creating smarter NPCs). However, managing this project is complex – as each participant is a volunteer, the team is seeking additional insights on how recruitment efforts could be improved.

Our overarching goal is to analyze player behaviours and habits in search of useful information garnering participation. As highlighted by the project lead, Frank Wood, our mission is:
> "*To know more about the user populations: in particular, to have a good model of whether or not a player will continue contributing given* [their] *past participation*".

Understanding these predictors can help in crafting strategies that not only improve participation rates but also ensure more representative, consistent data sampling throughout the course of the research.

### Research Question

This project addresses a pivotal question in the context of ongoing research participation:

> "Can committment to a project, as an abstract concept, be operationalized, modelled and thus predicted by training a model on past player data?" 

Our method uses KNN classification to create a model which can predict a player's commitment to longterm participation. A "committed player" is defined in this report as one that has continued partcipating for 2 or more weeks. The model works by taking two predictive variables - number of sessions and total hours of playtime - and classifying a player as "committed" or "not interested" based on the classification of players in its training data.

### Dataset Description

Our analysis utilizes two interconnected datasets hosted on Google Drive which together form the basis of this study.

The first dataset contains detailed profiles of the participants, including metrics that may influence their likelihood of continuing in the study, such as their demographic information, previous contributions to research, and initial level of Minecraft experience. Understanding the demographic and psychographic makeup of participants can help identify characteristics that correlate with sustained engagement.

+ *<font color='violet'>'players.csv'</font>*: A list of 196 unique players, including the following 9 data field for each player.

    + **experience** <font color='green'>(**type**:text)</font> ->
    This field defines whether a player is experienced in the game or just starting out. </br>&#10140; It is user input (perceived) data and may not correspond to their actual level or how much playtime they contribute.
    + **subscribe** <font color='green'>(**type**:boolean)</font> ->
    This field states if a player has subscribed to the mailing list or not. </br>&#10140; A subscriber might be a more persistent player.
    + **hashedEmail** <font color='green'>(**type**:text)</font> ->
    Here we are given the encrypted email for each player; a unique user ID. </br>&#10140; While the original email is user input, this field is computer-generated. The hashed emails will help us link the data of each player between both datasets.
    + **played_hours** <font color='green'>(**type**:float)</font> ->
    In this field, we are given the total amount of playtime each player has contributed, as cummulated over the play sessions. </br>&#10140; This field will likely be more useful for preliminary plotting rather than creating the model, as we would like to get more specific with our data. Note also that some users in the players' list have never played (min=0.0).
    + **name** <font color='green'>(**type**:text)</font> ->
    These are usernames for each player. Randomized, anonymous names allowed the players to communicate between themselves. </br>&#10140; This field is a user input and will probably not be useful for the analysis.
    + **gender** <font color='green'>(**type**:text)</font> -> This field contains user-input gender.
    </br>&#10140; This question was optional and might therefore not be reliable to the analysis.
    + **age** <font color='green'>(**type**:integer)</font> -> This field states the age of each player. </br>&#10140; It would be interesting to see if age influences the analysis, although the widest range (between the 25% and the 75% percentile) is between 17 and 22 - rather narrow.
    + **individualld** <font color='green'>(**type**:float)</font> ->
    Unknown: this field could be an ID-encoded number as a float.</br>&#10140; This field is empty.
    + **organizationName** <font color='green'>(**type**:float)</font> ->
    Unknown. This field could be meant to define whether a player is part of an organization. </br>&#10140; This field is empty.

The second dataset records the frequency, duration, and nature of each participant's interactions with the research activities via Minecraft session logs, providing temporal data that can be crucial for understanding patterns of engagement. By examining these logs, it is possible to track how participants' engagement changes over time and identify periods or conditions under which participation tends to wane or increase.

+ *<font color='violet'>'sessions.csv'</font>*: A list of 1535 individual play sessions from many different players, including the following 5 columns:

    + **hashedEmail** <font color='green'>(**type**:text)</font> -> Same as in players.csv. </br>&#10140; This will help us link session data to player information in the other dataset.
    + **start_time** <font color='green'>(**type**:text)</font> ->
    This tells us the time each player logged into the server.</br>$#10140; Along with the end time, this will be useful in our analysis of users' persistence.
    + **end_time** <font color='green'>(**type**:text)</font> -> This tells us at what time each player disconnected from/logged out of the game.</br>&#10140; Same as above.
    + **original_start_time** <font color='green'>(**type**:text)</font> -> A large number that does not seem to be precise enough to be meaningful.</br>&#10140; Will be disregarded. 
    + **original_end_time** <font color='green'>(**type**:text)</font> -> Another large number with insufficient precision.</br>&#10140; Will be disregarded.

Together, these datasets provide a robust foundation for analyzing participant behaviour and developing insights into the factors that encourage continued involvement in research studies. By integrating and analyzing these data sources, the project aims to uncover patterns and predictors that could inform more effective participant retention strategies in future research endeavours.

For our project's objective, not all of the variables in each dataset are needed.

It is anticipated that the most useful dataset will be <font color='violet'>'sessions.csv'</font>, as it contains the start time/date and end time/date. These variables will not only show us the duration and number of sessions, but will provide information on daily preferences and how those may change over time.  These two variables alone will play a big role in determining play frequency in our model. The "original_start_" and "_end_times" are irrelevant for our purposes and can be discarded. Lastly, the emails will allow us connect information to specific users from the other dataset.

In the <font color='violet'>'players.csv'</font> dataset, we are going to focus on the perceived "experience", "age", and "registration" columns in case a relationship appears between these metrics and participation data. These could help with improving our model if an associated to players' commitment is discovered. "Name", "gender", "individualId" and "organizationName" are either empty or irrelevant (see above), and so will be disregarded. Finally, "played_hours" will be cast aside as well, as it is not representative of the entire time played. 

We are now ready to explore these datasets and begin constructing our model.

## 2: Building the Classification Model

### Data Preprocessing

In [1]:
# Import packages
import pandas as pd
import altair as alt
import numpy as np
from datetime import timedelta
from sklearn import set_config
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline                           
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, precision_score
from sklearn.model_selection import GridSearchCV

np.random.seed(1)

# Output dataframes instead of arrays
set_config(transform_output="pandas")

* Extracting IDs from URLs: The URLs provided are Google Drive share links. The code extracts file IDs from these URLs to form direct download links, enabling programmatic access to the files.

* Reading CSV Files: Pandas' *read_csv* function is used to load specific columns from the datasets. For sessions, parse_dates converts the session timestamps into Pandas datetime objects, facilitating time-based operations.

In [2]:
# Load the data
players = pd.read_csv('players.csv', usecols=["experience", "subscribe", "hashedEmail", "played_hours", "age"])
sessions = pd.read_csv('sessions.csv', parse_dates=['start_time','end_time'], dayfirst=True, usecols=["hashedEmail", "start_time", "end_time"])

* Data Normalization: This step categorizes and orders the 'experience' column to standardize its values, making it more structured, thus allowing for easier grouping and analysis.

* Merging DataFrames: Combines session and player data on the 'hashedEmail' column. This centralized dataset facilitates comprehensive analyses that incorporate both session details and player profiles.

* Calculating Session Duration: Computes the total playtime for each session in minutes.

* Time Adjustment: Shifts 'start_time' and 'end_time' by -8 hours, presumably to adjust for timezone differences, standardizing time data to a specific timezone.

In [3]:
# Order the experience
players["experience"] = players["experience"].replace({
    "Beginner" : "1-Beginner",
    "Amateur" : "2-Amateur",
    "Regular" : "3-Regular",
    "Pro" : "4-Pro",
    "Veteran" : "5-Veteran"
})

# Merge dataframes, convert times to PST (UTC seems to have been assumed), calculate playtime for each session
tidywhole = pd.merge(sessions, players, how="inner", on=["hashedEmail"])
tidywhole["start_time"] = tidywhole["start_time"] + pd.Timedelta(hours=-8)
tidywhole["end_time"]   = tidywhole["end_time"]   + pd.Timedelta(hours=-8)
tidywhole["played_minutes"] = (tidywhole["end_time"]-tidywhole["start_time"]) / timedelta(minutes=1)

tidywhole


Unnamed: 0,hashedEmail,start_time,end_time,experience,subscribe,played_hours,age,played_minutes
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 10:12:00,2024-06-30 10:24:00,3-Regular,True,223.1,17,12.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 15:33:00,2024-06-17 15:46:00,2-Amateur,True,53.9,17,13.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 09:34:00,2024-07-25 09:57:00,2-Amateur,True,150.0,16,23.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-24 19:22:00,2024-07-24 19:58:00,3-Regular,True,223.1,17,36.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 08:01:00,2024-05-25 08:12:00,2-Amateur,True,53.9,17,11.0
...,...,...,...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-10 15:01:00,2024-05-10 15:07:00,2-Amateur,True,53.9,17,6.0
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,2024-06-30 20:08:00,2024-06-30 20:19:00,5-Veteran,True,1.6,23,11.0
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-28 07:36:00,2024-07-28 07:57:00,2-Amateur,True,56.1,23,21.0
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-24 22:15:00,2024-07-24 22:22:00,2-Amateur,True,56.1,23,7.0


In [4]:
len(tidywhole["hashedEmail"].unique())

125

Out of 196 users, 71 have never played. That gives us information from 125 particpants to work with.


### Defining the Commitment Category

Our mission is to build an effective model of whether or not a player is committed (and will remain committed) based on their past participation. This is a categorical prediction, hence the use of classification rather than regression. 

However, a commitment category is not given to us in the original datasets. Therefore, we will need to create this variable ourselves - to simplify this task, we will only be classifying players as "committed" or "not interested", perhaps sacrificing some nuance for a more straightforward model.

Our first step is to determine a metric that will separate the two groups of players.

* Grouping Data: Organizes data by participant using 'hashedEmail'.

* Calculating Engagement Period: Determines the total duration from the first to the last session for each participant in days, providing a measure of long-term engagement under the assumption that the longer the period, the higher a player's level of commitment.

* Plotting Engagement Period: Visualizes each player against their engagement, providing us with an effective overview of player trends.

In [5]:
# Calculate the period over which each person has played

grouped = tidywhole.groupby(['hashedEmail'])
period = grouped["end_time"].max() - grouped["start_time"].min()
period = period.dt.total_seconds() / (24 * 3600)   #converted to number of days (decimal)
period_df= period.reset_index()
period_df.columns = ["hashedEmail","period (days)"]

commitment = alt.Chart(period_df.assign(playerID=period_df.index+1)).mark_circle().encode(
    x=alt.X("playerID").title("Participant"),
    y=alt.Y("period (days)").title("Period over which participant has played (in days)"),
)

commitment_zoom = alt.Chart(period_df.assign(playerID=period_df.index+1)).mark_circle(clip=True).encode(
    x=alt.X("playerID").title("Participant"),
    y=alt.Y("period (days)").scale(domain=[0, 28]).title("Period over which participant has played (in days)"),
)

hline = alt.Chart(period_df).mark_rule(strokeDash=[10], size=1, color='darkorange').encode( y=alt.datum(14))

commitment+hline | commitment_zoom+hline


In [6]:
more_than_2wks = period[period>14]
more_than_1wk = period[period>7]
len(more_than_2wks)

25

We can see that few players (only 25) have played for longer than 2 weeks.

Given the above analysis, *<font color='red'>let us consider committed players as those who have played on the server for over 2 weeks</font>*.

In [7]:
commit_list = list(more_than_2wks.index)  # this is their list (emails)

### Choosing Variables for the Predictive Model

Let us explore several variables that could be useful. We believe commitment could be related to the following variables:

- *total playtime for each player (in minutes)*
- *average duration of the play sessions for each player (in minutes)*
- *number of play sessions for each player*
- *return rate (several definitions investigated)*

In [8]:
# Calculate total playtime for each player
playtime = grouped['played_minutes'].sum()
playtime_df = playtime.reset_index()
playtime_df.columns = ['hashedEmail', 'total playtime (min)']


In [9]:
# Calculate average session duration for each player
avg_duration = grouped['played_minutes'].mean()
avg_duration_df = avg_duration.reset_index()
avg_duration_df.columns = ['hashedEmail', 'average session duration (min)']


In [10]:
# Calculate the number of sessions played by players
nb_sessions = grouped.size()
nb_sessions_df = nb_sessions.reset_index()
nb_sessions_df.columns = ['hashedEmail', 'nb of sessions']


Return rate is defined here as either:
- the number of days played within a certain period (100% for someone who plays every day during that period), or
- the number of play sessions within a certain period (1 session/day for someone who plays every day during that period, or for example twice a day every other day)

We will use a 30 day period.

In [11]:
# Keep, for each player, only the sessions played within 30 days of 1st connection
tidywhole["first_connection"] = grouped["start_time"].transform('min')
tidywhole["end_30_days"] = tidywhole['first_connection'] + timedelta(days=30)
tidywhole["start_date"] = tidywhole["start_time"].dt.date
tidywhole_30days = tidywhole[tidywhole["start_time"] <= tidywhole["end_30_days"]]

grouped_30_days = tidywhole_30days.groupby(['hashedEmail'])


In [12]:
# Count number of played days within the 30 day period
nb_days_30_days = grouped_30_days["start_date"].nunique()
return_rate_bydays = nb_days_30_days / 30     # %age played days over total days in a month 
return_rate_bydays_df = return_rate_bydays.reset_index()
return_rate_bydays_df.columns = ['hashedEmail', 'return rate by days']


In [13]:
# Count number of play sessions within the 30 day period
nb_sessions_30_days = grouped_30_days.size()
return_rate_byplays = nb_sessions_30_days / 30     # nb of sessions per day 
return_rate_byplays_df = return_rate_byplays.reset_index()
return_rate_byplays_df.columns = ['hashedEmail', 'return rate by plays']


* Data Consolidation: Combines all previously calculated metrics into a single dataframe.

* Commitment Classification: Assigns a label ('committed' or 'not interested') to each player based on their engagement duration, operationalizing the definition of commitment for predictive modeling.

In [14]:
df = (
    playtime_df
    .merge(avg_duration_df, on='hashedEmail')
    .merge(return_rate_bydays_df, on='hashedEmail')
    .merge(return_rate_byplays_df, on='hashedEmail')
    .merge(nb_sessions_df, on='hashedEmail')
    .merge(period_df, on='hashedEmail')
)

# Add a column for the commitment identifier
df['identifier'] = df['hashedEmail'].apply(lambda x: 'committed' if x in commit_list else 'not interested')
df


Unnamed: 0,hashedEmail,total playtime (min),average session duration (min),return rate by days,return rate by plays,nb of sessions,period (days),identifier
0,0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc9335...,106.0,53.000000,0.033333,0.066667,2,0.079861,not interested
1,060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe...,30.0,30.000000,0.033333,0.033333,1,0.020833,not interested
2,0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce02...,11.0,11.000000,0.033333,0.033333,1,0.007639,not interested
3,0d4d71be33e2bc7266ee4983002bd930f69d304288a866...,418.0,32.153846,0.233333,0.433333,13,8.829861,not interested
4,0d70dd9cac34d646c810b1846fe6a85b9e288a76f5dcab...,70.0,35.000000,0.066667,0.066667,2,1.150694,not interested
...,...,...,...,...,...,...,...,...
120,fc0224c81384770e93ca717f32713960144bf0b52ff676...,16.0,16.000000,0.033333,0.033333,1,0.011111,not interested
121,fcab03c6d3079521e7f9665caed0f31fe3dae6b5ccb86e...,80.0,80.000000,0.033333,0.033333,1,0.055556,not interested
122,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,4795.0,15.467742,0.566667,1.366667,310,160.952083,committed
123,fe218a05c6c3fc6326f4f151e8cb75a2a9fa29e22b110d...,9.0,9.000000,0.033333,0.033333,1,0.006250,not interested


### Data Visualization

The next step is to draw scatter plots that will help us find the relevant variables for the predictive model. Available variables are total playtime (min), average session duration (min), 2 types of return rates and the number of sessions.

For this, let us first standardize the numerical variables of the dataframe to make them graphically comparable.

In [15]:
#Create a preprocessor and standardize the variables

preprocessor = make_column_transformer(
    (StandardScaler(), make_column_selector(dtype_include="number")),
    remainder="passthrough",
    verbose_feature_names_out = False
)
preprocessor.fit(df)
scaled_df = preprocessor.transform(df)

scaled_df

Unnamed: 0,total playtime (min),average session duration (min),return rate by days,return rate by plays,nb of sessions,period (days),hashedEmail,identifier
0,-0.223530,0.633796,-0.322532,-0.215290,-0.249749,-0.429325,0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc9335...,not interested
1,-0.256343,-0.077687,-0.322532,-0.304995,-0.274044,-0.430732,060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe...,not interested
2,-0.264546,-0.665434,-0.322532,-0.304995,-0.274044,-0.431047,0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce02...,not interested
3,-0.088823,-0.011060,0.905383,0.771457,0.017492,-0.220679,0d4d71be33e2bc7266ee4983002bd930f69d304288a866...,not interested
4,-0.239073,0.076983,-0.117880,-0.215290,-0.249749,-0.403790,0d70dd9cac34d646c810b1846fe6a85b9e288a76f5dcab...,not interested
...,...,...,...,...,...,...,...,...
120,-0.262388,-0.510764,-0.322532,-0.304995,-0.274044,-0.430964,fc0224c81384770e93ca717f32713960144bf0b52ff676...,not interested
121,-0.234755,1.469016,-0.322532,-0.304995,-0.274044,-0.429904,fcab03c6d3079521e7f9665caed0f31fe3dae6b5ccb86e...,not interested
122,1.800954,-0.527229,2.951908,3.283178,7.233014,3.406712,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,committed
123,-0.265410,-0.727302,-0.322532,-0.304995,-0.274044,-0.431080,fe218a05c6c3fc6326f4f151e8cb75a2a9fa29e22b110d...,not interested


- Then, let us draw colored scatter plots to visualize the relationships between the variables.

In [16]:
#Draw coloured scatter plots to visualize the relationships between the variables

domain = ['committed', 'not interested']
range_ = ['lightblue', 'orange']

pltime_duration = alt.Chart(scaled_df).mark_circle(opacity=1).encode(
    x=alt.X('total playtime (min)').title("total playtime (standardized)"),
    y=alt.Y('average session duration (min)').title("average session duration (standardized)"),
    color=alt.Color('identifier').title("Commitment").scale(domain=domain, range=range_)
)


Plotting the average session duration against the total playtime below, we can see that session durations vary extensively for both groups (non-committed players especially). This makes it a poor choice of variable for our predictive model.

In [17]:
pltime_duration

Plotting the other three principal variables below shows better separation between both groups, allowing us to conclude that the total playtime and number of sessions (or return rate) are more useful variables than average session duration for classifying committed players.

In [18]:
pltime_number = alt.Chart(scaled_df).mark_circle(opacity=1).encode(
    x=alt.X('total playtime (min)').title("total playtime by sessions (standardized)"),
    y=alt.Y('nb of sessions').title("number of sessions (standardized)"),
    color=alt.Color('identifier').title("Commitment").scale(domain=domain, range=range_)
)
bydays_rrate_duration = alt.Chart(scaled_df).mark_circle(opacity=1).encode(
    x=alt.X('total playtime (min)').title("30-day playtime by days (standardized)"),
    y=alt.Y('return rate by days').title("return rate by days over a month (standardized)"),
    color=alt.Color('identifier').title("Commitment").scale(domain=domain, range=range_)
)
byplays_rrate_duration = alt.Chart(scaled_df).mark_circle(opacity=1).encode(
    x=alt.X('total playtime (min)').title("30-day playtime by plays (standardized)"),
    y=alt.Y('return rate by plays').title("return rate by plays over a month (standardized)"),
    color=alt.Color('identifier').title("Commitment").scale(domain=domain, range=range_)
)

bydays_rrate_duration | byplays_rrate_duration | pltime_number

Additional plot with non-standardised variables on a log scale (to zoom in smaller clusters):

In [19]:
pltime_number_log = alt.Chart(df).mark_circle(opacity=1).encode(
    x=alt.X('total playtime (min)').scale(type='log').title("total playtime (min)"),
    y=alt.Y('nb of sessions').scale(type='log').title("number of sessions"),
    color=alt.Color('identifier').title("Commitment").scale(domain=domain, range=range_)
)

pltime_number_log

In [20]:
scaled_df['identifier'].value_counts(normalize=True)


identifier
not interested    0.8
committed         0.2
Name: proportion, dtype: float64

There are 20% of committed players (=25) in the whole dataset; 80% have played for less than 2 weeks.

### K-Nearest Neighbors Classification

Here we follow the natural workflow for performing K-NN classification in accordance with the following 3 steps:

1. Split the dataset into training and test sets
2. Choose the best K parameter for the classifier using *5-fold cross-validation*
3. Predict and evaluate model performance on the test set

#### 1. Split the Dataset

We use *train_test_split* to divide the data into a training and a testing set, using *<font color='red'> 65% of the data for training, and 35% for testing. </font>*

It is necessary to only use the training set to train the model in order to avoid biased, erroneously positive results.

Additionally, we verify that the proportion of committed players vs. not interested persons is well preserved in the training set.

In [21]:
commit_train, commit_test = train_test_split(df, train_size=0.65, stratify=df['identifier'])
commit_train['identifier'].value_counts(normalize=True) 


identifier
not interested    0.802469
committed         0.197531
Name: proportion, dtype: float64

#### 2. Choosing K with 5-Fold Cross-Validation

In [22]:
# Create the centering / scaling preprocessor

commit_preprocessor = make_column_transformer(
    (StandardScaler(), make_column_selector(dtype_include="number"))
)

# Create the K-NN model with K left unspecified

knn_tune = KNeighborsClassifier()

# Create the pipeline

tune_pipeline = make_pipeline(commit_preprocessor, knn_tune)

# Specify the range of K values to try: no need to try large K values given the small dataset

parameter_grid = {"kneighborsclassifier__n_neighbors": [1,3,5,7,9,11,15,24,31]}


We want to maximize our model's accuracy but don't currently know which of the three variables (plotted against time played) will produce the best results. Therefore, it makes sense to create all three models and compare their performance.

First, let's take a look at **number of sessions**. A cross-validation of cv=2 has been chosen here as initial tests with cv=10, 5 and 3 proved to be less accurate.

Note that using 31 neighbors reduces the performance close to that of the majority classifier of 80%. Due to this, there is no point in trying numbers of neighbours above 31.

In [23]:
# Create the GridSearchCV object with 5-fold cross-validation

commit_tune_grid_sessions = GridSearchCV(
    estimator=tune_pipeline,
    param_grid=parameter_grid,
    cv=2
)

In [24]:
X_sessions = commit_train[['total playtime (min)', 'nb of sessions']]
Y_sessions = commit_train['identifier']
commit_tune_grid_sessions.fit(X_sessions,Y_sessions)


In [25]:
# Compute the accuracy

subset_cv_results_sessions = {key: commit_tune_grid_sessions.cv_results_[key] for key in ["param_kneighborsclassifier__n_neighbors","mean_test_score","std_test_score"]}
accuracies_grid_sessions = pd.DataFrame(subset_cv_results_sessions)
accuracies_grid_sessions["Standard error"] = accuracies_grid_sessions["std_test_score"] / 5**(1/2)
accuracies_grid_sessions.columns=["Neighbors","Accuracy","Standard deviation","Standard error"]

accuracies_grid_sessions

Unnamed: 0,Neighbors,Accuracy,Standard deviation,Standard error
0,1,0.938415,0.011585,0.005181
1,3,0.96311,0.01189,0.005317
2,5,0.96311,0.01189,0.005317
3,7,0.93811,0.01311,0.005863
4,9,0.91372,0.01128,0.005045
5,11,0.864024,0.014024,0.006272
6,15,0.839024,0.039024,0.017452
7,24,0.802439,0.002439,0.001091
8,31,0.802439,0.002439,0.001091


In [26]:
accuracy_vs_k = alt.Chart(accuracies_grid_sessions).mark_line(point=True).encode(
    x=alt.X("Neighbors").title("Neighbors"),
    y=alt.Y("Accuracy").scale(zero=False).title("Accuracy estimate")
)
accuracy_vs_k

The best accuracy can be achieved with the number of sessions model if the number of K-neighbours = 3.

In [27]:
commit_tune_grid_sessions.best_params_


{'kneighborsclassifier__n_neighbors': 3}

**Model 2 - Return rate by days** : A cross-validation of cv=12 has been chosen here as initial tests with cv between 2-14 proved to be less accurate.

In [28]:
# Create the GridSearchCV object with 5-fold cross-validation
# To keep it clean, define a separate process

commit_tune_grid_bydays = GridSearchCV(
    estimator=tune_pipeline,
    param_grid=parameter_grid,
    cv=12
)
commit_tune_grid_bydays.fit(
    commit_train[['total playtime (min)', 'return rate by days']],
    commit_train['identifier']
)

# Compute the accuracy

subset_cv_results_bydays = {key: commit_tune_grid_bydays.cv_results_[key] for key in ["param_kneighborsclassifier__n_neighbors","mean_test_score","std_test_score"]}
accuracies_grid_bydays = pd.DataFrame(subset_cv_results_bydays)
accuracies_grid_bydays["Standard error"] = accuracies_grid_bydays["std_test_score"] / 5**(1/2)
accuracies_grid_bydays.columns=["Neighbors","Accuracy","Standard deviation","Standard error"]

accuracies_grid_bydays

Unnamed: 0,Neighbors,Accuracy,Standard deviation,Standard error
0,1,0.875,0.154647,0.06916
1,3,0.902778,0.107565,0.048104
2,5,0.938492,0.073036,0.032663
3,7,0.914683,0.092958,0.041572
4,9,0.902778,0.107565,0.048104
5,11,0.890873,0.119229,0.053321
6,15,0.890873,0.119229,0.053321
7,24,0.853175,0.101404,0.045349
8,31,0.803571,0.063832,0.028547


In [29]:
accuracy_vs_k = alt.Chart(accuracies_grid_bydays).mark_line(point=True).encode(
    x=alt.X("Neighbors").title("Neighbors"),
    y=alt.Y("Accuracy").scale(zero=False).title("Accuracy estimate")
)
accuracy_vs_k

K=5 yields the best accuracy for the return rate by days model.

In [30]:
commit_tune_grid_bydays.best_params_

{'kneighborsclassifier__n_neighbors': 5}

**Model 3 - Return rate by plays** : A cross-validation with cv=11 showed optimum accuracy over other trials.

In [31]:
# Create the GridSearchCV object with 5-fold cross-validation
# Once more, to keep it clean, define a separate process

commit_tune_grid_byplays = GridSearchCV(
    estimator=tune_pipeline,
    param_grid=parameter_grid,
    cv=11
)

commit_tune_grid_byplays.fit(
    commit_train[['total playtime (min)', 'return rate by plays']],
    commit_train['identifier']
)

# Compute the accuracy

subset_cv_results_byplays = {key: commit_tune_grid_byplays.cv_results_[key] for key in ["param_kneighborsclassifier__n_neighbors","mean_test_score","std_test_score"]}
accuracies_grid_byplays = pd.DataFrame(subset_cv_results_byplays)
accuracies_grid_byplays["Standard error"] = accuracies_grid_byplays["std_test_score"] / 5**(1/2)
accuracies_grid_byplays.columns=["Neighbors","Accuracy","Standard deviation","Standard error"]

accuracies_grid_byplays

Unnamed: 0,Neighbors,Accuracy,Standard deviation,Standard error
0,1,0.913961,0.084384,0.037738
1,3,0.86526,0.101691,0.045478
2,5,0.915584,0.083442,0.037316
3,7,0.926948,0.067012,0.029969
4,9,0.926948,0.067012,0.029969
5,11,0.902597,0.080353,0.035935
6,15,0.878247,0.085069,0.038044
7,24,0.839286,0.074215,0.03319
8,31,0.805195,0.057716,0.025811


In [32]:
accuracy_vs_k = alt.Chart(accuracies_grid_byplays).mark_line(point=True).encode(
    x=alt.X("Neighbors").title("Neighbors"),
    y=alt.Y("Accuracy").scale(zero=False).title("Accuracy estimate")
)
accuracy_vs_k

In [33]:
commit_tune_grid_byplays.best_params_

{'kneighborsclassifier__n_neighbors': 7}

K=7 yields the best accuracy for the return rate by plays model.

#### 3. Predict and Evaluate Model Performance on the Test Set

**Model 1 - Number of Sessions**

In [34]:
commit_test["predicted"] = commit_tune_grid_sessions.predict(commit_test[['total playtime (min)', 'nb of sessions']])

# Evaluate model performance

commit_tune_grid_sessions.score(
              commit_test[['total playtime (min)', 'nb of sessions']],
              commit_test['identifier']
           )


0.9545454545454546

The proportion of the majority class in the training set is approximately 80%. A majority classifier would fare at 80% accuracy at the very least. 

With an accuracy of <font color='red'>95.4%</font>, this predictive model does a lot better, but we should still test the other two models and see if their results improve on this percentage.


In [35]:
precision_score(
    y_true=commit_test['identifier'],
    y_pred=commit_test["predicted"],
    pos_label='committed'
)


np.float64(0.8888888888888888)

In [36]:
recall_score(
    y_true=commit_test['identifier'],
    y_pred=commit_test["predicted"],
    pos_label='committed'
)


np.float64(0.8888888888888888)

In [37]:
# Print the confusion matrix to show false positive and negative predictions

conf_matrix_sessions = pd.crosstab(
    commit_test['identifier'],
    commit_test["predicted"]
)

conf_matrix_sessions.index = ['Actually committed', 'Actually not interested'] 
conf_matrix_sessions.columns = ['Predicted committed', 'Predicted not interested']

conf_matrix_sessions

Unnamed: 0,Predicted committed,Predicted not interested
Actually committed,8,1
Actually not interested,1,34


**Model 2 - Return Rate by Days**

In [38]:
# Make a copy of commit_test to keep things separate

commit_test_bydays = commit_test.copy()
commit_test_bydays["predicted"] = commit_tune_grid_bydays.predict(commit_test_bydays[['total playtime (min)', 'return rate by days']])

# Evaluate model performance

commit_tune_grid_bydays.score(
              commit_test_bydays[['total playtime (min)', 'return rate by days']],
              commit_test_bydays['identifier']
           )

0.8863636363636364

In [39]:
precision_score(
    y_true=commit_test_bydays['identifier'],
    y_pred=commit_test_bydays["predicted"],
    pos_label='committed'
)

np.float64(0.7)

In [40]:
recall_score(
    y_true=commit_test_bydays['identifier'],
    y_pred=commit_test_bydays["predicted"],
    pos_label='committed'
)

np.float64(0.7777777777777778)

In [41]:
# Print the confusion matrix to show false positive and negative predictions

conf_matrix_bydays = pd.crosstab(
    commit_test_bydays['identifier'],
    commit_test_bydays["predicted"]
)

conf_matrix_bydays.index = ['Actually committed', 'Actually not interested'] 
conf_matrix_bydays.columns = ['Predicted committed', 'Predicted not interested']

conf_matrix_bydays


Unnamed: 0,Predicted committed,Predicted not interested
Actually committed,7,2
Actually not interested,3,32


**Model 3 - Return Rate by Plays**

In [42]:
# Make a copy of commit_test to keep things separate

commit_test_byplays = commit_test.copy()
commit_test_byplays["predicted"] = commit_tune_grid_byplays.predict(commit_test_byplays[['total playtime (min)', 'return rate by plays']])

# Evaluate model performance

commit_tune_grid_byplays.score(
              commit_test_byplays[['total playtime (min)', 'return rate by plays']],
              commit_test_byplays['identifier']
           )


0.9090909090909091

In [43]:
precision_score(
    y_true=commit_test_byplays['identifier'],
    y_pred=commit_test_byplays["predicted"],
    pos_label='committed'
)


np.float64(0.7777777777777778)

In [44]:
recall_score(
    y_true=commit_test_byplays['identifier'],
    y_pred=commit_test_byplays["predicted"],
    pos_label='committed'
)

np.float64(0.7777777777777778)

In [45]:
# Print the confusion matrix to show false positive and negative

conf_matrix_byplays = pd.crosstab(
    commit_test_byplays['identifier'],
    commit_test_byplays["predicted"]
)
conf_matrix_byplays.index = ['Actually committed', 'Actually not interested'] 
conf_matrix_byplays.columns = ['Predicted committed', 'Predicted not interested']

conf_matrix_byplays

Unnamed: 0,Predicted committed,Predicted not interested
Actually committed,7,2
Actually not interested,2,33


## 2.5: Extra Exploratory Plots

In [46]:
# Exploratory plots to check whether committed players have a routine

tidywhole['identifier'] = np.where(tidywhole['hashedEmail'].isin(commit_list), 'committed', 'not interested')
tidywhole

explore = tidywhole[['start_time','identifier']]

# Plot all sessions and color according to commitment

domain = ['committed', 'not interested']
range_ = ['lightblue', 'orange']

scat = alt.Chart(explore.reset_index()).mark_circle(opacity=1).encode(
    x=alt.X("index").title('Play session'),
    y=alt.Y("start_time:T", timeUnit='hoursminutes').title('Time when played'),
    color=alt.Color('identifier').title("Commitment").scale(domain=domain, range=range_)
)

scat2 = alt.Chart(explore.reset_index()).mark_circle(opacity=1).encode(
    x=alt.X("index").title('Play session'),
    y=alt.Y("start_time:N", timeUnit='day').title('Day when played'),
    color=alt.Color('identifier').title("Commitment").scale(domain=domain, range=range_)
)


In [47]:
# Exploratory plots to check whether committed players fall in certain age range

# Plot all players and color according to commitment

domain = ['committed', 'not interested']
range_ = ['lightblue', 'orange']

age_df = players[["hashedEmail", "age"]]
df = pd.merge(df, age_df, how="inner", on=["hashedEmail"])
scat3 = alt.Chart(df.assign(playerID=df.index+1)).mark_circle(opacity=1).encode(
    x=alt.X("playerID").title("Player"),
    y=alt.Y("age").title("Age of the player (in yrs)"),
    color=alt.Color('identifier').title("Commitment").scale(domain=domain, range=range_)
)

In [48]:
scat | scat2 | scat3

We can conclude that:
- there is no correlation between commitment and time or day of play.
- there is no correlation between commitment and age of player.

## 3: Conclusion

### Results

The performance of each model is as follows :
- Model 1 (number of sessions) accuracy is 95.4% with 2 false flags (1 false pos, 1 false neg)
- Model 2 (return rate by days) accuracy is 88.6% with 5 false flags (3 false pos, 2 false neg)
- Model 3 (return rate by plays) accuracy is 90.9% with 2 false flags (2 false pos, 2 false neg)

We recall that false negatives are users that are engaged in reality but not identified as such by the predictive model, and false positives are users that are not interested in the game but are identified as committed by the predictive model: while we want both of these numbers to be low, minimizing the number of false positives is slightly more important.

Model 1 fared the best out of the three, showing that using playtime and the number of sessions was the best way to predict commitment based on the available information.

No correlation was found between commitment and age of player, time of day during sessions, or day of the week during their sessions.

### Interpretation

We began this project by asking whether commitment could be operationalized. The results - while promising - show that measuring the interest of particpants is a complex task, and a truly accurate model would require a large volume of training data as well as information about a player's initial commitment to function well. There are no easy pieces of identity which can be algorithmically converted into predictions: individuals are too complex for their behaviours to be prognostically categorized by demographic data such as age, gender, or experience. Perhaps using a multi-variable classification model would improve on prediction accuracy when dealing with a complicated, multifaceted task such as this. 

The reason why Model 1 performed better than the models reliant on return rate is likely due to the connection between its variables and our definition of a "committed player". A player that has contributed more sessions and more hours to the research likely played over a longer period of time (rather than cramming). It is unfortunate that the models based on return rate didn't fare as well, since if they had, we would need less data and could potentially make accurate predictions about a player from the first week alone. However, this would require us to make two assumptions, the second of which is a glaring issue underlying our entire report: firstly, we would need to assume that the player would keep up a consistent, routine rate going forward, and secondly, we would need to assume that they would *remain* committed once the excitement of the first few weeks wore off.
 
Nevertheless, we achieved our goal of creating a sleek classification model that could gage commitment based on past participation, which would be helpful for quickly identifying determined participants who might be interested in helping with future research, among other uses. 