PLAYTIME - an analysis of who the target player is when marketing a game
-

Introduction:
-

&nbsp;&nbsp;&nbsp;&nbsp; In this report, the prompt being answered is **"We would like to know which 'kinds' of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts."** 

&nbsp;&nbsp;&nbsp;&nbsp; To do this, we must analyse user behavior, which is the study of how people interact with a service, product, or interface. In this specific report, the aim is to determine how individuals interact with a particular game and how this data can be used to visualize and conceptualize a strategy for targeting these players for recruitment, whether that's for a different game, a research study, or another motivated campaign. To determine this, the dataset that will be used is players.csv. We will further specify the prompt to the question: **"Can we predict the hours a player played using age, gender, and experience?"**

&nbsp;&nbsp;&nbsp;&nbsp; Firstly, players.csv is a flat file containing 196 observations (rows) and 9 variables, providing individual records related to gaming or a similar digital service. The data describes an individual's personal information, including name, gender, age, played_hours, experience, subscription status, and an identifier email referred to as a hashed email. We will only use the relevant variables 'played_hours', 'age', 'gender', and 'experience' to answer our question, and are described as followed:

| Variable Name | Data type | Variable type | Description |
| --- | --- | --- | --- |
| experience | object | ordinal | player's experience with Minecraft with 5 categories : Pro, Veteran, Regular, Amateur, Beginner|
| played_hours | float | continuous | player's total playtime in hours |
| gender | object | nominal | player's gender with 7 categories: 'Female', 'Male', 'Non-binary', 'Two-Spirited', 'Prefer not to say', 'Agender', 'Other'|
| age | integer | discrete | player's age |

&nbsp;&nbsp;&nbsp;&nbsp; Furthermore, the analysis of players.csv must account for several potential data issues:
* There is missing data in the columns 'individualID' and 'organizationName', which will make them unusable for segmentation and thus, dropped.
* The 'gender' categorical variable has seven possible categories. There are very few observations for genders other than 'Male', 'Female' and 'Non-binary', so our model will not have enough data to accurately predict 'played_hours' for these genders. To avoid this, we will perform data analysis on only those three genders.
* The exact data collection methodology is unknown, leading to possible inaccuracies.
* With 'experience' variable, a self-reported field, there could be bias and inconsistency in the understanding of each category.

Methods and Results
-


&nbsp;&nbsp;&nbsp;&nbsp; First, we must **tidy** our data before any analysis. We first load all the necessary packages for our analysis. Then we set the random seed to ensure the replicability of our results. Then we load our 'players' dataset by saving the url as a string and reading with pd.read_csv. The dataset has no metadata and requires no parameters in the function.

In [1]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import set_config
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import warnings

np.random.seed(10)

url_players = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"

players = pd.read_csv(url_players)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


&nbsp;&nbsp;&nbsp;&nbsp; Once we can see the loaded dataset, we use the info function to see a summary of the dataset prior to any cleaning or wrangling.

In [2]:
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   gender            196 non-null    object 
 6   age               196 non-null    int64  
 7   individualId      0 non-null      float64
 8   organizationName  0 non-null      float64
dtypes: bool(1), float64(3), int64(1), object(4)
memory usage: 12.6+ KB


&nbsp;&nbsp;&nbsp;&nbsp; To begin, we first clean the data by dropping unnecessary columns with the drop function, leaving the columns 'experience', 'played_hours', 'gender', and 'age'.

&nbsp;&nbsp;&nbsp;&nbsp; We then wrangle the data. First, we drop the gender categories 'Agender', 'Other', 'Prefer not to say', and 'Two-Spirited' by turning them into a list and dropping with the isin function. Then, we assign numerical values as strings to the nominal variables' categories. For 'experience' we have encoded with numbers 1-5 and 'gender' with numbers 1-3 with the replace function and dictionaries. Exact assignment is seen in the code.

&nbsp;&nbsp;&nbsp;&nbsp; In the cell below, we clean and wrangle the data as stated. We also split the dataset 'new_players' into a 0.8/0.2 train/test set to use the training set for initial visualisations.

In [3]:
new_players = players.drop(columns=["name", "hashedEmail",'individualId','organizationName','subscribe' ])
to_drop = ['Agender', 'Other', 'Prefer not to say', 'Two-Spirited']
new_players = new_players[new_players['gender'].isin(to_drop) == False]
new_players

new_players = new_players.replace({'Beginner': '1', 'Amateur': '2', 'Regular': '3', 'Veteran': '4', 'Pro': '5', 'Male': '1', 'Female': '2', 'Non-binary': '3'})

players_train, players_test = train_test_split(
    new_players, train_size = 0.8
)

&nbsp;&nbsp;&nbsp;&nbsp; Now we use the info function again to see a summary of our training set. There are now only the 4 relevant variables and 139 observations in our working dataset.

In [4]:
players_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 140 entries, 102 to 9
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   experience    140 non-null    object 
 1   played_hours  140 non-null    float64
 2   gender        140 non-null    object 
 3   age           140 non-null    int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 5.5+ KB


&nbsp;&nbsp;&nbsp;&nbsp; Now we will move on to **visualise** the data.

&nbsp;&nbsp;&nbsp;&nbsp; We will use bar charts to visualise the distribution of the two numerical and one categorical variable. Bar chart was chosen to maximise the number of variables we could visualise at a time. We plot the sum of hours played by each subgroup (bin) of 5 years of players' ages.

&nbsp;&nbsp;&nbsp;&nbsp; Then, we colour-code the bars to represent a third variable. The colours were put into a colour-blindness simulator to find that the categories are distinguishable from each other. In Figure 1, we encode 'gender' to colour and in Figure 2, we encode 'experience' with Minecraft to colour. A legend next to each figure denotes the colour's meaning.

&nbsp;&nbsp;&nbsp;&nbsp; The code to create the visualisations of the training set is written in the cell below using functions from the altair library. Inside the encode function, we set the x and y axes, the colour, and the legend. The variables 'gender' and 'age' are specified to be categorical using ':N' and reassigned from the numerical values to the true categories.

In [5]:
age_chart = alt.Chart(players_train).mark_bar().encode(
    x = alt.X('age', bin=True, title="Player age"),
    y = alt.Y('played_hours', title="Total number of hours played"),
    color = alt.Color(
        "gender:N",
        legend=alt.Legend(
            title="Gender",
            labelExpr="{'1':'Male','2':'Female','3':'Non-Binary'}[datum.label]"
        )
    )
).properties(
    title="Figure 1 : Total hours played by player age and gender"
)

age_chart_exp = alt.Chart(players_train).mark_bar().encode(
    x = alt.X('age', bin = True).title("Player age"),
    y = alt.Y('played_hours').title("Total number of hours played"),
    color = alt.Color(
        "experience:N",
    legend=alt.Legend(
            title="Experience level",
            labelExpr="{'1':'Beginner','2':'Amateur','3':'Regular','4':'Veteran','5':'Pro'}[datum.label]"
        ))
).properties(
    title="Figure 2 : Total hours played by player age and experience level"
)
(age_chart | age_chart_exp).resolve_scale(
    color='independent'
)

&nbsp;&nbsp;&nbsp;&nbsp; In both figures, the largest bar occurs at age 15-20 for ~400 hours played. The second largest bar is at age 20-25 for ~350 hours played. The other age ranges contribute significantly less numbers of hours. This implies that the player age range 15-25 are more likely to contribute the largest amounts of data. One possible reason is that more of the players recorded are in that age range.

&nbsp;&nbsp;&nbsp;&nbsp; In Figure 1, the largest bar has a significant majority of female players contributing to the number of hours. The second largest bar has a majority of hours played by non-binary players. Comparing the sizes of the largest two bars' colours, female players can be observed to contribute the most hours played.

&nbsp;&nbsp;&nbsp;&nbsp; In Figure 2, the largest bar is composed of a majority of hours from amateur level players, followed very closely by regular level players. The second largest bar has a majority of hours from regulars and a significant portion of hours from amateurs. Comparing which experience contributes the most hours isn't accurate with this figure, but considering how close in hours both amateur and regular levels contribute, they should both be considered.

&nbsp;&nbsp;&nbsp;&nbsp; So in our visualisations, we can initially predict that players contributing largest amounts of data are more likely to be 15-25 years old, female, and have some prior experience with Minecraft (amateur or regular).

&nbsp;&nbsp;&nbsp;&nbsp;


&nbsp;&nbsp;&nbsp;&nbsp; Now, we will begin our **data analysis**.

&nbsp;&nbsp;&nbsp;&nbsp; We first create the preprocessor with scikit-learn library where we apply the StandardScaler preprocessor to the predictor variables: 'age', 'gender', and 'experience'. This standardises our data so all variables have the same scale.

&nbsp;&nbsp;&nbsp;&nbsp; Then we create a pipeline that chains the preprocessor and KNeighborsRegressor steps together. After, we further split our training and testing sets into X (predictor variables) and y (predicted variable) components, using double square brackets for data frames and single square brackets for series.

In [6]:
preprocessor = make_column_transformer((StandardScaler(), ["age", 'gender', 'experience']))
pipeline = make_pipeline(preprocessor, KNeighborsRegressor())

X_train = players_train[["age",'gender','experience']]
y_train = players_train["played_hours"]

X_test = players_test[["age",'gender','experience']]
y_test = players_test["played_hours"]

&nbsp;&nbsp;&nbsp;&nbsp; To create the GridSearchCV, we first create a grid of each number of nearest neighbors (1-20) for our pipeline. Then, we create the object, specifying which pipeline, range, number of folds, and method of scoring. 5 folds for cross-validation was chosen to prevent over-fitting while maintaining efficiency. We set the scoring argument to the RMSE (Root Mean Squared Error) tuning method for the minimised RMSE. Finally, we fit the GridSearchCV object to our training set. 

In [7]:
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 21, 1),
}
gridsearch = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
)

gridsearch.fit(X_train, y_train)

  _data = np.array(data, dtype=dtype, copy=copy,


&nbsp;&nbsp;&nbsp;&nbsp; Then, we look at the results. First, we turn the results into a data frame, calculate the standard error, keep only relevant columns, rename the "param_kneighborsregressor__n_neighbors" column to be easily readable, then convert our negative RMSE values to positive values. Our results are a dataframe with columns: number of neighbors, RMSE value, and standard error.

In [8]:
results = pd.DataFrame(gridsearch.cv_results_)
results["sem_test_score"] = results["std_test_score"] / 5**(1/2)
results = (
    results[[
        "param_kneighborsregressor__n_neighbors",
        "mean_test_score",
        "sem_test_score"
    ]]
    .rename(columns={"param_kneighborsregressor__n_neighbors": "n_neighbors"})
)
results["mean_test_score"] = -results["mean_test_score"]
results

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,39.723825,2.053479
1,2,29.984438,4.181399
2,3,28.714744,4.453355
3,4,27.767422,5.054012
4,5,26.953113,5.600589
5,6,26.807141,5.515104
6,7,26.395416,5.616042
7,8,25.954935,5.865495
8,9,25.406784,5.891382
9,10,24.803296,5.484399


&nbsp;&nbsp;&nbsp;&nbsp; Now we evaluate how our model performs on the test set by using the predict function, which scikit-learn automatically uses the best k value for. Then, we calculate the RMSPE, which represents the model's performance on unseen data as opposed to RMSE which represents performance on training data.

In [9]:
players_test["predicted"] = gridsearch.predict(players_test)

RMSPE = mean_squared_error(
    y_true= y_test,
    y_pred= players_test["predicted"]
)**(1/2)
RMSPE

np.float64(38.936351343222505)

&nbsp;&nbsp;&nbsp;&nbsp; The calculated RMSPE is 38.94 (to 2 decimal places). This means contextually that our model will on average, incorrectly predict new players' 'played_hours' by 38.94 hours.

&nbsp;&nbsp;&nbsp;&nbsp; Now, we will look at how the RMSE varies with different numbers of neighbors. We create a visualisation of each number of neighbors and its corresponding RMSE value. The code is similar to our previous visualisations in setting the x and y axes and titles, but here, we create a line chart and do not need a legend or colour.

In [10]:
analysis_vis = alt.Chart(results).mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Number of neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Cross-validation RMSE")
).properties(
    title="Figure 3: RMSE for different KNN"
)

analysis_vis

&nbsp;&nbsp;&nbsp;&nbsp; From Figure 3, we can see the best-performing model is when number of neighbors is 11 because it has the lowest RMSE value. We can also confirm this with the code:

In [11]:
gridsearch.best_params_

{'kneighborsregressor__n_neighbors': 11}

## Discussion

Summary of what was found:
>The core objective of the project was to determine whether a player's playtime could be predicted from their age, gender, and experience. The initial dataset comprised 196 player observations. Preprocessing involved dropping irrelevant identifying columns, standardizing numerical features, and mapping the categorical variables for experience and gender so that it would be a more usable format. To do the prediction task, the model used was K-Nearest Neighbors Regressor (K-NN). GridSearchCV was implemented to find the optimal number of neighbors (k) for the KNN model across the range of 1-20, and the RMSE was minimized through cross-validation.
>
>Furthermore, the analysis revealed that the optimal number of neighbors (k) is 11, with an RMSE of 24.15 hours and an RMSPE of 38.94 hours. These are significant errors indicating that the model has very little predictive value, and that errors will be magnified for players with low actual played hours. Therefore, we conclude that the model is not suitable for accurately predicting individual playing hours.
 
Discussion of the expectedness/unexpectedness:
> From the visualizations (graphs), it was confirmed that high-played hours are correlated to age, experience, and gender, with the highest amount of played hours being from female amateur to regular level players with ages around 15-25. Something to keep in mind is the biases in self-reported experience levels, which might negatively impact the model's accuracy, particularly for KNN regression models.                                               >                              
> Something unexpected was the predictive power of the model. Looking at the RMSPE and the RMSE, the numbers are extremely high, which points to the low predictive power of the model. This suggests that age, gender, and experience might not be primary drivers of sustained high play time. The played hours might have more correlation to unmeasured behavioral factors, such as in-game events, social network use, etc. Although it was expected that the variables provided would not provide a perfect predictive capability, it was not predicted that it would be as low as it was.
 
Discussion of impacts and improvements that could be made:
> Through looking at the visualizations, target demographics for high-volume data contribution can be determined. As aforementioned, there is a clear spike in the contribution of data from ages 15-25 from individuals of amateur-regular experience. A recruitment strategy should be catered to these individuals as they would be the most responsive.
>                                                        
> With regards to the model, there are some limitations to its predictive power from its reliance solely on age, gender, and experience. Through analysis, it is clear that while the current model can point to the right general groups that log the most gametime, it cannot accurately predict how much each individual plays. To improve this, it is suggested that more data should be collected with more variables so we can create a model with better predictive abilities.

Discussion of possible future questions:
> Future explorations could be on questions of which additional variables could be used to better predict playtime or whether a non-linear or more complex model could improve prediction. Overall, while the model was not able to predict playtime accurately, the analysis revealed insights on data limitations and also highlights a potential direction for deeper, more informative future studies.