# Player Characteristics and Subscribing to a Game Related Newsletter

## Introduction

Video games attract players of all ages and understanding player behavior can help researchers and developers engage users more effectively. This project investigates how a player’s age and total time spent playing Minecraft relate to the likelihood of subscribing to a game related newsletter, an indicator of engagement. We used the players.csv dataset from a UBC Minecraft research server, which contains 196 players and nine variables. The analysis focuses on age, played hours, and subscription status, while other variables, such as experience and gender, provide additional context but are not central to the study.

First, we will import the necessary libraries for our analysis and load the dataset from its csv file.

In [1]:
import altair as alt
import pandas as pd
import numpy as np
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import (
    GridSearchCV,
    cross_validate,
    train_test_split,
)

alt.data_transformers.enable('vegafusion')

set_config(transform_output="pandas")

In [2]:
players = pd.read_csv("https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz")
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


## Method and Results

Before starting to create the model, we will explore this dataset to find the best variables to use for our model. We are interested in whether certain player types can affect subscription status. From the dataset loaded above, players can be described by their age, number of hours they played the game, and their experience level. This is a good starting point at picking predictor variables for our model. But first, visualizing the relationship between these variables and if a player is subcribed or not will clarify if there is any meaningful relationship between the predictors and our target. Figure 1 compares the numerical predictors of hours played and age while also showing each player's experience level and subscription status.

In [3]:
players_plot_1 = alt.Chart(players).mark_point(opacity = 0.6).encode(
    x = alt.X("age").title("Age"),
    y = alt.Y("played_hours").title("Time Played (Hours)").scale(type = "sqrt"),
    shape = alt.Shape("subscribe").title("Subscribed"),
    color = alt.Color("experience").title("Experience")
).properties(
    title = "Figure 1: Age vs Time Played for Various Player Types",
    width = 700,
    height = 500
)

players_plot_1

From this plot, it is clear that there is no trend between player experience level and whether they are subscribed or not as given by each points' colour and shape respectively. Hence, for our model we will not be looking at player experience level as an indicator to our target variable, subcription status. To assess whether time played and age have any effects on subscription, we can simplify this graph and reduce noise by omitting the experience level colouring and instead colouring by whether a player has subscribed.

In [4]:
players_plot_2 = alt.Chart(players).mark_point(opacity = 0.6).encode(
    x = alt.X("age").title("Age"),
    y = alt.Y("played_hours").title("Time Played (Hours)").scale(type = "sqrt"),
    color = alt.Color("subscribe").title("Subscribed")
).properties(
    title = "Figure 2: Age vs Time Played of All Players",
    width = 700,
    height = 500
)

players_plot_2

Looking at this new plot, we can see a few trends. Many points marked as a circle (indicating that a player is subscribed) tend to fall past the age of 25 or above 15 hours played. Therefore, it is safe to assume that both player age and the amount of time they played contribute to if they choose to subcribe to a game related newsletter. We will be using both of these variables as predictor variables in our classification model to predict our target variable, 'subcribe'.

Moving ahead, we will not need the other columns of our dataset, hence we will drop them. Other than this slight modification, we will not need to wrangle our data any further since it is already tidy. This will be the dataset we use for our model.

In [5]:
players_reduced = players.drop(columns = ["experience", "individualId", "organizationName", "hashedEmail", "name", "gender"])
players_reduced

Unnamed: 0,subscribe,played_hours,age
0,True,30.3,9
1,True,3.8,17
2,False,0.0,17
3,True,0.7,21
4,True,0.1,21
...,...,...,...
191,True,0.0,17
192,False,0.3,22
193,False,0.0,17
194,False,2.3,17


To solve this classification problem, we will use the k-nearest-neighbours model. We will have to initially find the best number of neighbours that minimizes the error produced by the model. To start, we create a preprocessor to standardize each of 'played_hours' and 'age' since they are on varying scales. Then we combine this preprocessor in a pipeline with a k-nearest-neighbours classifier model of no specific number of neighbors so that we can grid search for the best 'n' value ourselves. We choose a random seed of 2025 so that the data will be exactly reproducable.

In [6]:
np.random.seed(2025)

players_train, players_test = train_test_split(players_reduced, train_size = 0.75, stratify = players_reduced["subscribe"])

X_train = players_train[["played_hours", "age"]]
y_train = players_train["subscribe"]

X_test = players_test[["played_hours", "age"]]
y_test = players_test["subscribe"]

players_preprocessor = make_column_transformer(
    (StandardScaler(), ["played_hours", "age"]),
    remainder = "passthrough",
)

knn_general = KNeighborsClassifier()

players_pipeline = make_pipeline(players_preprocessor, knn_general)

players_pipeline
players_pipeline.fit(X_train, y_train)

!!!

In [12]:
param_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 36)
}

players_tune_grid = GridSearchCV(
    estimator = players_pipeline,
    param_grid = param_grid,
    cv = 10 #since we don't have much data, we can afford to do more cv folds
)

players_model_grid = players_tune_grid.fit(X_train, y_train)
accuracies_grid = pd.DataFrame(players_model_grid.cv_results_)
accuracies_grid

  _data = np.array(data, dtype=dtype, copy=copy,


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.004492,0.000335,0.004203,0.00024,1,{'kneighborsclassifier__n_neighbors': 1},0.866667,0.6,0.6,0.533333,0.6,0.666667,0.666667,0.285714,0.642857,0.714286,0.617619,0.139647,34
1,0.004443,0.000473,0.004081,3.5e-05,2,{'kneighborsclassifier__n_neighbors': 2},0.533333,0.333333,0.666667,0.533333,0.533333,0.6,0.466667,0.285714,0.571429,0.714286,0.52381,0.126992,35
2,0.004299,0.000311,0.004081,7.4e-05,3,{'kneighborsclassifier__n_neighbors': 3},0.733333,0.6,0.866667,0.533333,0.666667,0.733333,0.666667,0.785714,0.785714,0.785714,0.715714,0.094642,17
3,0.004191,4.3e-05,0.004054,2.5e-05,4,{'kneighborsclassifier__n_neighbors': 4},0.666667,0.6,0.933333,0.533333,0.533333,0.6,0.466667,0.857143,0.785714,0.785714,0.67619,0.148079,30
4,0.004167,3.7e-05,0.004038,2.2e-05,5,{'kneighborsclassifier__n_neighbors': 5},0.733333,0.666667,0.8,0.6,0.533333,0.666667,0.666667,0.785714,0.714286,0.714286,0.688095,0.076525,29
5,0.004213,6.7e-05,0.004049,3e-05,6,{'kneighborsclassifier__n_neighbors': 6},0.666667,0.6,0.866667,0.6,0.533333,0.6,0.666667,0.857143,0.785714,0.785714,0.69619,0.112703,27
6,0.006169,0.005926,0.004068,4.9e-05,7,{'kneighborsclassifier__n_neighbors': 7},0.733333,0.666667,0.8,0.6,0.6,0.666667,0.666667,0.857143,0.785714,0.714286,0.709048,0.081161,23
7,0.004179,5.2e-05,0.005743,0.004962,8,{'kneighborsclassifier__n_neighbors': 8},0.666667,0.666667,0.733333,0.6,0.6,0.6,0.666667,0.714286,0.785714,0.642857,0.667619,0.058971,32
8,0.004241,0.000113,0.007056,0.008587,9,{'kneighborsclassifier__n_neighbors': 9},0.6,0.733333,0.666667,0.466667,0.6,0.6,0.666667,0.714286,0.785714,0.714286,0.654762,0.086642,33
9,0.004202,0.000153,0.004405,0.001051,10,{'kneighborsclassifier__n_neighbors': 10},0.6,0.733333,0.733333,0.533333,0.6,0.6,0.666667,0.714286,0.785714,0.714286,0.668095,0.076607,31


!!!

In [8]:
players_grid_accuracy = alt.Chart(accuracies_grid).mark_line(point = True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors")
    .title("Number of K Neighbors")
    .scale(zero = False),
    y=alt.Y("mean_test_score")
    .title("Model Accuracy")
    .scale(zero = False)
).properties(
    title = "Figure 3: Classifier Model Accuracy vs Number of Neighbors",
    width = 700,
    height = 500
)   

players_grid_accuracy

In [13]:
players_tune_grid.best_params_

{'kneighborsclassifier__n_neighbors': 26}

!!! BEST K = 26

In [10]:
knn = KNeighborsClassifier(n_neighbors = 26)
players_pipeline = make_pipeline(players_preprocessor, knn)
players_fit = players_pipeline.fit(X_train, y_train)
predicted = players_fit.predict(X_test)
players_df = X_test.assign(predicted = predicted, actual = y_test)

players_df

Unnamed: 0,played_hours,age,predicted,actual
31,0.1,21,True,True
62,1.0,17,True,True
189,0.0,17,True,False
187,0.0,17,True,True
167,0.3,17,True,False
153,0.1,17,True,True
111,4.0,21,True,True
149,0.0,16,True,True
163,0.5,20,True,True
121,0.1,24,True,True


In [11]:
players_pipeline.score(
    X_test,
    y_test,
)

0.7551020408163265

!!! ^ model's error

## Discussion

!!!

To do: to finish writing methods section, discussion, plot, references?