# Predicting Newsletter Subscription from Player Behaviour: A Classification Approach

In [1]:
### Run this cell before continuing.

import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics import recall_score, precision_score
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

# Output dataframes instead of arrays
set_config(transform_output="pandas")

A Computer Science research group at UBC have gathered data on how people play video games, MineCraft in this study. Our team decided to respond to the following question posed by the research group: "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?" More specifically, "Can we predict whether players subscribe to the newsletter based on their ages and hours played in the game?"

For this purpose, we will work with the "players" dataset. Let's begin by importing the data.

In [2]:
url = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players_df = pd.read_csv (url)
players_df

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [3]:
players_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   gender            196 non-null    object 
 6   age               196 non-null    int64  
 7   individualId      0 non-null      float64
 8   organizationName  0 non-null      float64
dtypes: bool(1), float64(3), int64(1), object(4)
memory usage: 12.6+ KB


The dataframe contains 196 observations and 9 columns. Below is a summary of the variables, their data types, and descriptions:
    
1. experience (object): how experienced the players are.
2. subscribe (boolean): whether the player has subscribed to the newsletter.
3. hashedEmail (object): players’ email addresses.
4. played_hours (float): total hours playing the game.
5. name (object): players’ names.
6. gender (object): players’ gender.
7. age (integer): players’ ages.
8. individualId (float): individual ID.
9. organizationName (float): organization name.

Overall, the dataframe is untidy, as the values in both "individualId" and "organizationName" are missing for all observations.
However, since the project’s guiding question focuses on "subscribe," "age," and "played_hours", we simply filter out these columns for data wrangling and future analysis.

In [4]:
players_tidy = players_df[['age','played_hours','subscribe']]
players_tidy

Unnamed: 0,age,played_hours,subscribe
0,9,30.3,True
1,17,3.8,True
2,17,0.0,False
3,21,0.7,True
4,21,0.1,True
...,...,...,...
191,17,0.0,True
192,22,0.3,False
193,17,0.0,False
194,17,2.3,False


The "players_tidy" is now tidy! Let's visualize the data to understand them better. 

In [5]:
players_plot = alt.Chart(players_tidy, title="Fig. 1. Plot of Playing Time vs. Age of the Players").mark_point(size=20, opacity = 0.5).encode(
    x = alt.X('age').title('Age of Players (in years)'),
    y = alt.Y('played_hours').title('Playing Time (in hours)'),
    color = alt.Color(
            'label:N',
            legend=alt.Legend(title="Legend"),          
            scale=alt.Scale(range=['steelblue']))        
).configure_axis(titleFontSize=12)
players_plot

ValueError: DataFusion error: Schema error: No field named label. Valid fields are _vf_order.
    Context[0]: Failed to get node value


alt.Chart(...)

The visualization shows the relationship between players’ age (in years) and their playing time (in hours). The majority of players have fewer than 20 hours of playtime and are between approximately 15 to 30 years old. There are also a few outliers, with some players having logged over 150–220 hours of playtime.

Next, let's colour the data points based on newsletter subscription to see whether any noticeable patterns emerge between age, playing time, and subscription status.

In [6]:
players_plot_classified = alt.Chart(players_tidy, title="Fig. 2. Relationship Between Players’ Age and Playing Time, by Subscription Status").mark_point(size=20, opacity = 0.5).encode(
    x = alt.X('age')
    .title('Players Age (in years)'),
    y = alt.Y('played_hours')
    .title('Playing Time (in hours)'),
    color=alt.Color("subscribe")
    .legend(orient="right")
    .title("Subscription Status")
    .scale(scheme="dark2"),
    shape="subscribe"
).configure_axis(titleFontSize=12)
players_plot_classified

From the scatterplot, there is no strong visual association between players’ subscription status and either their age or total hours played. However, we do observe that all non-subscribers have fewer than 10 hours of playtime and are older than 15 years.

Our team aims to address the guiding question of this project using a classification approach. Specifically, we will use “age” and “played_hours” (both continuous numerical variables) as the predictors for the “subscribe” (boolean) response variable.
To model the relationship between these variables, we use the K-Nearest Neighbours (KNN) algorithm, which classifies new observations based on the class of their K closest neighbours in the plot.

However, there are several challenges associated with this approach:
First, the number of observations in the players dataframe is relatively small, and having a larger sample size would improve the model’s accuracy and reliability.
Second, several players have 0.0 hours of playing time, which can lead to overplotting and make patterns in the data less distinguishable.
Third, the presence of outliers, players with unusually high playtime, can influence the performance of the KNN model.
Lastly, the predictor variables (age and played_hours) are measured in different units, which may bias the distance calculations in KNN.

To address most of these challenges, we can apply StandardScaler() to the data to standardize the predictor variables before training the model. This ensures that differences in units and the influence of extreme values do not distort the model’s distance-based calculations.

We now proceed by splitting the data into training and testing sets, using 75% of the observations for training and the remaining 25% for testing. This allows us to fit the model on one subset of the data and then evaluate its performance on previously unseen observations. After the split, we define a preprocessing step using StandardScaler, specify the KNN classifier along with a range of K values, and combine both components into a single pipeline for cleaner, more reproducible model fitting.

In [7]:
players_train, players_test = train_test_split(
    players_tidy, 
    test_size = 0.25,
    random_state = 123
)
players_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 147 entries, 100 to 109
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   age           147 non-null    int64  
 1   played_hours  147 non-null    float64
 2   subscribe     147 non-null    bool   
dtypes: bool(1), float64(1), int64(1)
memory usage: 3.6 KB


In [8]:
players_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49 entries, 136 to 82
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   age           49 non-null     int64  
 1   played_hours  49 non-null     float64
 2   subscribe     49 non-null     bool   
dtypes: bool(1), float64(1), int64(1)
memory usage: 1.2 KB


In [9]:
players_processor = make_column_transformer(
    (StandardScaler(),['age','played_hours']),
    remainder="passthrough",
    verbose_feature_names_out=False
)

X_train = players_train[['age','played_hours']]
y_train = players_train['subscribe']

X_test = players_test[['age','played_hours']]
y_test = players_test['subscribe']

knn = KNeighborsClassifier()

players_pipe = make_pipeline(players_processor, knn)

param_grid = {
    "kneighborsclassifier__n_neighbors": range(2, 25, 1),
}
players_pipe

To estimate how well the KNN models generalize to unseen data, we use 10-fold cross-validation, which evaluates each model across ten different training–validation splits rather than relying on a single train–test division. Below, we compute the cross-validated mean accuracy and corresponding standard error for each k value in our specified range. The data is shown in a dataframe below:

In [10]:
knn_tune_grid = GridSearchCV(
    players_pipe, param_grid, cv = 10,
)

knn_model_grid = knn_tune_grid.fit(X_train, y_train)

accuracies_grid = pd.DataFrame(knn_model_grid.cv_results_)
accuracies_grid["sem_test_score"] = accuracies_grid["std_test_score"] / 10**(1/2)
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "sem_test_score"
    ]]
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
)
accuracies_grid

  _data = np.array(data, dtype=dtype, copy=copy,


Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,2,0.570476,0.040946
1,3,0.659048,0.04308
2,4,0.611905,0.039471
3,5,0.749048,0.027726
4,6,0.674762,0.026963
5,7,0.755238,0.025672
6,8,0.708095,0.026368
7,9,0.735238,0.013837
8,10,0.715238,0.021684
9,11,0.727619,0.020066


To better understand how the model performs as k increases, we visualize the accuracy for each value of k in our range in the line plot below.

In [11]:
accuracies_grid['label'] = 'Mean Accuracy'
accuracy_versus_k_grid = alt.Chart(accuracies_grid, title="Fig. 3. Plot of Accuracy Scores vs. K Values").mark_line(point = True).encode(
    x=alt.X('n_neighbors')
        .title('K Nearest Neighbor')
        .scale(zero=False),
    y=alt.Y('mean_test_score')
        .title('Mean Accuracy')
        .scale(zero=False),
    color = alt.Color('label:N', legend=alt.Legend(title="Legend")),
)
accuracy_versus_k_grid

The graph above indicates that the KNN classifier makes the highest prediction accuracy on the training set when K = 7. At this value, the model’s accuracy is approximately 76%, with an associated standard error of about 3% from the accuracies_grid dataframe. This means that the true average accuracy of the classifier is likely to fall within the interval of roughly 73% to 79%, acknowledging that sampling variability means it could fall slightly outside this range.

Having identified the optimal K based on the training results, the next step is to fit the model on the unseen data (players_test) to evaluate its performance on new observations.

In [12]:
best_knn = KNeighborsClassifier(n_neighbors = 7)

best_fit = best_knn.fit(X_train, y_train)

best_fit_df = players_test.assign(
    predicted = best_fit.predict(X_test)
)
best_fit_df

Unnamed: 0,age,played_hours,subscribe,predicted
136,20,0.0,True,True
4,21,0.1,True,True
81,17,1.0,True,True
181,22,0.8,True,True
161,17,0.0,False,True
154,19,0.0,True,True
62,17,1.0,True,True
187,17,0.0,True,True
122,32,0.1,False,True
185,18,0.1,False,True


In [13]:
best_fit_acc = best_fit.score(X_test,y_test)
best_fit_acc

0.7346938775510204

The model’s accuracy in predicting players’ newsletter subscription status is about 73% on the test data. In other words, it correctly classifies whether players subscribe to the newsletter in 73% of cases. However, a closer look at the results shows that the model performs much better for players who do subscribe to the newsletter, and makes noticeably more errors when predicting the class of players who do not subscribe. 

Let's further analyze this by calculating the precision and recall of the model:

In [14]:
precision_score(
    y_true=best_fit_df["subscribe"],
    y_pred=best_fit_df["predicted"],
    pos_label= True
)

np.float64(0.723404255319149)

In [15]:
recall_score(
    y_true=best_fit_df["subscribe"],
    y_pred=best_fit_df["predicted"],
    pos_label= True
)

np.float64(1.0)

## Discussion

#### Summarize What We Found
The model has a recall of 1.0 for the positive class, meaning it successfully identifies every player who truly subscribes to the newsletter. In other words, the classifier does not miss any actual subscribers. However, the precision for this class is lower, around 72%, which indicates that some players who do not subscribe are incorrectly predicted as subscribers. These false positives reduce precision, even though recall remains perfect.

Overall, returning to the guiding question of this project, whether players’ age and hours played can be used to predict newsletter subscription, we find that these variables do have some predictive value, but they are not strong predictors on their own. The model performs reasonably well when identifying non-subscribers, but is less reliable when predicting that a player does subscribe, as reflected by the lower precision for the positive class. This outcome aligns with our initial hypothesis based on Figure 2, where we observed no strong visual association between age, hours played, and subscription status.

#### Discuss Whether This is What We Expected to Find
Based on our initial exploration of the dataset, especially the scatterplots in Figure 2, we did not observe a clear visual separation between subscribers and non-subscribers. This early observation suggested that age and hours played were unlikely to be strong predictors of whether a player chooses to subscribe to the newsletter. Because our research question specifically asked whether these two characteristics could meaningfully predict subscription behaviour, we expected that the model would show only limited predictive ability. This interpretation is further reinforced by the upward trend in the KNN accuracy curve, which shows that the model performs better only when k is large, implying that stronger smoothing is required to overcome noise—consistent with the weak predictive structure of age and hours played.

The model’s results confirm this expectation. Although it achieves a perfect recall of 1.0 for the subscriber class, its precision is noticeably lower, indicating that many players predicted to subscribe actually do not. This imbalance highlights that the model struggles to reliably distinguish true subscribers from non-subscribers when relying solely on age and gameplay time. The overall accuracy of about 73% further supports the idea that these variables have only weak predictive value. In the context of our guiding question, the findings reinforce our initial hypothesis that age and hours played do not strongly differentiate player types in terms of newsletter subscription.

#### Discuss What Impact Could Such Findings Have
These findings suggest that relying only on age and hours played to predict newsletter subscription would lead to limited practical value. Since the model identifies all true subscribers but also produces many false positives, using these predictions for targeted marketing could result in reaching out to a large number of players who are actually not interested in subscribing. This would reduce the efficiency of any outreach strategy and could lead to unnecessary use of resources.

At the same time, the results indicate that age and gameplay time alone are not enough to accurately characterize players’ interest in the newsletter. If a company wants to build a more reliable prediction system, it would need to incorporate additional behavioral or engagement features. In this sense, our findings highlight that meaningful personalization or targeted communication requires richer data, and using weak predictors may create misleading expectations about player behavior.


#### Discuss What Future Questions Could This Lead To

Building on these analyses and findings, several important future questions naturally arise. Because our model shows limited predictive value when relying only on age and gameplay hours variables, with a precision of approximately 72%, one key direction for future work is to investigate whether using different predictors, algorithms, or a broader set of variables would improve performance. Comparing our K-NN classifier with other models we learned in the course, such as linear regression, could demonstrate whether the weak precise predictive performance is inherent to the data or specific to the method we used. Additionally, exploring alternative feature combinations and incorporating additional variables through data wrangling will also reveal stronger predictors that are not immediately visible in the raw players dataset. As we progressed further in the course, we also recognized that clustering techniques would help reveal natural groupings of players that are not captured by the simple age and played hours relationship we examined. 

In addition, these findings raise practical questions about how these predictions could support the decision-making process in the real world. For example, we want to investigate whether identifying players with lower gameplay hours could help target engagement efforts more effectively, and whether such strategies would produce meaningful changes in player behaviour over time. Finally, as our analysis was built on static characteristics of players, future investigation should assess how player behaviour evolves over time and whether a player's likelihood of subscribing to the newsletter changes as they continue to interact with the game. Therefore, these questions show the significance of understanding the models in depth and rich datasets before our predictive tools can be reliably applied in practical settings. 