**Predicting Subscription Status from Player Demographics and Behaviour**

**Introduction:**  

Vast amounts of behavioral data can be generated from video games, revealing how players interact, make decisions, and engage with online communities. Researchers can design more effective engagement strategies and optimize digital experiences by understanding this data. The UBC Computer Science research group is conducting a study to collect player data and analyze engagement patterns using a Minecraft server in an open-world gaming environment. As players explore the server, their demographic details and gameplay activity are recorded to help answer key questions about online participation and interest in community content.

This project investigates the question:
What player characteristics and behaviors are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

To answer this question, using the "players.csv" dataset, we analyzed data from 196 players who were on the Minecraft server. Each observation represents one player and includes variables such as:

experience: self-reported skill level (Amateur, Regular, Pro, Veteran)

played_hours: total gameplay time in hours

age: player’s age in years

gender: demographic category

subscribe: indicates whether the player subscribed to the newsletter

We removed two identifier variables (hashedEmail, name, individualId, and organizationName) because they provided no analytical value and raised concerns about privacy. We cleaned and reshaped the dataset into a tidy format, with each variable and column forming a column and a row, respectively. This structure allows for a clear analysis of how demographic and behavioral factors influence newsletter subscriptions.
By combining gameplay data and demographic insights, the study aims to identify which player groups are most likely to engage with community communications, which provides valuable information for future recruitment and outreach strategies within the gaming research community.


**Data Wrangling: Methods & Results**

**1) Load the Dataset:**
   
The dataset was loaded from a remove URL containing information about 196 Minecraft server players. Initial inspection revealed 9 columns including demographic information (gender, age) and behavioural metrics (experience level).

In [20]:
import pandas as pd
#load the dataset from the internet
url="https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players=pd.read_csv(url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


**2) Clean the data:**
   
Data Wrangling began with type conversion ensuring played_hours and age were numeric using pd.to_numeric() with error coercion to handle non-numeric entries. Missing value analysis identifies gaps in several columns. We removed rows with missing values in critical variables (played_hours, age, experience, gender, subscribe) since the dataset is large enough to accommodate this loss while maintaining data quality. The subscribe variable was converted to boolean type for classification purposes. Feature selection involved removing irrelevant columns that did not contribute to prediction: hashedEmail, name, individualId, and organizationName. This reduced the dataset to 5 essential columns. The final cleaned dataset contained 196 complete observations. 

In [21]:
import pandas as pd
import numpy as np

# Convert everything to numeric
players['played_hours'] = pd.to_numeric(players['played_hours'], errors='coerce')
players['age'] = pd.to_numeric(players['age'], errors='coerce')

# Drop rows with missing values in critical columns
players.dropna(subset=['played_hours', 'age', 'experience', 'gender'], inplace=True)

# Ensure subscribe is boolean
if players['subscribe'].dtype != 'bool':
    players['subscribe'] = players['subscribe'].astype(bool)
    
# Drop irrelevant columns
irrelevant_columns = ['hashedEmail', 'name', 'individualId','organizationName'] 
players.drop(columns=[col for col in irrelevant_columns if col in players.columns], inplace=True)

players

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


3) Summary of the dataset: Descriptive analysis revealed key dataset characteristics. The target variable showed 75% of players subscribed to the newsletter, indicating moderate class imbalance that could influence model performance. The played_hours variable exhibited right-skewed distribution with a mean of 6 hours and a median of 0.3 hours, with some high-engagement outliers exceeding 200 hours. Age ranged from 9 to 91 years with potential data entry errors at upper extreme (91 to 99). Gender distribution was heavily imbalanced with 79% male players, reflecting common gaming demographic patterns. Experience levels were relatively well-distributed across Beginner, Amateur, Regular, Veteran, and Pro categories, with Amateur being the most common. 

In [22]:
players.describe()

Unnamed: 0,played_hours,age
count,196.0,196.0
mean,5.845918,21.280612
std,28.357343,9.706346
min,0.0,8.0
25%,0.0,17.0
50%,0.1,19.0
75%,0.6,22.0
max,223.1,99.0


In [23]:
players['subscribe'].value_counts()  # How many True vs False
players['experience'].value_counts()  # How many in each category

experience
Amateur     63
Veteran     48
Regular     36
Beginner    35
Pro         14
Name: count, dtype: int64

**4) Visualizations:**

**Figure 1:** Subscription Rate by Experience Level:
A bar chart displaying newsletter subscription rates across player experience levels. This visualization reveals whether more experienced plays show different engagement patterns with newsletter content. 

In [24]:
import altair as alt 
exp_sub_rates = players.groupby('experience')['subscribe'].agg(['sum', 'count', 'mean']).reset_index()
exp_sub_rates.columns = ['experience', 'subscribed', 'total', 'rate']

fig1 = alt.Chart(players).mark_bar().encode(
    x=alt.X('experience:N', 
            title='Experience Level', 
            sort=['Beginner', 'Amateur', 'Regular', 'Veteran', 'Pro']),
    y=alt.Y('mean(subscribe):Q', 
            title='Subscription Rate',
            axis=alt.Axis(format='%'),
            scale=alt.Scale(domain=[0, 1])),
    color=alt.Color('experience:N', 
                    legend=None, 
                    scale=alt.Scale(scheme='tableau10')),
    tooltip=[
        alt.Tooltip('experience:N', title='Experience'),
        alt.Tooltip('mean(subscribe):Q', title='Subscription Rate', format='.1%'),
        alt.Tooltip('count()', title='Number of Players')
    ]
).properties(
    title='Figure 1: Newsletter Subscription Rate by Experience Level',
    width=450,
    height=300
)
fig1


**Figure 2:** Age distribution by subscription status
An overlapping histogram showing age distributions for subscribers versus non-subscribers. This figure explored whether certain age groups demonstrate higher propensity to engage with the newsletter, informing age-targeted marketing strategies. 

In [25]:
fig2 = alt.Chart(players).mark_bar(opacity=0.7).encode(
    x=alt.X('age:Q', 
            bin=alt.Bin(maxbins=20), 
            title='Age (years)'),
    y=alt.Y('count()', 
            title='Number of Players',
            stack=None),
    color=alt.Color('subscribe:N', 
                    title='Subscribed',
                    scale=alt.Scale(scheme='set2')),
    tooltip=[
        alt.Tooltip('age:Q', bin=True, title='Age Range'),
        alt.Tooltip('subscribe:N', title='Subscribed'),
        alt.Tooltip('count()', title='Count')
    ]
).properties(
    title='Figure 2: Age Distribution by Subscription Status',
    width=500,
    height=300
)
fig2

**Figure 3:** Played Hours vs Age By Subscription Status: A scatter plot illustrating the relationship between player age and engagement level (hours played), with points colored by subscription status. This visualization identifies whether the combination of age and engagement jointly influences potential interaction effects. 

In [26]:
fig3 = alt.Chart(players).mark_circle(size=60, opacity=0.6).encode(
    x=alt.X('age:Q', 
            title='Age (years)',
            scale=alt.Scale(domain=[5, 95])),
    y=alt.Y('played_hours:Q', 
            title='Played Hours',
            scale=alt.Scale(domain=[-5, 250])),
    color=alt.Color('subscribe:N', 
                    title='Subscribed',
                    scale=alt.Scale(scheme='set1')),
    tooltip=[
        alt.Tooltip('age:Q', title='Age'),
        alt.Tooltip('played_hours:Q', title='Hours Played', format='.1f'),
        alt.Tooltip('subscribe:N', title='Subscribed'),
        alt.Tooltip('experience:N', title='Experience'),
        alt.Tooltip('gender:N', title='Gender')
    ]
).properties(
    title='Figure 3: Played Hours vs Age by Subscription Status',
    width=500,
    height=350
)
fig3

**Figure 4:**  Played Hours Distribution by Subscription Status: Bar chart comparing played hours distributions between subscribers and non-subscribers. This figure examines whether highly engaged players are more likely to subscribe to the newsletter, testing the assumption that engagement correlates with newsletter interest.

In [27]:
fig4 = alt.Chart(players).mark_bar().encode(
    x=alt.X('subscribe:N', 
            title='Subscription Status',
            axis=alt.Axis(labelAngle=0)),
    y=alt.Y('mean(played_hours):Q', 
            title='Average Played Hours'),
    color=alt.Color('subscribe:N', 
                    legend=None,
                    scale=alt.Scale(scheme='set2')),
    tooltip=[
        alt.Tooltip('subscribe:N', title='Subscribed'),
        alt.Tooltip('mean(played_hours):Q', title='Avg Hours', format='.2f'),
        alt.Tooltip('count()', title='Number of Players')
    ]
).properties(
    title='Figure 4: Average Played Hours by Subscription Status',
    width=300,
    height=300
)
fig4

**Figure 5:** Gender Distribution By Subscription Status: Grouped bar charts showing gender composition within subscriber and non-subscriber groups. This visualization assesses whether gender differs between groups and helps identify gender-based subscription patterns. 

In [28]:
fig5 = alt.Chart(players).mark_bar().encode(
    x=alt.X('gender:N', title='Gender'),
    y=alt.Y('count()', title='Number of Players'),
    color=alt.Color('subscribe:N', title='Subscribed'),
    column=alt.Column('subscribe:N', title='Subscription Status')
).properties(
    title='Figure 5: Gender Distribution by Subscription Status',
    width=200,
    height=300
)
fig5

**Summary Insights From Visualizations:**
The exploratory analysis revealed several important patterns. Subscription rates appear relatively consistent across most experience levels, though some variation exists. Age distributions show overlap between subscribers and non-subscribers, suggesting age alone may not be a strong discriminator. 

The scatter plot reveals no strong linear relationship between age and played hours, indicating these features may provide independent information for prediction. Engagement levels (played hours) show similar distributions for both groups, challenging the assumption that highly engaged players are more likely to subscribe.


The dataset exhibits quality issues including class imbalance, right-skewed continuous variables, and demographic imbalances that must be considered during modeling. These characteristics suggest the need for appropriate preprocessing including feature scaling and potentially class-balancing techniques during model development.

**KNN Classification**

We build a K-Nearest Neighbours (KNN) classifier to predict whether a playersubscribes to the newsletter (`subscribe`) using their age, total hours played, experience level, and gender.

In [29]:
import pandas as pd
import numpy as np
import altair as alt

# tools from scikit-learn
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [30]:
# 1. Keep only the relevant columns from the cleaned `players` data
players_model = players[['age', 'played_hours', 'experience', 'gender', 'subscribe']]

# 2. One-hot encode 'experience' and 'gender'
#    - This creates separate 0/1 columns for each category.
#    - drop_first=True avoids multicollinearity by removing one reference level.
players_model = pd.get_dummies(players_model,
                               columns=['experience', 'gender'],
                               drop_first=True)

# Look at the first few rows to check the result
players_model.head()

Unnamed: 0,age,played_hours,subscribe,experience_Beginner,experience_Pro,experience_Regular,experience_Veteran,gender_Female,gender_Male,gender_Non-binary,gender_Other,gender_Prefer not to say,gender_Two-Spirited
0,9,30.3,True,False,True,False,False,False,True,False,False,False,False
1,17,3.8,True,False,False,False,True,False,True,False,False,False,False
2,17,0.0,False,False,False,False,True,False,True,False,False,False,False
3,21,0.7,True,False,False,False,False,True,False,False,False,False,False
4,21,0.1,True,False,False,True,False,False,True,False,False,False,False


Next, we separate our data into:
- **X**: predictor variables  
- **y**: target variable (`subscribe`)
Then we split into training and testing sets:
- 75% training data (used to fit the model)
- 25% test data (used only to evaluate the final model)
We also use `stratify = y` so the proportion of subscribers vs non-subscribers is similar in both sets.

In [31]:
# Separate predictors (X) and target (y)
X = players_model.drop('subscribe', axis=1)
y = players_model['subscribe']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,      # 25% of data goes to test set
    random_state=2025,   # set seed for reproducibility
    stratify=y           # keep class balance similar in train and test
)

X_train.shape, X_test.shape

((147, 12), (49, 12))

In [32]:
#Standardize numeric features 
scaler=StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

We try different values of *k* (number of neighbours) from 1 to 20.
For each *k*:
- Fit a KNN model on the **training data only**.
- Use 5-fold cross-validation on the training data.
- Compute the mean accuracy across the 5 folds.
Then we plot mean accuracy vs *k* to see which *k* works best.

In [33]:
k_values = range(1, 21)
cv_scores = []            # store mean CV accuracy for each k
for k in k_values:
 # Create a KNN model with this k
    knn = KNeighborsClassifier(n_neighbors=k)
 # 5-fold cross-validation on the training data
    scores = cross_val_score(
        knn,
        X_train_scaled,
        y_train,
        cv=5,            # 5 folds
        scoring="accuracy"
    )
    
# Store the mean accuracy (average over 5 folds)
    cv_scores.append(scores.mean())

cv_scores

[np.float64(0.6066666666666667),
 np.float64(0.5245977011494253),
 np.float64(0.6671264367816091),
 np.float64(0.5519540229885058),
 np.float64(0.6937931034482758),
 np.float64(0.6128735632183908),
 np.float64(0.7075862068965517),
 np.float64(0.7144827586206897),
 np.float64(0.7144827586206896),
 np.float64(0.6871264367816091),
 np.float64(0.7347126436781608),
 np.float64(0.7347126436781608),
 np.float64(0.727816091954023),
 np.float64(0.7211494252873563),
 np.float64(0.727816091954023),
 np.float64(0.727816091954023),
 np.float64(0.727816091954023),
 np.float64(0.7211494252873563),
 np.float64(0.727816091954023),
 np.float64(0.7211494252873563)]

In [23]:
# Put k and mean accuracy together in one table
cv_results = pd.DataFrame({
    "k": list(k_values),
    "mean_accuracy": cv_scores
})
# Line + points plot of mean accuracy for each k
knn_cv_plot = (
    alt.Chart(cv_results)
    .mark_line(point=True)                     
    .encode(
        x=alt.X("k:Q",
                title="Number of Neighbours (k)"),
        y=alt.Y("mean_accuracy:Q",
                title="Mean 5-fold CV Accuracy"),
        tooltip=[
            alt.Tooltip("k:Q", title="k"),
            alt.Tooltip("mean_accuracy:Q",
                        title="Mean Accuracy",
                        format=".3f")
        ]
    )
    .properties(
        title="Figure 6: Mean Cross-Validation Accuracy vs k",
        width=450,
        height=300
    )
)

knn_cv_plot

In [35]:
# Find the index of the largest CV score
best_index = int(np.argmax(cv_scores))
# Use that index to get the best k
best_k = list(k_values)[best_index]
best_k

11

We used 5-fold cross-validation to compare k-NN models with k from 1 to 20. The plot shows that accuracy is lower for small k values (around 0.60–0.65) but increases and then stabilizes for k between about 7 and 15. Using np.argmax on the cross-validation scores, we found that the highest mean CV accuracy occurs at k = 11, with an accuracy of roughly 0.73. Therefore, we chose k = 11 as our final number of neighbours for the k-NN classification model.

In [36]:
# Build final KNN model with the chosen k
knn_final = KNeighborsClassifier(n_neighbors=best_k)

# Fit on all training data
knn_final.fit(X_train_scaled, y_train)

# Predict on the test data (unseen during training)
y_pred = knn_final.predict(X_test_scaled)
# Compute test accuracy
test_accuracy = accuracy_score(y_test, y_pred)
test_accuracy

0.7551020408163265

In [37]:
cm = confusion_matrix(y_test, y_pred, labels=[False, True])
cm_table = pd.DataFrame(
    cm,
    index=["Predicted False", "Predicted True"],
    columns=["Actual False", "Actual True"]
)
TN = cm[0, 0]  # true negatives
FP = cm[0, 1]  # false positives
FN = cm[1, 0]  # false negatives
TP = cm[1, 1]  # true positives

precision_true = TP / (TP + FP)
recall_true    = TP / (TP + FN)

precision_true, recall_true


(np.float64(0.7608695652173914), np.float64(0.9722222222222222))

On the test set, the k-NN model with k = 11 achieved an accuracy of about 75.5%, a precision of 76.1% for predicting subscribers, and a recall of 97.2%. Precision tells us, “when the model predicts someone will subscribe, how often is that correct?”, while recall tells us, “among all actual subscribers, how many did the model successfully detect?”

In [22]:
# Combine actual and predicted values into one DataFrame
players_results = pd.DataFrame({
    "predicted": y_pred.astype(str),
    "actual":    y_test.astype(str)
})

#Bar chart
pred_vs_actual_plot = (
    alt.Chart(players_results)
       .mark_bar()
       .encode(
           x     = alt.X("predicted:N", title="Predicted Subscription"),
           y     = alt.Y("count()", title="Count"),
           color = alt.Color("actual:N", title="Actual subscription")
       )
       .properties(
           title="Figure 7: Predicted vs Actual Subscription (k-NN, k = 11)",
           width=500,
           height=300
       )
)
pred_vs_actual_plot

This plot shows how often the KNN model correctly or incorrectly predicted subscription status. When the model predicted a player as a subscriber (the “True” bar), around 35 of those players were actually subscribers (orange section), but about 10 were non-subscribers (blue section). This means the model frequently predicts “True,” leading to many correct predictions but also a noticeable number of false positives.

When the model predicted “False,” there were far fewer cases overall. Most were correct non-subscribers, but the model still missed a small number of actual subscribers.

**Discussion:**

The question of whether the player characteristics, such as age, gender, experience, and played hours, can be used to predict newsletter subscription was explored in this project. As expected, it was found that subscribers tended to have a higher played hours on average (Figure 4). This suggests that engagement with the game is related to a player’s likelihood of subscribing. However, experience level showed only slight variations across groups, with all experience categories displaying similarly high subscription rates (Figure 1). This makes experience level a relatively weak predictor of subscription status.

Continuing with the demographic patterns, the age distribution of subscribers and non-subscribers appeared very similar, indicating that age alone is a weak predictor (Figure 2).  Most players in the dataset fell between ages 15-25, meaning that the dataset is skewed toward younger players, but this does not translate into meaningful predictions (Figure 2). Gender also showed large differences in group sizes, with most players identifying as male, making it difficult to interpret gender-related patterns reliably. Figure 3 shows extensive overlap between subscribers and non-subscribers across both age and played hours for players with fewer than 10 hours of gameplay, revealing no strong visual separation between the two groups. However, the figure also shows that all players with more than 10 hours of gameplay were subscribed, suggesting that higher engagement may be associated with subscription, though this group is very small.

It is also important to note that all exploratory visualizations were created using the unscaled dataset, while the KNN classifier was trained on standardized features. This means that although the raw plots help us understand overall patterns and natural distributions, the KNN model learns decision boundaries in a transformed feature space, where distances are normalized. As a result, the model may detect relationships that are not visually obvious from the unscaled plots.

After performing 5-fold cross-validation, it was found that the best performing KNN is k=11 (Figure 6). The KNN mean was used to classify players into either “subscriber” or “non-subscriber”. The model achieved a reasonable test accuracy of about 76%, but it was not strong, as it shows that the predictor machine still made lots of mistakes. It also achieved a precision of 76.1% for predicting subscribers and a recall of 97.2%. These metrics show that while the model correctly identifies the actual subscribers, it also produces many false positives.

The model’s performance is influenced heavily by class imbalance, since subscribers make up the majority of the dataset. Because of this, the classifier tends to predict “True” (subscriber) very frequently (Figure 7). In contrast, the model predicts “False” (non-subscriber) only a few times. When it does, these predictions are usually correct, but the sample size is too small to draw firm conclusions (Figure 7). Overall, the model struggles to reliably distinguish the two groups using only the given features.

The findings of this project can still provide useful insight for the research group in Computer Science at UBC. The results give a clearer picture of how people engage with the video game and which characteristics correlate with newsletter subscription. However, as seen above, using just basic demographics and general engagement metrics for targeting newsletter outreach will limit predictive accuracy. With the current model, it is difficult to confidently predict whether a new player will subscribe or not.

Looking forward, future work should explore which in-game behaviours might better predict subscription when combined with played hours. These could be behaviours such as session frequency, game achievements, and social networks. As well as, exploring personal factors that influence subscription decisions beyond behaviour, may reveal important influences on subscription decisions, for example, player’s interest in contributing to research, overall level of involvement with the community and their willingness to receive promotional emails. Using additional variables like that could greatly increase the predictive power of the model, and help the research group better target their recruitment efforts, and plan their resources..  

Finally, questions, such as: “Why do some highly engaged players still choose not to subscribe?” could guide further investigations. Understanding these behaviours may allow the team at UBC to adjust outreach strategies and modify the content or delivery of the newsletter.
