### DSCI 100 Winter Term 1 2025/2026 
# Predicting Player Contribution Levels on a Minecraft Game Research Server
## GROUP 9 - PROJECT FINAL REPORT 
Group Members: Chenxu Zhao (76439926), Ellenna Edij (62956032), Harpuneet Sran (20655627), Sean Jin (59517383) 

---

### Libraries


In [1]:
import altair as alt
import pandas as pd
import numpy as np
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import recall_score, precision_score

---
### Introduction

#### (1) Background
A UBC Computer Science research group is collecting gameplay data from a custom Minecraft server to study how players behave in-game. Player actions and sessions are recorded, and the research team needs this information to make decisions about:
- recruiting the right types of players,
- ensuring enough server resources and software licenses,
- understanding which players contribute the most data,
- and identifying behavioural patterns linked to newsletter subscription or long-term engagement.

The project lead, **Frank Wood**, has three broad research questions for students to explore:
- Which player characteristics and behaviours predict newsletter subscription?
- Which types of players contribute the most gameplay data?
- What time windows are likely to experience high numbers of simultaneous players?

#### (2) Question

For this project, our group chose to focus on **Question 1**:

“What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?”

From this, we constructed a specific predictive question:

***"Can we predict using the reported playing time (hours) and age (years) the subscription purchase?"***

In [2]:
# This is the Uniform Resource Locator string for our data file
url = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"

# Loading in dataset
players = pd.read_csv(url)

# Raw dataset (untidy)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


#### (3) Data Description

Our group will be using the **player.csv** dataset, as it's suitable for building our predictive models. The dataset contains the players' characteristics (experience, age, gender) and behavioural measures (hours played).

The players.csv file is comprised of **196 observations** and **7 columns** with the following variables:

| Variable Name | Variable Type | Variable Description |
| :------- | :------: | :-------: |
|experience|String|Categorical variable describing the users' experience in the game (Veteran, Pro, Regular, Amateur, Beginner)|
|subscribe|boolean|Categorical variable showing if the user was subscribed to the newsletter or not|
|hashedEmail|String|Unique categorical variable that represents each specific player's email address encrypted|
|played_hours|float|Quantitative variable representing the total reported hours of playtime|
|name|String|Categorical variable representing the name of each player|
|gender|String|Categorical variable showing whether the player is Male or Female|
|age|Integer|Quantitative variable representing the current age of the player|


##### **Issues/Potential Issues**

The main issue with the dataset is it is not tidy as mentioned, *“individualId”* and *“organizationName”* are not respecting *"each column is a single variable, and each value is a single cell."* A potential issue is that the scale range for the numeric variables differs vastly which can affect how our model operates, larger scales of variables may be weighed more than others.

##### **Follow Up to Issues: Values included/excluded**

Columns and non-numeric variables like *"name"*, *"hashedEmail"*, *"gender"*, and *"experience"* should also be excluded as they do not contribute to the analysis of the data. Contrarily, *"age"* and *"hours_played"* are great identifiers for the subscription likelihood and should be included.

##### **Data Collection**

The data was collected using player activity within a pseudo Minecraft server by the Computer Science Department at UBC.

---

### Methods & Results

To answer our predictive question, our group decided to use the **classification approach**. The sections below show the steps taken along with the output and some explanation. Overall, here are the steps:
1. Wrangle and clean the dataset
2. Visualize and describe the training dataset
3. Perform K-Nearest Neighbors (KNN) Classification
4. Visualize the outputs
5. Interpret data results and relationships

#### (1) Wrangling & Cleaning the Dataset

Firstly, we would like to tidy the dataset by **dropping the empty columns** (*"individualID"* and *"organizationName"*), then **removing the irrelevant columns** for the prediction (*"name"*, *"hashedEmail"*, *"gender"*, and *"experience"*). As for the rows, there are no missing values in the dataset, therefore no rows were removed.

In [3]:
# Making data tidy. Dropping "individualId" and "organizationName"
columns_to_drop = ["individualId", "organizationName"]
players = players.drop(columns=columns_to_drop)

# Removing any unrelated columns to our data analysis ("name", "hashedEmail", "gender", and "experience")
columns_to_drop = ["name", "hashedEmail", "gender", "experience"]
players = players.drop(columns=columns_to_drop)

players

Unnamed: 0,subscribe,played_hours,age
0,True,30.3,9
1,True,3.8,17
2,False,0.0,17
3,True,0.7,21
4,True,0.1,21
...,...,...,...
191,True,0.0,17
192,False,0.3,22
193,False,0.0,17
194,False,2.3,17


#### (2) Visualizing the Training Data

Next, we converted the variable *"subscribe"* into "yes" and "no" labels so it can be used for the classification.

In [4]:
# Output dataframes instead of arrays
set_config(transform_output="pandas")

# set the seed
np.random.seed(1)

# re-label Class "True" as "yes", and Class "False" as "no"
players["subscribe"] = players["subscribe"].map({True: "yes", False: "no"})

The dataset was split into **training (75%)** and **testing (25%)** sets using stratified sampling to maintain the proportion of subscribers.

In [5]:
# Splitting the data into training set and testing set. Split by training -> 75% / testing -> 25%
players_train, players_test = train_test_split(
    players, train_size=0.75, stratify=players["subscribe"]
)

In [6]:
# create scatter plot of hours played versus age,label the points be subscription class
players_visualization_training = (
    alt.Chart(players_train).mark_circle(opacity=0.6, size=49)
    .encode(
        x=alt.X("age:Q").title("Age of Player"),
        y=alt.Y("played_hours").title("Hours Played").scale(zero=False, type="sqrt"),
        color=alt.Color("subscribe").title("Player Subscription Status"),
        tooltip=[
            alt.Tooltip("age:Q", title="Age"),
            alt.Tooltip("played_hours:Q", title="Hours Played"),
            alt.Tooltip("subscribe:N", title="Subscribed?")
        ]
    ).properties(title="Subscription Status Visualizations Relating to Player Age and Hours Played")
)

players_visualization_training

**Figure 1. Relationship Between Player Age, Hours Played, and Subscription Status**

The scatter plot displays how players' overall ages and total playtime relate to whether they subscribed to the game’s newsletter. Each point represents a player, color-coded by subscription status (“yes” or “no”). The vertical spread shows total playtime ranging from very low play hours to extreme outliers above 200 hours.

#### Insights
- Subscribed players tend to cluster between ages 18–30 and show higher variance in playtime.
- Most non-subscribers report playtime close to zero.
- A few extreme outliers appear (e.g., playtime above 200 hours), but they do not distort the general trend: higher playtime is associated with subscribing.
- Overlap between the two classes suggests that subscription cannot be predicted using simple linear boundaries, supporting the use of a flexible classifier such as KNN.

In [7]:
# create a zoomed in, with domain of 0-2 for hours played, scatter plot of hours played versus age, label the points be subscription class
players_zoomed = players_train[players_train["played_hours"] <=2]
players_visualization_training_zoomed = (
    alt.Chart(players_zoomed).mark_circle(opacity=0.6, size=49)
    .encode(
        x=alt.X("age:Q").title("Age of Player"),
        y=alt.Y("played_hours").title("Hours Played").scale(zero=False, type="sqrt"),
        color=alt.Color("subscribe").title("Player Subscription Status"),
        tooltip=[
            alt.Tooltip("age:Q", title="Age"),
            alt.Tooltip("played_hours:Q", title="Hours Played"),
            alt.Tooltip("subscribe:N", title="Subscribed?")
        ]
    ).properties(title="Subscription Status Visualizations Relating to Player Age and Hours Played (Zoomed In)")
)

players_visualization_training_zoomed

**Figure 2. Zoomed-In View of Hours Played (0–2 Hours Range)**

To better understand the densely packed portion of the dataset, this visualization zooms into the 0–2 hours range. This allows us to see subtle patterns that were overshadowed in the full-scale view.

#### Insights
- Most players who played less than 1 hour tend to be non-subscribers.
- Subscribers also appear in this range, but with slightly higher play hours on average.
- Ages remain widely distributed, from teens to players over 60, but age alone does not clearly separate subscribers from non-subscribers.
- This reinforces the idea that playtime has stronger predictive power than age, especially at lower activity levels.

### In conclusion from the figures above:
- Subscribed players tend to be concentrated between **ages 20–30**, although subscribers appear across a wider age range overall.
- Individuals with **higher playtime** are more **likely to subscribe**, while most non-subscribers cluster near very low playtime values.
- The zoomed-in view shows that even within the 0–2 hour range, subscribers are more widely distributed than non-subscribers, who remain mostly near zero hours.
- Several extreme values are visible in the full-scale figure (e.g., ages near 90–100 or playtime exceeding 200 hours), though these points do not dominate the general trend.
- There is **no clear linear pattern** between age, playtime, and subscription status, making linear regression inappropriate for this task.
- A **K-Nearest Neighbors (KNN) classifier** is more suitable because it can capture the non-linear structure of the data and relies on the proximity of data points rather than assuming a specific functional form.

#### (3) Summary of the Training Dataset 

| Variable Name | Variable Type | Variable Description |
| :------- | :------: | :-------: |
|subscribe|string|Categorical variable showing if the user was subscribed to the newsletter or not|
|played_hours|float|Quantitative variable representing the total reported hours of playtime|
|age|integer|Quantitative variable representing the current age of the player|

Looking at the dataset, it is now clean, tidy, and ready for K Neighbors Classification. The untidy columns and unused columns are dropped, and the columns of age, played_hours, and subscription status are kept.

**Additional Statistics**

In [8]:
players_train["subscribe"].value_counts(normalize=True)

subscribe
yes    0.734694
no     0.265306
Name: proportion, dtype: float64

→ The statistics show that the majority is **yes (73.5%)** and **no (26.5%)**, showing there's **imbalance** in the class.

In [9]:
players_train["age"].agg(["mean", "std"])

mean    21.517007
std     10.902654
Name: age, dtype: float64

→ Players' age **mean ≈ 21.5**, with **standard deviation ≈ 10.90**.

In [10]:
players_train["played_hours"].agg(["mean", "std"])

mean     6.055782
std     27.488436
Name: played_hours, dtype: float64

→ Players' played hours **mean ≈ 6.1**, with **standard deviation ≈ 27.5** 

**Issues/Potential Issues**

However, the data is not standardized yet. The class imbalance (as there are many subscribers) and outliers (in players' age and played hours) may affect the prediction. This suggests the use of methods like re-sampling or different classifiers for the uneven class distributions would be fit but that is beyond what we have covered in class so far. 

#### (4) Data Analysis and Visualization
##### **A. Building the Classification Model and Selecting the K Value**

- Standardized the numerical variables (*"age"*, *"played_hours"*)
- Used **GridSearchCV** to tune the number of neighbors from k = 1 to 30.
- Used the number of neighbors, the cross-validation accuracy estimate, and computed the standard error of the accuracy estimate to produce **accuracies_grid**

In [None]:
# create the preprocessor, pipeline, and CV grid search objects
players_preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "played_hours"]),
)
players_tune_pipe = make_pipeline(players_preprocessor, KNeighborsClassifier())

parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 31, 1),
}

players_tune_grid = GridSearchCV(
    estimator=players_tune_pipe,
    param_grid=parameter_grid,
    cv=5
)

# fit the model on the sub-training data
players_tune_grid.fit(
    players_train[["age", "played_hours"]],
    players_train["subscribe"]
)

# wrap it in a pd.DataFrame to make it easier to understand
accuracies_grid = pd.DataFrame(players_tune_grid.cv_results_)

# compute the standard error from the standard deviation

accuracies_grid["sem_test_score"] = accuracies_grid["std_test_score"] / 5**(1/2)

# evaluate the number of neighbors (param_kneighbors_classifier__n_neighbors), the cross-validation accuracy estimate (mean_test_score), and the standard error of the accuracy estimate
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "sem_test_score"
    ]]
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
)

In [None]:
# decide which number of neighbors is best by plotting the accuracy versus k
accuracy_vs_k = alt.Chart(accuracies_grid, title = "Accuracy vs. k (KNN)").mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)

accuracy_vs_k

**Figure 3. KNN Accuracy Across Different K Values**

The line chart shows the cross-validated accuracy of the K-Nearest Neighbors classifier for values of K from 1 to 29. The optimal K value was found to be 11, corresponding to the highest mean accuracy.

**Result of Tuning**

In [None]:
# obtain the number of neighbours with the highest accuracy
players_tune_grid.best_params_

→ Best K = 11

##### **B. Evaluating Performance on the Train Set**

In [None]:
# Predictions
players_train["predicted"] = players_tune_grid.predict(
    players_train[["age", "played_hours"]]
)

# Test Accuracy
players_tune_grid.score(
    players_train[["age", "played_hours"]],
    players_train["subscribe"]
)

**Model Accuracy: 76.2%** → strong performance as there were only two predictors.

##### **C. Evaluating Performance on the Test Set**

In [None]:
# Predictions
players_test["predicted"] = players_tune_grid.predict(
    players_test[["age", "played_hours"]]
)

# Test Accuracy
players_tune_grid.score(
    players_test[["age", "played_hours"]],
    players_test["subscribe"]
)

**Model Accuracy: 77.6%** → strong performance as there were only two predictors.

In [None]:
# Precision
precision_score(
    y_true=players_test["subscribe"],
    y_pred=players_test["predicted"],
    pos_label="yes"
)

**Precision (yes): 76.6%** → among predicted "yes", some were false positives.

In [None]:
# Recall
recall_score(
    y_true=players_test["subscribe"],
    y_pred=players_test["predicted"],
    pos_label="yes"
)

**Recall (yes): 100%** → the model correctly identifies all subscribers.

In [None]:
# Confusion Matrix
pd.crosstab(
    players_test["subscribe"],
    players_test["predicted"]
)

**Figure 4. Confusion Matrix for KNN Classification Results**

The confusion matrix compares true subscription values against the model’s predictions. The classifier achieved perfect recall for identifying subscribers but performed less accurately in identifying non-subscribers.

##### **D. Trial Observation**

We would like to know if: 
- **Case 1**: a 13-year-old with a play time of 5 hours and
- **Case 2**: a 50-year-old with a play time of 2 hours

would subscribe to the platform. 


In [None]:
new_obs = pd.DataFrame([[13,5]], columns=["age", "played_hours"])

subscription_prediction = players_tune_grid.predict(new_obs)

subscription_prediction 

→ Predicted: **yes**

In [None]:
new_obs_2= pd.DataFrame([[50, 2]], columns=["age", "played_hours"])
subscription_prediction_2= players_tune_grid.predict(new_obs_2)
subscription_prediction_2 

→ Predicted: **no**

##### **E. Visualization of the analysis**

In [None]:
# fit the model on the training data
knn = KNeighborsClassifier(n_neighbors=11)

knn_pipeline = make_pipeline(players_preprocessor, knn)

knn_fit = knn_pipeline.fit(
    X=players_train[["age", "played_hours"]],
    y=players_train["subscribe"]
)

# create the grid of age/playtime vals, and arrange in a data frame
are_grid = np.linspace(
    players["age"].min() * 0.95, players["age"].max() * 1.05, 50
)
smo_grid = np.linspace(
    players["played_hours"].min() * 0.95, players["played_hours"].max() * 1.05, 50
)
asgrid = np.array(np.meshgrid(are_grid, smo_grid)).reshape(2, -1).T
asgrid = pd.DataFrame(asgrid, columns=["age", "played_hours"])

# use the fit workflow to make predictions at the grid points
knnPredGrid = knn_fit.predict(asgrid)

# bind the predictions as a new column with the grid points
prediction_table = asgrid.copy()
prediction_table["subscribe"] = knnPredGrid

# plot:
# 1. the colored scatter of the original data
unscaled_plot = alt.Chart(players).mark_point(
    opacity=0.6,
    filled=True,
    size=40
).encode(
    x=alt.X("age")
        .scale(
            nice=False,
            domain=(
                players["age"].min() * 0.95,
                players["age"].max() * 1.05
            )
        ),
    y=alt.Y("played_hours")
        .scale(
            nice=False,
            domain=(
                players["played_hours"].min() * 0.95,
                players["played_hours"].max() * 1.05
            )
        ),
    color=alt.Color("subscribe").title("Subscribed")
)

# 2. the faded colored scatter for the grid points
prediction_plot = alt.Chart(prediction_table, title="Subscription among Players using Age and Play time").mark_point(
    opacity=0.05,
    filled=True,
    size=300
).encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Played time (in hours)"),
    color=alt.Color("subscribe").title("Subscribed")
)
unscaled_plot + prediction_plot


**Figure 5. Subscription Prediction Regions Based on Age and Playtime (KNN = 11)**

The visualization overlays:
- The **actual player data** (small opaque points), and
- The **KNN prediction grid** (large faint colored background points),

to show how the classifier predicts subscription status across the full age-playtime space.

##### Insights
- The model draws a non-linear boundary, adapting closely to the pattern of the actual data.
- Most of the region with high playtime is predicted as “yes,” regardless of age.
- Low playtime across any age tends to fall into the “no” region.
- The decision boundary is not vertical or horizontal, which means:
1. Age alone cannot predict subscription,
2. Playtime is the dominant factor (but it is noted that interaction between age and playtime also matters)

---

### Discussion
#### (1) Summary of Findings
Our analysis indicates that newsletter subscription behavior is primarily influenced by a player's total gameplay hours, with age showing a weaker and less consistent relationship. The exploratory visualization revealed that subscribed players tend to cluster between ages 20 and 30 and report higher playtime overall, while non-subscribers tend to have very low playtime. The zoomed-in visualization (0–2 hours) emphasized this pattern even more clearly, showing that most non-subscribers fall near zero hours played, whereas subscribers remain more spread out even within smaller playtime values.

Using the cross-validation, the K-Nearest Neighbors classifier selected **K = 11** as the optimal neighborhood size. When evaluated on the test set, the model achieved an accuracy of **77.6%** and when evaluated on the train set, the model achieved an accuracy of **76.2%**. The model when evaluated resulted with a precision of **76.6%** and a recall of **100%**. The perfect recall score shows that the model successfully identifies all subscribing players. However, the confusion matrix indicates the model has difficulty in classifying non-subscribers.

From our evaluation, we can make a quite accurate prediction of newsletter subscription using only age and played hours.

#### (2) Expectations
These results align broadly with our initial expectations. We expected that players with greater engagement, represented by longer gameplay duration, would be more likely to subscribe to the newsletter. 

The model’s high recall supports this expectation, showing that the characteristics of subscribing players form a relatively consistent pattern. However, we do anticipate clearer separation between the two groups. Instead, the significant overlap between subscribers and non-subscribers, along with class imbalance, made it challenging for the model to distinguish classes.

#### (3) Potential Impacts of the Findings
These findings may help researchers better understand engagement patterns on the server. Since players with higher playtime are more likely to subscribe, targeted communication or engagement strategies could be oriented to lower activity players. Additionally, the model reliably identifies all subscribing users may be useful in applications where missing engaged players is particularly costly, such as player recruitment, newsletter marketing, or resource planning.

#### (4) Future Questions
The analysis raises several new questions that could be explored further. For example:
- Other than age and total gameplay hours, what other player characteristics might help differentiate subscribers from non-subscribers effectively?
- Do patterns of in-game behavior, like login frequency or the types of activities players engage in, affect subscription likelihood?
- Are interactions with other players or participation in collaborative events linked to subscription decisions?
It is also unclear whether players with similar engagement levels subscribe at similar rates across different cultural backgrounds.