# DSCI 100 Winter Term 1 2025/2026 
## GROUP 9 - PROJECT FINAL REPORT

## Predicting Player Contribution Levels on a Minecraft Game Research Server

### Group Members: Chenxu Zhao (), Ellenna Edij (62956032), Harpuneet Sran (20655627), Sean Jin (59517383) 

### Libraries


In [1]:
import altair as alt
import pandas as pd
import numpy as np
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import recall_score, precision_score

### Introduction

#### (1) Relevant Background Information
A UBC Computer Science research group is collecting gameplay data from a custom Minecraft server to study how players behave in-game. Player actions and sessions are recorded, and the research team needs this information to make decisions about:
- recruiting the right types of players,
- ensuring enough server resources and software licenses,
- understanding which players contribute the most data,
- and identifying behavioural patterns linked to newsletter subscription or long-term engagement.


The project lead, Frank Wood, has three broad research questions for students to explore:

- Which player characteristics and behaviours predict newsletter subscription?
- Which types of players contribute the most gameplay data?
- What time windows are likely to experience high numbers of simultaneous players?


#### (2) Question

For this project, our group chose to focus on Question 1:

“What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?”

From this, we constructed a specific predictive question:-

"Can we predict using the reported playing time (hours) and age (years) the subscription purchase rates?"

In [2]:
# This is the Uniform Resource Locator string for our data file
url = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"

# Loading in dataset
players = pd.read_csv(url)

# Raw dataset (untidy)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


#### (3)  Data Description:

Our group will be using the player.csv dataset, as it's suitable for building our predictive models. The dataset contains the players' characteristics (experience, age, gender) and behavioural measures (hours played).
3

The players.csv file is comprised of 196 observations and 7 columns with the following variables:

| Variable Name | Variable Type | Variable Description |
| :------- | :------: | :-------: |
|experience|String|Categorical variable describing the users' experience in the game (Veteran, Pro, Regular, Amateur, Beginner)|
|subscribe|boolean|Categorical variable showing if the user was subscribed to the newsletter or not|
|hashedEmail|String|Unique categorical variable that represents each specific player's email address encrypted|
|played_hours|float|Quantitative variable representing the total reported hours of playtime|
|name|String|Categorical variable representing the name of each player|
|gender|String|Categorical variable showing whether the player is Male or Female|
|age|Integer|Quantitative variable representing the current age of the player|


##### Issues/Potential Issues: 

The main issue with the dataset is it is not tidy as mentioned,  “individualId” and “organizationName” are not  respecting "each column is a single variable, and each value is a single cell." A potential issue is that the scale range for the numeric variables differs vastly which can affect how our model operates, larger scales of variables may be weighed more than others.

##### Follow Up to Issues: Values included/excluded

Columns and non-numeric variables like "name", "hashedEmail", "gender", and "experience" should also be excluded as they do not contribute to the analysis of the data. Contrarily, "age" and "hours_played" are great indentifiers for the subscription likelihood and should be included.

##### Data Collection:

The data was collected using player activity within a pseudo Minecraft sever by the Computer Science Department at UBC.

### Methods & Results
Here we describe the methods we used to perform our analysis from beginning to end that narrates the analysis code. We can give an overview of what will we do (for example, the steps below)


To answer the question, we will:
- Wrangle and clean the dataset
- Visualize the dataset prior to the data analysis.
- Classification
- Visualizing the outputs
- Interpreting results and relationships


Wrangle and Clean Data:
- Dropping the hashedEmail, gender, experience, name, individualID, and organizationName columns and removing any rows with missing data.

Visualize the Data:
- Graph out the relationship between age, played hours, and subscription status

Describe the Data:
- Describe the wrangled data

Perform Classification:
- Code for Classification

Output Visualization:
- Show the result/Summary in figures, don’t forget the figure number and legend

Results (Interpret the Data):
- Explain the relationship



#### (1) Wrangling & Cleaning the Dataset

In [3]:
# Making data tidy. Dropping "individualId" and "organizationName"
columns_to_drop = ["individualId", "organizationName"]
players = players.drop(columns=columns_to_drop)

# Removing any unrelated columns to our data analysis ("name", "hashedEmail", "gender", and "experience")
columns_to_drop = ["name", "hashedEmail", "gender", "experience"]
players = players.drop(columns=columns_to_drop)

players

Unnamed: 0,subscribe,played_hours,age
0,True,30.3,9
1,True,3.8,17
2,False,0.0,17
3,True,0.7,21
4,True,0.1,21
...,...,...,...
191,True,0.0,17
192,False,0.3,22
193,False,0.0,17
194,False,2.3,17


#### (2)  Visualizating the Training Data

In [4]:
# Output dataframes instead of arrays
set_config(transform_output="pandas")

# set the seed
np.random.seed(1)

# re-label Class "True" as "yes", and Class "False" as "no"
players["subscribe"] = players["subscribe"].map({True: "yes", False: "no"})

# Splitting the data into training set and testing set. Split by training -> 75% / testing -> 25%
players_train, players_test = train_test_split(
    players, train_size=0.75, stratify=players["subscribe"]
)
# create scatter plot of hours played versus age,labl the points be subscription class
players_visualization_training = (
    alt.Chart(players_train).mark_circle(opacity=0.6, size=49)
    .encode(
        x=alt.X("age:Q").title("Age of Player"),
        y=alt.Y("played_hours").title("Hours Played").scale(zero=False, type="sqrt"),
        color=alt.Color("subscribe").title("Player Subscription Status")
    ).properties(title="Subscription Status Visualizations Relating to Player Age and Hours Played")
)

players_visualization_training

##### Insights

In the visualization above, it seems that the majority of the people who are subscribed to the game's newletter are around 20-30 years of age are subscribed. Furthermore, It seems that the longer the individual plays the game, the the more likely they are to be subscribed to the newletter. Lastly, there are 2 points of outliers where individuals of age around 91 and 100 seem to be subscribed, which is also something to point out. Overall, it does not seem like there is a linear relationship or shape, suggesting that linear regression is not a good fit. A K-Nearest Neighbors approach is more suitable because it does not assume the shape of the data, but rather relies on the proximal distances of the data points to our observation.

#### (3) Summary of the data set 

| Variable Name | Variable Type | Variable Description |
| :------- | :------: | :-------: |
|subscribe|string|Categorical variable showing if the user was subscribed to the newsletter or not|
|played_hours|float|Quantitative variable representing the total reported hours of playtime|
|age|integer|Quantitative variable representing the current age of the player|

Looking at this dataset, it is now clean, tidy, and ready for K Neighbors Classification. The untidy columns and unused columns are dropped, and the columns of age, played_hours, and subscription status are kept.

##### Issues/Potential Issues:
- A potential issues are that the data is not standardized (we will standardize the data on a relative scale in the KNN analysis).

In [5]:
players_train["subscribe"].value_counts(normalize=True)

subscribe
yes    0.734694
no     0.265306
Name: proportion, dtype: float64

In [6]:
players_train["age"].agg(["mean", "std"])

mean    21.517007
std     10.902654
Name: age, dtype: float64

In [7]:
players_train["played_hours"].agg(["mean", "std"])

mean     6.055782
std     27.488436
Name: played_hours, dtype: float64

#### (4) Data Analysis: 
##### A. Building the classification model 
##### Selecting the K value

In [8]:
# create the preprocessor, pipeline, and CV grid search objects
players_preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "played_hours"]),
)
parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 30, 1),
}
players_tune_pipe = make_pipeline(players_preprocessor, KNeighborsClassifier())


players_tune_grid = GridSearchCV(
    estimator=players_tune_pipe,
    param_grid=parameter_grid,
    cv=5
)
# fit the model on the sub-training data
players_tune_grid.fit(
    players_train[["age", "played_hours"]],
    players_train["subscribe"]
)
# wrap it in a pd.DataFrame to make it easier to understand
accuracies_grid = pd.DataFrame(players_tune_grid.cv_results_)


# compute the standard error from the standard deviation
# evaluate the number of neighbors (param_kneighbors_classifier__n_neighbors), the cross-validation accuracy estimate (mean_test_score), and the standard error of the accuracy estimate 
accuracies_grid["sem_test_score"] = accuracies_grid["std_test_score"] / 10**(1/2)
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "sem_test_score"
    ]]
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
)
# decide which number of neighbors is best by plotting the accuracy versus k
accuracy_vs_k = alt.Chart(accuracies_grid, title = "Accuracy vs. K value").mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)
accuracy_vs_k

In [9]:
# obtain the number of neighbours with the highest accuracy
players_tune_grid.best_params_

{'kneighborsclassifier__n_neighbors': 11}

##### B. The Classification Model

In [16]:
# fit the model on the sub-training data
knn = KNeighborsClassifier(n_neighbors=11)

knn_pipeline = make_pipeline(players_preprocessor, knn)
knn_fit = knn_pipeline.fit(
    X=players,
    y=players["subscribe"]
)
knn_fit 


##### C. Evaluating performance using the test set

In [17]:
players_test["predicted"] = players_tune_grid.predict(
    players_test[["age", "played_hours"]]
)

players_tune_grid.score(
    players_test[["age", "played_hours"]],
    players_test["subscribe"]
)

0.7755102040816326

In [18]:
precision_score(
    y_true=players_test["subscribe"],
    y_pred=players_test["predicted"],
    pos_label="yes"
)

np.float64(0.7659574468085106)

In [19]:
recall_score(
    y_true=players_test["subscribe"],
    y_pred=players_test["predicted"],
    pos_label="yes"
)

np.float64(1.0)

In [20]:
pd.crosstab(
    players_test["subscribe"],
    players_test["predicted"]
)

predicted,no,yes
subscribe,Unnamed: 1_level_1,Unnamed: 2_level_1
no,2,11
yes,0,36


##### D. Trial Observation 

We want to know if a 13-year-old with a play time of 5 hours and if a 50-year-old with a play time of 2 hours would subscribe to our platform. 


In [29]:
new_obs = pd.DataFrame([[13,5]], columns=["age", "played_hours"])
subscription_prediction = knn_fit.predict(new_obs)
subscription_prediction 

array(['no'], dtype=object)

In [30]:
new_obs_2= pd.DataFrame([[50, 2]], columns=["age", "played_hours"])
subscription_prediction_2= knn_fit.predict(new_obs_2)
subscription_prediction_2 

array(['no'], dtype=object)