# DSCI 100 Winter Term 1 2025/2026 
## GROUP 9 - PROJECT FINAL REPORT

## Predicting Player Contribution Levels on a Minecraft Game Research Server

### Group Members: Chenxu Zhao (76439926), Ellenna Edij (62956032), Harpuneet Sran (20655627), Sean Jin (59517383) 

#### Libraries


In [1]:
import altair as alt
import pandas as pd
import numpy as np
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV

#### (1) Introduction

##### A. Relevant Background Information
A UBC Computer Science research group is collecting gameplay data from a custom Minecraft server to study how players behave in-game. Player actions and sessions are recorded, and the research team needs this information to make decisions about:
- recruiting the right types of players,
- ensuring enough server resources and software licenses,
- understanding which players contribute the most data,
- and identifying behavioural patterns linked to newsletter subscription or long-term engagement.


The project lead, Frank Wood, has three broad research questions for students to explore:

- Which player characteristics and behaviours predict newsletter subscription?
- Which types of players contribute the most gameplay data?
- What time windows are likely to experience high numbers of simultaneous players?



#### (2) Question

For this project, our group chose to focus on Question 1:

“What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?”

From this, we constructed a specific predictive question:

"Can we predict using the reported playing time and age, the subscription purchase rate?"

#### (3) Data Description:

Our group will be using the player.csv dataset, as it's suitable for building our predictive models. The dataset contains the players' characteristics (experience, age, gender) and behavioural measures (hours played).


In [2]:
url = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


The players.csv file is comprised of 196 observations and 7 columns with the following variables:

-	experience: whether the individual was a pro, amateur, or veteran, (type = string ) 

-	subscribe: whether the individual has subscribed to the platform or not, (type = boolean)

-	hashedEmail: concealed email of the individual, (type = string) 

-	played_hours: hours reported of play time, (type = float point number)

-	name: name of the individual, (type = string) 

-	gender: gender of the individual, (type = string) 

-	age: age of the individual, (type = integer)

-	Note: Unknown columns with no data include individualId and organizationName


##### Issues/Potential Issues: 

The main issue with the dataset is it is not tidy as mentioned,  “individualId” and “organizationName” are not  respecting "each column is a single variable, and each value is a single cell." A potential issue is that the scale range for the numeric variables differs vastly which can affect how our model operates, larger scales of variables may be weighed more than others.

##### Follow Up to Issues: Values included/excluded

Untidy columns and non-numeric variables like "name" and "hashedEmai"l should also be excluded as they do not contribute to the analysis of the data. Contrarily, "age" and "hours_played" are great indentifiers for the subscription likelihood and should be included.

##### Data Collection:

The data was collected using player activity within a pseudo Minecraft sever by the Computer Science Department at UBC.

In [9]:
set_config(transform_output="pandas")
np.random.seed(1)

players_train, players_test = train_test_split(
    players, train_size=0.75, stratify=players["subscribe"]
)

players_preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "played_hours"]),
)
knn = KNeighborsClassifier()
players_tune_pipe = make_pipeline(players_preprocessor, knn)
parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 30, 1),
}

players_tune_grid = GridSearchCV(
    estimator=players_tune_pipe,
    param_grid=parameter_grid,
    cv=5
)
players_tune_grid.fit(
    players_train[["age", "played_hours"]],
    players_train["subscribe"]
)
accuracies_grid = pd.DataFrame(players_tune_grid.cv_results_)

accuracies_grid["sem_test_score"] = accuracies_grid["std_test_score"] / 10**(1/2)
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "sem_test_score"
    ]]
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
)
accuracy_vs_k = alt.Chart(accuracies_grid, title = "Accuracy vs. K value").mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)
accuracy_vs_k

In [10]:
players_tune_grid.best_params_

{'kneighborsclassifier__n_neighbors': 11}

In [12]:
knn = KNeighborsClassifier(n_neighbors=11)

preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "played_hours"]),
)

knn_pipeline = make_pipeline(preprocessor, knn)
knn_pipeline.fit(
    X=players,
    y=players["subscribe"]
)
knn_pipeline
