# DSCI 100 Winter Term 1 2025/2026 
## GROUP 9 - PROJECT FINAL REPORT

## Predicting Player Contribution Levels on a Minecraft Game Research Server

### Group Members: Chenxu Zhao (76439926), Ellenna Edij (62956032), Harpuneet Sran (20655627), Sean Jin (59517383) 

#### Libraries


In [3]:
import altair as alt
import pandas as pd

#### (1) Introduction

##### A. Relevant Background Information
A UBC Computer Science research group is collecting gameplay data from a custom Minecraft server to study how players behave in-game. Player actions and sessions are recorded, and the research team needs this information to make decisions about:
- recruiting the right types of players,
- ensuring enough server resources and software licenses,
- understanding which players contribute the most data,
- and identifying behavioural patterns linked to newsletter subscription or long-term engagement.


The project lead, Frank Wood, has three broad research questions for students to explore:

- Which player characteristics and behaviours predict newsletter subscription?
- Which types of players contribute the most gameplay data?
- What time windows are likely to experience high numbers of simultaneous players?



#### (2) Question

For this project, our group chose to focus on Question 1:

“What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?”

From this, we constructed a specific predictive question:

"Can we predict using the reported playing time and age, the subscription purchase among players aged 15–28?"

#### (3) Data Description:

Our group will be using the player.csv dataset, as it's suitable for building our predictive models. The dataset contains the players' characteristics (age) and behavioural measures (hours played).


In [6]:
# This is the Uniform Resource Locator string for our data file
url = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"

# Loading in dataset
players = pd.read_csv(url)

# Raw dataset (untidy)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


The players.csv file is comprised of 196 observations and 7 columns with the following variables:

| Variable Name | Variable Type | Variable Description |
| :------- | :------: | :-------: |
|experience|String|Categorical variable describing the users' experience in the game (Veteran, Pro, Regular, Amateur, Beginner)|
|subscribe|boolean|Categorical variable showing if the user was subscribed to the newsletter or not|
|hashedEmail|String|Unique categorical variable that represents each specific player's email address encrypted|
|played_hours|float|Quantitative variable representing the total reported hours of playtime|
|name|String|Categorical variable representing the name of each player|
|gender|String|Categorical variable showing whether the player is Male or Female|
|age|Integer|Quantitative variable representing the current age of the player|

##### Issues/Potential Issues: 

The dataset contains missing values in the “individualId” and “organizationName” columns, making it untidy, these variables can be safely removed." A potential issue is that the scale range for the numeric variables differs vastly which can affect how our model operates, larger scales of variables may be weighed more than others.

##### Follow Up to Issues: Values included/excluded

Columns and non-numeric variables like "name" and "hashedEmail" should also be excluded as they do not contribute to the analysis of the data. Contrarily, "age" and "hours_played" are great indentifiers for the subscription likelihood and should be included.

##### Data Collection:

The data was collected using player activity within a pseudo Minecraft sever by the Computer Science Department at UBC.

#### Data Wrangling and Cleaning

In [7]:
# Making data tidy. Dropping "individualId" and "organizationName"
columns_to_drop = ["individualId", "organizationName"]
players = players.drop(columns=columns_to_drop)

# Dropping columns/variables that do not contribute to the analysis of the data 
columns_to_drop = ["experience", "hashedEmail", "name", "gender"]
players = players.drop(columns=columns_to_drop)

players

Unnamed: 0,subscribe,played_hours,age
0,True,30.3,9
1,True,3.8,17
2,False,0.0,17
3,True,0.7,21
4,True,0.1,21
...,...,...,...
191,True,0.0,17
192,False,0.3,22
193,False,0.0,17
194,False,2.3,17
