### Introduction

Video games are a popular form of entertainment, but not everyone plays in the same way. Some players spend many hours exploring and interacting with the game, while others play only briefly. Understanding what influences how long people play can help researchers learn about player behavior and engagement.

At the University of British Columbia, a research team is studying this using a Minecraft server. On the server, players can explore freely and interact with the environment, and the researchers record their actions. Along with playtime, the team collects information about each player, such as their experience level, subscription status, age, and gender.

The goal of this project is to see whether these characteristics can help **predict how long a player will spend in the game**. Specifically, we ask: **Can we estimate the total number of hours a user will play (played_hours) based on their experience, subscription status, gender, and age?** By examining these relationships, we hope to identify factors that are linked to higher or lower engagement.


### Dataset 1 (`players.csv`)

This dataset contains information about **196 individuals**, each representing a player profile. It includes **9 variables** capturing demographic information, subscription status, and engagement metrics. Below is a detailed description of each variable:

| Variable Name     | Dtype    | Description              | Summary Statistic |
|-------------------|----------|---------------------------|-------------------|
| experience        | object   | Player expertise level    | – |
| subscribe         | bool     | Subscription status       | – |
| hashedEmail       | object   | Encrypted email ID        | – |
| played_hours      | float64  | Total playtime (hours)    | refer to summary_stats |
| name              | object   | Participant name          | – |
| gender            | object   | Gender                    | – |
| age               | int64    | Player age (years)        | refer to summary_stats |
| individualId      | float64  | -              | – |
| organizationName  | float64  |-              | – |

### Dataset Issues and Considerations

While reviewing the dataset, several important points were noted:

- **Columns with all missing values:**  
  `individualId` and `organizationName` contain 100% missing data. These likely were placeholders or never collected. These will be removed in preprocessing.

- **Fields not useful for prediction:**  
  Columns like `name` or `hashedEmail` are identifiers and don’t provide meaningful information for modeling.

- **Categorical variables need encoding:**  
  - `experience` may follow an **ordinal scale** (Beginner → Pro).  
  - `gender` and `subscribe` **non-ordinal**, so it should be one-hot encoded for models like KNN.

- **Self-reported data:**  
  Fields like experience level and age may contain bias or inaccuracies, which should be considered when interpreting results.

### Relevance for KNN Regression

Some features of this dataset are especially important when using K-Nearest Neighbors (KNN) regression:  

- **Numeric features only:** KNN measures distance between points, so any categorical variables (like subscription status or gender) need to be converted to numbers.  
- **Scaling matters:** Features such as `age` and `played_hours` are on different scales. Without scaling, variables with larger ranges could dominate the distance calculations.  
- **Watch out for outliers:** Extremely high or low values, especially in age, can have a big effect on which points are considered neighbors.  

# Methods

1) Loading the data

We started by importing the necessary Python libraries such as pandas, altair, sklearn, and numpy for data manipulation, modelling, and visualisation to prepare for the project. The dataset was loaded into a DataFrame by assigning the URL of our data to "url" and using pd.read_csv. 

2) Cleaning and preprocessing data

Unnecessary or irrelevant columns such as hashedEmail, individualID, organizationName, and name were removed, as none of them were relevant for our modelling. The subscribe column was remapped into 2 categories: those who were "Subscribed" and "Not Subscribed". This step ensured our models were using the categorical variable in a meaningful way and improved visualizations. The gender column was also grouped into 3 categories to avoid bias due to low occurences of values like Agender during regression: Male, Female, and Other (consisting of non-binary, Agender, Two-spirited, etc.).  The dataset was also split into a training set (75%) and a testing set (25%) using train_test_split, ensuring our reproducibility with present by using a fixed random seed (113) prior to performing the exploratory data analysis.

3) EDA

The relationships between played hours and potential predictors were presented using histograms as they were able to most effectively display the distribution of results. This was done using the training set. These potential predictors included "Experience Level", "Subscription status", and “Gender”. We also investigated the relationship between played hours and “Age” using a scatterplot. The plots were placed beside one another using facet, revealing that across all variables, there was a highly skewed distribution, with most players logging very few hours and a very small group accumulating very high hours. Since there were so many players logging low hours, the same histograms were plotted with only the lowest 90% of hours_played values to display the distribution of the low values, finding a large number of players with 0-0.5 hours logged.

4) Preparing data for our modelling

The features for our model included: experience, subscription status, gender, and age, with played hours set as our target variable. A preprocessor was created with categorical variables subscribe, and gender "one-hot" encoded, experience placed in a ordinal encoder (beginner to pro) and age was scaled to convert all data into a format suitable for KNN regression where distances for all variables are equally weighted. We created a pipeline using our data preprocessor and KNeighborsRegressor. To find the optimal k value between 1 and 50, and a search over the param_grid was performed for  our pipeline, and 3 CV folds using a RMSE scoring metric to obtain the most accurate solution possible across multiple cross-validation sets.

5) Modelling with KNN Regression

After finding our best params and best score, we found the k value with the best parameters was K = 38, RMSE=25.90. However, looking at the K vs RMSE graph, the RMSE improves by only a small amount after K=12, RMSE=27.06, so the simpler model of K=12 is preferred. K=12 is also an improvement as it is the lowest acceptable RMSE since the skew found in the EDA shows our data is very prone to underfitting. The final model was trained on our training set and used to predict played hours on our test set. The resulting RMSE for the test set was 23.96.

6) Visualising analysis results

To assess model performance visually, we created a scatterplot of the predicted vs actual playtime. This plot highlighted where predictions match actual values and where our model underestimates/overestimates playtime, especially for our high-hour individuals. For individuals logging a lot of hours (20+), our model greatly underestimated the number of hours played with no predictions above 40 hours, and several low values (0-5hrs) were overestimated, with several predictions near 20 hours. When cutting out the outliers and examining hours 0 to 5 to more specifically look at the lower values, our model overestimated (predicted>actual) the majority of points in the domain, though some were also underestimated. Points on both graphs are scattered far from the expected predicted=actual diagonal line in an indiscernible pattern, indicating no correlation, and low accuracy.




In [2]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error
from sklearn.compose import make_column_transformer

In [3]:
url='https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz'
player_data=pd.read_csv(url)
player_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [4]:
summary_stats = player_data[["played_hours","age"]].describe()
summary_stats

Unnamed: 0,played_hours,age
count,196.0,196.0
mean,5.845918,21.280612
std,28.357343,9.706346
min,0.0,8.0
25%,0.0,17.0
50%,0.1,19.0
75%,0.6,22.0
max,223.1,99.0


In [5]:
player_relevant_data=player_data.drop(columns=['hashedEmail', 'individualId', 'organizationName', 'name'])
player_relevant_data

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


In [6]:
# Map subscription status to consistent labels
player_relevant_data['subscribe'] = (
    player_relevant_data['subscribe']
    .map({True: 'Subscribed',
          False: 'Not Subscribed',
          'Subscribed': 'Subscribed',
          'Not Subscribed': 'Not Subscribed'})
)
    
# Map genders to Male / Female / Other
gender_map = {
    'Male': 'Male',
    'Female': 'Female',
    'Prefer not to say': 'Other',
    'Non-binary': 'Other',
    'Agender': 'Other',
    'Two-Spirited': 'Other',
    'Other': 'Other'
}
player_relevant_data['gender'] = player_relevant_data['gender'].map(gender_map)

from sklearn.model_selection import train_test_split

features = ['experience', 'subscribe', 'gender', 'age']
target = 'played_hours'

player_training, player_testing = train_test_split(
    player_relevant_data,
    test_size=0.25,
    random_state=113
)

X_train = player_training[features]
y_train = player_training[target]

X_test = player_testing[features]
y_test = player_testing[target]

In [7]:
bounds = player_data["played_hours"].quantile([0.9])
bounds

0.9    2.8
Name: played_hours, dtype: float64

In [8]:
# Experience facet
chart_exp = (
    alt.Chart(player_training)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('experience:N', title='Experience')
    )
    .properties(title='Played Hours Distribution by Experience')
    .facet(column=alt.Column('experience:N',title='Experience'))
)

# Subscribe facet
chart_sub = (
    alt.Chart(player_training)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('subscribe:N').title("Subscribed")
    )
    .properties(title='Played Hours Distribution by Subscription')
    .facet(column=alt.Column('subscribe:N',title='Subscribe'))
)

# Subscribe facet
chart_gen = (
    alt.Chart(player_training)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('gender:N').title("gender")
    )
    .properties(title='Played Hours Distribution by Gender')
    .facet(column=alt.Column('gender:N',title='Gender'))
)

#Age vs Played-Hours Scatter Plot
chart_age = (
    alt.Chart(player_training)
    .mark_point(opacity=0.5)
    .encode(
        x=alt.X('played_hours:Q', title='Played Hours (hrs)'),
        y=alt.Y('age:Q', title='Age of Players (yrs)')
    )
    .properties(
        title='Age vs Played Hours'
    )
)


(chart_exp & chart_sub & chart_gen & chart_age).configure_header(
    labelFontSize=20,
    titleFontSize=24
)

In [9]:
# Experience facet (Zoomed - It was very hard to see trends from the above plots, so the plots below were created which are 'zoomed in' to the 90% quantile of data, which was found above)
chart_exp_zoom = (
    alt.Chart(player_training)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=500),
                scale=alt.Scale(domain=[0, 2.8]),   
                title='Played Hours (hrs)',),
        y=alt.Y('count()', title='Number of Players'),
        color='experience:N'
    )
    .properties(title='Played Hours Distribution by Experience (0 to 2.8 hours)')
    .facet(column=alt.Column('experience:N',title='Experience'))
)

# Subscribe facet (zoomed)
chart_sub_zoom = (
    alt.Chart(player_training)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=500),
                scale=alt.Scale(domain=[0, 2.8]), 
                title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('subscribe:N').title("Subscribed")
    )
    .properties(title='Played Hours Distribution by Subscription (0 to 2.8 hours)')
    .facet(column=alt.Column('subscribe:N',title='Subscribe'))
)

chart_gen_zoom = (
    alt.Chart(player_training)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=500),
                scale=alt.Scale(domain=[0, 2.8]), 
                title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('gender:N').title("gender")
    )
    .properties(title='Played Hours Distribution by Gender (0 to 2.8 hours)')
    .facet(column=alt.Column('gender:N',title='Gender'))
)

#Age vs Played-Hours Scatter Plot
chart_age_zoom = (
    alt.Chart(player_training)
    .mark_point(clip=True, opacity=0.5)
    .encode(
        x=alt.X(
            'played_hours:Q',
            title='Played Hours (hrs)',
            scale=alt.Scale(domain=[0, 2.8])
        ),
        y=alt.Y('age:Q', title='Age of Players (yrs)')
    )
    .properties(
        title='Age vs Played Hours'
    )
)

(chart_exp_zoom & chart_sub_zoom & chart_gen_zoom & chart_age_zoom).configure_header(
    labelFontSize=20,
    titleFontSize=24
)

In [18]:
#pre-preprocessing
player_preprocessor = make_column_transformer(
    (StandardScaler(), ["age"]),
    (OneHotEncoder(sparse_output=False),["gender", "subscribe", "experience"]),
    (OrdinalEncoder(categories=[["Beginner", "Amateur", "Regular", "Veteran", "Pro"]]), ["experience"]),
    verbose_feature_names_out=False,
    remainder="passthrough"
)

#create pipeline
player_pipe = make_pipeline(
    player_preprocessor,
    KNeighborsRegressor()
)

#finding optimal K
param_grid = {
    "kneighborsregressor__n_neighbors": range(1,75)
}

player_gridsearch = GridSearchCV(
    estimator=player_pipe,
    param_grid=param_grid,
    cv=3,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1
)

# Fit and extract results
player_results = (
    pd.DataFrame(player_gridsearch.fit(X_train, y_train).cv_results_)
)

# Best K and its RMSE
player_best_K = player_gridsearch.best_params_
player_best_RMSE = -player_gridsearch.best_score_

player_best_K, player_best_RMSE

({'kneighborsregressor__n_neighbors': 38}, np.float64(25.898794713363447))

In [19]:
rmse_k12 = (player_results
            .loc[player_results['param_kneighborsregressor__n_neighbors'] == 12,
            'mean_test_score']
            .iloc[0])

rmse_k12 = -rmse_k12   # negate because GridSearchCV uses NEGATIVE RMSE
rmse_k12

np.float64(27.058669299928493)

In [20]:
player_results=player_results.assign(RMSE= -player_results["mean_test_score"])

In [21]:
#Best K graph
Optimal_K_Chart=alt.Chart(player_results).mark_line().encode(
    x=alt.X('param_kneighborsregressor__n_neighbors', title='K Value'),
    y=alt.Y('RMSE', title='Root Mean Squared Error'))
Optimal_K_Chart

In [22]:
player_gridsearch.best_params_

{'kneighborsregressor__n_neighbors': 38}

In [23]:
# FINAL MODEL TESTING USING K = 12

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

final_knn_model = make_pipeline(
    player_preprocessor,
    KNeighborsRegressor(n_neighbors=12)
)

final_knn_model.fit(X_train, y_train)

y_pred = final_knn_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

rmse

np.float64(23.96461449754352)

In [24]:
final_knn_model

In [17]:
#Predicted vs Actual Plot
predicted_vs_actual = pd.DataFrame({
    "actual": y_test,
    "predicted": y_pred
})

predicted_vs_actual_chart=alt.Chart(predicted_vs_actual).mark_circle(opacity=0.5).encode(
    x=alt.X("actual",title='Actual Time Played (hours)'),
    y=alt.Y("predicted", title="Predicted Time Played (hours)")
).properties(
    title="Predicted vs Actual",
    width=300,
    height=300
)
predicted_actual = alt.Chart(pd.DataFrame({"x": [predicted_vs_actual["actual"].min(), predicted_vs_actual["actual"].max()]})).mark_line(color="black", opacity=0.2).encode(
    x="x",
    y="x")

# Layer the scatter and line
predicted_vs_actual_chart + predicted_actual


In [147]:
restricted_predicted_vs_actual_chart = (
    alt.Chart(predicted_vs_actual)
    .transform_filter("(datum.actual <= 5) && (datum.predicted <= 5)")
    .mark_circle(opacity=0.5)
    .encode(
        x=alt.X("actual:Q", title='Actual Time Played (hours)',
                scale=alt.Scale(domain=[0,5])),
        y=alt.Y("predicted:Q", title="Predicted Time Played (hours)",
                scale=alt.Scale(domain=[0,5]))
    )
    .properties(
        title="Predicted vs Actual",
        width=300,
        height=300
    )
)
line_data = pd.DataFrame({"x": [0, 5], "y": [0, 5]})
line = alt.Chart(line_data).mark_line(color="black", opacity=0.2).encode(
    x="x:Q",
    y="y:Q"
)
restricted_predicted_vs_actual_chart+line

In [148]:
import pandas as pd

new_player = pd.DataFrame([{
    "experience": "Regular",
    "subscribe": "Subscribed",
    "gender": "Female",
    "age": 20
}])

predicted_hours = final_knn_model.predict(new_player)
predicted_hours

array([37.])

In [149]:
new_player_2 = pd.DataFrame([{
    "experience": "Veteran",
    "subscribe": "Not Subscribed",
    "gender": "Male",
    "age": 80
}])

predicted_hours_2 = final_knn_model.predict(new_player_2)
predicted_hours_2

array([1.60833333])

# Discussion

We found that most players in the dataset had a very small amount of hours, most frequently 0. A few individuals, however, had extremely high hours in comparison, ranging from 30 to 180 hours. When modelling a method to predict the number of hours played, we used 4 different variables, but our model still struggled to accurately make predictions. While we expected there to be some inaccuracy due to the small dataset, the level to which is was inaccurate was surprising. Therefore, we have found that age, gender, subscription status and experience are not effective variables for an estimation of the hours a player will play.

When observing our best RMSE with k = 12 on our test data, we receive an RMSE of 23.96. In other words, our model predicts the number of hours a player will play with an average error of about 24 hours. Although that is better than making a random guess, the improvement is minimal, especially considering how our predicted vs actual plot displays that high values are significantly underestimated and low values tend to be overestimated. This can be an indicator of underfitting for a model. The most likely explanation is that due to the small dataset, the model simply regresses towards the mean value with its predictions instead of accurately representing the extremes actually found in the dataset. To improve upon this dataset for future or similar questions, more data must be collected so the model can properly locate trends in the data, allowing a more accurate model to be found where the trends for high and low hours played hours values can accurately predict the traits of a player who will play many hours. On the other hand, increased data could also support these results by continuing to show that no trends exist between age, gender, experience, and subscription status, but they would be able to do so with more conviction, without the risk of biases from a small dataset that may be affecting this model.

These findings could have a major impact as while they do not show any clear trends, they offer insight into the next steps that should be taken with this project. For instance, that more data should be collected, as the imbalance between high and low values made this difficult to work with, balance and derive results from. In addition, these findings may provide an alert that other factors than age, gender, subscription status, and experience should be considered and inspire questions involving other factors like amount of free time per week as an example.

Finally, future questions for this project may involve changing what was wrangled (likely after more data is collected), such as "for players who have played on the server (over 0 hours played), can we predict playing time based on gender, age, subscription status and experience?". A second interesting question with these results would be which factors (gender, age, subscription status and experience) individually can be used to predict the number of hours a user will play. This would allow us to see if there were individual trends that existed that were unseen with the combination of them all not being able to be used effectively for prediction.