### Introduction

Video games are a popular form of entertainment, but not everyone plays in the same way. Some players spend many hours exploring and interacting with the game, while others play only briefly. Understanding what influences how long people play can help researchers learn about player behavior and engagement.

At the University of British Columbia, a research team is studying this using a Minecraft server. On the server, players can explore freely and interact with the environment, and the researchers record their actions. Along with playtime, the team collects information about each player, such as their experience level, subscription status, age, and gender.

The goal of this project is to see whether these characteristics can help **predict how long a player will spend in the game**. Specifically, we ask: **Can we estimate the total number of hours a user will play (played_hours) based on their experience, subscription status, gender, and age?** By examining these relationships, we hope to identify factors that are linked to higher or lower engagement.


### Dataset 1 (`players.csv`)

This dataset contains information about **196 individuals**, each representing a player profile. It includes **9 variables** capturing demographic information, subscription status, and engagement metrics. Below is a detailed description of each variable:

| Variable Name     | Dtype    | Description              | Summary Statistic |
|-------------------|----------|---------------------------|-------------------|
| experience        | object   | Player expertise level    | – |
| subscribe         | bool     | Subscription status       | – |
| hashedEmail       | object   | Encrypted email ID        | – |
| played_hours      | float64  | Total playtime (hours)    | refer to summary_stats |
| name              | object   | Participant name          | – |
| gender            | object   | Gender                    | – |
| age               | int64    | Player age (years)        | refer to summary_stats |
| individualId      | float64  | -              | – |
| organizationName  | float64  |-              | – |


**Note:** The `individualId` and `organizationName` columns are entirely empty and provide no usable information. These should be removed during preprocessing.  These fields may not have been collected, applicable, or were lost during data processing.
 
---

### Dataset Issues and Considerations

While reviewing the dataset, several important points were noted:

- **Columns with all missing values:**  
  `individualId` and `organizationName` contain 100% missing data. These likely were placeholders or never collected.

- **Fields not useful for prediction:**  
  Columns like `name` or `hashedEmail` are identifiers and don’t provide meaningful information for modeling.

- **Categorical variables need encoding:**  
  - `experience` may follow an **ordinal scale** (Beginner → Pro).  
  - `gender` is **non-ordinal**, so it should be one-hot encoded for models like KNN.

- **Self-reported data:**  
  Fields like experience level and age may contain bias or inaccuracies, which should be considered when interpreting results.

### Relevance for KNN Regression

Some features of this dataset are especially important when using K-Nearest Neighbors (KNN) regression:  

- **Numeric features only:** KNN measures distance between points, so any categorical variables (like subscription status or gender) need to be converted to numbers.  
- **Scaling matters:** Features such as `age` and `played_hours` are on different scales. Without scaling, variables with larger ranges could dominate the distance calculations.  
- **Remove unnecessary fields:** Columns like `name` or `hashedEmail` don’t provide useful information for predicting playtime and can distort distances between players.  
- **Watch out for outliers:** Extremely high or low values, especially in age, can have a big effect on which points are considered neighbors.  
- **Uneven experience levels:** If some experience categories have very few players, KNN might struggle to find representative neighbors for them.  

In [1]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error
from sklearn.compose import make_column_transformer

In [19]:
url='https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz'
player_data=pd.read_csv(url)
player_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [20]:
summary_stats = player_data[["played_hours","age"]].describe()
summary_stats

Unnamed: 0,played_hours,age
count,196.0,196.0
mean,5.845918,21.280612
std,28.357343,9.706346
min,0.0,8.0
25%,0.0,17.0
50%,0.1,19.0
75%,0.6,22.0
max,223.1,99.0


In [21]:
player_relevant_data=player_data.drop(columns=['hashedEmail', 'individualId', 'organizationName', 'name'])
player_relevant_data

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


In [22]:
# Map subscription status to consistent labels
player_relevant_data['subscribe'] = (
    player_relevant_data['subscribe']
    .map({True: 'Subscribed',
          False: 'Not Subscribed',
          'Subscribed': 'Subscribed',
          'Not Subscribed': 'Not Subscribed'})
)
    
# Map genders to Male / Female / Other
gender_map = {
    'Male': 'Male',
    'Female': 'Female',
    'Prefer not to say': 'Other',
    'Non-binary': 'Other',
    'Agender': 'Other',
    'Two-Spirited': 'Other',
    'Other': 'Other'
}
player_relevant_data['gender'] = player_relevant_data['gender'].map(gender_map)

from sklearn.model_selection import train_test_split

features = ['experience', 'subscribe', 'gender', 'age']
target = 'played_hours'

player_training, player_testing = train_test_split(
    player_relevant_data,
    test_size=0.25,
    random_state=113
)

X_train = player_training[features]
y_train = player_training[target]

X_test = player_testing[features]
y_test = player_testing[target]

In [23]:
bounds = player_data["played_hours"].quantile([0.9])
bounds

0.9    2.8
Name: played_hours, dtype: float64

In [24]:
# Experience facet
chart_exp = (
    alt.Chart(player_training)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('experience:N', title='Experience')
    )
    .properties(title='Played Hours Distribution by Experience')
    .facet(column=alt.Column('experience:N',title='Experience'))
)

# Subscribe facet
chart_sub = (
    alt.Chart(player_training)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('subscribe:N').title("Subscribed")
    )
    .properties(title='Played Hours Distribution by Subscription')
    .facet(column=alt.Column('subscribe:N',title='Subscribe'))
)

# Subscribe facet
chart_gen = (
    alt.Chart(player_training)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('gender:N').title("gender")
    )
    .properties(title='Played Hours Distribution by Gender')
    .facet(column=alt.Column('gender:N',title='Gender'))
)

#Age vs Played-Hours Scatter Plot
chart_age = (
    alt.Chart(player_training)
    .mark_point(opacity=0.5)
    .encode(
        x=alt.X('played_hours:Q', title='Played Hours (hrs)'),
        y=alt.Y('age:Q', title='Age of Players (yrs)')
    )
    .properties(
        title='Age vs Played Hours'
    )
)


(chart_exp & chart_sub & chart_gen & chart_age).configure_header(
    labelFontSize=20,
    titleFontSize=24
)

## Experience
Across all experience levels: Amateur, Beginner, Pro, Regular, and Veteran, the histograms show a similar pattern: **most players record very few hours**, while only a handful log **extremely high playtime**.  
Even though we might expect more experienced players to play longer, the distributions are all dominated by a **large spike near 0–5 hours**, followed by a **long tail extending out toward 200+ hours**.  
Because these long-tail values stretch the x-axis, **differences between experience groups become hard to distinguish**, as most of the meaningful variation happens in the first few hours of play.

## Subscription Status
Both Subscribed and Not Subscribed players show the same issue: a **very high concentration of people with low playtime**. Subscribed players appear to have slightly more individuals with **high playtime (e.g., >100 hours)**, but the overall shape still shows that **most people cluster close to zero**.  
Again, the stretched x-axis caused by a small number of extreme players makes it difficult to visually compare the two groups within the range where **most of the data actually lies**.

## Gender
When splitting by gender—Female, Male, and Other—the same pattern appears:  
**Most players in each gender category log only a few hours**  
A **very small number of players produce long right tails**  
This makes it tricky to identify whether any gender meaningfully differs in average playtime because the **outliers dominate the scale**, compressing the main distribution toward the left of each plot.

## Age
The scatterplot of Age vs. Played Hours also shows **heavy right-skew**. Across all ages, the **majority of players are near the bottom of the y-axis (close to zero hours)**, with just a few individuals playing **extremely large amounts**.  
Because these outliers stretch the axis, it becomes difficult to see whether **age genuinely relates to playtime in the low-hour range**, where almost all players fall.

## Why We Needed to Zoom In
Across all variables, the range of played hours is dominated by a **small number of extreme high-hour players**. This causes the x-axes to stretch across **0–240+ hours**, despite nearly all observations sitting **between 0–5 hours**. As a result, the plots show a **large spike at the left** and almost no visible detail within the range where **most players actually fall**.  
To address this, in the next step of our analysis we calculated the **75th percentile of played hours** and narrowed the plots to **0 up to that cutoff**, allowing us to more clearly examine how **experience, subscription, gender, and age relate to playtime** among the majority of the dataset.

In [8]:
# Experience facet (Zoomed - It was very hard to see trends from the above plots, so the plots below were created which are 'zoomed in' to the 90% quantile of data, which was found above)
chart_exp_zoom = (
    alt.Chart(player_training)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=500),
                scale=alt.Scale(domain=[0, 2.8]),   
                title='Played Hours (hrs)',),
        y=alt.Y('count()', title='Number of Players'),
        color='experience:N'
    )
    .properties(title='Played Hours Distribution by Experience (0 to 2.8 hours)')
    .facet(column=alt.Column('experience:N',title='Experience'))
)

# Subscribe facet (zoomed)
chart_sub_zoom = (
    alt.Chart(player_training)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=500),
                scale=alt.Scale(domain=[0, 2.8]), 
                title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('subscribe:N').title("Subscribed")
    )
    .properties(title='Played Hours Distribution by Subscription (0 to 2.8 hours)')
    .facet(column=alt.Column('subscribe:N',title='Subscribe'))
)

chart_gen_zoom = (
    alt.Chart(player_training)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=500),
                scale=alt.Scale(domain=[0, 2.8]), 
                title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('gender:N').title("gender")
    )
    .properties(title='Played Hours Distribution by Gender (0 to 2.8 hours)')
    .facet(column=alt.Column('gender:N',title='Gender'))
)

#Age vs Played-Hours Scatter Plot
chart_age_zoom = (
    alt.Chart(player_training)
    .mark_point(clip=True, opacity=0.5)
    .encode(
        x=alt.X(
            'played_hours:Q',
            title='Played Hours (hrs)',
            scale=alt.Scale(domain=[0, 2.8])
        ),
        y=alt.Y('age:Q', title='Age of Players (yrs)')
    )
    .properties(
        title='Age vs Played Hours'
    )
)

(chart_exp_zoom & chart_sub_zoom & chart_gen_zoom & chart_age_zoom).configure_header(
    labelFontSize=20,
    titleFontSize=24
)

## Zoomed-In Analysis (0 to 2.8 Hours)
After restricting the x-axis to the 75th percentile of played hours (approximately **2.8 hours**), the distributions become much clearer. Removing the influence of extreme outliers allows us to better compare groups and see how experience, subscription status, gender, and age relate to **low–moderate playtime**, where almost all players fall.

### Experience
- When zooming in, all experience groups show a strong clustering below **1 hour**. 
- Amateur and Beginner players have the highest counts in the **0–0.5 hour** range.  
- Regular and Veteran players still mostly log under 1 hour, but their distributions have a slightly wider spread.  
- Overall, even with the zoomed scale, experience does not show a dramatic separation, suggesting that **low playtime is common across all skill levels**.

### Subscription Status
- With the extreme values removed, both groups show a sharper concentration near **0 hours**.  
- Subscribed users display a somewhat broader distribution, with more individuals appearing between **0.5–1.5 hours**.  
- Not subscribed users are even more tightly concentrated at very low playtime.  
- This suggests that subscription status may have a modest influence, but still not a strong predictor within this limited range.

### Gender
The zoomed-in distributions reveal greater clarity:  
- Male players dominate the dataset numerically and cluster heavily under **0.5 hours**.  
- Female players show a similar pattern but with fewer individuals across all bins.  
- Other gender identities appear in very small numbers but follow the same shape—mostly under 0.5 hours.  
- The zoomed view reinforces that **gender differences are minimal**, with all groups showing extremely low playtime for most users.

### Age
In the zoomed scatterplot of Age vs. Played Hours, the structure becomes more interpretable:  
- Almost all points fall under **2.8 hours** regardless of age.  
- There is no clear upward or downward trend, suggesting that **age is not strongly correlated with short-term playtime**.  
- The clustering near zero across all ages is now much easier to see without the distortion caused by outliers.

### Summary of Zoomed-In Patterns
The zoomed-in plots reveal that across experience, subscription status, gender, and age, the majority of players record **extremely low playtime, typically below 1 hour**.  While some small differences are visible—such as subscribers having slightly more players with moderate hours—the overall takeaway is that **low-hour behavior is consistent across all groups**.  This clearer visualization helps confirm that **outliers were masking the true structure of the data** in the original full-range plots.


In [26]:
#pre-preprocessing
player_preprocessor = make_column_transformer(
    (StandardScaler(), ["age"]),
    (OneHotEncoder(sparse_output=False),["gender", "subscribe", "experience"]),
    (OrdinalEncoder(categories=[["Beginner", "Amateur", "Regular", "Veteran", "Pro"]]), ["experience"]),
    verbose_feature_names_out=False,
    remainder="passthrough"
)

#create pipeline
player_pipe = make_pipeline(
    player_preprocessor,
    KNeighborsRegressor()
)

#finding optimal K
param_grid = {
    "kneighborsregressor__n_neighbors": range(1, 50)
}

player_gridsearch = GridSearchCV(
    estimator=player_pipe,
    param_grid=param_grid,
    cv=3,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1
)

# Fit and extract results
player_results = (
    pd.DataFrame(player_gridsearch.fit(X_train, y_train).cv_results_)
)

# Best K and its RMSE
player_best_K = player_gridsearch.best_params_
player_best_RMSE = -player_gridsearch.best_score_

player_best_K, player_best_RMSE

({'kneighborsregressor__n_neighbors': 38}, np.float64(25.898794713363447))

In [27]:
player_results=player_results.assign(RMSE= -player_results["mean_test_score"])

In [28]:
#Best K graph
Optimal_K_Chart=alt.Chart(player_results).mark_line().encode(
    x=alt.X('param_kneighborsregressor__n_neighbors', title='K Value'),
    y=alt.Y('RMSE', title='Root Mean Squared Error'))
Optimal_K_Chart

In [29]:
player_gridsearch.best_params_

{'kneighborsregressor__n_neighbors': 38}

In [30]:
# FINAL MODEL TESTING USING K = 38

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

final_knn_model = make_pipeline(
    player_preprocessor,
    KNeighborsRegressor(n_neighbors=38)
)

final_knn_model.fit(X_train, y_train)

y_pred = final_knn_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

rmse

np.float64(25.940692317683844)

In [31]:
final_knn_model

In [32]:
#Predicted vs Actual Plot
predicted_vs_actual = pd.DataFrame({
    "actual": y_test,
    "predicted": y_pred
})

predicted_vs_actual_chart=alt.Chart(predicted_vs_actual).mark_circle(opacity=0.5).encode(
    x=alt.X("actual",title='Actual Time Played (hours)'),
    y=alt.Y("predicted", title="Predicted Time Played (hours)")
).properties(
    title="Predicted vs Actual",
    width=300,
    height=300
)
predicted_vs_actual_chart

In [33]:
restricted_predicted_vs_actual_chart = (
    alt.Chart(predicted_vs_actual)
    .transform_filter("(datum.actual <= 5) && (datum.predicted <= 5)")
    .mark_circle(opacity=0.5)
    .encode(
        x=alt.X("actual:Q", title='Actual Time Played (hours)',
                scale=alt.Scale(domain=[0,5])),
        y=alt.Y("predicted:Q", title="Predicted Time Played (hours)",
                scale=alt.Scale(domain=[0,5]))
    )
    .properties(
        title="Predicted vs Actual",
        width=300,
        height=300
    )
)

restricted_predicted_vs_actual_chart

# Interpretation of the Predicted vs Actual Plots

These two visualizations (the zoomed and unzoomed actual time played vs predicted time played) shows a clear **bias pattern**: it tends to **overpredict** playtime for low-hour users and can **underpredict** for high-hour users.

## What the plot shows

- **Each point represents a user:**
  - **x-axis:** actual hours played
  - **y-axis:** predicted hours played by the model
- **For users with 0 actual hours**, many predicted values are as high as **16 hours**, indicating a **positive bias** (overestimation) for non-players which can be seen in the zoomed graph.
- **For users with higher actual hours** (e.g., ~27 hours), predictions can be much lower than the true value, showing a **negative bias** (underestimation) for high-play users, which can be seen in the unzoomed graph.
- Overall, the dominant pattern is **overestimation** in the common **0–5 hours range**, where most users are concentrated.

## Possible reason for this bias

- The KNN model predicts based on the **average behavior of nearest neighbors**.
- Since most users have low playtime, neighbors of low-play users may include some moderate players, causing **overprediction** for low-hour users.
- Conversely, extreme high-hour users may have few similar neighbors, causing the model to **underestimate** their playtime.

## Implications for the research question

**Research question:** *Can we estimate total hours played (`played_hours`) based on experience, subscription status, gender, and age?*

- The positive bias for low-hour users and negative bias for high-hour users indicates the model is **systematically inaccurate across the range of playtime**.
- Predictions are **widely scattered** around the diagonal line, showing **substantial error** even in the common playtime range.

## Conclusion

With the current KNN model and selected features, **predictions are not reliably accurate**. There is a tendency to **overpredict low-hour users** and **underpredict high-hour users**, likely due to the distribution of playtime in the dataset and the averaging nature of KNN.

In [35]:
import pandas as pd

new_player = pd.DataFrame([{
    "experience": "Regular",
    "subscribe": "Subscribed",
    "gender": "Female",
    "age": 20
}])

predicted_hours = final_knn_model.predict(new_player)
predicted_hours

array([15.92105263])

In [36]:
new_player_2 = pd.DataFrame([{
    "experience": "Veteran",
    "subscribe": "Not Subscribed",
    "gender": "Male",
    "age": 80
}])

predicted_hours_2 = final_knn_model.predict(new_player_2)
predicted_hours_2

array([1.21842105])

# Methods


# 1) Loading the data

We began by importing the necessary Python libraries including **pandas**, **altair**, **sklearn**, and **numpy** to handle data manipulation, modeling, and visualization. The dataset was loaded into a DataFrame using `pd.read_csv` after assigning its URL to the variable `url`.


# 2) Cleaning and preprocessing

We removed columns that were irrelevant to our analysis or modeling, including **hashedEmail**, **individualID**, **organizationName**, and **name**.

The **subscribe** variable was recoded into two categories: **Subscribed** and **Not Subscribed**, ensuring it could be used meaningfully in visualizations and models.

The **gender** variable was grouped into three categories—**Male**, **Female**, and **Other** (which included non-binary, agender, Two-Spirit, etc.)—to simplify analysis and reduce sparsity.


# 3) Exploratory data analysis (EDA)

We used histograms to examine the relationship between **played hours** and the potential categorical predictors:

- **Experience level**  
- **Subscription status**  
- **Gender**

We also created a scatterplot to inspect how **age**, a numerical predictor,related elates to played hours.

Across all variables, the distribution of played hours was **highly skewed**: most users logged very few hours, while a small subset recorded extremely high playtime. Faceting allowed direct comparison of patterns across predictors.

# 4) Preparing data for modeling

We split the dataset into **training (75%)** and **testing (25%)** sets using `train_test_split`, with a fixed random seed (**113**) for reproducibility.

Our feature set included:

- **experience**
- **subscription status**
- **gender**
- **age**

The target variable was **played hours**.

Categorical features (experience, subscription status, gender) were transformed using **one-hot encoding**, and all features were standardized using **StandardScaler** to ensure meaningful distance calculations for KNN.We constructed a pipeline containing both the preprocessor and a **KNeighborsRegressor**.  A grid search was then performed across a range of **k values (1–50)** using 3-fold cross-validation, with **RMSPE** as the scoring metric (consistent with the screenshot).


# 5) KNN regression modeling

The grid search identified **k = 38** as the optimal parameter.

We trained the final model on the training set and used it to predict played hours on the test set.  
With **k = 38**, the model achieved an **RMSPE of 25.94**.


# 6) Visualizing results

We visualized model performance using a scatterplot comparing **predicted vs. actual** hours played.

These plots revealed a systematic pattern:

- **Low-hour users (especially those with 0 hours):**  
  Predictions were often **much higher** than the true values, sometimes as high as 16 hours.  
  This shows a **positive bias** (overestimation) for non-players and low-hour individuals.

- **High-hour users (20+ hours):**  
  Predictions were **consistently lower** than the true values, indicating a **negative bias** (underestimation).

When focusing on the 0–5 hour range, the model **underestimated nearly all data points** except those with exactly 0 hours, which were frequently **overestimated**. This mirrors the strong skew in the dataset and the averaging behavior of KNN.

# Discussion

Our analysis shows that predicting total hours played using experience, subscription status, gender, and age is extremely challenging. The dataset is dominated by players who recorded very low playtime, most often **0 hours**, while a small number of players logged extremely high hours, sometimes between **30 and 180 hours**. This created a highly skewed distribution, and this unevenness is reflected throughout our visualizations: across all variables, most players cluster near zero, and the few extreme cases stretch the axes, making patterns among the majority difficult to see.

When zooming into the 75th percentile of playtime, it became clear that low-hour behavior is extremely consistent across experience levels, subscription status, gender, and age. Even with this closer focus, there were only small differences between groups, suggesting that these features provide limited information about why some players engage much more than others.

The KNN model we applied reflects these patterns. With the optimal parameter **k = 38**, our best RMSPE was **25.94**, meaning the model’s typical error is about 26 hours. This is only slightly better than random guessing when compared to the standard deviation of the testing data (approximately 27). In other words, the model still struggles substantially to approximate true playtime.

This result is expected with such a highly skewed dataset. Since most players have 0 hours, the nearest neighbors for the majority of users are also zeros. As a result:

- **KNN overpredicts playtime for low-hour users** because even a few moderate-hour neighbors pull the average upward.  
- **KNN underpredicts playtime for high-hour users** because they have few similar neighbors, and the averaging process pulls their predictions down.  

This bias pattern is visible in the predicted–actual scatterplots: low-hour users are frequently predicted to have several hours when they played none, while high-hour users are consistently underestimated. The model’s predictions cluster poorly around the diagonal, reinforcing the overall inaccuracy.

Although we also considered linear regression, KNN performed better given the extreme skew. A linear model would be heavily influenced by the distant outliers, pulling the regression line upward and causing **major overprediction for 0-hour players** and **major underprediction for high-hour players**. KNN, by adapting locally, avoids some of this distortion, even though the overall accuracy remains low.

These findings indicate that the demographic and account-based features used—age, gender, subscription status, and experience level—are insufficient for predicting individual playtime. While the model cannot accurately identify who will be highly engaged, the analysis does reveal broader behavioral patterns: most users engage very little, and a few extreme outliers contribute disproportionately to total playtime. Recognizing this skew is important for understanding overall engagement trends.

### Future directions

Our results point to several directions for future research:

- **Remove extreme outliers** or analyze the model separately for low-, moderate-, and high-hour groups.  
- **Explore models more robust to skew**, such as random forests or gradient boosting.  
- **Incorporate behavioral features**, such as session frequency, in-game actions, or social interactions, which may provide stronger predictive power.  
- **Test how performance changes when zero-hour entries are excluded**, or when high-hour individuals are treated differently in the modeling process.  

Addressing these possibilities could help identify the factors that truly drive engagement and improve our ability to predict playtime with greater accuracy.
