# Introduction

Video games are a popular form of entertainment, but not everyone plays in the same way. Some players spend many hours exploring and interacting with the game, while others play only briefly. Understanding what influences how long people play can help researchers learn about player behavior and engagement.

At the University of British Columbia, a research team is studying this using a Minecraft server. On the server, players can explore freely and interact with the environment, and the researchers record their actions. Along with playtime, the team collects information about each player, such as their experience level, subscription status, age, and gender.

The goal of this project is to see whether these characteristics can help **predict how long a player will spend in the game**. Specifically, we ask: **Can we estimate the total number of hours a user will play (played_hours) based on their experience, subscription status, gender, and age?** By examining these relationships, we hope to identify factors that are linked to higher or lower engagement.


### Dataset 1 (`players.csv`)

This dataset contains information about **196 individuals**, each representing a player profile. It includes **9 variables** capturing demographic information, subscription status, and engagement metrics. Below is a detailed description of each variable:

| Variable Name     | Dtype    | Description              | Summary Statistic |
|-------------------|----------|---------------------------|-------------------|
| experience        | object   | Player expertise level    | – |
| subscribe         | bool     | Subscription status       | – |
| hashedEmail       | object   | Encrypted email ID        | – |
| played_hours      | float64  | Total playtime (hours)    | refer to summary_stats |
| name              | object   | Participant name          | – |
| gender            | object   | Gender                    | – |
| age               | int64    | Player age (years)        | refer to summary_stats |
| individualId      | float64  | -              | – |
| organizationName  | float64  |-              | – |


**Note:** The `individualId` and `organizationName` columns are entirely empty and provide no usable information. These should be removed during preprocessing.  These fields may not have been collected, applicable, or were lost during data processing.
 
---

### Dataset Issues and Considerations

A review of the dataset revealed several issues that affect analysis and modeling. Two columns, `individualId` and `organizationName`, contain no usable data and appear to be placeholders. Fields like `name` and `hashedEmail` are identifiers and provide no predictive value, so they are excluded to avoid distorting distance calculations.

Some variables require preprocessing. `experience` is ordinal (Beginner to Pro), while non-ordinal fields like `gender` need one-hot encoding for distance-based models. Self-reported fields, such as age and experience, may contain inaccuracies that should be considered when interpreting results.

### Dataset Preparation and Considerations

The dataset contains several issues that affect KNN modeling. Two columns, `individualId` and `organizationName`, are empty placeholders, while `name` and `hashedEmail` are identifiers with no predictive value, so they are excluded. Categorical variables need proper encoding: `experience` is ordinal (Beginner to Pro), while fields like `gender` require one-hot encoding. Numerical features such as `age` and `played_hours` should be standardized to prevent any single variable from dominating distance calculations.  Outliers in age or playtime can strongly influence neighbor selection, and uneven distributions of experience levels may reduce the model’s ability to find representative neighbors for some groups. Self-reported fields may also introduce inaccuracies. Addressing these considerations ensures that KNN operates on a clean, meaningful distance space, improving the reliability of predictions.

# Methods

1) Loading the data

We imported Python libraries: pandas, altair, sklearn, and numpy for data manipulation, modeling, and visualization. The dataset was loaded into a DataFrame using `pd.read_csv` with the data URL.

2) Cleaning and preprocessing data

Irrelevant columns (`hashedEmail`, `individualID`, `organizationName`, `name`) were removed. The `subscribe` column was recoded into "Subscribed" and "Not Subscribed," and `gender` was grouped into Male, Female, and Other (non-binary, Agender, Two-Spirited, etc.). The dataset was split into training (75%) and testing (25%) sets using `train_test_split` with a fixed seed (113) before exploratory analysis.

3) EDA

We examined relationships between `played_hours` and predictors using histograms and scatterplots for `experience`, `subscribe`, `gender`, and `age`. Distributions were highly skewed: most players logged very few hours, while a small group had very high hours. To better visualize low-hour players, histograms of the bottom 90% of `hours_played` were plotted, revealing many with 0–0.5 hours.

4) Preparing data for modeling

Features included `experience`, `subscribe`, `gender`, and `age`, with `played_hours` as the target. Categorical variables were one-hot encoded, `experience` was ordinally encoded, and `age` was scaled for KNN regression. A pipeline combined preprocessing and `KNeighborsRegressor`. Optimal `k` values (1–50) were searched via 3-fold cross-validation using RMSE to select the best model.

5) Modeling with KNN Regression

The optimal `k` was 38 (RMSE = 25.90), but RMSE improved only slightly after `k = 12` (RMSE = 27.06). To simplify the model and reduce underfitting, `k = 12` was chosen. The final model trained on the training set yielded a test RMSE of 23.96.

6) Visualizing results

Scatterplots of predicted vs. actual hours highlighted model performance. High-hour players (20+ hours) were underestimated, with no predictions above 40 hours. Low-hour players (0–5 hours) were often overestimated. Overall, points were scattered around the predicted = actual diagonal, showing limited accuracy and correlation.

In [1]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error
from sklearn.compose import make_column_transformer

In [2]:
url='https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz'
player_data=pd.read_csv(url)
player_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [3]:
summary_stats = player_data[["played_hours","age"]].describe()
summary_stats

Unnamed: 0,played_hours,age
count,196.0,196.0
mean,5.845918,21.280612
std,28.357343,9.706346
min,0.0,8.0
25%,0.0,17.0
50%,0.1,19.0
75%,0.6,22.0
max,223.1,99.0


The summary statistics show that most players in this dataset report extremely low playtime: the median is only 0.1 hours, and even the 75th percentile is just 0.6 hours. This means the majority of observations cluster very close to zero, while a few players have much higher values, creating a long right tail (e.g., the maximum of 223 hours). Age is more normally distributed, with most players falling between 17 and 22.

For KNN regression, this imbalance is important: since the algorithm predicts based on nearby points, the dominance of near-zero playtime values means many players will have neighbors who also played almost no hours. As a result, KNN may struggle to learn meaningful differences unless we handle outliers, scale features, and consider whether the target variable is too skewed for distance-based prediction.

In [4]:
player_relevant_data=player_data.drop(columns=['hashedEmail', 'individualId', 'organizationName', 'name'])
player_relevant_data

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


# Exploratory Data Analaysis

In [5]:
# Map subscription status to consistent labels
player_relevant_data['subscribe'] = (
    player_relevant_data['subscribe']
    .map({True: 'Subscribed',
          False: 'Not Subscribed',
          'Subscribed': 'Subscribed',
          'Not Subscribed': 'Not Subscribed'})
)
    
# Map genders to Male / Female / Other
gender_map = {
    'Male': 'Male',
    'Female': 'Female',
    'Prefer not to say': 'Other',
    'Non-binary': 'Other',
    'Agender': 'Other',
    'Two-Spirited': 'Other',
    'Other': 'Other'
}
player_relevant_data['gender'] = player_relevant_data['gender'].map(gender_map)

from sklearn.model_selection import train_test_split

features = ['experience', 'subscribe', 'gender', 'age']
target = 'played_hours'

player_training, player_testing = train_test_split(
    player_relevant_data,
    test_size=0.25,
    random_state=113
)

X_train = player_training[features]
y_train = player_training[target]

X_test = player_testing[features]
y_test = player_testing[target]

In [6]:
# Experience Chart 
chart_exp = (
    alt.Chart(player_training)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('experience:N', title='Experience')
    )
    .facet(
        column=alt.Column('experience:N', title='Experience')
    )
    .resolve_scale(color='shared')
    .properties(
        title=alt.TitleParams(
            text='Figure 1: Played Hours Distribution by Experience',
            fontSize=20,
            fontWeight='bold',
            anchor='middle'   # centers the title
        )
    )
)

chart_exp

In [7]:
#Subscription Chart
chart_sub = (
    alt.Chart(player_training)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('subscribe:N', title='Subscription Status')
    )
    .facet(
        column=alt.Column('subscribe:N', title='Subscription Status')
    )
    .resolve_scale(color='shared')
    .properties(
        title=alt.TitleParams(
            text='Figure 2: Played Hours Distribution by Subscription Status',
            fontSize=20,
            fontWeight='bold',
            anchor='middle'
        )
    )
)

chart_sub

In [8]:
#Gender Chart
chart_gen = (
    alt.Chart(player_training)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('gender:N', title='Gender')
    )
    .facet(
        column=alt.Column('gender:N', title='Gender')
    )
    .resolve_scale(color='shared')
    .properties(
        title=alt.TitleParams(
            text='Figure 3: Played Hours Distribution by Gender',
            fontSize=20,
            fontWeight='bold',
            anchor='middle'
        )
    )
)

chart_gen

In [9]:
# Age Chart
chart_age = (
    alt.Chart(player_training)
    .mark_circle(size=60, opacity=0.7)
    .encode(
        x=alt.X('age:Q', title='Age'),
        y=alt.Y('played_hours:Q', title='Played Hours'),
    )
    .properties(
        width=600,
        height=400,
        title=alt.TitleParams(
            text='Figure 4: Scatter Plot of Age vs Played Hours',
            fontSize=20,
            fontWeight='bold',
            anchor='middle'
        )
    )
)

chart_age

## Experience
Across all experience levels: Amateur, Beginner, Pro, Regular, and Veteran, most players log very few hours, while only a handful record extremely high playtime. Although we might expect more experienced players to play longer, the distributions are dominated by a **spike at 0–5 hours** and a **long tail extending past 200 hours**, making differences between groups hard to see.

## Subscription Status
Both Subscribed and Not Subscribed players show the same pattern: most cluster near zero, with only a few extreme high-hour players stretching the x-axis. Subscribed players have slightly more high-hour individuals, but the overall distribution is similar, making it difficult to compare within the range where most data lies.

## Gender
Female, Male, and Other players all show very low playtime for most individuals, with a few long right-tail outliers. These outliers compress the main distribution, masking any potential differences between genders.

## Age
The scatterplot of Age vs. Played Hours is heavily right-skewed. Most players, regardless of age, are near zero hours, with a few extreme cases stretching the axis. This makes it hard to see any relationship between age and low-hour playtime.

## Why We Needed to Zoom In
Extreme high-hour players stretch the x-axes to **0–240+ hours**, while nearly all observations are **0–5 hours**. This causes the main spike to dominate each plot. Zooming in to the **90th percentile of played hours** allows us to clearly examine how **experience, subscription, gender, and age relate to the majority of playtime**.

In [10]:
bounds = player_data["played_hours"].quantile([0.9])
bounds

0.9    2.8
Name: played_hours, dtype: float64

In [11]:
# Zoomed Experience Chart (Figure 5)
chart_exp_zoom = (
    alt.Chart(player_training)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=500),
                scale=alt.Scale(domain=[0, 2.8]),
                title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('experience:N', title='Experience')
    )
    .facet(column=alt.Column('experience:N', title='Experience'))
    .properties(
        title=alt.TitleParams(
            text='Figure 5: Played Hours Distribution by Experience (0 to 2.8 hours)',
            fontSize=20,
            fontWeight='bold',
            anchor='middle'
        )
    )
)
chart_exp_zoom

In [12]:
# Zoomed Subscription Chart (Figure 6)
chart_subscribe_zoom = (
    alt.Chart(player_training)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=500),
                scale=alt.Scale(domain=[0, 2.8]),
                title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('subscribe:N', title='Subscription Status')
    )
    .facet(column=alt.Column('subscribe:N', title='Subscription Status'))
    .properties(
        title=alt.TitleParams(
            text='Figure 6: Played Hours Distribution by Subscription Status (0 to 2.8 hours)',
            fontSize=20,
            fontWeight='bold',
            anchor='middle'
        )
    )
)
chart_subscribe_zoom

In [13]:
# Zoomed Gender Chart (Figure 7)
chart_gender_zoom = (
    alt.Chart(player_training)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=500),
                scale=alt.Scale(domain=[0, 2.8]),
                title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('gender:N', title='Gender')
    )
    .facet(column=alt.Column('gender:N', title='Gender'))
    .properties(
        title=alt.TitleParams(
            text='Figure 7: Played Hours Distribution by Gender (0 to 2.8 hours)',
            fontSize=20,
            fontWeight='bold',
            anchor='middle'
        )
    )
)
chart_gender_zoom

In [14]:
# Age vs Played-Hours Scatter Plot with centered title
chart_age_zoom = (
    alt.Chart(player_training)
    .mark_point(clip=True, opacity=0.5)
    .encode(
        x=alt.X(
            'played_hours:Q',
            title='Played Hours (hrs)',
            scale=alt.Scale(domain=[0, 2.8])
        ),
        y=alt.Y('age:Q', title='Age of Players (yrs)')
    )
    .properties(
        title=alt.TitleParams(
            text='Figure 8: Age vs Played Hours (0 to 2.8 hours)',
            fontSize=20,
            fontWeight='bold',
            anchor='middle'  # centers the title
        )
    )
)

chart_age_zoom

## Zoomed-In Analysis (0 to 2.8 Hours)

After restricting the x-axis to the 75th percentile of played hours (about **2.8 hours**), the distributions become much easier to interpret. Removing extreme outliers allows the patterns in low and moderate playtime, the range where almost all players fall, to become visible across experience level, subscription status, gender, and age.

### Experience
Once the scale is narrowed, every experience group clusters heavily below **1 hour**, with Beginners and Amateurs showing the strongest concentration near 0–0.5 hours. Regular and Veteran players spread out slightly more, yet still remain mostly under 1 hour. Even with this clearer view, the groups do not separate meaningfully, suggesting that low playtime is common regardless of skill level.

### Subscription Status
Both subscribed and non-subscribed users are tightly concentrated around **0 hours**, but the zoomed plots reveal subtle differences. Subscribed users show a slightly broader spread, with more players appearing between roughly 0.5 and 1.5 hours. Non-subscribed users remain even more compressed near zero. This pattern hints at a modest subscription effect, though it still does not strongly differentiate playtime within this limited range. Additionally, there is overall more subscribed players contributing data, which is important to note.

### Gender
With the outliers removed, the gender distributions become more comparable. Male players who make up most of the dataset—form a dense cluster under 0.5 hours. Female players follow almost the identical pattern but with fewer individuals overall, and other gender identities show the same shape on an even smaller scale. The zoomed view confirms that gender differences are minimal, and that extremely low playtime is typical for all groups.

### Age
The zoomed Age vs. Played Hours scatterplot shows that nearly every observation falls under **2.8 hours**, regardless of age. No clear relationship or trend is visible; younger and older players alike cluster near zero. Removing the outliers makes the uniformity across ages much easier to see.

In [25]:
#pre-preprocessing
player_preprocessor = make_column_transformer(
    (StandardScaler(), ["age"]),
    (OneHotEncoder(sparse_output=False),["gender", "subscribe", "experience"]),
    (OrdinalEncoder(categories=[["Beginner", "Amateur", "Regular", "Veteran", "Pro"]]), ["experience"]),
    verbose_feature_names_out=False,
    remainder="passthrough"
)

#create pipeline
player_pipe = make_pipeline(
    player_preprocessor,
    KNeighborsRegressor()
)

#finding optimal K
param_grid = {
    "kneighborsregressor__n_neighbors": range(1,75)
}

player_gridsearch = GridSearchCV(
    estimator=player_pipe,
    param_grid=param_grid,
    cv=3,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1
)

# Fit and extract results
player_results = (
    pd.DataFrame(player_gridsearch.fit(X_train, y_train).cv_results_)
)

# Best K and its RMSE
player_best_K = player_gridsearch.best_params_
player_best_RMSE = -player_gridsearch.best_score_

player_best_K, player_best_RMSE

({'kneighborsregressor__n_neighbors': 38}, np.float64(25.898794713363447))

In [26]:
rmse_k12 = (player_results
            .loc[player_results['param_kneighborsregressor__n_neighbors'] == 12,
            'mean_test_score']
            .iloc[0])

rmse_k12 = -rmse_k12   # negate because GridSearchCV uses NEGATIVE RMSE
rmse_k12

np.float64(27.058669299928493)

In [27]:
player_results=player_results.assign(RMSE= -player_results["mean_test_score"])

In [28]:
#Best K graph
Optimal_K_Chart = (
    alt.Chart(player_results)
    .mark_line()
    .encode(
        x=alt.X('param_kneighborsregressor__n_neighbors', title='K Value'),
        y=alt.Y('RMSE', title='Root Mean Squared Error')
    )
    .properties(
    title="Figure 9: RMSE vs K Value",
    width=300,
    height=300
)
)

Optimal_K_Chart

In [18]:
player_gridsearch.best_params_

{'kneighborsregressor__n_neighbors': 38}

In [32]:
# FINAL MODEL TESTING USING K = 12 (although 38 was the 'best_params', the RMSE increased only slightly after 12 and we wanted to avoid underfitting)

final_knn_model = make_pipeline(
    player_preprocessor,
    KNeighborsRegressor(n_neighbors=12)
)

final_knn_model.fit(X_train, y_train)

y_pred = final_knn_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

rmse

np.float64(23.96461449754352)

In [33]:
final_knn_model

In [34]:
#Predicted vs Actual Plot
predicted_vs_actual = pd.DataFrame({
    "actual": y_test,
    "predicted": y_pred
})

predicted_vs_actual_chart=alt.Chart(predicted_vs_actual).mark_circle(opacity=0.5).encode(
    x=alt.X("actual",title='Actual Time Played (hours)'),
    y=alt.Y("predicted", title="Predicted Time Played (hours)")
).properties(
    title="Figure 10: Predicted vs Actual Played Hours",
    width=300,
    height=300
)
predicted_actual = alt.Chart(pd.DataFrame({"x": [predicted_vs_actual["actual"].min(), predicted_vs_actual["actual"].max()]})).mark_line(color="black", opacity=0.2).encode(
    x="x",
    y="x")

# Layer the scatter and line
predicted_vs_actual_chart + predicted_actual

In [35]:
restricted_predicted_vs_actual_chart = (
    alt.Chart(predicted_vs_actual)
    .transform_filter("(datum.actual <= 5) && (datum.predicted <= 5)")
    .mark_circle(opacity=0.5)
    .encode(
        x=alt.X("actual:Q", title='Actual Time Played (hours)',
                scale=alt.Scale(domain=[0,5])),
        y=alt.Y("predicted:Q", title="Predicted Time Played (hours)",
                scale=alt.Scale(domain=[0,5]))
    )
    .properties(
        title="Figure 11: Predicted vs Actual Played Hours (Restricted Domain",
        width=300,
        height=300
    )
)
line_data = pd.DataFrame({"x": [0, 5], "y": [0, 5]})
line = alt.Chart(line_data).mark_line(color="black", opacity=0.2).encode(
    x="x:Q",
    y="y:Q"
)
restricted_predicted_vs_actual_chart+line

# Interpretation of the Predicted vs Actual Plots

These visualizations (zoomed and unzoomed actual vs predicted hours) reveal a clear **bias pattern**: the model **overpredicts** low-hour users and **underpredicts** high-hour users.

Each point represents a user (**x-axis**: actual hours, **y-axis**: predicted hours). Users with **0 hours** are sometimes predicted as high as **16 hours**, showing **positive bias**, while high-hour users (e.g., ~27 hours) are predicted much lower, showing **negative bias**. The dominant pattern is **overestimation** in the common **0–5 hour range**, where most users are.

KNN predicts based on the **average of nearest neighbors**. Low-hour users often have moderate-hour neighbors, causing **overprediction**, while extreme high-hour users have few similar neighbors, leading to **underprediction**.

**Research question:** *Can we estimate total hours played (`played_hours`) from experience, subscription status, gender, and age?* The observed bias indicates the model is **systematically inaccurate** across playtime. Predictions scatter widely around the diagonal, showing **substantial error** even in common ranges. With the current KNN model and features, **predictions are not reliably accurate**, primarily due to the skewed distribution of playtime and the averaging nature of KNN. Thus, the answer to our question with the methods we have tested is **No**; using experience, subscription status, gender, and age alone, we cannot reliably estimate total hours played.

# Discussion

Predicting total hours played using experience, subscription status, gender, and age is challenging. Most players recorded **very low playtime**, often **0 hours**, while a few logged extremely high hours (**30–180 hours**), creating a **highly skewed distribution**. Visualizations show most players clustered near zero, with a few extreme cases stretching the axes, obscuring patterns.

Focusing on the **90th percentile of playtime**, low-hour behavior is consistent across experience, subscription, gender, and age. Differences between groups are small, suggesting these features provide **limited information** about why some players play much more than others.

The KNN model reflects this pattern. Using **k = 38**, RMSPE was **25.94**, roughly the standard deviation of the test data (~27 hours). RMSE was slightly higher at 27 for **k = 12**, but we chose **k = 38** to reduce underfitting and capture extreme high-hour users better. Typical prediction errors remain large, showing the model struggles to predict playtime accurately.

Because most players have 0 hours:  

- **Low-hour users are overpredicted**, as moderate-hour neighbors raise the average.  
- **High-hour users are underpredicted**, due to few similar neighbors.

Predicted–actual scatterplots highlight this bias: low-hour users are predicted a few hours above zero, while high-hour users are underestimated. Points scatter widely around the diagonal, showing overall inaccuracy.

Linear regression performed worse due to extreme outliers. KNN adapts locally, reducing some distortion, though overall accuracy remains low.

These results suggest **age, gender, subscription status, and experience** are insufficient for predicting individual playtime. Most users engage very little, and a small number of extreme users dominate total hours, reflecting the skew in engagement.

### Future directions

To improve predictions:  

- **Handle extreme outliers** or model low-, moderate-, and high-hour groups separately.  
- **Include behavioral features**, such as session frequency, in-game actions, or social interactions.  
- **Test effects** of excluding zero-hour players or modeling high-hour users differently.  

These steps could better reveal what drives engagement and improve playtime predictions.