### Dataset 1 (`players.csv`)

This dataset contains information about **196 individuals** (observations) representing participant profiles. It includes **9 variables** describing demographics and engagement metrics.

| Variable Name     | Dtype    | Description              | Summary Statistic |
|-------------------|----------|---------------------------|-------------------|
| experience        | object   | Player expertise level    | – |
| subscribe         | bool     | Subscription status       | – |
| hashedEmail       | object   | Encrypted email ID        | – |
| played_hours      | float64  | Total playtime (hours)    | refer to summary_stats |
| name              | object   | Participant name          | – |
| gender            | object   | Gender                    | – |
| age               | int64    | Player age (years)        | refer to summary_stats |
| individualId      | float64  | -              | – |
| organizationName  | float64  |-              | – |

**Note:** `individualId` and `organizationName` columns are completely empty.  
These fields may not have been collected, applicable, or were lost during data processing.

---

#### Issues & Considerations
- Two columns contain no data.  
- Possible duplicates by name/email hash.  
- Potential age or entry errors (e.g., `age = 99` seems unrealistic).  
- Self-reported data may include inaccuracies or bias.


In [53]:
# Required libraries
# Install these if not already available

import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split

In [54]:
url="https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"

In [55]:
players=pd.read_csv(url)

In [56]:
summary_stats = players[["played_hours","age"]].describe()
summary_stats

Unnamed: 0,played_hours,age
count,196.0,196.0
mean,5.845918,21.280612
std,28.357343,9.706346
min,0.0,8.0
25%,0.0,17.0
50%,0.1,19.0
75%,0.6,22.0
max,223.1,99.0


### **Question 2**: We would like to know which **"kinds" of players** are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

**Subquestion:**  
Can we predict whether an **active** player will spend **above-average time playing** based on their **experience level** and **subscription status**?

To wrangle, we’ll remove rows where **played_hours=0** to focus on **active players**, then label each player as **above** or **below average** based on the mean playtime to create classifications for the **response variable**. For explanatory variables, we’ll use **experience** and **subscription status**, converting them to numeric form for predictive modeling.

- **`subscription`**: `1` = subscribed, `2` = not subscribed  
- **`experience`**: `1` = amateur, `2` = beginner, `3` = regular, `4` = veteran, `5` = pro  

Using a **K-Nearest Neighbours (KNN) model**, we aim to determine which player characteristics are most associated with higher engagement.

In [57]:
players.head(20)

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
5,Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee...,0.0,Adrian,Female,17,,
6,Regular,True,8e594b8953193b26f498db95a508b03c6fe1c24bb5251d...,0.0,Luna,Female,19,,
7,Amateur,False,1d2371d8a35c8831034b25bda8764539ab7db0f6393869...,0.0,Emerson,Male,21,,
8,Amateur,True,8b71f4d66a38389b7528bb38ba6eb71157733df7d17403...,0.1,Natalie,Male,17,,
9,Veteran,True,bbe2d83de678f519c4b3daa7265e683b4fe2d814077f90...,0.0,Nyla,Female,22,,


In [58]:
# Check data info
players.info()

# Drop completely empty columns or duplicates if any
players.drop_duplicates()

# Filter and select relevant columns
players_tidy = players.loc[players["played_hours"] > 0, ['experience', 'subscribe', 'played_hours']]

# Display
players_tidy.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   gender            196 non-null    object 
 6   age               196 non-null    int64  
 7   individualId      0 non-null      float64
 8   organizationName  0 non-null      float64
dtypes: bool(1), float64(3), int64(1), object(4)
memory usage: 12.6+ KB


Unnamed: 0,experience,subscribe,played_hours
0,Pro,True,30.3
1,Veteran,True,3.8
3,Amateur,True,0.7
4,Regular,True,0.1
8,Amateur,True,0.1


In [59]:
# Calculate the mean playtime
mean_playtime = players_tidy["played_hours"].mean()

# Create a new column for engagement level
players_tidy["engagement_level"] = "Below Average"
players_tidy.loc[players_tidy["played_hours"] > mean_playtime, "engagement_level"] = "Above Average"

# Check how many players fall into each category
players_tidy["engagement_level"].value_counts()

engagement_level
Below Average    98
Above Average    13
Name: count, dtype: int64

In [60]:
# Split data: 80% training, 20% testing because we cannot use the test set to build the model and EDA is a part of model-building
players_train, players_test = train_test_split(
    players_tidy,
    test_size=0.2,
    stratify=players_tidy["engagement_level"], #We performed a stratified split based on engagement level to ensure the training data preserves the same ratio of below-average to above-average players as the full dataset
    random_state=42
)

# --- Experience facet ---
chart_exp = (
    alt.Chart(players_train)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color='experience:N'
    )
    .properties(title='Played Hours Distribution by Experience')
    .facet(column='experience:N')
)

# --- Subscribe facet ---
chart_sub = (
    alt.Chart(players_train)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('subscribe:N').title("Subscribed")
    )
    .properties(title='Played Hours Distribution by Subscription')
    .facet(column='subscribe:N')
)

chart_exp

In [61]:
chart_sub

In [64]:
# --- Experience facet (zoomed) --- (it was very hard to see trends from the above plots, so the plots below were created which are zoomed in)
chart_exp_zoom = (
    alt.Chart(players_train)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=45),
                scale=alt.Scale(domain=[0, 60]),   
                title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color='experience:N'
    )
    .properties(title='Played Hours Distribution by Experience (0–60)')
    .facet(column='experience:N')
)

# --- Subscribe facet (zoomed) ---
chart_sub_zoom = (
    alt.Chart(players_train)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=45),
                scale=alt.Scale(domain=[0, 60]), 
                title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Number of Players'),
        color=alt.Color('subscribe:N').title("Subscribed")
    )
    .properties(title='Played Hours Distribution by Subscription (0–60)')
    .facet(column='subscribe:N')
)

chart_exp_zoom

In [65]:
chart_sub_zoom

# Insights from Visualizations

*The histograms of the **training data** show a **strong right skew**, with most players spending little time and a few reaching up to **223 hours**. Because of these outliers, zooming in to **0–60 hours** allowed clearer group comparisons.*

*Across experience levels, **Amateur players** are the largest group but usually have low playtime, while **Pro players** show a greater share of high-playtime individuals. **Subscribed players** also display a wider spread of hours, suggesting higher engagement.*

*Overall, the data is **highly imbalanced**, with most players below average playtime. This may cause the **classification model to favor the majority group**, but the plots still indicate that both **experience** and **subscription** are linked to higher engagement.*

# Proposed Method: K-Nearest Neighbours (KNN) Classification

We will use a KNN classifier to predict whether a player’s playtime is above or below average based on their experience and subscription status. 

### **Why this method is appropriate**
- It can show which combinations of experience and subscription are linked to higher playtime.  
- It works well with small datasets and does not require the data to follow a specific distribution.  
- It handles non-linear patterns, which is useful since playtime is skewed.

### **Assumptions**
- Players with similar experience and subscription behave similarly.  
- The encoded variables are scaled so distances make sense.  

### **Limitations**
- It only tells us above/below average, not the exact hours.  
- Because most players have low playtime, the model may overpredict below-average playtime.

### **Model Comparison and Selection**
- Evaluate KNN using accuracy, precision, and recall.
- Choose the K that best balances correctly identifying above-average players while avoiding overfitting or underfitting.

### **Data Processing Plan**

| Step | Description |
|------|-------------|
| Clean data | Remove rows with `played_hours = 0`. |
| Response variable | Label players as above/below average using the mean. |
| Encode features | Convert `experience` and `subscription` to numeric values. |
| Scale features | Standardize features for distance calculations to ensure both variables contribute equally. |
| Split data | 80% training, 20% testing before EDA (done). |
| Validation | 5-fold cross-validation on training to pick the best K. |
| Testing | Use the test set for final evaluation. |
