### Dataset 1 (`players.csv`)

This dataset contains information about **196 individuals** (observations) representing unique participant profiles.  
It includes **9 variables** describing personal demographics, gaming experience, and engagement metrics.

| Variable Name     | Dtype    | Description |
|-------------------|----------|-------------|
| experience        | object   | Indicates player expertise level |
| subscribe         | bool     | Whether the individual is subscribed (TRUE/FALSE) |
| hasedEmail        | object   | Encrypted email used for anonymized identification |
| played_hours      | float64  | Total number of hours each participant played |
| name              | object   | Participant’s first name (non-unique identifier) |
| gender            | object   | Gender identity |
| age               | int64    | Participant’s age in years |
| individualId      | float64  | Individual Numeric ID |
| organizationName  | float64  | Organization/group name |

**Note:** `individualId` and `organizationName` columns are completely empty.  
These fields may not have been collected, applicable, or were lost during data processing.

---

#### Issues & Considerations
- Entire columns (`individualId` and `organizationName`) contain no data.  
- Possible duplicates by name or email hash.  
- Potential age or entry errors (e.g., `age = 99`).  
- Self-reported data may include bias or inaccuracies.


### **Question 2**

We would like to know which **"kinds" of players** are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

**Subquestion:**  
Can we predict whether a player will spend **above-average time playing** based on their **experience level** and **subscription status**?

We aim to identify the types of players who are most likely to play the most, using **`played_hours`** as our response variable. To do this, we will first clean the dataset by **removing rows where `played_hours` is zero**, since these represent players who did not contribute meaningful data. Afterward, we will calculate the **average played hours** and classify each player as either **above-average** or **below-average** based on this value. These classifications will form the categories for our **response variable**.

For explanatory variables, we will focus specifically on **experience** and **subscription status**. We will convert these variables into numeric form to use in predictive modeling — for example:  

- **`subscription`**: `1` = subscribed, `2` = not subscribed  
- **`experience`**: `1` = amateur, `2` = beginner, `3` = regular, `4` = veteran, `5` = pro  

By filtering and transforming the data this way, we can apply **classification methods** to predict which types of players are likely to play **above-average amounts**.

In [2]:
import pandas as pd
players=pd.read_csv("players.csv")

In [3]:
players.head(20)

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
5,Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee...,0.0,Adrian,Female,17,,
6,Regular,True,8e594b8953193b26f498db95a508b03c6fe1c24bb5251d...,0.0,Luna,Female,19,,
7,Amateur,False,1d2371d8a35c8831034b25bda8764539ab7db0f6393869...,0.0,Emerson,Male,21,,
8,Amateur,True,8b71f4d66a38389b7528bb38ba6eb71157733df7d17403...,0.1,Natalie,Male,17,,
9,Veteran,True,bbe2d83de678f519c4b3daa7265e683b4fe2d814077f90...,0.0,Nyla,Female,22,,


In [4]:
# Check data info
players.info()

# Drop completely empty columns or duplicates if any
players.drop_duplicates()

# Filter and select relevant columns
players_tidy = players.loc[players["played_hours"] > 0, ['experience', 'subscribe', 'played_hours']]

# Display
players_tidy.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   gender            196 non-null    object 
 6   age               196 non-null    int64  
 7   individualId      0 non-null      float64
 8   organizationName  0 non-null      float64
dtypes: bool(1), float64(3), int64(1), object(4)
memory usage: 12.6+ KB


Unnamed: 0,experience,subscribe,played_hours
0,Pro,True,30.3
1,Veteran,True,3.8
3,Amateur,True,0.7
4,Regular,True,0.1
8,Amateur,True,0.1


In [54]:
import altair as alt

# --- Experience facet ---
chart_exp = (
    alt.Chart(players_tidy)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Count'),
        color='experience:N'
    )
    .properties(title='Played Hours by Experience')
    .facet(column='experience:N')
)

# --- Subscribe facet ---
chart_sub = (
    alt.Chart(players_tidy)
    .mark_bar()
    .encode(
        x=alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Count'),
        color=alt.Color('subscribe:N').title("subscribed")
    )
    .properties(title='Played Hours by Subscription')
    .facet(column='subscribe:N')
)

chart_exp 

In [55]:
chart_sub

In [57]:
# --- Experience facet (zoomed) ---
chart_exp_zoom = (
    alt.Chart(players_tidy)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=45),
                scale=alt.Scale(domain=[0, 60]),   
                title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Count'),
        color='experience:N'
    )
    .properties(title='Played Hours by Experience (0–60)')
    .facet(column='experience:N')
)

# --- Subscribe facet (zoomed) ---
chart_sub_zoom = (
    alt.Chart(players_tidy)
    .mark_bar(clip=True)
    .encode(
        x=alt.X('played_hours:Q',
                bin=alt.Bin(maxbins=45),
                scale=alt.Scale(domain=[0, 60]), 
                title='Played Hours (hrs)'),
        y=alt.Y('count()', title='Count'),
        color=alt.Color('subscribe:N').title("subscribed")
    )
    .properties(title='Played Hours by Subscription (0–60)')
    .facet(column='subscribe:N')
)

chart_exp_zoom 

In [58]:
chart_sub_zoom

### Insights from Visualizations

From the histograms of played hours by experience level and subscription status, we can see that most players spend very little time playing, while a few spend an extremely large amount, creating a heavily right-skewed distribution. Most values fall between 0 and 20 hours, so narrowing the range to 0–60 hours helped reveal smaller differences between groups that were previously hidden by outliers.

Across experience levels, there isn’t a clear pattern, all categories are dominated by players with low play times, though a few experienced players (especially Pro and Veteran) reach higher hours. For subscription status, subscribed players appear to have a slightly wider range of play times, suggesting they may be more engaged overall.

These visualizations don’t directly answer our question yet, since we still need to assign numeric values to the categorical variables (experience and subscription) before applying classification. However, they reveal that the data is highly imbalanced, with most players spending below-average time. This means our classification model may lean toward predicting the majority class, so we’ll need to handle that imbalance carefully. Additionally, it does a reveal a potential positive relationship between subscribed players and higher data contribution, as well as a positive relationship between a higher experience level and higher data c