*Data Wrangling: Methods & Results*

**1) Load the Dataset:**
   
The dataset was loaded from a remove URL containing information about 196 Minecraft server players. Initial inspection revealed 9 columns including demographic information (gender, age) and behavioural metrics (experience level).

In [1]:
import pandas as pd
#load the dataset from the internet
url="https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players=pd.read_csv(url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


**2) Clean the data:**
   
Data Wrangling began with type conversion ensuring played_hours and age were numeric using pd.to_numeric() with error coercion to handle non-numeric entries. Missing value analysis identifies gaps in several columns. We removed rows with missing values in critical variables (played_hours, age, experience, gender, subscribe) since the dataset is large enough to accommodate this loss while maintaining data quality. The subscribe variable was converted to boolean type for classification purposes. Feature selection involved removing irrelevant columns that did not contribute to prediction: hashedEmail, name, individualId, and organizationName. This reduced the dataset to 5 essential columns. The final cleaned dataset contained 196 complete observations. 

In [8]:
import pandas as pd
import numpy as np

# Convert everything to numeric
players['played_hours'] = pd.to_numeric(players['played_hours'], errors='coerce')
players['age'] = pd.to_numeric(players['age'], errors='coerce')

# Drop rows with missing values in critical columns
players.dropna(subset=['played_hours', 'age', 'experience', 'gender'], inplace=True)

# Ensure subscribe is boolean
if players['subscribe'].dtype != 'bool':
    players['subscribe'] = players['subscribe'].astype(bool)
    
# Drop irrelevant columns
irrelevant_columns = ['hashedEmail', 'name', 'individualId','organizationName'] 
players.drop(columns=[col for col in irrelevant_columns if col in players.columns], inplace=True)

players

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


3) Summary of the dataset: Descriptive analysis revealed key dataset characteristics. The target variable showed 75% of players subscribed to the newsletter, indicating moderate class imbalance that could influence model performance. The played_hours variable exhibited right-skewed distribution with a mean of 6 hours and a median of 0.3 hours, with some high-engagement outliers exceeding 200 hours. Age ranged from 9 to 91 years with potential data entry errors at upper extreme (91 to 99). Gender distribution was heavily imbalanced with 79% male players, reflecting common gaming demographic patterns. Experience levels were relatively well-distributed across Beginner, Amateur, Regular, Veteran, and Pro categories, with Amateur being the most common. 

In [9]:
players.describe()

Unnamed: 0,played_hours,age
count,196.0,196.0
mean,5.845918,21.280612
std,28.357343,9.706346
min,0.0,8.0
25%,0.0,17.0
50%,0.1,19.0
75%,0.6,22.0
max,223.1,99.0


In [10]:
players['subscribe'].value_counts()  # How many True vs False
players['experience'].value_counts()  # How many in each category

experience
Amateur     63
Veteran     48
Regular     36
Beginner    35
Pro         14
Name: count, dtype: int64

**4) Visualizations:**

**Figure 1:** Subscription Rate by Experience Level:
A bar chart displaying newsletter subscription rates across player experience levels. This visualization reveals whether more experienced plays show different engagement patterns with newsletter content. 

In [12]:
import altair as alt 
exp_sub_rates = players.groupby('experience')['subscribe'].agg(['sum', 'count', 'mean']).reset_index()
exp_sub_rates.columns = ['experience', 'subscribed', 'total', 'rate']

fig1 = alt.Chart(players).mark_bar().encode(
    x=alt.X('experience:N', 
            title='Experience Level', 
            sort=['Beginner', 'Amateur', 'Regular', 'Veteran', 'Pro']),
    y=alt.Y('mean(subscribe):Q', 
            title='Subscription Rate',
            axis=alt.Axis(format='%'),
            scale=alt.Scale(domain=[0, 1])),
    color=alt.Color('experience:N', 
                    legend=None, 
                    scale=alt.Scale(scheme='tableau10')),
    tooltip=[
        alt.Tooltip('experience:N', title='Experience'),
        alt.Tooltip('mean(subscribe):Q', title='Subscription Rate', format='.1%'),
        alt.Tooltip('count()', title='Number of Players')
    ]
).properties(
    title='Figure 1: Newsletter Subscription Rate by Experience Level',
    width=450,
    height=300
)
fig1


**Figure 2:** Age distribution by subscription status
An overlapping histogram showing age distributions for subscribers versus non-subscribers. This figure explored whether certain age groups demonstrate higher propensity to engage with the newsletter, informing age-targeted marketing strategies. 

In [13]:
fig2 = alt.Chart(players).mark_bar(opacity=0.7).encode(
    x=alt.X('age:Q', 
            bin=alt.Bin(maxbins=20), 
            title='Age (years)'),
    y=alt.Y('count()', 
            title='Number of Players',
            stack=None),
    color=alt.Color('subscribe:N', 
                    title='Subscribed',
                    scale=alt.Scale(scheme='set2')),
    tooltip=[
        alt.Tooltip('age:Q', bin=True, title='Age Range'),
        alt.Tooltip('subscribe:N', title='Subscribed'),
        alt.Tooltip('count()', title='Count')
    ]
).properties(
    title='Figure 2: Age Distribution by Subscription Status',
    width=500,
    height=300
)
fig2

**Figure 3:** Played Hours vs Age By Subscription Status: A scatter plot illustrating the relationship between player age and engagement level (hours played), with points colored by subscription status. This visualization identifies whether the combination of age and engagement jointly influences potential interaction effects. 

In [14]:
fig3 = alt.Chart(players).mark_circle(size=60, opacity=0.6).encode(
    x=alt.X('age:Q', 
            title='Age (years)',
            scale=alt.Scale(domain=[5, 95])),
    y=alt.Y('played_hours:Q', 
            title='Played Hours',
            scale=alt.Scale(domain=[-5, 250])),
    color=alt.Color('subscribe:N', 
                    title='Subscribed',
                    scale=alt.Scale(scheme='set1')),
    tooltip=[
        alt.Tooltip('age:Q', title='Age'),
        alt.Tooltip('played_hours:Q', title='Hours Played', format='.1f'),
        alt.Tooltip('subscribe:N', title='Subscribed'),
        alt.Tooltip('experience:N', title='Experience'),
        alt.Tooltip('gender:N', title='Gender')
    ]
).properties(
    title='Figure 3: Played Hours vs Age by Subscription Status',
    width=500,
    height=350
)
fig3

**Figure 4:**  Played Hours Distribution by Subscription Status: Bar chart comparing played hours distributions between subscribers and non-subscribers. This figure examines whether highly engaged players are more likely to subscribe to the newsletter, testing the assumption that engagement correlates with newsletter interest.

In [17]:
fig4 = alt.Chart(players).mark_bar().encode(
    x=alt.X('subscribe:N', 
            title='Subscription Status',
            axis=alt.Axis(labelAngle=0)),
    y=alt.Y('mean(played_hours):Q', 
            title='Average Played Hours'),
    color=alt.Color('subscribe:N', 
                    legend=None,
                    scale=alt.Scale(scheme='set2')),
    tooltip=[
        alt.Tooltip('subscribe:N', title='Subscribed'),
        alt.Tooltip('mean(played_hours):Q', title='Avg Hours', format='.2f'),
        alt.Tooltip('count()', title='Number of Players')
    ]
).properties(
    title='Figure 4: Average Played Hours by Subscription Status',
    width=300,
    height=300
)
fig4

**Figure 5:** Gender Distribution By Subscription Status: Grouped bar charts showing gender composition within subscriber and non-subscriber groups. This visualization assesses whether gender differs between groups and helps identify gender-based subscription patterns. 

In [19]:
fig5 = alt.Chart(players).mark_bar().encode(
    x=alt.X('gender:N', title='Gender'),
    y=alt.Y('count()', title='Number of Players'),
    color=alt.Color('subscribe:N', title='Subscribed'),
    column=alt.Column('subscribe:N', title='Subscription Status')
).properties(
    title='Figure 5: Gender Distribution by Subscription Status',
    width=200,
    height=300
)
fig5

**Summary Insights From Visualizations:**
The exploratory analysis revealed several important patterns. Subscription rates appear relatively consistent across most experience levels, though some variation exists. Age distributions show overlap between subscribers and non-subscribers, suggesting age alone may not be a strong discriminator. 

The scatter plot reveals no strong linear relationship between age and played hours, indicating these features may provide independent information for prediction. Engagement levels (played hours) show similar distributions for both groups, challenging the assumption that highly engaged players are more likely to subscribe.


The dataset exhibits quality issues including class imbalance, right-skewed continuous variables, and demographic imbalances that must be considered during modeling. These characteristics suggest the need for appropriate preprocessing including feature scaling and potentially class-balancing techniques during model development.