## DSCI 100 Group 31 Final Analysis

In [6]:
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn import set_config
import warnings

URL = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
alt.data_transformers.disable_max_rows()
set_config(transform_output="pandas")
# Suppress warnings about unknown categories being encoded as zeros (expected behavior with handle_unknown='ignore')
warnings.filterwarnings("ignore", message="Found unknown categories")


## Introduction

* Back ground information: Game studios often use email newsletters to keep players informed when new events happens. Understanding who subscribe this service and why can help the company improve their marketing and game furtures.
* We are tring to answer the question that can we predict whether a player subscribe to the newsletter using experience, gender, age and played_hours?
To answer this, we focus on the dataset players.csv. which has 196 players' data, and it contains the following useful information:
+ `experience`:describe how well palyers playing games, like "Pro", "Veteran", "Amateur", "Regular", and "Beginner".   
+ `subscribe`:describe whether players subscribe the game, like "True" and "False". 
+ `played hours`:describe how many hours players played games. 
+ `gender`:players' gender. 
+ `age`:players' age.

+ Here are infomations contained in the dataset but we do not need to use them: 
+ `individualId`:players' ID. 
+ `organization_name`:players' organization.
+ `hashed_email`:describe players' encrypted email address.
+ `name`:players' names. 

## Question we are answering

 Can we predict whether a player subscribes to the newsletter (subscribe) using experience, gender, age, and played_hours?

## Data Loading and Wrangling

In [7]:
df = pd.read_csv(URL)
df["subscribe"] = df["subscribe"].astype(str).str.upper().map({"TRUE": True, "FALSE": False})
df["played_hours"] = pd.to_numeric(df["played_hours"], errors="coerce")
df["age"] = pd.to_numeric(df["age"], errors="coerce").astype("Int64")
df["experience"] = df["experience"].astype("category")
df["gender"] = df["gender"].astype("category")


summary = pd.DataFrame({
    "variable": df.columns,
    "dtype": [str(t) for t in df.dtypes],
})

print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
display(summary)

missing = df.isna().sum().sort_values(ascending=False).reset_index()
missing.columns = ["variable", "num_missing"]
display(missing)

print("hashedEmail is unique?:", df["hashedEmail"].nunique() == len(df))
print("subscription rate (overall):", f"{float(pd.Series(df['subscribe']).mean()):.1%}")

display(df["experience"].value_counts().to_frame("count"))
display(df["gender"].value_counts().to_frame("count"))

Rows: 196, Columns: 9


Unnamed: 0,variable,dtype
0,experience,category
1,subscribe,bool
2,hashedEmail,object
3,played_hours,float64
4,name,object
5,gender,category
6,age,Int64
7,individualId,float64
8,organizationName,float64


Unnamed: 0,variable,num_missing
0,individualId,196
1,organizationName,196
2,experience,0
3,subscribe,0
4,hashedEmail,0
5,name,0
6,played_hours,0
7,age,0
8,gender,0


hashedEmail is unique?: True
subscription rate (overall): 73.5%


Unnamed: 0_level_0,count
experience,Unnamed: 1_level_1
Amateur,63
Veteran,48
Regular,36
Beginner,35
Pro,14


Unnamed: 0_level_0,count
gender,Unnamed: 1_level_1
Male,124
Female,37
Non-binary,15
Prefer not to say,11
Two-Spirited,6
Agender,2
Other,1


In [8]:
overall_stats = df.agg({
    "subscribe": ["mean", "sum", "count"],
    "played_hours": ["mean", "median", "min", "max"],
    "age": ["mean", "median", "min", "max"]
})
overall_stats

Unnamed: 0,subscribe,played_hours,age
mean,0.734694,5.845918,21.280612
sum,144.0,,
count,196.0,,
median,,0.1,19.0
min,,0.0,8.0
max,,223.1,99.0


In [12]:
gender_stats = df.groupby("gender").agg({
    "subscribe": "mean",
    "played_hours": "mean",
    "age": "mean"
})
gender_stats

  gender_stats = df.groupby("gender").agg({


Unnamed: 0_level_0,subscribe,played_hours,age
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Agender,1.0,6.25,23.5
Female,0.783784,10.635135,21.810811
Male,0.75,4.127419,20.209677
Non-binary,0.733333,14.88,19.066667
Other,1.0,0.2,91.0
Prefer not to say,0.363636,0.372727,21.363636
Two-Spirited,0.666667,0.083333,33.166667


The overall summary tells us that approximately 73% of players subscribe to the newsletter. The average value of the subscription variable is around 0.73, which confirms this. Since subscription is either True or False, converting it to a number (True = 1, False = 0) allows us to interpret the average as the subscription rate. A 73% subscription rate is quite high, suggesting that the majority of players choose to subscribe. This might indicate that players are generally interested in receiving game updates or engaging with the game community. 
Next, let's look at "played_hours". The average game time is approximately **5.85 hours**, but the median is only **0.1 hour**. This indicates that most players spend a short amount of time playing, while a few players spend a considerable amount of time for up to 223 hours. This distribution suggests that the data is highly "skewed" because only a few players play for a long time, while many players play very little. This also tells us that the player group is diverse in terms of game time.

## Exploratory Data Analysis and Visualization

In [11]:
age_bins = pd.cut(
    df["age"].astype("float"),
    bins=[0, 12, 17, 24, 34, 44, 100],
    right=True,
    labels=["0-12", "13-17", "18-24", "25-34", "35-44", "45+"],
)
df = df.assign(age_bin=age_bins)
df[["age", "age_bin"]].head(5)

Unnamed: 0,age,age_bin
0,9,0-12
1,17,13-17
2,17,13-17
3,21,18-24
4,21,18-24


descriptions regarding data 

In [12]:
rate_by_experience = (
    df.groupby('experience', dropna=False, observed=True)['subscribe']
      .mean()
      .mul(100)
      .reset_index(name='subscription_rate_pct')
)
chart_exp = (
    alt.Chart(rate_by_experience, title='Subscription rate by experience')
    .mark_bar()
    .encode(
        x=alt.X('experience:N', title='Experience', sort='-y'),
        y=alt.Y('subscription_rate_pct:Q', title='Subscription rate (%)', scale=alt.Scale(domain=[0, 100])),
        tooltip=['experience', 'subscription_rate_pct']
    )
    .properties(width=420, height=300)
)
chart_exp

In [13]:
rate_by_gender = (
    df.groupby('gender', dropna=False, observed=True)['subscribe']
      .mean()
      .mul(100)
      .reset_index(name='subscription_rate_pct')
)
chart_gender = (
    alt.Chart(rate_by_gender, title='Subscription rate by gender')
    .mark_bar()
    .encode(
        x=alt.X('gender:N', title='Gender', sort='-y'),
        y=alt.Y('subscription_rate_pct:Q', title='Subscription rate (%)', scale=alt.Scale(domain=[0, 100])),
        tooltip=['gender', 'subscription_rate_pct']
    )
    .properties(width=420, height=300)
)
chart_gender

In [14]:
chart_hours = (
    alt.Chart(df, title='Played hours by subscription')
    .mark_boxplot(extent=1.5, size=60)
    .encode(
        alt.X('played_hours:Q', title='Played hours').scale(zero=False),
        alt.Y('subscribe:N', title='Subscribed'),
        color=alt.Color('subscribe:N', legend=None)
    )
)
chart_hours

In [15]:
rate_by_age = (
    df.groupby('age_bin', dropna=False, observed=True)['subscribe']
      .mean()
      .mul(100)
      .reset_index(name='subscription_rate_pct')
)
chart_age = (
    alt.Chart(rate_by_age, title='Subscription rate by age group')
    .mark_bar()
    .encode(
        x=alt.X('age_bin:N', title='Age group', sort=['0-12','13-17','18-24','25-34','35-44','45+']),
        y=alt.Y('subscription_rate_pct:Q', title='Subscription rate (%)', scale=alt.Scale(domain=[0, 100])),
        tooltip=['age_bin', 'subscription_rate_pct']
    )
    .properties(width=420, height=300)
)
chart_age

## Feature Selection and Splitting

We select the features experience, gender, age, and played_hours as our predicting variables or X in order to predictsubscribe. We have decided to go with a 75/25 split.

In [16]:
X = df[["experience", "gender", "age", "played_hours"]]
y = df["subscribe"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

## Model Implementation with Grid Search

We see that there is a significant class imbalance where 73% of the users are subscribers and 27% aren't. What this means is that if we used a model to predict the subsciription status of our users, and they predicted subscribe 100% of the time, 73% of the time they would be accurate. What this means is that even if our model isn't trained properly it can achieve a high percentage of accuracy by just guessing that the user is always subscribed. This is just an issue that arises from the way the dataset is and is something we just have to work around.

In the code below we need to build the model and then implement it using Grid search. Firstly, we split up the numerical variables and categorical variables from each other. We then begin to build our preprocessor. We need to standardize all numeric values so in order to make sure that they all have the same amount of "influence" so to speak on the way our model performs so that one variable doesn't have more of an effect than the other. We use OneHotEncoder in order to convert our categorical predictors into numerical values that our KNN model can actually use. After making the preprocessor we can start making our pipepline using the preprocessor and KNeighborsClassifier().

We check all odd values of K from 1 to 50 in order to see which value of K is our best and we have decided to perform a 5-fold-cross-validation in order to check the models accuracy on a new set of data.

In [18]:
numeric_features = ["age", "played_hours"]
categorical_features = ["experience", "gender"]

preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(drop="first", handle_unknown="ignore", sparse_output=False), categorical_features)
)

knn = KNeighborsClassifier()

pipe = make_pipeline(preprocessor, knn)

# We check odd numbers up to 49
param_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 50, 2)
}

grid_search = GridSearchCV(pipe, param_grid, cv=5, return_train_score=True, n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best K:", grid_search.best_params_["kneighborsclassifier__n_neighbors"])
print("Best CV Score:", grid_search.best_score_)

Best K: 7
Best CV Score: 0.7554022988505747


## What This Tell Us

After running our code we can see that our best K value is 7 and that the CV score is around 75.54% what this means is that for k = 7, our model predicts the users subscription status accurately 75.54% of the time. Below we have made a few visualizations regarding the k value vs training accuracy and accuracy

In [19]:
results = pd.DataFrame(grid_search.cv_results_)

chart = alt.Chart(results).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors", title="Number of Neighbors (K)"),
    y=alt.Y("mean_test_score", title="Cross-Validation Accuracy", scale=alt.Scale(zero=False)),
    tooltip=["param_kneighborsclassifier__n_neighbors", "mean_test_score"]
).properties(
    title="KNN K vs Accuracy"
)

chart

The graph above is just a visualization to see how varying odd numbers of K from 0 - 50 compare to cross-validation accuracy. We see that 7 has indeed the highest cross-validation accuracy compared to other values. 

The elbow plot below shows where overfitting stabilizes which is around 5. Even though it isn't 7 , 7 is still right after the elbow and our CV accuracy supports it so K = 7 would definitetly be the better option. 

In [25]:
results = pd.DataFrame(grid_search.cv_results_) 

train_elbow = alt.Chart(results).mark_line(point=True, color='blue').encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors:Q", title="Number of Neighbors (K)"),
    y=alt.Y("mean_train_score:Q", title="Mean Training Accuracy"),
    tooltip=["param_kneighborsclassifier__n_neighbors", "mean_train_score"]
)

train_elbow

## Baseline Comparison

 After running our code we can see that our best K value is 7 and that the CV score is around 75.54% what this means is that for k = 7, our model predicts the users subscription status accurately 75.54% of the time. In order to see if this accuracy is a good thing or not, we need to compare it to our baseline. What our baseline will be is a model that just guesses that the most frequent status (being subscribed) is the actual status of the users we are trying to predict. The baselines accuracy is evaluated for on the test set. What the percentage will tell us is how accurate the baseline model is just by always guessing "subscribe". To do this we need to import a DummyClassifier instead of a KNeighborsClassifier so that this model always predicts the most frequent option, hence why dummy = DummyClassifier(strategy = "most_frequent"). 

In [26]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
baseline_score = dummy.score(X_test, y_test)

print(f"Baseline (Most Frequent) Accuracy: {baseline_score:.4f}")

Baseline (Most Frequent) Accuracy: 0.6939


We see that the baseline's most frequent accuracy is 0.6939, or 69.39%. Our model's accuracy should be higher than that of our baseline's because it means that our model has actually picked up on patterns from the data. Because our KNN model's accuracy is higher than our baselines accuracy, we can say that it has actually learned patterns from the dataset and shows that it is a useful model.

## Test Set Accuracy and Disccusion 

In [30]:
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test Set Accuracy: {test_score:.4f}")

Test Set Accuracy: 0.7143


Finally we evaluate the test set accuracy of our model. What this tells us is how accurate our model would be in the real world against a set of data it hasn't seen before. We get an accuracy of around 71.43% which is slightly above our baseline. Although not significantly higher, it is still higher than our baseline. This shows that the model we have created is able to use the explanatory variables experience, gender, age, and played_hours in order to predict whether a user will subscribe.

However, this number is significantly lower than our CV accuracy which was around 75.5%, this could be attributed to the class imabalance present in the data since our model might slightly favor the class that appears majority of the time. In a way none of these results necessarily surprise us and were kind of expected since the class imbalance was clear from the beginning of building our model.  Nonetheless, it makes sense that KNN would be able to learn the patterns in the data given the explanatory variables we have chosen and predictive the subscription class. Companies could use models like these in order to identify which users would be most likely to subscribe so that they can focus their efforts on how to market towards them. 

## Questions

Several questions arise after building and evaluationg the model. 
- The biggest one being, how can we address the class imbalance present in the data in order to create a model that performs with a higher accuracy and doesn't favor the "majority class?"
- Would a different model work better than KNN? 
- How would the model have performed if there was a larger set of data?