# **Introduction**

**Dataset Overview**  
Our project will use two datasets collected from UBC's Computer Science department's research Minecraft server:  

**`players.csv`** consists of information about players in the server.  
**`sessions.csv`** consists of details from each gameplay session in the server.  

The data was gathered automatically through server logs, which recorded player activities like joining, playing, and exiting the world. A separate dataset containing information regarding each player was also provided. These data sets both share a `hasedEmail` field, which helps link the two together.


**Observations of Datasets:**

**`players.csv`**
- 196 Observations
- 9 Variables
| Variable | Type | Description |
| --- | --- | --- |
| `experience` | object | Self-reported experience level |
| `subscribe` | bool | Newsletter subscription status |
| `hashedEmail` | object | Player identifier |
| `played_hours` | float | Total hours played |
| `name` | object | Player name |
| `gender` | object | Gender |
| `age` | int | Age in years |
| `individualId` | float | Empty |
| `organizationName` | float | Empty |       

**Key takeaways (players.csv):**  
- `individualId` and `organizationName` don't hold any data in them, would be best to remove them.   
- `played_hours` is a strong behavioral candidate to go off of.
- `hashedEmail` is the key identifier.


**`sessions.csv`**
- 1,535 Observations
- 5 Variables
| Variable | Type | Description |
| --- | --- | --- |
| `hashedEmail` | object | Player identifier |
| `start_time` | object | Session start timestamp |
| `end_time` | object | Session end timestamp |
| `original_start_time` | float | Float form start time |
| `original_end_time` | float | Float form end time |

**Key takeaways (sessions.csv):**  
- Can choose from either the original or finalized start/end times, whatever isn't used can be dropped.
- A few users are missing `end_time` and `original_end_time`, though I wonder if this will affect anything.  
- `hashedEmail` is also the key identifier, but appears multiple times ber player here.

**Potential Issues in Data:**
- Missing or inconsistent timestamps.
- Players with extremely long sessions (potential outliers).
- Overlap or duplication in session data.
- Possible selection bias via only choosing players who joined the server.  
- Empty/duplicate columns like `original_start_time` and `original_end_time` or `individualId`.  
- Not every `hashedEmail` in players appears in sessions, as well as the other way around.

**Objective:**

We will focus on Question 1 from the project description:  
> What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific Research Question:
> Can we predict whether a player subscribes to the newsletter using their in-game and demographic features such as age, gender, and experience level?

This question is important because the player attributes that relate to the subscription of the newsletter can allow the research team to direct their efforts on the players that are most likely to be actively involved.

**Variables of Interest**
- Response Variable: `subscribe`
- Explanatory Variables: `age`, and `experience`.

We chose to explore 'age' and 'experience' to see if these two characteristics can predict whether or not a player will subscribe. We chose 'age' because effective advertising can come from targeting specific age groups. We chose 'experience' because it represents interest and commitment to the game. 

We will create and train a **k-Nearest Neighbors (kNN)** classification model to test how well different player characteristics predict newsletter subscription. The model will be trained using the available features (age, gender, experience) and evaluated with **cross-validation** to determine which variables are most predictive. The `subscribe` variable will serve as the class label.

# **Methods**

**Overview**

We will start by importing the datasets using pandas, verify their structures, and then tidy the data.

**Steps:**
- Load data with `read_csv` from the csv URL.
- Check the structure of the data using `head()`
- Check the data for any missing or inconsistent values.  
- Create simple visualizations to explore how each variable relates to subscription status.
- Split dataset into training and testing before EDA
- Perform EDA
- Train KNN classifier
- Perform cross-validation
- Evaluate using test set 

**Observations:**
- The age and experience histogram will show which age groups are more or less likely to subscribe.  
- These patterns are descriptive, which will help with later analysis.

**Visualization:** 
- A histogram showing the distribution of player `age` for subscribers and non-subscribers.   
- A bar chart comparing average subscription rates by `experience` level.

**KNN Prediction Model**
 - Preprocess the training dataset
 - Train the KNN model
 - Create and fit the pipeline
 - Perform 5-fold cross validation on the players training dataset
 - Plot cross validation results
 - Pick the best K value based on the cross validation plot
 - Evaluate on the test set (determine the accuracy, precision and recall scores)

# **Data Wrangling**

We start by importing the necessary Python libraries.
- **pandas** and **numpy** for data handling.  
- **altair** for visualizations and exploratory data analysis.  
- **scikit-learn** contains all the tools needed for preprocessing, splitting the dataset, building the KNN model, and evaluating its performance.  

In [36]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score, precision_score

Next, we load the `players.csv` dataset from a url using `pandas.read_csv()`.  
We remove the columns that are either empty or unnecessary for our analysis. We drop:

- `played_hours`  
- `hashedEmail` 
- `individualId` and `organizationName` (empty fields),  
- `name` and `gender`

This leaves us with `players_tidy`, a dataset containing only the variables relevant to our question.


In [37]:
url="https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url) 

players_tidy=players.drop(["played_hours", "hashedEmail", "individualId", "organizationName", "name", "gender"], axis=1)
players_tidy

Unnamed: 0,experience,subscribe,age
0,Pro,True,9
1,Veteran,True,17
2,Veteran,False,17
3,Amateur,True,21
4,Regular,True,21
...,...,...,...
191,Amateur,True,17
192,Veteran,False,22
193,Amateur,False,17
194,Amateur,False,17


In [27]:
# code which contains numerical column for experience

players_tidy["experience_numerical"]= LabelEncoder().fit_transform(
    players_tidy["experience"])

players_tidy

Unnamed: 0,experience,subscribe,age,experience_numerical
0,Pro,True,9,2
1,Veteran,True,17,4
2,Veteran,False,17,4
3,Amateur,True,21,0
4,Regular,True,21,3
...,...,...,...,...
191,Amateur,True,17,0
192,Veteran,False,22,4
193,Amateur,False,17,0
194,Amateur,False,17,0


In [28]:
players_train, players_test = train_test_split(
    players_tidy,
    test_size = 0.25,
    random_state = 2000
)

players_train
#splitting the data into training and testing data. 
#lock away testing data and perform visualization on training data.

Unnamed: 0,experience,subscribe,age,experience_numerical
165,Regular,True,21,3
49,Beginner,True,22,1
6,Regular,True,19,3
77,Regular,True,17,3
88,Beginner,True,17,1
...,...,...,...,...
28,Amateur,True,23,0
123,Beginner,False,17,1
54,Beginner,False,42,1
72,Veteran,True,17,4


**Training/Test Split**

After wrangling and splitting the dataset, we end up with `players_train` and `players_test`. The dataset `players_train` is the dataset that will be used for visualization and training the model. We will lock away the testing dataset to make sure that the model has not seen any test observations in order to get the most accurate insight on how well our model perform. They each contains columns of `subscribe` (our class variable), `age`(our predictive variable) and `experience_numerical`.

Now we will perform some visualization with **altair** to determine whether or not a correlation between player age and experience exists, and whether or not a player will subscribe to the newsletter.

In [40]:
players_plot_age = alt.Chart(players_train, title="Newsletter Subscription by Age Distribution").mark_bar().encode(
    x = alt.X("age").bin(maxbins=20).title("Player's Age"),
    y = alt.Y("count()").title("Number of Players"),
    color = alt.Color("subscribe").title("subscribe")
)

players_plot_experience = alt.Chart(players_train, title= "Newsletter Subscription by Player Experience Distribution").mark_bar().encode(
    x = alt.X("experience:N").title("Player's Experience"),
    y = alt.Y("count()").title("Number of Players"),
    color = alt.Color("subscribe").title("subscribe")
)

players_plot_age | players_plot_experience

This visualization **does not** show direct correlation between player age and subscription status. Furthermore, the distribution is imbalanced, which may introduce issues into our model due to the descrepancy in distances, as KNN relies on Euclidean distance and assigns based on majority vote of nearby points.

# **Analysis - KNN Classification**

We begin will begin by preprocessing the training set.

In [48]:
players_preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "experience_numerical"])
)
players_preprocessor

Next, we create a kNN model and train the classifier using the preprocessed training set.

In [49]:
players_knn = KNeighborsClassifier()
players_pipe = make_pipeline(players_preprocessor, players_knn)

X = players_train[["age", "experience_numerical"]]
y = players_train["subscribe"]
players_pipe

We will choose the best K value with cross validation.

In [52]:
#Specify the grid of parameter values to test
parameter_grid = {
    "kneighborsclassifier__n_neighbors" : range (1, 31),
}

#Create GridSearchCV object
players_grid = GridSearchCV(
    estimator = players_pipe,
    param_grid = parameter_grid,
    cv = 5
)

#Fit to GridSearchCV
players_grid.fit(
    players_train[["age", "experience_numerical"]],
    players_train["subscribe"]
)
accuracies_grid = pd.DataFrame(players_grid.cv_results_)

accuracies_grid

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003619,0.000183,0.004258,0.000297,1,{'kneighborsclassifier__n_neighbors': 1},0.7,0.5,0.689655,0.655172,0.551724,0.61931,0.079433,28
1,0.003327,3.5e-05,0.003706,4.6e-05,2,{'kneighborsclassifier__n_neighbors': 2},0.666667,0.366667,0.551724,0.586207,0.482759,0.530805,0.101208,30
2,0.003557,0.00041,0.005427,0.003106,3,{'kneighborsclassifier__n_neighbors': 3},0.733333,0.633333,0.689655,0.689655,0.724138,0.694023,0.035139,25
3,0.003354,8.5e-05,0.00369,2.5e-05,4,{'kneighborsclassifier__n_neighbors': 4},0.666667,0.666667,0.413793,0.62069,0.62069,0.597701,0.094225,29
4,0.003606,0.000546,0.003783,0.000168,5,{'kneighborsclassifier__n_neighbors': 5},0.7,0.733333,0.724138,0.689655,0.724138,0.714253,0.016539,22
5,0.005277,0.00385,0.00365,2.5e-05,6,{'kneighborsclassifier__n_neighbors': 6},0.7,0.7,0.448276,0.655172,0.689655,0.638621,0.096586,27
6,0.003278,6.5e-05,0.00362,1.5e-05,7,{'kneighborsclassifier__n_neighbors': 7},0.7,0.766667,0.655172,0.758621,0.724138,0.72092,0.040706,21
7,0.003557,0.000645,0.003661,0.000106,8,{'kneighborsclassifier__n_neighbors': 8},0.7,0.7,0.655172,0.758621,0.689655,0.70069,0.033318,24
8,0.003317,0.000185,0.003589,3.4e-05,9,{'kneighborsclassifier__n_neighbors': 9},0.7,0.733333,0.758621,0.758621,0.758621,0.741839,0.023099,1
9,0.00322,8e-06,0.003579,2.5e-05,10,{'kneighborsclassifier__n_neighbors': 10},0.7,0.733333,0.551724,0.758621,0.724138,0.693563,0.07336,26


In [54]:
#Plot the accuracy (y-axis) vs the  value (x-axis)
cross_val_plot = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x = alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbours"),
    y = alt.Y("mean_test_score")
    .scale(zero=False)
    .title("Accuracy estimate")
)

cross_val_plot

We can see from our plot that **k = 9** yields the highest accuracy, so we select 9 as the optimal number of neighbors.

With this value, we proceed to create our final kNN model.

In [55]:
players_knn_true = KNeighborsClassifier(n_neighbors = 9)
players_pipeline_true = make_pipeline(players_preprocessor, players_knn_true)
players_pipeline_true.fit(X, y)

Finally, we can predict labels in the test set, and check for the accuracy in each of our values.

In [56]:
players_test_predictions = players_test.assign(
    predicted=players_pipeline_true.predict(players_test[["age", "experience_numerical"]])
)

players_test_predictions

Unnamed: 0,experience,subscribe,age,experience_numerical,predicted
111,Regular,True,21,3,True
73,Veteran,True,22,4,True
3,Amateur,True,21,0,True
149,Amateur,True,16,0,True
80,Veteran,True,17,4,True
172,Veteran,True,20,4,True
61,Regular,True,20,3,True
157,Regular,True,99,3,False
65,Veteran,True,21,4,True
50,Veteran,True,21,4,True


In [57]:
players_pipeline_true.score(
    players_test[["age", "experience_numerical"]],
    players_test["subscribe"]
)

0.7346938775510204

**Accuracy score:** 0.7346938775510204

In [62]:
recall_score(
    y_true=players_test_predictions["subscribe"],
    y_pred=players_test_predictions["predicted"],
    pos_label=True
)

np.float64(0.9722222222222222)

**Recall score:** 0.9722222222222222

In [63]:
precision_score(
    y_true=players_test_predictions["subscribe"],
    y_pred=players_test_predictions["predicted"],
    pos_label=True
)

np.float64(0.7446808510638298)

**Precision score:** 0.7446808510638298

# **Discussion**

**Overview**

Our project sought to investigate the effectiveness of simple demographic data like **age** and **self-reported experience level** as the predictors of players who subscribed/haven't subscribed to the newsletter. The experiment was performed on the filtered data set taken from `players.csv` by training and testing a k-Nearest Neighbors (kNN) classifier that was set to the value of **k = 9**.

Overall, the model was able to realize **an accuracy of 0.7347**, implying that it was able to correctly classify nearly 73% of players in the test set. Given that this model only used two features, this is reasonably accurate. However, to rely on accuracy alone isn't a way to classify performance. This is where we turn towards the precision and recall scores for the positive class (subscribed). The model appeared to have a **recall of 0.9722**, so it sorted out almost all true subscribers with great success. This means the model very rarely makes a mistake in missing someone who actually subscribed.

Conversely, the **precision score was 0.7447**, meaning that when the model predicts a user as a subscriber, it is right only 74% of the time. The recall-precision gap is somewhat large, so the classifier often **overpredicts** the positive class, labeling some non-subscribers as subscribers. This tendency is actually aligned with the results of the exploratory data analysis we did earlier: neither age nor experience exhibited a strong distinction between the two groups. Age distributions were skewed, and the experience categories were overlapped.

**Conclusions**

From our findings, **age and experience are not good predictors** of whether someone will want to subscribe to the newsletter. Though the model is very sensitive in detecting actual subscribers, it isn't specific enough to be a reliable classifier of subcription status. It really shows how demographic variables alone do not really determine subscription behavior.

To improve the model's prediction in the future, more features should be added, especially behavioral ones such as **`hours_played`, cumulative hours, and session-level patterns**—all of which can be found in `sessions.csv` or derived from it. These characteristics are more directly connected to engagement hence they are more likely to unveil the pattern difference between subscribers and non-subscribers.

To be brief, the findings indicate that **age and experience are not strong standalone predictors** of people signing up for a newsletter. They help the model find almost all the subscribers, but they do not give the correct discriminatory power for accurate classification. Integrating valuable consumer behavior and session data to create a more effective player-engagement prediction system would yield much better results in future models.
