# **Introduction**

experience: this a categorical variable that describes the level of experience of players. The possible outcomes are "Beginner", "Amateur", "Regular", "Pro", and "Veteran".
subscribe: this is also a categorical variable that presents whether each player has subscribed to the newsletter or not. Its possible values are "TRUE" and "FALSE".
hashedEmail: This is a categorical variable that is not repeated because it presents each user's email. This will not contribute to our analysis because its solepurpose is to differentiate players from players which will be the same as one of the columns we have after.
played_hours: this is a continuous variable that presents how long did each player played in time unit of hours. There are none missing values, however, the observations are very sparse. There are a lot of players who have played_hours of around 0 hours, and only a few have a significant played_hours of above 50. This column will be interesting to analyze on as this may create some outsiders or noise points which will affect our prediciton model.
name: This is categorical variable that presents the username of all the observations to differentiate and track each entry. This column serves the same purpose as hashedEmail in this dataset and it easier to manipulate and operate on due to the simplicity of the observations.
gender: this is a categorical variable that describes the player's gender. It ranges from "Male", "Female", "Non-binary", "Prefer not to say", "Agender", "Two-spirited" and "Other". There are a lot of observations in "Male" and "Female", and not so much in the other categories. This will create some outsiders or noise points which will affect our prediciton model.
age: This is a continuous variable that describes the player's age.
individualid: this column have no observations for all of the rows. We might remove this column from the dataset when we start wrangling since this does not provide any information on our thesis.
organizationName: this column have no observations for all of the rows. We might remove this column from the dataset when we start wrangling since this does not provide any information on our thesis.

Alternate Intro, can add or remove content as needed.

## **Data Description**

**Dataset Overview**  
Our project will use two datasets collected from UBC's Computer Science department's research Minecraft server:  

**`players.csv`** consists of information about players in the server.  
**`sessions.csv`** consists of details from each gameplay session in the server.  

The data was gathered automatically through server logs, which recorded player activities like joining, playing, and exiting the world. A separate dataset containing information regarding each player was also provided. These data sets both share a `hasedEmail` field, which helps link the two together.

## **Observations of Datasets:**

**`players.csv`**
- 196 Observations
- 9 Variables
| Variable | Type | Description |
| --- | --- | --- |
| `experience` | object | Self-reported experience level |
| `subscribe` | bool | Newsletter subscription status |
| `hashedEmail` | object | Player identifier |
| `played_hours` | float | Total hours played |
| `name` | object | Player name |
| `gender` | object | Gender |
| `age` | int | Age in years |
| `individualId` | float | Empty |
| `organizationName` | float | Empty |       

**Key takeaways (players.csv):**  
- `individualId` and `organizationName` don't hold any data in them, would be best to remove them.   
- `played_hours` is a strong behavioral candidate to go off of.
- `hashedEmail` is the key identifier.


**`sessions.csv`**
- 1,535 Observations
- 5 Variables
| Variable | Type | Description |
| --- | --- | --- |
| `hashedEmail` | object | Player identifier |
| `start_time` | object | Session start timestamp |
| `end_time` | object | Session end timestamp |
| `original_start_time` | float | Float form start time |
| `original_end_time` | float | Float form end time |

**Key takeaways (sessions.csv):**  
- Can choose from either the original or finalized start/end times, whatever isn't used can be dropped.
- A few users are missing `end_time` and `original_end_time`, though I wonder if this will affect anything.  
- `hashedEmail` is also the key identifier, but appears multiple times ber player here.


### Potential Issues
- Missing or inconsistent timestamps.
- Players with extremely long sessions (potential outliers).
- Overlap or duplication in session data.
- Possible selection bias via only choosing players who joined the server.  
- Empty/duplicate columns like `original_start_time` and `original_end_time` or `individualId`.  
- Not every `hashedEmail` in players appears in sessions, as well as the other way around.

We will focus on Question 1 from the project description:  
> What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific Research Question:
> Can we predict whether a player subscribes to the newsletter using their in-game and demographic features such as age, gender, and experience level?

This question is important because the player attributes that relate to the subscription of the newsletter can allow the research team to direct their efforts on the players that are most likely to be actively involved.

**Variables of Interest**
- Response Variable: `subscribe`
- Explanatory Variables: `age`, and `experience`.

We chose to explore 'age' and 'experience' to see if these two characteristics can predict whether or not a player will subscribe. We chose 'age' because effective advertising can come from targeting specific age groups. We chose 'experience' because it represents interest and commitment to the game. 

We will create and train a **k-Nearest Neighbors (kNN)** classification model to test how well different player characteristics predict newsletter subscription. The model will be trained using the available features (age, gender, experience) and evaluated with **cross-validation** to determine which variables are most predictive. The `subscribe` variable will serve as the class label.

## Methods

We will start by importing the datasets using pandas, verify their structures, and then tidy the data.

**Steps:**
- Load data with `read_csv` from the csv URL.
- Check the structure of the data using `head()`
- Check the data for any missing or inconsistent values.  
- Create simple visualizations to explore how each variable relates to subscription status.
-  Split the players dataset into train and test datasets

**Observations:**
- The age and experience histogram will show which age groups are more or less likely to subscribe.  
- These patterns are descriptive, which will help with later analysis.\

**Visualization:** 
- A histogram showing the distribution of player `age` for subscribers and non-subscribers.   
- A bar chart comparing average subscription rates by `experience` level.

 **KNN Prediction Model**
 - Preprocess the training dataset
 - Train the KNN model
 - Create and fit the pipeline
 - Perform 5-fold cross validation on the players training dataset
 - Plot cross validation results
 - Pick the best K value based on the cross validation plot
 - Evaluate on the test set (determine the accuracy, precision and recall scores)

## **Data Wrangling**

In [4]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier

In [5]:
url="https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url) 

players_tidy=players.drop(["played_hours", "hashedEmail", "individualId", "organizationName", "name", "gender"], axis=1)
players_tidy

Unnamed: 0,experience,subscribe,age
0,Pro,True,9
1,Veteran,True,17
2,Veteran,False,17
3,Amateur,True,21
4,Regular,True,21
...,...,...,...
191,Amateur,True,17
192,Veteran,False,22
193,Amateur,False,17
194,Amateur,False,17


In [6]:
# code which contains numerical column for experience

players_tidy["experience_numerical"]= LabelEncoder().fit_transform(
    players_tidy["experience"])

players_tidy

Unnamed: 0,experience,subscribe,age,experience_numerical
0,Pro,True,9,2
1,Veteran,True,17,4
2,Veteran,False,17,4
3,Amateur,True,21,0
4,Regular,True,21,3
...,...,...,...,...
191,Amateur,True,17,0
192,Veteran,False,22,4
193,Amateur,False,17,0
194,Amateur,False,17,0


In [8]:
players_train, players_test = train_test_split(
    players_tidy,
    test_size = 0.25,
    random_state = 2000
)
players_train
#splitting the data into training and testing data. 
#lock away testing  data and perform visualization on training data.

Unnamed: 0,experience,subscribe,age,experience_numerical
165,Regular,True,21,3
49,Beginner,True,22,1
6,Regular,True,19,3
77,Regular,True,17,3
88,Beginner,True,17,1
...,...,...,...,...
28,Amateur,True,23,0
123,Beginner,False,17,1
54,Beginner,False,42,1
72,Veteran,True,17,4


After wrangling and splitting the dataset, we end up with players_train and players_test. The dataset players_train is the dataset that will be used for visualization and training the model. We will lock away the testing dataset to make sure that the model has not seen it before in order to get the most accurate insight on how well our model perform. They each contains columns of subscribe (our class variable), age(our predictive variable) and experience_numerical.

Now we will perform some visualization to see whether there are correlation between the player's age or experience and whether they will subscribe to the newsletter.

In [9]:
players_plot_age = alt.Chart(players_train).mark_bar().encode(
    x = alt.X("age").bin().title("Player's Age"),
    y = alt.Y("count()").title("Number of Players"),
    color = alt.Color("subscribe").title("subscribe")
)
players_plot_age

players_plot_experience = alt.Chart(players_train).mark_bar().encode(
    x = alt.X("experience:N").title("Player's Experience"),
    y = alt.Y("count()").title("Number of Players"),
    color = alt.Color("subscribe").title("subscribe")
)
players_plot_age | players_plot_experience

This visualization does not show direct correlation between player's age and subscribe. Furthermore, it is distributed imbalancely which may cause error in our model due to the descrepancy in distances because KNN classifies through the euclidean distance and takes the majority of votes.

## **Analysis - KNN Classification**

Preprocess the training set

In [10]:
players_preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "experience_numerical"])
)
players_preprocessor

Create knn model and train the classifier 

In [11]:
players_knn = KNeighborsClassifier()
players_pipe = make_pipeline(players_preprocessor, players_knn)

X = players_train[["age", "experience_numerical"]]
y = players_train["subscribe"]
players_pipe

Choose best K value with cross validation

In [15]:
#Specify the grid of parameter values to test
parameter_grid = {
    "kneighborsclassifier__n_neighbors" : range (1, 31),
}

#Create GridSearchCV object
players_grid = GridSearchCV(
    estimator = players_pipe,
    param_grid = parameter_grid,
    cv = 5
)

#Fit to GridSearchCV
players_grid.fit(
    players_train[["age", "experience_numerical"]],
    players_train["subscribe"]
)
accuracies_grid = pd.DataFrame(players_grid.cv_results_)
accuracies_grid

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003609,0.000292,0.003894,0.000337,1,{'kneighborsclassifier__n_neighbors': 1},0.7,0.5,0.689655,0.655172,0.551724,0.61931,0.079433,28
1,0.003355,4.4e-05,0.003668,4.1e-05,2,{'kneighborsclassifier__n_neighbors': 2},0.666667,0.366667,0.551724,0.586207,0.482759,0.530805,0.101208,30
2,0.003303,1.4e-05,0.00363,2.1e-05,3,{'kneighborsclassifier__n_neighbors': 3},0.733333,0.633333,0.689655,0.689655,0.724138,0.694023,0.035139,25
3,0.003341,8e-05,0.003852,0.000394,4,{'kneighborsclassifier__n_neighbors': 4},0.666667,0.666667,0.413793,0.62069,0.62069,0.597701,0.094225,29
4,0.003297,8.1e-05,0.003646,5.6e-05,5,{'kneighborsclassifier__n_neighbors': 5},0.7,0.733333,0.724138,0.689655,0.724138,0.714253,0.016539,22
5,0.003351,8.9e-05,0.003641,3e-05,6,{'kneighborsclassifier__n_neighbors': 6},0.7,0.7,0.448276,0.655172,0.689655,0.638621,0.096586,27
6,0.003332,0.000113,0.003596,2.2e-05,7,{'kneighborsclassifier__n_neighbors': 7},0.7,0.766667,0.655172,0.758621,0.724138,0.72092,0.040706,21
7,0.003249,1.6e-05,0.003617,2.5e-05,8,{'kneighborsclassifier__n_neighbors': 8},0.7,0.7,0.655172,0.758621,0.689655,0.70069,0.033318,24
8,0.003337,0.000198,0.003597,2.7e-05,9,{'kneighborsclassifier__n_neighbors': 9},0.7,0.733333,0.758621,0.758621,0.758621,0.741839,0.023099,1
9,0.003263,6e-05,0.003585,2.4e-05,10,{'kneighborsclassifier__n_neighbors': 10},0.7,0.733333,0.551724,0.758621,0.724138,0.693563,0.07336,26


In [16]:
#Plot the accuracy (y-axis) vs the  value (x-axis)
cross_val_plot = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x = alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbours"),
    y = alt.Y("mean_test_score")
    .scale(zero=False)
    .title("Accuracy estimate")
)
cross_val_plot

We will choose K neighbors as 9

Create model

In [18]:
players_knn_true = KNeighborsClassifier(n_neighbors = 9)
players_pipeline_true = make_pipeline(players_preprocessor, players_knn_true)
players_pipeline_true.fit(X, y)