# **Introduction**

experience: this a categorical variable that describes the level of experience of players. The possible outcomes are "Beginner", "Amateur", "Regular", "Pro", and "Veteran".
subscribe: this is also a categorical variable that presents whether each player has subscribed to the newsletter or not. Its possible values are "TRUE" and "FALSE".
hashedEmail: This is a categorical variable that is not repeated because it presents each user's email. This will not contribute to our analysis because its solepurpose is to differentiate players from players which will be the same as one of the columns we have after.
played_hours: this is a continuous variable that presents how long did each player played in time unit of hours. There are none missing values, however, the observations are very sparse. There are a lot of players who have played_hours of around 0 hours, and only a few have a significant played_hours of above 50. This column will be interesting to analyze on as this may create some outsiders or noise points which will affect our prediciton model.
name: This is categorical variable that presents the username of all the observations to differentiate and track each entry. This column serves the same purpose as hashedEmail in this dataset and it easier to manipulate and operate on due to the simplicity of the observations.
gender: this is a categorical variable that describes the player's gender. It ranges from "Male", "Female", "Non-binary", "Prefer not to say", "Agender", "Two-spirited" and "Other". There are a lot of observations in "Male" and "Female", and not so much in the other categories. This will create some outsiders or noise points which will affect our prediciton model.
age: This is a continuous variable that describes the player's age.
individualid: this column have no observations for all of the rows. We might remove this column from the dataset when we start wrangling since this does not provide any information on our thesis.
organizationName: this column have no observations for all of the rows. We might remove this column from the dataset when we start wrangling since this does not provide any information on our thesis.

Alternate Intro, can add or remove content as needed.

## **Data Description**

**Dataset Overview**  
Our project will use two datasets collected from UBC's Computer Science department's research Minecraft server:  

**`players.csv`** consists of information about players in the server.  
**`sessions.csv`** consists of details from each gameplay session in the server.  

The data was gathered automatically through server logs, which recorded player activities like joining, playing, and exiting the world. A separate dataset containing information regarding each player was also provided. These data sets both share a `hasedEmail` field, which helps link the two together.

## **Observations of Datasets:**

**`players.csv`**
- 196 Observations
- 9 Variables
| Variable | Type | Description |
| --- | --- | --- |
| `experience` | object | Self-reported experience level |
| `subscribe` | bool | Newsletter subscription status |
| `hashedEmail` | object | Player identifier |
| `played_hours` | float | Total hours played |
| `name` | object | Player name |
| `gender` | object | Gender |
| `age` | int | Age in years |
| `individualId` | float | Empty |
| `organizationName` | float | Empty |       

**Key takeaways (players.csv):**  
- `individualId` and `organizationName` don't hold any data in them, would be best to remove them.   
- `played_hours` is a strong behavioral candidate to go off of.
- `hashedEmail` is the key identifier.


**`sessions.csv`**
- 1,535 Observations
- 5 Variables
| Variable | Type | Description |
| --- | --- | --- |
| `hashedEmail` | object | Player identifier |
| `start_time` | object | Session start timestamp |
| `end_time` | object | Session end timestamp |
| `original_start_time` | float | Float form start time |
| `original_end_time` | float | Float form end time |

**Key takeaways (sessions.csv):**  
- Can choose from either the original or finalized start/end times, whatever isn't used can be dropped.
- A few users are missing `end_time` and `original_end_time`, though I wonder if this will affect anything.  
- `hashedEmail` is also the key identifier, but appears multiple times ber player here.


### Potential Issues
- Missing or inconsistent timestamps.
- Players with extremely long sessions (potential outliers).
- Overlap or duplication in session data.
- Possible selection bias via only choosing players who joined the server.  
- Empty/duplicate columns like `original_start_time` and `original_end_time` or `individualId`.  
- Not every `hashedEmail` in players appears in sessions, as well as the other way around.

**Data Wrangling**

In [1]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

In [2]:
url="https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url) 

players_tidy=players.drop(["played_hours", "hashedEmail", "individualId", "organizationName", "name", "gender"], axis=1)
players_tidy

Unnamed: 0,experience,subscribe,age
0,Pro,True,9
1,Veteran,True,17
2,Veteran,False,17
3,Amateur,True,21
4,Regular,True,21
...,...,...,...
191,Amateur,True,17
192,Veteran,False,22
193,Amateur,False,17
194,Amateur,False,17


In [3]:
# code which contains numerical column for experience

players_tidy["experience_numerical"]= LabelEncoder().fit_transform(
    players_tidy["experience"])

players_tidy


Unnamed: 0,experience,subscribe,age,experience_numerical
0,Pro,True,9,2
1,Veteran,True,17,4
2,Veteran,False,17,4
3,Amateur,True,21,0
4,Regular,True,21,3
...,...,...,...,...
191,Amateur,True,17,0
192,Veteran,False,22,4
193,Amateur,False,17,0
194,Amateur,False,17,0


In [4]:
#code which contains numerical column for experience only

players_numerical = players_tidy.drop(["experience"], axis=1)
players_numerical

Unnamed: 0,subscribe,age,experience_numerical
0,True,9,2
1,True,17,4
2,False,17,4
3,True,21,0
4,True,21,3
...,...,...,...
191,True,17,0
192,False,22,4
193,False,17,0
194,False,17,0


In [8]:
players_train, players_test = train_test_split(
    players_numerical,
    test_size = 0.25,
    random_state = 2000
)
players_train
#splitting the data into training and testing data. 
#lock away testing  data and perform visualization on training data.

Unnamed: 0,subscribe,age,experience_numerical
165,True,21,3
49,True,22,1
6,True,19,3
77,True,17,3
88,True,17,1
...,...,...,...
28,True,23,0
123,False,17,1
54,False,42,1
72,True,17,4


After wrangling and splitting the dataset, we end up with players_train and players_test. The dataset players_train is the dataset that will be used for visualization and training the model. We will lock away the testing dataset to make sure that the model has not seen it before in order to get the most accurate insight on how well our model perform. They each contains columns of subscribe (our class variable), age(our predictive variable) and experience_numerical.

Now we will perform some visualization to see whether there are correlation between the player's age and whether they will subscribe to the newsletter.

In [9]:
players_plot_age = alt.Chart(players_train).mark_bar().encode(
    x = alt.X("age").bin().title("Player's Age"),
    y = alt.Y("count()").title("Number of Players"),
    color = alt.Color("subscribe").title("subscribe")
)
players_plot_age

This visualization does not show direct correlation between player's age and subscribe. Furthermore, it is distributed imbalancely which may cause error in our model due to the descrepancy in distances because KNN classifies through the euclidean distance and takes the majority of votes.