**<u>Data Description**</u>

Data about each player who participated in this Minecraft study.
Number of observations: 196
Number of Variables: 9

**Variables:** 

1. experience - experience level
2. subscribe - whether the player subscribed to the newsletter (true or false)
3. hashed_email - players email rewritten, as a way to identify each player
4. played_hours - number of hours played on the MineCraft server
5. name - player's name
6. gender 
7. age 
8. individualID 
9. organizationName 

Most of these variables are nominal variables. We may need to manage the data and assign the values to numbers for analysis. 


**Question**

"What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"

Our specific question is: 
"Can we predict whether or not a player will subscribe to the newsletter based on age, gender, and experience, and what is the best combination of these predictors to use?" 

Columns we will use: 

Predictor variables: experience, gender, age

Response variable: subscribe


<u>**Exploratory data analysis and visualization**</u>

In [1]:
import pandas as pd
url1 = "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
 

#load dataframe into python
players = pd.read_csv(url1)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


I will create a new column in this dataset to combine some of the gender data for more effective visualization.

These observations are renamed "Other*" to represent Agender, Non-binary, Two-Spirited, Prefer not to say, and Other values. I recognize that "other" is not an appropriate term to represent these individual groups, but this is just to better visualize any general trends.


In [2]:
gender_category_renames = {
    'Male': 'Male',
    'Female': 'Female',
    'Agender': 'Other*',
    'Non-binary': 'Other*',
    'Two-Spirited': 'Other*',
    'Prefer not to say': 'Other*',
    'Other' : 'Other*'
}

players["gender_binned"] = players['gender'].map(gender_category_renames)
                                
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName,gender_binned
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,,Male
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,,Male
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,,Male
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,,Female
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,,Male
...,...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,,Female
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,,Male
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,,Other*
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,,Male


In [3]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
players['experience_encoded'] = label_encoder.fit_transform(players['experience'])
players['gender_encoded'] = label_encoder.fit_transform(players['gender'])
players['gender_binned_encoded'] = label_encoder.fit_transform(players['gender_binned'])
##this code was determined through an online search for how to transform qualitative data into numerical ordinal data

##This technique technically makes "gender" appear as an ordered dataset when it should not be, however it still is a way to transform these 
#categorical variables into numbers for analysis, and since we are not dealing with orders this should not be an issue


In [4]:
##Data split into training and testing datasets
import numpy as np
from sklearn.model_selection import train_test_split

players_training, players_testing = train_test_split(
   players, train_size = 75, random_state=2000           # same random state as group members
)
players_training

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName,gender_binned,experience_encoded,gender_encoded,gender_binned_encoded
46,Veteran,True,577aa5f15468252b1c6f32dcd515012923476292e30f95...,0.1,Winston,Male,17,,,Male,4,2,1
32,Amateur,True,1683a3e0aed65119f83540274ff6f965fdf66890613a80...,0.0,Farid,Male,17,,,Male,0,2,1
48,Veteran,True,b3510c708bd50bf9f75e6e02bb6fe14edb705e0ea671ee...,12.5,Isidore,Agender,27,,,Other*,4,0,2
187,Amateur,True,e3f0ad9aadd27f3d1d9197e58546d045018daa76767503...,0.0,Jasper,Male,17,,,Male,0,2,1
126,Beginner,True,d51a8c57269fe347b4b7760ec29c420832cb36a11c4756...,0.7,Amelie,Female,24,,,Female,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
28,Amateur,True,4b01bce3f141289709e8278b02ba5d2aaa7105d7ccb9c7...,1.8,Luca,Male,23,,,Male,0,2,1
123,Beginner,False,e74c60a92c0100e7240be56d66969db85856152b048c63...,7.1,Arash,Male,17,,,Male,1,2,1
54,Beginner,False,5e5c25a773be7a62638a163d773534e575a5ad57821047...,0.0,Jude,Female,42,,,Female,1,1,0
72,Veteran,True,1edebbf13898dbe99da5ee743b1fd9cbfdc6d69f1180dd...,0.0,Will,Male,17,,,Male,4,2,1


**Visualization**

In [5]:
import altair as alt

chart_1 = alt.Chart(players_training).mark_point().encode(
    x = alt.X("experience_encoded").title("Experience"),
    y = alt.Y("age").title("Age"),
    color = alt.Color("subscribe:N")
 ).properties(
    width=200,
    height=200
)


chart_2 = alt.Chart(players_training).mark_point().encode(
    x = alt.X("experience_encoded").title("Experience"),
    y = alt.Y("gender").title("Gender"),
    color = alt.Color("subscribe:N"),
).properties(
    width=200,
    height=200
)

chart_3 = alt.Chart(players_training).mark_point().encode(
    x = alt.X("age").title("Age"),
    y = alt.Y("gender").title("Gender"),
    color = alt.Color("subscribe:N"),
).properties(
    width=200,
    height=200
)


side_by_side_charts = chart_1 | chart_2 | chart_3
side_by_side_charts

These charts show that there are overlapping observations at points, appearing as darker points. Brown coloured points show that at these points some subscribed and some did not. 

In [6]:
bar_plot_gender_percent = alt.Chart(players_training, title = "Percentage of Subscribed").mark_bar().encode(
    x=alt.X("gender")
        .title("Gender"),
    y=alt.Y("count()")
        .title("Percentage")
        .stack('normalize'),
    color=alt.Color("subscribe").title("Subscribed")
)
bar_plot_gender = alt.Chart(players_training, title = "Number of Subscribed").mark_bar().encode(
    x=alt.X("gender")
        .title("Gender"),
    y=alt.Y("count()")
        .title("Count"),
    color=alt.Color("subscribe").title("Subscribed")
)

bar_plot_gender_binned_percent = alt.Chart(
    players_training, title = "Percentage of Subscribed"
).mark_bar().encode(
    x=alt.X("gender_binned")
        .title("Gender"),
    y=alt.Y("count()")
        .title("Percentage")
        .stack('normalize'),
    color=alt.Color("subscribe").title("Subscribed")
)
bar_plot_gender_binned = alt.Chart(
    players_training, title = "Number of Subscribed"
).mark_bar().encode(
    x=alt.X("gender_binned")
        .title("Gender"),
    y=alt.Y("count()")
        .title("Count"),
    color=alt.Color("subscribe").title("Subscribed")
)

side_by_side_plots_gender = bar_plot_gender | bar_plot_gender_percent | bar_plot_gender_binned | bar_plot_gender_binned_percent
side_by_side_plots_gender

In [7]:
bar_plot_experience_percent= alt.Chart(
    players_training,
    title="Percentage of Subscribed"
).mark_bar().encode(
    x=alt.X("experience")
        .title("Experience"),
    y=alt.Y("count()")
        .title("Percentage")
        .stack('normalize'),
    color=alt.Color("subscribe").title("Subscribed")
)
bar_plot_experience = alt.Chart(
    players_training, title = "Number of Subscribed"
).mark_bar().encode(
    x=alt.X("experience")
        .title("Experience"),
    y=alt.Y("count()")
        .title("Count"),
    color=alt.Color("subscribe").title("Subscribed")
)
side_by_side_plots_experience = bar_plot_experience | bar_plot_experience_percent
side_by_side_plots_experience

In [8]:
bar_plot_age_percent = alt.Chart(
    players_training,
    title= "Percentage of Subscribed"
).mark_bar().encode(
    x=alt.X("age").bin()
        .title("Age"),
    y=alt.Y("count()")
        .title("Percentage")
        .stack('normalize'),
    color=alt.Color("subscribe").title("Subscribed")
)
bar_plot_age = alt.Chart(
    players_training,
    title= "Number of Subscribed"
).mark_bar().encode(
    x=alt.X("age").bin()
        .title("Age"),
    y=alt.Y("count()")
        .title("Count"),
    color=alt.Color("subscribe").title("Subscribed")
)
side_by_side_plots_age = bar_plot_age | bar_plot_age_percent
side_by_side_plots_age

<u>**Methods and plan:**</u>

We will preform a K-NN classification analysis to predict whether or not a player will subscribe to a newsletter. This method is appropriate for answering a predictive classification question such as this one.

**Assumptions:**
We assume that we can use different K neighbours values for each model based on our cross validation tests. We also assume that we have relevant, unskewed data where all player types are represented relatively equally.

**Potential Limitations:**
K-NN classification can be skewed by outliers or overrepresentation of certain player types. Also, significantly more players subscribed to the newsletter than not, so we may need to address this balancing issue. 

**Process for model creation and comparison:**
We will create K-NN classification models for different combinations of predictor vairables, following the same steps of preprocessing and cross-validation each time. We will then compare each model to the testing set to calculate accuracy, precision, and recall to determine the best combination of predictors. Listed below are the four combinations to be compared:

  1. age and gender
  2. age and experience
  3. gender and experience
  4. age, gender, and experience

**Preprocessing Steps:**
Experience and gender are nominal variables, so will be assigned numerical values, ordered for experience and random for gender. The data will be split into a training set and testing set (75/25). When creating each of our four models we will preprocess the data, preform a 5 fold cross validation test to determine the best K value.