# Project Planning

In [29]:
# Run this cell before continuing
import pandas as pd
import altair as alt

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

DataTransformerRegistry.enable('vegafusion')

## Data Description
There are two datasets: **players.csv** and **sessions.csv**.
### players.csv
* 196 Observations / Rows
* 9 variables, 7 which contain data, 2 are unused
| Variable Name | Type of Variable | Meaning | How it was collected | 
| --- | --- | --- | --- |
| experience | Categorical (Ordinal)| Self reported user's experience with Minecraft. | During sign up |
| subscribe | Categorical (Nominal) | Whether user subsribed mailing list. | During sign up |
| hashedEmail | Categorical (Nominal) | Email that user used to sign up (encrypted). | During sign up (then encrypted)|
| played_hours | Numeric | Hours of game played | Monitoring game time of user |
| name | Categorical (Nominal) | Name user choosen from list of names (not their actual name) | During sign up |
| gender | Categorical (Nominal) | Gender of user | During sign up |
| age | Numeric | Age of user |  During sign up |
| individualId | unknown | unknown | N/A |
| organizationName | unknown | unknown | N/A |
##### Issues
* There is no data for the variables *individualId* and *organizationName*
* Options for *experience* (Beginner, Amatuer, Regular, Pro, Veteran) are confusing.
  * To me Pro and Veteran almost sounds like the same order of experience.
 
### session.csv
* 1535 Observations / Rows
* 5 variables
| Variable Name | Type of Variable | Meaning | How it was collected | 
| --- | --- | --- | --- |
| hashedEmail | Categorical (Nominal)| Email that user used to sign up (encrypted). | During sign up |
| start_time | Categorical & Numeric | Game session start time for user in date and hour of the day. | Transforming original_start_time |
| end_time | Categorical & Numeric | Game session end time for user in date and hour of the day. | Transforming original_end_time|
| original_start_time | Numeric | Game session start time for user in UNIX time. | Monitoring when session start for user |
| original_end_time | Numeric | Game session end time for user in UNIX time | Monitoring when session end for user |
##### Issues
* *start_time* & *end_time* is both Categorical & Numeric through having a date and the hour of the day, this is not tidy and may make it hard to analyse
 

## Question
### Selected Question:
* **Question 1**: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

Players that generate lots of data have large play time.
We will focus on dataset **players.csv** and the variables, *experience*, *subscribed*, *played_hours*, *gender*, and *age*. 
The response variable will be *play_hours* and the explanatory variable will be *experience*, *subscribed*, *gender*, and *age*.
* experience: Do experienced minecraft players play more plaicraft
* subscribed: Does being subscribed to the mailing list contribute to more play time
* gender: Does gender affect play time
* age: Is there a relationship with age and play time

##### Wrangling
**players.csv** is tidy, no wrangling needed to tidy it. groupby will be used to group the observations of the explanatory variable together to find the average *played_hours*. 
* Ex: group all the veteran players and find their mean play time



## Exploratory Data Analysis and Visualization

In [26]:
player_data = pd.read_csv("data/players.csv").drop(columns = ['individualId','organizationName','hashedEmail','name'])
player_data

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


In [65]:
player_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   experience    196 non-null    object 
 1   subscribe     196 non-null    bool   
 2   played_hours  196 non-null    float64
 3   gender        196 non-null    object 
 4   age           196 non-null    int64  
dtypes: bool(1), float64(1), int64(1), object(2)
memory usage: 6.4+ KB


In [30]:
Age = alt.Chart(player_data).mark_point(opacity=0.4).encode(
    x=alt.X('age:Q')
        .title('Age')
        .scale(zero = False),
    y=alt.Y('played_hours:Q')
        .title('Hours Played')
)
Age

In [42]:
experience_avg = player_data[['experience','played_hours']].groupby('experience').mean(numeric_only=True)
experience_avg

Experience = alt.Chart(experience_avg).mark_bar().encode(
    x=alt.X('experience:O')
        .title('Experience'),
    y=alt.Y('played_hours:Q')
        .title('Hours Played')
)
Experience

In [47]:
subscribe_avg = player_data[['subscribe','played_hours']].groupby('subscribe').mean(numeric_only=True)
subscribe_avg

Subscribed = alt.Chart(subscribe_avg).mark_bar().encode(
    x=alt.X('subscribe:N')
        .title('Subscribed to mailing list (T/F)')
        .scale(zero = False),
    y=alt.Y('played_hours:Q')
        .title('Hours Played')
)
Subscribed

In [60]:
gender_avg = player_data[['gender','played_hours']].groupby('gender').mean(numeric_only=True)
gender_avg

Gender = alt.Chart(gender_avg).mark_bar().encode(
    x=alt.X('gender:N')
        .title('Gender').sort('-y'),
    y=alt.Y('played_hours:Q')
        .title('Hours Played')
)
Gender

### Data Analysis 
##### Experience
* Regular has most avg play time
##### Subscribe
* Those who subscribed have longer avg play time
* Subscribe option is on by default thus hard to tell whether people play more because of subscription
##### Gender
* Non binary people have the most avg play time followed by female, agender, male, prefer not to say, others and two-spirited
##### Age
* Most of the play time is near the 0 hours, there are some outliers with very high play time
##### Overall
* As noted by **Age** there are some outlier players that play a lot more. Dealing with these data points are necessary to have good data

## Methods and Plan

To determine whether there's a relationship with Age and Playtime (not sure yet how to include the other variables as they are categoric) Knn-regression will be used.
* Will ask the teaching team on how to incorporate other variables.

Looking at the Age graph there seems to be no linear relationship between Age and playtime thus Knn-regression is suitable; knn-regression makes minimal assumptions about what the data must look like. In addition, the dataset is relatively small with only ~200 observations which KNN can handle. However, it does have limitations; specifically, KNN struggles to predict values outside the range of the training data. Since most data points fall within the 15-25 age range, the model is likely to perform poorly for ages outside this range. 

1. Create exploratory visualisation and to find outliers and remove them.
2. Split the data into a training and testing set with a 70% training split. 
3. Make a pipeline and scale the values
4. Use cross validation on training set to determine the ideal amount of neighbours to create the best fitted model 
5. Evaluate how will the model predicts using the test set by using RMSPE
