# Project Planning Stage (Individual)

In [1]:
# load players.csv dataset

import pandas as pd
import altair as alt

players=pd.read_csv('data/players.csv')
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [2]:
# display dataset information

players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   gender            196 non-null    object 
 6   age               196 non-null    int64  
 7   individualId      0 non-null      float64
 8   organizationName  0 non-null      float64
dtypes: bool(1), float64(3), int64(1), object(4)
memory usage: 12.6+ KB


In [3]:
# range of played hours and age

print(players['played_hours'].min())
print(players['played_hours'].max())

print(players['age'].min())
print(players['age'].max())

0.0
223.1
8
99


### Descriptive Summary of Dataset
Number of rows: 196 \
Number of variables: 9

### Information about Variables
**played_hours (response variable):**
- Type: float64 (quantitative decimal observations: range 0.0 to 223.1)
- How many cumulative hours did each player play?

**experience (explanatory variable):**
- Type: object (qualitative observations: Amateur, Beginner, Pro, Regular, or Veteran)
- What is the experience level of each player?

**subscribe (explanatory variable):**
- Type: bool (True or False)
- Is the player subscribed to email notifications?

**gender (explanatory variable):**
- Type: object (qualitative observations: Agender, Female, Male, Non-binary, Prefer not to say, or Two-spirited)
- What gender does each player assiciate themselves with?

**age (explanatory variable):**
- Type: int64 (quantitative integer observations: range 8 to 99)
- How old is each player?

**hashedEmail and name:**
- No effects on played hours, column will be dropped.

**individualID and organizationName:**
- N/A, column will be dropped.

### Potential Issues with Dataset
- Subscribe is a boolean variable. Will this have any effects on wrangling and visualizations? Should this be changed to an object?
- Many players have 0.0 hours of played hours, which affects the average of the data.
- There are many outliers of people who are more than 50 years-old and who have more than 140 hours of played hours.

In [4]:
# filter out unwanted columns

players_filtered=players[['experience', 'subscribe', 'played_hours', 'gender', 'age']]

# top 13 players with highest play times

top_hours=players_filtered.sort_values(by=['played_hours'], ascending=False).head(13)
top_hours

Unnamed: 0,experience,subscribe,played_hours,gender,age
74,Regular,True,223.1,Male,17
51,Regular,True,218.1,Non-binary,20
158,Regular,True,178.2,Female,19
90,Amateur,True,150.0,Female,16
130,Amateur,True,56.1,Male,23
71,Amateur,True,53.9,Male,17
17,Amateur,True,48.4,Female,17
183,Amateur,True,32.0,Male,22
0,Pro,True,30.3,Male,9
144,Beginner,True,23.7,Male,24


### Research Question
How do experience, email subscriptions, gender, and age affect the total played hours of Minecraft players?

We want to explore which kinds of players, categorized by their differences in experience, subscription, gender, and age, spend the longest time gaming. This will help the company target recruitment effors.

The data may helpl us answer this question by demonstrating certain patterns showing a correlation between a variable and the played hours. For instance, by looking at the table above, the players with the most played hours have the following characteristics:
- Experience: Amateur and Regular
- Subscribe: True
- Gender: Male
- Age: teens and young adults

General wranglign play: explore each explanatory variable to identify which one(s) significantly affect played hours. This will be done using visualizations.

### Exploratory Data Analysis and Visualizations

**Average Played Hours vs Experience:**\
Based on the plot below, Regular contribute significantly more played hours than the other experience levels. There is a strong relationship between played hours and experience.

In [5]:
# grouping experience and computing the average played hours for each experience level

experience_grouped=players_filtered.groupby('experience')['played_hours'].mean().reset_index()

# plotting played hours vs experience

experience_plot=alt.Chart(experience_grouped, title='Average Played Hours vs Experience').mark_bar().encode(
    y=alt.Y('played_hours').title('Average Played Hours'),
        x=alt.X('experience').sort('y').title('Experience')
)
experience_plot

**Average Played Hours vs Subscribe:**\
Based on the plot below, players who are subscribed to email notifications spend significantly more time playing.

In [6]:
# grouping subscribe and computing the average played hours for true or false

subscribe_grouped=players_filtered.groupby('subscribe')['played_hours'].mean().reset_index()

# plotting played hours vs subscribe

subscribe_plot=alt.Chart(subscribe_grouped, title='Average Played Hours vs Subscribe').mark_bar().encode(
    y=alt.Y('played_hours').title('Average Played Hours'),
        x=alt.X('subscribe').sort('y').title('Subscribe')
)
subscribe_plot

**Average Played Hours vs Gender:**\
Based on the plot below, non-binary and female players have the highest average played hours.

In [7]:
# grouping gender and computing the average played hours for each gender category

gender_grouped=players_filtered.groupby('gender')['played_hours'].mean().reset_index()

# plotting played hours vs gender

gender_plot=alt.Chart(gender_grouped, title='Average Played Hours vs Gender').mark_bar().encode(
    y=alt.Y('played_hours').title('Average Played Hours'),
        x=alt.X('gender').sort('y').title('Gender')
)
gender_plot

**Average Played Hours vs Age:**\
In the plot below, there are outliers at ages between 90 and 100, and when the played hours are above 140 hours. These outliers will be omitted in the subsequent plot. In the latter, it is clear that most players are between 17 and 26 years-old.

In [8]:
# plotting played hours vs age

age_plot=alt.Chart(players_filtered, title='Played Hours vs Age').mark_point().encode(
    x=alt.X('age').title('Age'),
    y=alt.Y('played_hours').title('Played Hours')
).properties(width=700)
age_plot

In [64]:
# filtering outliners

age_filtered=players_filtered.loc[(players_filtered['played_hours'] < 140) & (players_filtered['age'] < 55)]

# plotting filtered played hours vs filtered age

age_filtered_plot=alt.Chart(age_filtered, title='Played Hours vs Age (no outliners)').mark_point().encode(
    x=alt.X('age').title('Age'),
    y=alt.Y('played_hours').title('Played Hours')
).properties(height=500, width=700)
age_filtered_plot

### Methods and Plan
K-NN regression can be used to model. Since none of the variables seem to form a linear relationship with played hours, linear regression would not be appropriate. K-NN classifier would not be appropriate either because we are predicting played hours, which is a quantitative value, not a category.

Assumption: the data point density is high enough. If the data points are too sparse, it's difficult for K-NN to identify representative neighbours.

Limitations: we need to decide how many variable to use to model because too many variables and skew and/or overfit the data.

The model will be compared and selected by using different combinations of predictors, and selecting which one gives the lowest RMPE and RMSPE.

The data will be split (train size is 0.75) at the beginning of the K-NN regression process and there will be validation sets using cross validate (cv = 5).