# Data Description

players.csv dataset = information about the characteristics of players such as: 
- The experience level of each player, with “veteran” being the most experienced/familiar and “beginner” being the least (values: “veteran”, “pro”, “regular”, “amateur”, and “beginner”).
- Whether the player is subscribed (TRUE) or not (FALSE) 
- String of a decrypted version of an individual player’s email
- Numerical value representing the total amount of hours each player has played 
- String of each player’s unique “nickname” 
- Given gender of individual players
- Numerical value representing each player’s age 
-  "individualId" column indicates if the player is playing alone
- "organizationName" column indicates name of the organization if the player isn't playing alone

sessions.csv dataset = information about the player’s playing sessions, including: 
- same hashed email column in the players dataset 
- “original_start_time” and “original_end_time” columns = the start and end time of each gaming session of a given player 
- the “start_time” and “end_time” columns = same as “original_start_time” and “original_end_time” columns but in UNIX time


# Question: How many hours would a player contribute given their age?

- Response Variable = “hours_played” (in players dataset)
- Explanatory Variable = “age” (in players dataset)
- The players dataset answers the question since we can use the data to find a relationship/pattern regarding the age of players and how many hours they play cumulatively
- Could also see if age has a relationship with the length of individual sessions (variables in sessions dataset)
- By considering the explanatory variable of age, we can manipulate the data to see if there exists a relationship regarding age, total play time, and if the total play time is in regards to a small amount of longer sessions or a large amount of short sessions.


# Explanatory Data Analysis and Visualization

In [79]:
import pandas as pd
import altair as alt

In [80]:
players_data = pd.read_csv("data/players.csv")

# check if all the names are unique, ensuring no repeated players 
players_data["name"].is_unique
players_data["hashedEmail"].is_unique

# see if any observations in the 'individualID' and 'organizationName' columns are NOT null
players_data[players_data["individualId"].isnull() == False]
players_data[players_data["organizationName"].isnull() == False]

# columns have null values, filter them out of the data
players_data = players_data.loc[:, :"age"]

# add a played minutes variable to see the player's play time in minutes, avoiding decimal/partial hours
players_data = players_data.assign(played_mins = players_data["played_hours"]*60)

# rename "hashedEmail" column to match other column names
players_data = players_data.rename(columns = {"hashedEmail":"hashed_email"})

# reorganizing order of columns
players_data = players_data[["experience", "subscribe", "hashed_email", "name", "played_hours", "played_mins", "gender", "age"]]

players_data

Unnamed: 0,experience,subscribe,hashed_email,name,played_hours,played_mins,gender,age
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,Morgan,30.3,1818.0,Male,9
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,Christian,3.8,228.0,Male,17
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,Blake,0.0,0.0,Male,17
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,Flora,0.7,42.0,Female,21
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,Kylie,0.1,6.0,Male,21
...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,Bailey,0.0,0.0,Female,17
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,Pascal,0.3,18.0,Male,22
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,Dylan,0.0,0.0,Prefer not to say,17
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,Harlow,2.3,138.0,Male,17


In [81]:
 sessions_data = pd.read_csv("data/sessions.csv")

# 'start_time' and 'end_time' columns are 'original_start_time' and 'original_end_time' columns 
# but in a more readable format, can filter out 'original_start_time' and 'original_end_time' columns
sessions_data = sessions_data.loc[:, :"end_time"] 

# rename "hashedEmail" column to match players dataset 
sessions_data = sessions_data.rename(columns = {"hashedEmail":"hashed_email"})

# merge the players dataset with the sessions dataset to get the name variable
sessions_data = sessions_data.merge(players_data, on = "hashed_email")

# filter out the columns so we only keep the playing times, and the name variable (for readability)
sessions_data = sessions_data[["name", "start_time", "end_time"]]
sessions_data

Unnamed: 0,name,start_time,end_time
0,Hiroshi,30/06/2024 18:12,30/06/2024 18:24
1,Alex,17/06/2024 23:33,17/06/2024 23:46
2,Delara,25/07/2024 17:34,25/07/2024 17:57
3,Hiroshi,25/07/2024 03:22,25/07/2024 03:58
4,Alex,25/05/2024 16:01,25/05/2024 16:12
...,...,...,...
1530,Alex,10/05/2024 23:01,10/05/2024 23:07
1531,Lane,01/07/2024 04:08,01/07/2024 04:19
1532,Dana,28/07/2024 15:36,28/07/2024 15:57
1533,Dana,25/07/2024 06:15,25/07/2024 06:22


In [82]:
# chart the data (age and total hours played)

# we see there are only 2 players above age 50, chart data excluding these observations
players_data[players_data["age"] > 50]

# most played hours are under 7 hours, see how many have over 7 hours play time
players_data[players_data["played_hours"] > 7].count()

# data is hard to read due to discrepancy between high played hours and low played hours
# chart the high played hours in a different chart

under_50_rows = players_data["age"] < 50
less_7hrs_chart_data = players_data[(under_50_rows) & (players_data["played_hours"] < 7)]

# use line graph to better visualize trend and since we're exploring time
less_7hrs_chart = alt.Chart(less_7hrs_chart_data).mark_line().encode(
    x = alt.X("played_hours").title("Total Hours Played"),
    y = alt.Y("age").title("Player's Age").scale(zero=False)
)
less_7hrs_chart

In [83]:
# chart data to see if relationship for players with over 7 hours

more_7hrs_chart_data = players_data[(under_50_rows) & (players_data["played_hours"] >= 7)]

more_7hrs_chart = alt.Chart(more_7hrs_chart_data).mark_line().encode(
    x = alt.X("played_hours").title("Total Hours Played"),
    y = alt.Y("age").title("Player's Age").scale(zero=False)
)
more_7hrs_chart

# Methods and Plan: Simple Linear Regression or K-NN Regression

Why This Method? 
- We want to predict a numerical criterion variable from an explanatory variable, so we should use a regression model
- Linear → the data has simple numerical values, would provide easy interpretability & useful equation to map out relationship, not easily affected by outliers
- K-NN → the data may not be linear, meaning linear regression will not be useful, so predicting the hours played based on neighbors may be more feasible

Assumptions:
- Linear → relationship between age and play time is linear (for model to work best)

Potential Limitations/Weaknesses:
- Linear → the data not being linear = underfit predictions 
- K-NN → doesn't predict accurately beyond range of training data, slow as training data expands

Compare and Selecting The Model:
- Make K-NN regression model, choose the amount of neighbors, cross-validate, calculate RMSPE
- Make simple linear regression model, calculate RMSPE
- Select the model with the lower RMSPE

Processing Data To Apply To Model: 
- After choosing model → split data into test and training set (test set = 25% of data)
- Split data after loading it in and setting the seed, don't want model to see any test data
- If choose K-NN → will cross-validate, so need a validation set as well (validation set will be 25% of training data)
- If choose Linear → don't need to cross-validate, only split into training and testing data

