In [1]:
import pandas as pd
import altair as alt

## Data Description

Provide a full descriptive summary of the dataset, including information such as the number of observations, 
number of variables, name and type of variables, what the variables mean, any issues you see in the data,
any other potential issues related to things you cannot directly see, how the data were collected, etc. 
Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format.

Note that the selected dataset will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project.
You need to summarize the full data regardless of which variables you may choose to use later on.

The `players.csv` data has 196 observations, includes 9 variables but the `individualId` and `organizationName` are not used, leaving 7 variables:
- `experience`, which is a text indicating the class of experience level of the give player.
- `subscribe`, a TRUE or FALSE variable indicating whether or not the player is subscribed to the email notification system
- `hashedEmail` a string storing the hash of the player's email
- `played_hours` a floating point number indicating the amount of hours a given player has played
- `name` the fake chosen name of the given player
- `gender` a string representing the gender the player has chose to identify as
- `age` the age of the player

The `sessions.csv` data has 1535 observations, includes 5 variables:
- `hashedEmail` a string storing the hash of the player's email
- `start_time` the time that they started playing at in a string format
- `end_time` the time that they stopped playing at in a string format
- `original_start_time` and `original_end_time` a number used to represent the start and end time in a floating point number instead

One potential issue of the data is that the age distribution is fairly limited with most participants being around the age of 17-25. Another issue could be that the play time is cumulative and not rate based. 

## Question

Clearly state one question your group will try to answer using the selected dataset. Your question should involve one response variable of interest and one or more explanatory variables. Describe clearly how the data will help you address the question of interest. You may need to describe how you plan to wrangle your data to get it into a form where you can apply one of the predictive methods from this class.

"How many hours would a player contribute given their age?"

The data provided in `players.csv` contains the `age` and `played_hours` information needed to answer this predictive question, as we can analyze the relationship between these 2 variables and see how changing age affects play time. Very minimal wrangling is needed in this case since the data is already tidy and in a format that is easy to analyze.

## Exploratory Data Analysis and Visualization

In this assignment, you will:

- Demonstrate that the dataset can be loaded into R.
- Do the minimum necessary wrangling to turn your data into a tidy format. Do not do any additional wrangling here; that will happen later during the group project phase.
- Make a few exploratory visualizations of the data to help you understand it.
  - Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)
  - Explain any insights you gain from these plots that are relevant to address your question
Note: do not perform any predictive analysis here. We are asking for an exploration of the relevant variables to demonstrate that you understand them well before performing any additional modelling, and to identify potential problems you anticipate encountering.

In [2]:
players = pd.read_csv("data/players.csv")
players = players[["played_hours", "age"]]
players

FileNotFoundError: [Errno 2] No such file or directory: 'data/players.csv'

In [None]:
alt.Chart(players).mark_point(opacity=0.5).encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Play Time (Hours)")
)

In [26]:
alt.Chart(players[players["played_hours"] < 10]).mark_point(opacity=0.5).encode(
    x=alt.X("age").title("Age"),
    y=alt.Y("played_hours").title("Play Time (Hours)")
)

There's a highly skewed distribution of play time, with most players having relatively low hours (<20 hours) but there are several notable outliers with very high play times (>150 hours). In addition, it seems like players with the highest engagement is in the 15-25 age range, and there is a dense cluster of data points that has a play time of below 1 hour.

## Methods and Plan

Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

- Why is this method appropriate?
- Which assumptions are required, if any, to apply the method selected?
- What are the potential limitations or weaknesses of the method selected?
- How are you going to compare and select the model?
- How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

Both Linear regression and KNN regression will likely be used to answer our question, and the results of the 2 models will be compared, since both models are common choices when trying to answer a regression problem. 

To use the linear regression model, we are assuming there is a somewhat linear relationship between the models which does not seem to be the case through the graph, and given the extreme outliers in this data set, the linear model may not perform very well. The existance of the outliers may also impact the performance of the KNN model. 

The RMSPE will be used to compare and select the better of the 2 models. The data will be split 70% for the training data and 30% testing data, since the data set is not that big (~200 observations). The split will be done at the very start before the training of either models. For the KNN model, cross validation will be used to select the K value.