In [1]:
import altair as alt
import numpy as np
import pandas as pd

# Individual Planning Report

Each student is expected to prepare a 1 page (max 500 words) written proposal that describes the data they are working on, demonstrates an understanding of all variables and potential issues in the data, and identifies the question they would like to answer using that dataset for their project. The proposal should be done in a Jupyter notebook, and then submitted in **two formats**:

- as an .html file (File -> Download As -> HTML)
- as an .ipynb file. **This file must be fully reproducible. It must run completely from top to bottom without any additional files.**
It's important to note that this first step in the project will be completed individually. Every student needs to write and submit their own assignment. We aim to ensure that all students in the group are well-prepared and able to contribute effectively to the final report.

In your notebook you need to cover the following: 

## 1. Data Description

Provide a full descriptive summary of the dataset, including information such as the number of observations, number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format.

Note that the selected dataset will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. You need to summarize the full data regardless of which variables you may choose to use later on.

The project involves two datasets, `players.csv` and `session.csv`, which are summarized below.

#### Players Dataset (`players.csv`)
- Observations: There are 196 records.
- Variables:
  There are 9 variables as follows:
  - `experience` (object): Ordinal variable representing the levels of experience that have a natural order (e.g., "Amateur," "Regular," "Pro," "Veteran").
  - `subscribe` (bool): Indicates whether the player has subscribed (True/False).
  - `hashedEmail` (object): Anonymized identifier for each player.
  - `played_hours` (float): Represents the total number of hours the player has played.
  - `name` (object): Represents the player's name.
  - `gender` (object): Represents the gender of the player ("Male" or "Female").
  - `age` (int): Represents the age of the player.
  - `individualId` (float): It is an unpopulated field with null values and could be potentially unused.
  - `organizationName` (float): It is an unpopulated field with null values and could be potentially unused.

Potential issues:
- The fields `individualId` and `organizationName` contain only null values and may be unnecessary for analysis.
- The dataset does not provide information about how experience level or other demographics affect gameplay.

#### Sessions Dataset (`sessions.csv`)
- Observations: There are 1535 records.
- Variables:
  There are 5 variables as follows:
  - `hashedEmail` (object): Anonymized identifier for each player. It is also present in the `players.csv` dataset.
  - `start_time` (object): Represents the start time of the playing session.
  - `end_time` (object): Represent the end time of the playing session (with two missing values).
  - `original_start_time` (float): It is the timestamp format for `start_time`.
  - `original_end_time` (float): It is the timestamp format for `end_time`.

Potential issues:
- The `end_time` field has missing values.
- The presence of both formatted datetime strings and timestamps for session times may require cleaning.

Data Collection Method:

This dataset originates from a real-world data science project led by a research group in Computer Science at UBC. The group set up a dedicated Minecraft server where players' in-game actions are meticulously tracked as they explore and interact within the game world. This setup allows researchers to gather extensive data on player behavior, engagement, and demographics in a controlled environment.

## 2. Question

Clearly state **one question** your group will try to answer using the selected dataset. Your question should involve one response variable of interest and one or more explanatory variables. Describe clearly how the data will help you address the question of interest. You may need to describe how you plan to wrangle your data to get it into a form where you can apply one of the predictive methods from this class.

**Research Question**: *How many hours would a player contribute given their age?*

#### Description:

To address this question, the response variable of interest is `played_hours`, which measures the total time a player has spent in the game, while the primary explanatory variable is `age`, representing the player's age in years. The objective is to understand if there's a predictive relationship between a player's age and their total playtime, which could reveal trends in engagement across different age groups.

#### Data Wrangling and Predictive Approach

To analyze this relationship, the dataset will undergo the following steps:

- Data Cleaning: Ensure the played_hours and age fields have no missing or outlier values that could skew results.
- Data Transformation: If necessary, transform age into age groups (e.g., teens, young adults, adults) to see if broader age bands provide more robust insights into playtime trends.
- Predictive Modeling: Using linear regression, knn regression or another appropriate predictive method, we will model the relationship between age and played_hours. This will allow us to predict playtime for players of different ages and analyze any potential patterns.

By analyzing this data, we hope to provide the stakeholders with insights on which age groups tend to be the most engaged in terms of hours played, aiding in recruitment and resource allocation for the server.

## 3. Exploratory Data Analysis and Visualization

In this assignment, you will:

- Demonstrate that the dataset can be loaded into R.
- Do the **minimum necessary** wrangling to turn your data into a tidy format. Do not do any additional wrangling here; that will happen later during the group project phase.
- Make a few exploratory visualizations of the data to help you understand it.
  - Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)
  - Explain any insights you gain from these plots that are relevant to address your question

**Note:** do not perform any predictive analysis here. We are asking for an exploration of the relevant variables to demonstrate that you understand them well before performing any additional modelling, and to identify potential problems you anticipate encountering.

In [2]:
import seaborn as sns

players_df = pd.read_csv('data/players.csv')
sessions_df = pd.read_csv('data/sessions.csv')

players_df = players_df[['hashedEmail', 'age', 'played_hours', 'experience']].dropna().drop_duplicates()
sessions_df = sessions_df[['hashedEmail', 'start_time', 'end_time']].dropna().drop_duplicates()

display(players_df)

Unnamed: 0,hashedEmail,age,played_hours,experience
0,f6daba428a5e19a3d47574858c13550499be23603422e6...,9,30.3,Pro
1,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,17,3.8,Veteran
2,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,17,0.0,Veteran
3,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,21,0.7,Amateur
4,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,21,0.1,Regular
...,...,...,...,...
191,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,17,0.0,Amateur
192,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,22,0.3,Veteran
193,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,17,0.0,Amateur
194,f19e136ddde68f365afc860c725ccff54307dedd13968e...,17,2.3,Amateur


In [3]:
display(sessions_df)

Unnamed: 0,hashedEmail,start_time,end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12
...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,10/05/2024 23:01,10/05/2024 23:07
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,01/07/2024 04:08,01/07/2024 04:19
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,28/07/2024 15:36,28/07/2024 15:57
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,25/07/2024 06:15,25/07/2024 06:22


In [4]:
missing_players = players_df[['age', 'played_hours']].isnull().sum()
print("Missing values in key columns:")
print(missing_players)

Missing values in key columns:
age             0
played_hours    0
dtype: int64


### Visualizations

In [5]:
import altair as alt

played_hours_chart = alt.Chart(players_df).mark_bar().encode(
    alt.X('played_hours:Q', bin=alt.Bin(maxbins=20), title='Played Hours (hours)'),
    alt.Y('count()', title='Frequency')
).properties(
    title='Distribution of Played Hours',
    width=400,
    height=300
)

age_distribution_chart = alt.Chart(players_df).mark_bar().encode(
    alt.X('age:Q', bin=alt.Bin(maxbins=10), title='Age (years)'),
    alt.Y('count()', title='Frequency')
).properties(
    title='Age Distribution of Players',
    width=400,
    height=300
)

scatter_chart = alt.Chart(players_df).mark_circle(size=60).encode(
    alt.X('age:Q', title='Age (years)'),
    alt.Y('played_hours:Q', title='Played Hours (hours)'),
    tooltip=['age', 'played_hours']
).properties(
    title='Scatter Plot of Age vs. Played Hours',
    width=400,
    height=300
)

played_hours_chart & age_distribution_chart & scatter_chart

### Insights from Exploratory Visualizations

#### Distribution of Played Hours (played_hours_chart):
- The histogram for `played_hours` reveals that most players have relatively low playtimes, with a sharp drop-off as play hours increase. This indicates that a large portion of players may engage with the game casually, while only a small subset contributes a significant number of hours.
- This insight is relevant to the question because it suggests that age might influence a tendency toward casual or intensive play, which could impact predictions about playtime based on age.

#### Age Distribution of Players (age_distribution_chart):
- The age distribution shows a mix of player ages, though some age groups may be more represented than others. This diversity across age ranges is beneficial for our analysis, as it provides a broad basis to examine how age might relate to playtime.
- A well-distributed age range supports the exploration of whether specific age groups, such as younger or older players, are more likely to log higher hours, which is directly relevant to predicting playtime from age.

#### Scatter Plot of Age vs. Played Hours (scatter_chart):
- The scatter plot suggests no strong, immediate correlation between `age` and `played_hours`, with points appearing relatively scattered. This lack of clear linear correlation implies that while age might influence playtime, other factors could also play a significant role.
- This observation is essential as it indicates that a simple linear relationship might not fully capture the complexity of playtime behavior. Age could interact with other variables (like experience level or subscription status) to affect playtime, hinting at a more nuanced relationship that may need further analysis.

## 4. Methods and Plan

Propose one method to address your question of interest using the selected dataset and explain why it was chosen. **Do not** perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

- Why is this method appropriate?
- Which assumptions are required, if any, to apply the method selected?
- What are the potential limitations or weaknesses of the method selected?
- How are you going to compare and select the model?
- How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

#### Method
To address the question, "How many hours would a player contribute given their age?", Linear regression is an appropriate starting point. Linear regression allows us to model the relationship between a continuous response variable (played_hours) and a single explanatory variable (age). If initial exploratory analysis or further tests show a nonlinear relationship, we could consider polynomial regression to capture a more complex relationship, where higher-order terms of age help fit the data.

#### Why This Method Is Appropriate:
Linear regression (or polynomial regression) is suitable because:
- Interpretability: The coefficients are interpretable, allowing us to understand how age contributes to playtime.
- Simplicity: This approach provides a straightforward relationship between age and hours, avoiding overfitting given the limited number of features.
- Flexibility: If linear regression shows inadequacies, polynomial regression allows flexibility by adding non-linear terms while maintaining interpretability.

#### Assumptions Required:
- The relationship between age and playtime should ideally be linear. Polynomial regression relaxes this by allowing curvilinear fits.
- Each observation (player) should be independent, which should generally hold true here.
- Errors should ideally follow a normal distribution. Deviations here may reduce the model's efficiency but not necessarily its validity.
- The variance in playtime should be constant across age levels. If this assumption is violated, additional transformations or model adjustments might be necessary.

#### Potential Limitations:
- Overfitting in Polynomial Regression: Adding higher-order terms could lead to overfitting, especially if playtime is affected by more than age alone.
- Limited Feature Scope: Since playtime may be influenced by factors beyond age, the model might miss out on other significant predictors (e.g., experience level).
- Sensitivity to Outliers: Extreme values in played_hours (outliers) could disproportionately impact the model. This will require careful outlier analysis and possibly transformations.

#### Model Selection and Comparison:
To evaluate the model, I plan to:
- Train/Test Split: Split the data into training and test sets (e.g., 80% for training and 20% for testing). The split will occur after initial data wrangling but before model fitting.
- Cross-Validation: Use k-fold cross-validation (e.g., 5-fold) to validate the model’s performance across different subsets of data, ensuring robustness.
- Metrics: Evaluate model performance using RMSE (Root Mean Squared Error) as the primary metric, assessing how well the model predicts playtime based on age.

#### Data Processing Plan:
- Splitting Data: Data will be split into training and test sets with an 80/20 proportion. The split will occur before applying any transformations to avoid data leakage.
- Normalization/Transformation: We will evaluate if transformations (e.g., log transformation on played_hours) are necessary to meet regression assumptions, especially if heteroscedasticity or non-normal error patterns are observed.
- Cross-Validation: Implement 5-fold cross-validation on the training set to ensure stable performance before testing on the holdout set.