In [1]:
import altair as alt
import pandas as pd

In [2]:
players = pd.read_csv("data/players.csv")
sessions = pd.read_csv("data/sessions.csv")

**(1) Data Set Description:**
* Players:
    * **Observations**: 196 -> All the players that signed up
    * **Variables**: 9
        * **Experience** (catagorical): User experience level that the players (Beginner, Amateur, Regular, Veteran, Pro)
        * **Subscribe** (boolean): A boolean indicating whether the user has subscribed (TRUE/FALSE)
        * **Hashed Email** (string): A unique identifier for each user
        * **Played hours** (numerical): the amount of hours a use has spent engaging with thte application
        * **Name** (string): the user's name
        * **Gender** (categorical): The user's gender
        * **Age** (numerical): The user's age
        * **Individual Id** (string): (NULL)
        * **Organization Name** (string): (NULL)
    * Method of data collection: This data is collected from the students of DSCI 100 and most of the values of the variables are self reported.
    * Potential limitations:
        * Missing data: In the columns of "Individual Id" and "Organization Name", it is clear that all the values are NULL and empty, which can impact and limit certain analysese that works with these variables.
        * False or inaccurate values: In the "Age" column, some users have recorded ages as low as 8, which might not be reliable or consistent with the ages of the poepe in this class.
* Sessions:
    * **Observations**: 1535 -> recorded sessions
    * **Variables**: 5
        * **Hashed Email** (string): A unique identifier for each user
        * **Start Time** (string): Session's start time (DD/MM/YYYY Time)
        * **End Time** (string): Session's end time (DD/MM/YYYY Time)
        * **Original Start Time** (Numeric): The session's start time of the session represented in floating point number
        * **Original End Time** (Numeric):The session's start time of the session represented in floating point number
    *  Method of data collection: Collected through logging system where a participant's session's start and end time is recorded. 
    * Potential limitations:
        * No qualitative data: This dataset has only session times, which limits its context for user behavior analysis
        * Skewed data distribution: The possibility of having multiple rows correspond to one player of a specific experience could skew usage statistics depending on what we are analyzing.
* Overall data set observation:
    * Interlink: Having a commin variable between the two datasets like the variable "Hash Email", is very useful as it allows one to join the data sets and expand our analysis between sessions and player characteristics.

**(2) Question:**
* Our question (corresponding to general statement 3): "How many hours would a player contribute given their “age”?"
* Response variable: **Played Hours**
    * This variable is representing the total hours a player has spent engaging with the game. This variable will be theresponse variable as the to quantify the level of contribution made by each player.
* Explanatory Variable: **Age**
    * Age will serve as the primary explanatory variable to examine how it influences the hours played by a user.
* Answering the question:
    * I will be using past data of the users' played hours to explore possible patterns that may indicate a relationship or correlation between the variables. Addressing whether age is a strong predictor to hours played.
* Possible methods applied to our datasets:
    * Filtering data sets to only focus on user's age and played hours. Within this step, I will also remove any null-values so our data stays consistent.
    * Grouping data by age ranges to capture general patterns across a similar age group instead of focusing on all 193 individual observatinos.
    * Standarizing through transformation regarding the distrbution of "played hours" if it is skewed. I may have to use log transformation for a better model performance.

**(3) Exploratory Data Analysis and Visualization**
* Through the visualization of the data after filtering and wrangling it, I can see that there is little to no linear relationship between the variable and thus the method of linear regression would not be appropriate for this data. Moreover, the range of "played hours" is so large that we may have to remove outliers as well as perform lsight log transformation to get a better insight to the point's differences. 


In [45]:
players_filtered = players [["hashedEmail","experience", "gender", "name", "age", "played_hours"]]
players_filtered = (players_filtered[(players_filtered["played_hours"] > 0) & (players_filtered["played_hours"] < 10)])
players_filtered = players_filtered.loc[players_filtered["age"] < 40] 
players_filtered

Unnamed: 0,hashedEmail,experience,gender,name,age,played_hours
1,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,Veteran,Male,Christian,17,3.8
3,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,Amateur,Female,Flora,21,0.7
4,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,Regular,Male,Kylie,21,0.1
8,8b71f4d66a38389b7528bb38ba6eb71157733df7d17403...,Amateur,Male,Natalie,17,0.1
10,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,Veteran,Female,Lane,23,1.6
...,...,...,...,...,...,...
182,f7875ae87a61632030d5c4029ee8cf081be7047b2b4a9c...,Pro,Male,Liam,17,0.2
184,d46bd29a2ed08e3500bd8729085ef4b6f0ca65baf4c756...,Pro,Male,Asher,17,1.7
185,8e98b6db2053af0bc0e62cd55bcea5a08f23986dec3d02...,Regular,Male,Sam,18,0.1
192,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,Veteran,Male,Pascal,22,0.3


In [46]:
players_scatter_plot = alt.Chart(players_filtered).mark_point(opacity = 0.4).encode(
    x = alt.X("age").scale(zero = False).title("Player's Age"),
    y = alt.Y("played_hours").title("Hours played (hrs)"),
    tooltip=alt.Tooltip(["experience", "gender"])
)
players_scatter_plot

In [48]:
players_line_plot = alt.Chart(players_filtered).mark_line(opacity = 0.4).encode(
    x = alt.X("age").scale(zero = False).title("Player's Age"),
    y = alt.Y("played_hours").title("Hours played (hrs)"),
)

players_line_plot

**Method - k-Nearest Neighbors (kNN):**

We propose using the kNN model to predict player hours based on age. Since our analysis indicates no clear linear relationship, kNN is ideal due to its flexibility in modeling complex, non-linear data patterns. This non-parametric model is suitable for predicting the continuous variable of hours contributed without assuming linearity, addressing our question of "How many hours will a player contribute given their age?"
* **Assumptions:** kNN is advantageous because it does not require specific data distribution patterns, making it robust for our dataset. However, age should reasonably correlate with hours to ensure meaningful similarity measures, and the data must be sufficiently dense for accurate predictions. Larger values of k risk underfitting, so careful selection is essential.
* **Limitations:** kNN is sensitive to noise and outliers, potentially impacting accuracy. Choosing the correct k is crucial: too small a k risks overfitting, while a large k risks underfitting.
* **Evaluation:** We will determine the optimal k through validation and measure accuracy using Mean Squared Error (MSE), comparing it with a linear regression baseline despite initial evidence that linearity is weak.
* **Data Processing:** A 70-30 train-test split will train the model while preserving data for an unbiased evaluation. Five-fold cross-validation will further validate model stability, ensuring a reliable k selection. This approach leverages kNN’s flexibility to capture non-linear relationships, making it suitable for predicting player hours based on age.
