# Part 1: Data Description

Here I will provide a summary of both datasets, including:
- The number of observations.
- A description of each variable and its data type.
- Potential data issues that may need to be addressed.





In [11]:
import pandas as pd


sessions_df = pd.read_csv("data/sessions.csv")
players_df = pd.read_csv("data/players.csv")


print("Sessions Data Information:")
sessions_df.info()
print("Players Data Information")
players_df.info()


Sessions Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1535 entries, 0 to 1534
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   hashedEmail          1535 non-null   object 
 1   start_time           1535 non-null   object 
 2   end_time             1533 non-null   object 
 3   original_start_time  1535 non-null   float64
 4   original_end_time    1533 non-null   float64
dtypes: float64(2), object(3)
memory usage: 60.1+ KB
Players Data Information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   g

## Step 2: Summary of data

### Dataset 1 Sessions data 
hashedEmail:
- A hashed identifier representing the player.

start_time:
- The start time of the session, represents when a player started their session on the server

end_time:
- The end time of the session in string format, represents when a player ended their session on the server.

original_start_time:
- A timestamp for the session users start time.

original_end_time:
- A timestamp for the usera session end time.



In [12]:
total_sessions_observations = len(sessions_df)
print(f"Total number of observvations in sessions_df: {total_sessions_observations}")

sess_dict = {
    "Column Name": sessions_df.columns,
    "Data Type": sessions_df.dtypes.values,
    "Count": sessions_df.count().values,
}

sessions_summary_table = pd.DataFrame(sess_dict)
sessions_summary_table


Total number of observvations in sessions_df: 1535


Unnamed: 0,Column Name,Data Type,Count
0,hashedEmail,object,1535
1,start_time,object,1535
2,end_time,object,1533
3,original_start_time,float64,1535
4,original_end_time,float64,1533


#### Potential and Identified issues in Sessions data

Timezone Mismatches:
- If players from different time zones are using the server, the start_time and end_time might not be standardized, leading to erros in calculations.

Player Behavior Anomalies:
- If some sessions are extremely short or long, it could indicate anomalies like accidental log-ins or players forgetting to log out.

Missing Data:
- The end_time columns has 2 missing values.

### Dataset 2 Players data 

Experience:
- The experience level of the player, indicates the player’s skill.

Subscribe
- A Boolean value indicating whether the player is a subscriber (True/False).

hashedEmail:
- A hashed identifier for the player, which helps link with the sessions data.
This allows for the merging the players data with the sessions data for analysis.

Played_hours:
- The total number of hours the player has spent playing the game.

Name:
- The player’s name

Gender:
- The gender of the player.

Age:
- The age of the player

IndividualId: 
- This column has no data (all values are missing)

OrganizationName:
- This column also has no data (all values are missing). It may have been for players associated with certain organizations.

In [13]:
total_players_observations = len(players_df)
print(f"Total number of observations in players_df: {total_players_observations}")

play_dict = {
    "Column Name": players_df.columns,
    "Data Type": players_df.dtypes.values,
    "Count": players_df.count().values
}

players_summary_table = pd.DataFrame(play_dict)

players_summary_table


Total number of observations in players_df: 196


Unnamed: 0,Column Name,Data Type,Count
0,experience,object,196
1,subscribe,bool,196
2,hashedEmail,object,196
3,played_hours,float64,196
4,name,object,196
5,gender,object,196
6,age,int64,196
7,individualId,float64,0
8,organizationName,float64,0


#### Potential and Identified issues in Players data

Missing Columns:
- individualId and organizationName are entirely empty.

Data Consistency:
- Some columns like experience and gender are self-reported, which may lead to inconsistencies.

Zero Values in played_hours:
- Some players have zero hours played. This could indicate new players, incomplete data, or errors in logging.

# Part 2: Question

### The primary question we I will answer is Question 1

### Response Variable
-The response variable will be "played_hours".

### We will use the following explanatory variables from the players_df dataset to predict the played_hours:

- experience: This indicates the player's skill level. 
  - Experienced players may be more likely to play longer and generate more data.

- subscribe:
  - Subscription status (True/False) could be a strong predictor, as subscribers most likely play more.
    
- age:
  - Age of the player could affect playtime, as younger or older age groups might have different habits.

- gender:
  - Gender may have an influence on playtime trends.

### Data Wrangling:

- Remove columns individualId, organizationName that contain only missing values.
- Convert categorical variables (experience, subscribe, gender) into numeric format to use in for when we make prediction models.

- Merging Datasets:
  - Merge the players_df and sessions_df tables using the hashedEmail column to combine player info with session data.

### Analysis

- Using regression or classification models, either Linear Regression or K-Nearest Neighbors to predict played_hours based on the selected explanatory variables.
  
- Assess which features are most influential in predicting high playtime, helping us identify the kinds of players who contribute the most data.

### How the Data Will Help Address the Question
The players_df dataset provides detailed information about player demographics, experience, and play habits. By analyzing this data, we can identify trends and patterns among players. This will help to identify the specific player groups that should be targeted in their recruitment effort, ensuring a higher likelihood of attracting players who contribute a large amount of data.



# Part 3: Exploratory Data Analysis and Visualization

## Importing Libraries and Loading Data

In [14]:
import altair as alt

sessions_df = pd.read_csv("data/sessions.csv")
players_df = pd.read_csv("data/players.csv")


## Merge the datasets

In [15]:
merged_df = pd.merge(players_df, sessions_df, on='hashedEmail', how='inner')


## Data Cleaning

In [16]:
merged_df.drop(columns=['individualId', 'organizationName'], errors = "ignore", inplace=True)
merged_df.dropna(subset=['played_hours', 'experience', 'subscribe', 'age'], inplace=True)
merged_df

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,start_time,end_time,original_start_time,original_end_time
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,08/08/2024 00:21,08/08/2024 01:35,1.723080e+12,1.723080e+12
1,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,09/09/2024 22:30,09/09/2024 22:37,1.725920e+12,1.725920e+12
2,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,08/08/2024 02:41,08/08/2024 03:25,1.723080e+12,1.723090e+12
3,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,10/09/2024 15:07,10/09/2024 15:29,1.725980e+12,1.725980e+12
4,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,05/05/2024 22:21,05/05/2024 23:17,1.714950e+12,1.714950e+12
...,...,...,...,...,...,...,...,...,...,...,...
1530,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,23/08/2024 21:59,23/08/2024 22:06,1.724450e+12,1.724450e+12
1531,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,09/09/2024 02:17,09/09/2024 02:45,1.725850e+12,1.725850e+12
1532,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,23/08/2024 21:39,23/08/2024 21:53,1.724450e+12,1.724450e+12
1533,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,08/09/2024 19:40,08/09/2024 19:45,1.725820e+12,1.725820e+12


## Plot 1: Distribution of Played Hours

In [17]:
played_hours_chart = alt.Chart(merged_df).mark_bar().encode(
    x=alt.X("played_hours").bin(alt.Bin(maxbins=20)).title("Played Hours"),
    y=alt.Y("count()").title("Number of Players")
).properties(
    title="Distribution of Played Hours"
)

played_hours_chart




### Insights:
- This plot shows how playtime (played_hours) is distributed among all players.

- The majority of players have lower playtime, clustered between 0 to 50 hours and a big gap between 50 and 140. There is a noticeable drop in the number of players as played hours increases
 
 - This plot also suggests that a majority of active players contributes to lower ammounts of hours played but there is also a smaller group but still significant ammount that have played lots and less inbetween.

## Plot 2: Average Played Hours by Experience Level

In [18]:

experience_chart = alt.Chart(merged_df).mark_bar().encode(
    x=alt.X("experience").title("Experience Level"),
    y=alt.Y("mean(played_hours)").title("Average Played Hours"),
    color="experience"
).properties(
    title="Average Played Hours by Experience Level"
).facet(
    column="experience"
)

experience_chart





### Insights

  - This plot compares the average playtime across different experience levels (e.g., Pro, Veteran, Regular, Amateur).

  - This plot demonstrates that "Amateur" and "Regular" experience level players actually have the highest average playtime, with "Regular" dominating.
  - Players with least ammount of hours played are "Veteran" and "Begineer". This trend indicates that experience level is not necessarly strongly correlated with playtime.
  - Focusing recruitment efforts on "Regular" players could yield more data. Understanding what keeps "Regular" players engaged could also help increase playtime for other experience levels as well, and help recruiting efforts.

## Plot 3: Average Played Hours by Subscription Status



In [27]:
subscribe_chart = alt.Chart(merged_df).mark_bar().encode(
    x=alt.X("subscribe").title("Subscription Status (True/False)"),
    y=alt.Y("mean(played_hours)").title("Average Played Hours"),
    color="subscribe"
).properties(
    title="Average Played Hours by Subscription Status"
).facet(
    column=alt.Column("subscribe", title="Subscription Group").header(title=None)
)

subscribe_chart


### Insight
- Subscribers play significantly more,indicating that subscription status is a strong predictor of player engagement.
  
- Non-subscribers have low playtime, suggesting they are less invested in the game.

- Focusing on recruiting or trying to get players to subscribe could lead to higher data collection, as subscribers demonstrate much higher engagement.


## Plot 4: Average Hours Played by Age Group (Faceted by Experience Level)

In [28]:

age_bins = [0, 18, 25, 35, 100]
age_labels = ["<18", "18-25", "26-35", "36+"]

players_df["age_group"] = pd.cut(players_df["age"], bins=age_bins, labels=age_labels)

experience_avg_hours = players_df.groupby(["experience", "age_group"]).agg(
    average_hours_played=("played_hours", "mean")
).reset_index()

facet_plot = alt.Chart(experience_avg_hours).mark_bar().encode(
    x=alt.X("age_group").title("Age Group"),
    y=alt.Y("average_hours_played").title("Average Hours Played"),
    color=alt.Color("age_group").title("Age Group")
).properties(
    width=200,
    height=300,
    title="Average Hours Played by Age Group (Faceted by Experience Level)"
).facet(
    column=alt.Column("experience", title="Experience Level")
)

facet_plot




  experience_avg_hours = players_df.groupby(["experience", "age_group"]).agg(


### Insight
- The <18 and 18-25 age group consistently shows the highest playtime across experience levels, especially among "Regular" players, indicating strong engagement from younger demographics.
- Unlike other experience levels, "Veteran" players in the 26-35 age group show high playtime.In general because the 26-35 and 36+ age groups have lower playtime except highlighting a potential area for improving to get new older players.
- Focusing recruitment on younger and "Regular" players, along with experienced "Veteran" players in the 26-35 age group, could help improve data contribution and player engagement.


# Part 4: Methods and Plan

## Proposed Method:
- I propose using Linear Regression to address our question of interest. This will help in identifying which player characteristics (e.g., experience level, subscription status, age) are most predictive of high playtime which means more data.

## Why This Method?
- Simplicity: Linear Regression is easy to interpret, and provides clear insights into the relationship between predictors (e.g., experience, age, subscription) and the response variable (played_hours).

- Interpretability: The model's results show how much each player characteristic (like experience or subscription status) affects playtime, making it easier to see which factors are most important.

- Target: Since the target variable (played_hours) is continuous, Linear Regression is well-suited for this analysis.

## Assumptions:
- Assumes a linear relationship between predictors and the target variable.

- Observations are assumed to be independent.

## Potential Limitations :
- The model is sensitive to outliers, which may skew predictions.

- If the relationships are non-linear, the model may underperform.

## Compare and Selection
- I will evaluate the Linear Regression model using Root Mean Squared Error (RMSE) since it measures the average prediction error in the same units as played_hours, making the results clear. Linear Regression is a strong choice because it allows us to understand the relationship between player characteristics (e.g., experience level, subscription status, age) and playtime. The model will help us see which features have the biggest impact on playtime. To ensure a reliable model, I will use 5-fold cross-validation on the training set to see the model’s performance and avoid overfitting. The model with the lowest RMSE will be selected.
## Data Processing Plan
- Data Splitting:
  - Split the data into a training set (70%) and a test set (30%) after data cleaning.
- Cross-Validation:
  - I will use 5-fold cross-validation as encouraged by the textbook on the training set to check the model’s performance.
- Standardizing data:
  - I will make sure the data is standardized to ensure the model isn’t affected by different scales of the predictors. Additionally I will change categorical variables into numeric form before applying the model.