# Data Science 100 101 Project Final Report

## Introduction

A computer science group at UBC has set up a **Minecraft** server and is recording play sessions to understand how people engage with video games. In doing so, they have created two datasets: one containing player information and another containing past play sessions. The goal of this project is to use the data provided by the Minecraft server to address one of the questions posed by the project lead, **Frank Wood**. The question we have chosen to focus on is question 1:

> **"Which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts."**

We chose this question because we believe it to be the most valuable for the research team. Understanding which players contribute the most data is critical towards the research team as they need players who will contribute more than just demographic data in order to answer the other two questions posed by the group. Essentially, in analyzing and identifying the most valuable player group(s) in terms of data contributed we can help the research team in finding players who will be helpful in answering the other questions.

To address which kinds of players are most likely to contribute the most data we used both datasets. The first dataset, players.csv, has 196 observations and is a list of all unique players, including data about each player. The bulk of the players.csv data was collected in the intro survey and then updated with the hours that the participant played. The second, sessions.csv, has 1535 observations and is a list of individual play sessions by each player, including data about the session. This data was collected by recording the activity of players on the server and then attaching it to the hashed email of the player. The variables in both datasets are presented below:

### Datasets

#### players.csv
This dataset contains 196 observations and provides information about each player, including data collected from the introductory survey and updated with hours played. The key variables are:

| Variable Name   | Type                | Explanation                                                                  |
|-----------------|---------------------|------------------------------------------------------------------------------|
| experience      | Categorical         | Player's proficiency for the game (Beginner, Amateur, Regular, Veteran, Pro) |
| subscribe       | Categorical         | Boolean value indicating subscription status                                 |
| hashedEmail     | Categorical         | Hashed email for player identification                                       |
| played_hours    | Continuous (Float)  | Total hours played                                                           |
| name            | Categorical         | Player's name                                                                |
| gender          | Categorical         | Player's gender                                                              |
| age             | Discrete (Integer)  | Player's age                                                                 |
| individualId    | Categorical         | Player's Individual ID                                                       |
| organizationName| Categorical         | Name of player's organization                                                |

The major issue with this dataset is that many players have contributed zero hours, which can create skewed plots and hinder analysis. Additionally, there are several variables that are not useful for this project, but this can be addressed through data wrangling.

In this dataset, **played_hours** is the response variable of interest, while **experience**, **gender**, and **age** are the explanatory variables. These explanatory variables will be used to understand which factors correlate with the amount of data contributed.

#### sessions.csv
This dataset contains 1,535 observations, detailing individual play sessions by each player. Key variables include:

| Variable Name        | Type                   | Explanation                                        |
|----------------------|------------------------|----------------------------------------------------|
| hashedEmail          | Categorical            | Hashed email for player identification             |
| start_time           | Categorical (Text)     | Session start time (text format)                   |
| end_time             | Categorical (Text)     | Session end time (text format)                     |
| original_start_time  | Continuous (Timestamp) | Numeric timestamp of session start                 |
| original_end_time    | Continuous (Timestamp) | Numeric timestamp of session end                   |

The major issue with this dataset is that the **original_start_time** and **original_end_time** variables are unnecessary for our analysis, and they will be removed during the wrangling process. Also worth noting that the time zone is in GMT for the database.

### Project Goal

In this project and through the rest of this notebook we will attempt to answer the question we selected using the data available and present a full analysis, from reading the data to communicating results.

## Methods & Results

In [1]:
# Load the packages
import pandas as pd
import altair as alt
import numpy as np
from sklearn import set_config

In [2]:
# Output dataframes instead of arrays
set_config(transform_output="pandas")

# Set the random seed for reproducibility
np.random.seed(2)

### 1. Load the Data

In this step, we load the datasets into Python using the `pandas` library to demonstrate that the dataset can be loaded into Python.
We read the `players.csv` and `sessions.csv` files, which contain essential details about player characteristics and session activities, respectively.

The code used for this step accomplishes the following:
- It imports the data from the relative file paths into pandas DataFrames.
- It displays the first three rows of each dataset to provide a quick overview of the data structure and contents, allowing us to verify that the datasets have been loaded correctly.

In [3]:
# Load the datasets
url_players = "https://raw.githubusercontent.com/DH-Alex/dsci-100-2024w1-group-python23/96404c10ae3b3e68dd68d0cc1197ad0aa2aca598/data/players.csv"
players = pd.read_csv(url_players)
url_sessions = "https://raw.githubusercontent.com/DH-Alex/dsci-100-2024w1-group-python23/refs/heads/main/data/sessions.csv"
sessions = pd.read_csv(url_sessions)

### 2. Wrangle and Clean the Data to the Format Necessary for the Planned Analysis

1. Convert All Data into Proper Data Type

2. Verify the Integrity of the Data

3. Handling Missing Data and Dropping Unnecessary Columns 

4. Combine two dataframe

#### 2-1. Convert All Data into Proper Data Type: 

Proper data types are essential for efficient data processing and accurate analysis. This step ensures that each column in our datasets is stored in the most appropriate format, reflecting the nature of the data and optimizing for both memory usage and processing speed. We will convert date columns to datetime objects and other relevant columns to categorical or numerical types based on their content and role in our analysis.


In [4]:
# check the origin data type
print(players.dtypes)
print()
print(sessions.dtypes)

experience           object
subscribe              bool
hashedEmail          object
played_hours        float64
name                 object
gender               object
age                   int64
individualId        float64
organizationName    float64
dtype: object

hashedEmail             object
start_time              object
end_time                object
original_start_time    float64
original_end_time      float64
dtype: object


In [5]:
# Converting data types in 'players'
players['experience'] = players['experience'].astype('category')
players['gender'] = players['gender'].astype('category')

# Converting data types in 'sessions'
sessions['start_time'] = pd.to_datetime(sessions['start_time'],dayfirst=True)
sessions['end_time'] = pd.to_datetime(sessions['end_time'],dayfirst=True)

# check the origin data type again
print(players.dtypes)
print()
print(sessions.dtypes)

experience          category
subscribe               bool
hashedEmail           object
played_hours         float64
name                  object
gender              category
age                    int64
individualId         float64
organizationName     float64
dtype: object

hashedEmail                    object
start_time             datetime64[ns]
end_time               datetime64[ns]
original_start_time           float64
original_end_time             float64
dtype: object


By converting `experience` and `gender` in the `players` dataset to categorical types, we enhance the efficiency of our data storage and simplify the analysis involving these variables. 

By converting `start_time` and `end_time` from the `sessions` dataset to datetime, we enhance accurate and efficient time-based calculations. This adjustment ensures that our data handling is robust and that our analyses will be based on correctly formatted data, enabling precise and reliable results.


#### 2-2. Verifying the Integrity of the Data

Ensuring data integrity is a critical step before having any other data analysis. We need to check for missing values across the datasets. Missing data may significantly impact the process of the analysis, leading to biased or incorrect conclusions if not properly addressed. By identifying missing values before further analysis, we can decide on appropriate strategies for handling them, such as imputation or removal, ensuring a robust dataset for subsequent analyses.

In [6]:
# Check for missing values in each column
missing_data_counts_players = players.isnull().sum()
missing_data_counts_sessions = sessions.isnull().sum()
print(missing_data_counts_players)
print()
print(missing_data_counts_sessions)

experience            0
subscribe             0
hashedEmail           0
played_hours          0
name                  0
gender                0
age                   0
individualId        196
organizationName    196
dtype: int64

hashedEmail            0
start_time             0
end_time               2
original_start_time    0
original_end_time      2
dtype: int64


By checking for missing values in each column of the `players` dataset, we can see that most columns in the `players` dataset are complete except for `individualId` and `organizationName`, which are entirely missing. This indicates that these columns may not provide any useful information for our analysis, as they contain no data at all. So we will drop them in the following wrangling.

By checking for missing values in each column of the `players` dataset, we can see that the `end_time` and `original_end_time` columns each have 2 missing entries, suggesting minor issues with data recording for these specific sessions. Considering the number of missing data is small, we may fill missing `end_time` with the start_time plus the average session duration. The `original_end_time` is not needed for our future analysis and will be drop in following steps, so we can ignore it.


#### 2-3. Handling Missing Data and Dropping Unnecessary Columns

In the `players` DataFrame, we can identify two completely empty columns: `individualId` and `organizationName`. Since these columns contain no useful data, we will drop them from the DataFrame.

In the `sessions` DataFrame, there is missing values in the `end_time` column. Considering the number of observation that has missing data is quite small, we can simply drop them.
Also, we will drop `original_start_time` and `original_end_time` from the `sessions` DataFrame since these columns are not necessary for our analysis.

In [7]:
# drop players 's empty column ['individualId', 'organizationName']
players.drop(columns=['individualId', 'organizationName'],inplace = True)

# drop sessions 's empty column ['individualId', 'organizationName']
sessions.drop(columns=['original_start_time', 'original_end_time'],inplace = True)

# drop the observation that contains missing valuable
sessions = sessions.dropna(subset=['end_time'])

# Check for missing values in each column again
missing_data_counts_players = players.isnull().sum()
missing_data_counts_sessions = sessions.isnull().sum()
print(missing_data_counts_players)
print()
print(missing_data_counts_sessions)

experience      0
subscribe       0
hashedEmail     0
played_hours    0
name            0
gender          0
age             0
dtype: int64

hashedEmail    0
start_time     0
end_time       0
dtype: int64


#### 2-4. Combining DataFrames and Analyzing Session Durations

In [8]:
# Calculate session durations in hours
sessions['session_duration'] = (sessions['end_time'] - sessions['start_time']).dt.total_seconds() / 3600

# Aggregate total session time per player
total_session_time = sessions.groupby('hashedEmail')['session_duration'].sum()

# Merge this with player data
players = players.merge(total_session_time, how='left', left_on='hashedEmail', right_index=True)

# Compare the calculated total session time with 'played_hours'
players['discrepancy'] = players['played_hours'] - players['session_duration']

# Fill NaN only in specific columns to avoid issues with categorical data
players[['session_duration', 'discrepancy']] = players[['session_duration', 'discrepancy']].fillna(0)
players

# Check discrepancies
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,session_duration,discrepancy
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,33.650000,-3.350000
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,4.250000,-0.450000
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,0.083333,-0.083333
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,0.833333,-0.133333
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,0.150000,-0.050000
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,0.000000,0.000000
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,0.350000,-0.050000
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,0.083333,-0.083333
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,2.983333,-0.683333


### 3. Perform a Summary of the Data Set that is Relevant for Exploratory Data Analysis Related to the Planned Analysis

In [9]:
#  Summarise Statistics for numerical variables
summary_stats = players.describe()
summary_stats

Unnamed: 0,played_hours,age,session_duration,discrepancy
count,196.0,196.0,196.0,196.0
mean,5.845918,21.280612,6.629762,-0.783844
std,28.357343,9.706346,31.310015,3.250779
min,0.0,8.0,0.0,-23.816667
25%,0.0,17.0,0.0,-0.133333
50%,0.1,19.0,0.166667,-0.075
75%,0.6,22.0,0.820833,0.0
max,223.1,99.0,244.516667,0.033333


**Figure 3.1**

The summary statistics provide valuable insights into the distribution of `played_hours`, `age`, `session_duration`, and `discrepancy` within the `players` dataset:

##### **Played Hours:**
- **Mean**: 5.85 hours—indicating that on average, players engage in more than 5 hours of playtime, but with substantial variability.
- **Median**: 0.1 hours—indicating most sessions are short.
- **Standard Deviation**: 28.36 hours—highlighting extremely high variability, as some players have significantly longer session times. This large standard deviation suggests that the data may include outliers or extreme values.
- **Range**: From 0 to 223.1 hours, with most players engaging for relatively short periods (as indicated by the mean and lower percentiles), but a small number of players contributing significantly to the upper tail of the distribution.

##### **Age:**
- **Mean**: The average age of players is 21.28 years, indicating a young player base.
- **Standard Deviation**: The standard deviation of players' age is 9.71 years, suggesting a wide spread around the mean.
- **Range**: From 8 to 99 years, with 75% of players under 22 years old, suggesting that the majority of players are young, though there is still a considerable age variation. The oldest player in the dataset is 99 years old, which may be an outlier or a data entry error.

In [10]:
def mean_confidence_interval(data, confidence=0.95):
    '''Function to calculate the mean confidence interval'''
    # Convert data to a numpy array if it's not already
    a = np.array(data)
    # Number of bootstrap samples
    n_bootstraps = 10000
    # Array to store the means from each bootstrap sample
    boot_means = np.empty(n_bootstraps)
    
    # Generate bootstrap samples and compute the means
    for i in range(n_bootstraps):
        sample = np.random.choice(a, size=len(a), replace=True)
        boot_means[i] = np.mean(sample)
    
    # Compute the confidence interval bounds
    alpha = 1 - confidence
    lower_bound = np.percentile(boot_means, 100 * alpha/2)
    upper_bound = np.percentile(boot_means, 100 * (1 - alpha/2))
    mean_estimate = np.mean(boot_means)
    
    return mean_estimate, lower_bound, upper_bound


played_hours_mean, lower_bound, upper_bound = mean_confidence_interval(players['played_hours'])
age_mean, age_lower, age_upper = mean_confidence_interval(players['age'])
session_duration_mean, duration_lower, duration_upper = mean_confidence_interval(players['session_duration'])

summary_players = pd.DataFrame({
    'Variable': ['Played Hours', 'Age', 'Session Duration'],
    'Mean': [played_hours_mean, age_mean, session_duration_mean],
    'CI Lower': [lower_bound, age_lower, duration_lower],
    'CI Upper': [upper_bound, age_upper, duration_upper],
    'Margin of Error': [(upper_bound - lower_bound)/2, (age_upper - age_lower)/2, (duration_upper - duration_lower)/2]
})
print(summary_players)

           Variable       Mean   CI Lower   CI Upper  Margin of Error
0      Played Hours   5.826082   2.349490  10.229247         3.939879
1               Age  21.301578  20.066327  22.755102         1.344388
2  Session Duration   6.613508   2.713416  11.399320         4.342952


We are 90% confident that the true mean of `Played Hours` is between 2.349490 hours and 10.229247 hours.

We are 90% confident that the true mean of `Age` is between 20.066327 years old and 22.755102 years old.

We are 90% confident that the true mean of `Session Duration` is between  2.713416 hours and 11.399320 hours.

In [11]:
error_bars = alt.Chart(summary_players).mark_errorbar(extent='ci').encode(
    y=alt.Y('Variable:N', title='Variable'),
    x=alt.X('CI Lower:Q', title='Lower Bound of CI'),
    x2=alt.X2('CI Upper:Q', title='Upper Bound of CI')
)

mean_points = alt.Chart(summary_players).mark_point(filled=True, color='black').encode(
    x=alt.X('Mean:Q', title='Mean'),
    y=alt.Y('Variable:N', title='Variable')
)

confidence_interval_chart = (error_bars + mean_points).properties(
    width=600,
    height=120,
    title="Confidence Intervals"
)

confidence_interval_chart

**Figure 3.2**

In [12]:
# Count frequency for Explanatory Variables

# Categorical variables
experience_counts = players.groupby('experience',observed=False).size().reset_index(name='count')
gender_counts = players.groupby('gender',observed=False).size().reset_index(name='count')
subscribe_counts = players.groupby('subscribe',observed=False).size().reset_index(name='count')

bins = [0, 12, 18, 35, 65, 120]
labels = ['Child(0-12)', 'Teen(12-18)', 'Young Adult(18-35)', 'Adult(35-65)', 'Senior(65-120)']
# Create a new categorical variable 'age_group'
players['age_group'] = pd.cut(players['age'], bins = bins, labels = labels, right = False)
age_counts = players.groupby('age_group',observed=False).size().reset_index(name = 'count')

print(experience_counts,"\n")
print(gender_counts,"\n")
print(subscribe_counts,"\n")
print(age_counts)

  experience  count
0    Amateur     63
1   Beginner     35
2        Pro     14
3    Regular     36
4    Veteran     48 

              gender  count
0            Agender      2
1             Female     37
2               Male    124
3         Non-binary     15
4              Other      1
5  Prefer not to say     11
6       Two-Spirited      6 

   subscribe  count
0      False     52
1       True    144 

            age_group  count
0         Child(0-12)      4
1         Teen(12-18)     83
2  Young Adult(18-35)     99
3        Adult(35-65)      8
4      Senior(65-120)      2


### 4. Create a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis

#### Histograms for continuous variables - Age and Played Hours (Figure 4.1-2)

In [13]:
hist_age = alt.Chart(players).mark_bar().encode(
    alt.X('age', bin = alt.Bin(maxbins = 30), title = 'Age(Years)'),
    alt.Y('count()', title = 'Frequency')
).properties(
    title = 'Histogram of Age',
    width = 360,
    height = 200
)

hist_age.display()

**Figure 4.1** <br>
As is depicted in the histogram, the age groups are concentrated primarily between 15 and 30, with the highest frequency in the 15 to 25 year range. This suggests that the player base is teenagers and young adults.

In [14]:
hist_played_hours = alt.Chart(players).mark_bar().encode(
    alt.X('played_hours', bin = alt.Bin(maxbins=30), title = 'Played Hours'),
    alt.Y('count()', title = 'Frequency', scale=alt.Scale(type='log'))
).properties(
    title = 'Histogram of Played Time',
    width = 360,
    height = 200
)

hist_played_hours.display()

**Figure 4.2** <br>
As is depicted in the plot, a vast majority have logged very few hours. Specifically, the highest frequency occurs at the lowest time bracket, indicating most players spend less than 10 hours total. This distribution highlights a potential issue in player engagement, as the data suggests many players do not return or play long.

#### Bar Charts for categorical variables - Experience (Figure 4.3-4), Gender (Figure 4.5), Subscribe (Figure 4.6-7)

In [15]:
bar_experience = alt.Chart(players).mark_bar().encode(
    x = alt.X('experience:N', sort=['Beginner', 'Amateur', 'Regular', 'Veteran', 'Pro'],
     axis = alt.Axis(labelAngle=0)).title("Experience"),  # Set labelAngle to 0 for horizontal labels
    y = 'count()'
).properties(
    title = 'Player Experience Levels Counts',
    width = 240,
    height = 200
)

bar_experience.display()

**Figure 4.3**

In [16]:
# grouping experience and computing the average played hours for each experience level

experience_grouped=players.groupby('experience', observed=True)['played_hours'].mean().reset_index()

# plotting played hours vs experience
experience_plot=alt.Chart(experience_grouped, title='Average Played Hours vs Experience').mark_bar().encode(
    y=alt.Y('played_hours').title('Average Played Hours'),
        x=alt.X('experience').sort('y').title('Experience')
)

experience_plot.display()

**Figure 4.4**

As is depicted in the plot, the majority of players are categorized as 'Amateur', followed by 'Regular' and 'Veteran', indicating a player base with a range of experience but leaning towards newer or moderately experienced players.

However, when comparing the average played hours, regulars are observed to spend significantly more time gaming than other experience groups, whereas veterans contribute the least playing hours. This shows a strong relationship between the experience group and the amount of played hours.

In [17]:
bar_gender = alt.Chart(players).mark_bar().encode(
    x = alt.X('gender:N',sort='-y', axis=alt.Axis(labelAngle=0)).title("Gender"),
    y = 'count()'
).properties(
    title = 'Gender Counts',
    width = 480,
    height = 200
)

bar_gender.display()

**Figure 4.5**

As is depicted in the plot, the distribution of gender among players shows a predominant number of male players compared to other gender identities since the number of male players is over double the number of the next largest group, females.

In [18]:
bar_subscribe = alt.Chart(players).mark_bar().encode(
    x = alt.X('subscribe:N', axis=alt.Axis(labelAngle=0)).title("Subscribe Status"),
    y = 'count()'
).properties(
    title = 'Subscription Status Counts',
    width = 100,
    height = 200
)

bar_subscribe.display()

**Figure 4.6**

In [19]:
# Grouping subscribe and computing the average played hours for true or false
subscribe_grouped=players.groupby('subscribe')['played_hours'].mean().reset_index()

# plotting played hours vs subscribe

subscribe_plot=alt.Chart(subscribe_grouped, title='Average Played Hours vs Subscribe').mark_bar().encode(
    y=alt.Y('played_hours').title('Average Played Hours'),
        x=alt.X('subscribe').sort('y').title('Subscribe')
)

subscribe_plot.display()

**Figure 4.7**

As is depicted in the plot, there is a significantly higher number of subscribed players compared to non-subscribers. Players who are subscribed spend more than 7 more hours gaming than non-subscribed players on average.

### 5. Perform the data analysis

#### Use one-hot encoding to turn categorical variables into vectors

In [20]:
players.drop(columns=['hashedEmail', 'name',"age_group","session_duration","discrepancy"],inplace = True)

# Applying one-hot encoding to 'experience' and 'gender'
experience_one_hot = pd.get_dummies(players['experience'], prefix='experience').astype(int)
gender_one_hot = pd.get_dummies(players['gender'], prefix='gender').astype(int)

print(experience_one_hot)
print(gender_one_hot)

     experience_Amateur  experience_Beginner  experience_Pro  \
0                     0                    0               1   
1                     0                    0               0   
2                     0                    0               0   
3                     1                    0               0   
4                     0                    0               0   
..                  ...                  ...             ...   
191                   1                    0               0   
192                   0                    0               0   
193                   1                    0               0   
194                   1                    0               0   
195                   0                    0               1   

     experience_Regular  experience_Veteran  
0                     0                   0  
1                     0                   1  
2                     0                   1  
3                     0                   0  
4

In [21]:
players['subscribe'] = players['subscribe'].astype(int)  # Replaces True with 1 and False with 0

# Concatenate the new columns back to the original dataframe
players = pd.concat([players, experience_one_hot, gender_one_hot], axis=1)

# Drop the original 'experience' and 'gender' columns
players.drop(['experience', 'gender'], axis=1, inplace=True)

players

Unnamed: 0,subscribe,played_hours,age,experience_Amateur,experience_Beginner,experience_Pro,experience_Regular,experience_Veteran,gender_Agender,gender_Female,gender_Male,gender_Non-binary,gender_Other,gender_Prefer not to say,gender_Two-Spirited
0,1,30.3,9,0,0,1,0,0,0,0,1,0,0,0,0
1,1,3.8,17,0,0,0,0,1,0,0,1,0,0,0,0
2,0,0.0,17,0,0,0,0,1,0,0,1,0,0,0,0
3,1,0.7,21,1,0,0,0,0,0,1,0,0,0,0,0
4,1,0.1,21,0,0,0,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
191,1,0.0,17,1,0,0,0,0,0,1,0,0,0,0,0
192,0,0.3,22,0,0,0,0,1,0,0,1,0,0,0,0
193,0,0.0,17,1,0,0,0,0,0,0,0,0,0,1,0
194,0,2.3,17,1,0,0,0,0,0,0,1,0,0,0,0


In [22]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

In [23]:
# Spliting players into train set and test set
players_train, players_test = train_test_split(
    players, train_size=0.7
)
players_train.drop(['experience_Pro', 'gender_Other'], axis=1, inplace=True)
players_test.drop(['experience_Pro', 'gender_Other'], axis=1, inplace=True)

y_train = players_train['played_hours']
X_train = players_train.drop(['played_hours'], axis=1)

y_test = players_test['played_hours']
X_test = players_test.drop(['played_hours'], axis=1)

In [24]:
# Initialize the Linear Regression model
lm = LinearRegression()

lm.fit(
   X_train,  # A single-column data frame
   y_train  # A series
)
# Get coefficients and intercept
coef = lm.coef_
intercept = lm.intercept_

feature_names = [col for col in X_train.columns]

# Constructing the formula
terms = [f"{coef[i]:.3f} * {feature_names[i]}" for i in range(len(feature_names))]
formula = " + ".join(terms)
linear_model_formula = f"y = {intercept:.3f} + {formula}"

print("Linear Model Formula:")
print(linear_model_formula)

Linear Model Formula:
y = 20.678 + 5.164 * subscribe + -0.282 * age + 7.252 * experience_Amateur + -0.656 * experience_Beginner + 23.504 * experience_Regular + -0.372 * experience_Veteran + -12.598 * gender_Agender + -10.601 * gender_Female + -20.920 * gender_Male + -0.353 * gender_Non-binary + -21.290 * gender_Prefer not to say + -22.774 * gender_Two-Spirited


The linear regression model represents the relationship between independent variables (features) and the dependent variable "played hours," which indicates the amount of data people contributed. The coefficients for each feature show how each variable influences the "played hours." 

- **Experience-Related Variables**: The most significant impact on data contribution comes from the "experience" variables, especially "experience_Regular," which has a large positive coefficient, suggesting that users with more regular experience tend to contribute much more data.
  
- **Gender-Related Variables**: Gender-related variables (such as "gender_Male," "gender_Female") all show a negative relationship with data contribution. But the gender_Female is larger than other gender-related variables, indicating female players may more likely to contribute more data.

- **Subscription**: The "subscribe" feature has a strong positive influence on played hours, indicating that users who subscribe tend to contribute more data.

- **Age**: The age variable has a slight negative effect on data contribution, but the magnitude of the effect is relatively small compared to the experience and gender-related variables.

In [25]:
y_pred = lm.predict(X_test)

lm_test_RMSPE = mean_squared_error(
    y_true=y_test,
    y_pred=y_pred
)**(1/2)

lm_test_RMSPE

12.619411564216625

The **Root Mean Squared Percentage Error (RMSPE)** of the model is about **12.62**. This indicates that, on average, the model's predictions deviate from the actual data by approximately **12.62%**.

In [26]:
# Define Bootstrap Function
def bootstrap_linear_regression(X, y, n_bootstraps=1000):
    bootstrap_coefs = []
    bootstrap_intercepts = []
    
    for _ in range(n_bootstraps):
        # Sample the data with replacement
        sample_indices = X.sample(frac=1, replace=True).index
        X_train_sample = X.loc[sample_indices]
        y_train_sample = y.loc[sample_indices]
        
        # Fit the model to the bootstrap sample
        bootstrap_sample_model = LinearRegression().fit(X_train_sample, y_train_sample)
        
        # Store the coefficients and intercept
        bootstrap_intercepts.append(bootstrap_sample_model.intercept_)
        bootstrap_coefs.append(bootstrap_sample_model.coef_)
    
    return np.array(bootstrap_intercepts), np.array(bootstrap_coefs)

bootstrap_intercepts, bootstrap_coefs = bootstrap_linear_regression(X_train, y_train, n_bootstraps=1000)

In [27]:
def bootstrap_summary_players(confidence_level):
    global bootstrap_intercepts, bootstrap_coefs, feature_names

    lower_tem = (100-confidence_level)/2
    upper_tem = (100+confidence_level)/2
    # Calculate Confidence Intervals
    intercept_mean = np.mean(bootstrap_intercepts)
    intercept_conf = np.percentile(bootstrap_intercepts, [lower_tem, upper_tem])

    coef_means = np.mean(bootstrap_coefs, axis=0)
    coef_confs = np.percentile(bootstrap_coefs, [lower_tem, upper_tem], axis=0)

    # Append intercept data for completeness
    feature_names_tem = ['Intercept'] + feature_names
    coef_means = np.insert(coef_means, 0, intercept_mean)
    coef_confs = np.insert(coef_confs, 0, intercept_conf, axis=1)

    # Create the DataFrame
    bootstrap_summary_players = pd.DataFrame({
        'Variable': feature_names_tem,
        'Mean': coef_means,
        'CI Lower': coef_confs[0],
        'CI Upper': coef_confs[1]
    })

    return bootstrap_summary_players


In [28]:
def confidence_interval_plot(df,title_confidence_interval="",graph_width=200,graph_height=400):
    # Error bars showing the confidence intervals
    error_bars = alt.Chart(df).mark_errorbar(extent='ci').encode(
        y=alt.Y('Variable:N', title='Variable', sort=None),  # sort=None to maintain the order of DataFrame
        x=alt.X('CI Lower:Q', title='Lower Bound of CI'),
        x2=alt.X2('CI Upper:Q', title='Upper Bound of CI'),
        tooltip=['Variable', 'Mean', 'CI Lower', 'CI Upper']
    )

    # Points showing the mean values
    mean_points = alt.Chart(df).mark_point(filled=True, color='black').encode(
        x=alt.X('Mean:Q', title='Mean'),
        y=alt.Y('Variable:N', title='Variable'),
        tooltip=['Variable', 'Mean', 'CI Lower', 'CI Upper']
    )

    # Vertical line at x = 0
    v_line = alt.Chart(df).mark_rule(strokeDash=[6], size=1.5).encode(
        x=alt.datum(0),
        color=alt.value("red")
    )

    # Combine the charts
    confidence_interval_chart = (error_bars + mean_points + v_line).properties(
        width=graph_width,
        height=graph_height,  # Adjust height based on the number of variables
        title=f"{title_confidence_interval} Confidence Intervals of Linear Regression Parameters"
    )

    return confidence_interval_chart

In [29]:
def pipeline_confidence_interval(confidence_level,graph_width=800,graph_height=400):
    global bootstrap_intercepts, bootstrap_coefs, feature_names
    bootstrap_summary_players_c = bootstrap_summary_players(confidence_level)
    confidence_interval_plot(bootstrap_summary_players_c,f"{confidence_level}%", graph_width, graph_height).display()

for _ in range (60,100,5):
    pipeline_confidence_interval(_,graph_height=200)

### 6. Create a visualization of the analysis 

In [30]:
# Extracting coefficients and feature names
coef_players = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': coef
})

# Adding the intercept as a new row to the DataFrame
intercept_players = pd.DataFrame({
    'Feature': ['Intercept'],
    'Coefficient': [intercept]
})

coef_players = pd.concat([coef_players, intercept_players], ignore_index=True)

In [31]:
categories = {
    'experience_Amateur': 'experience',
    'experience_Beginner': 'experience',
    'experience_Regular': 'experience',
    'experience_Veteran': 'experience',

    'gender_Agender': 'gender',
    'gender_Female': 'gender',
    'gender_Male': 'gender',
    'gender_Non-binary': 'gender',
    'gender_Prefer not to say': 'gender',
    'gender_Two-Spirited': 'gender',
}

default_category = 'other'

# Assign categories based on feature prefixes or exact matches
coef_players['Category'] = coef_players['Feature'].apply(lambda x: categories[x] if x in categories else
                                               categories.get(x.split('_')[0], default_category))

coef_players                                          

Unnamed: 0,Feature,Coefficient,Category
0,subscribe,5.163914,other
1,age,-0.281778,other
2,experience_Amateur,7.252083,experience
3,experience_Beginner,-0.656442,experience
4,experience_Regular,23.504035,experience
5,experience_Veteran,-0.371653,experience
6,gender_Agender,-12.598382,gender
7,gender_Female,-10.60104,gender
8,gender_Male,-20.919996,gender
9,gender_Non-binary,-0.352722,gender


**Figure 6.1**

In [32]:
color_scale = alt.Scale(domain=['gender', 'experience', 'other'],
                        range=['#e41a1c', '#377eb8', '#984ea3'])  # Adjust the colors as needed

# Creating the bar chart
bar_chart_impact = alt.Chart(coef_players).mark_bar().encode(
    x=alt.X('Coefficient', title='Coefficient Value'),
    y=alt.Y('Feature', sort='-x', title='Feature'),

    color=alt.Color('Category:N', scale = color_scale, legend=alt.Legend(title="Variable Types")),  # Use the Category field for color
    # color=alt.condition(
    #     alt.datum.Coefficient > 0,
    #     alt.value("green"),  # The positive bars will be green
    #     alt.value("red")     # The negative bars will be red
    # ),
    
    tooltip=[alt.Tooltip('Feature'), alt.Tooltip('Coefficient')]
).properties(
    title="Impact of Variables on 'y'",
    width=700,
    height=280
)

bar_chart_impact.display()

**Figure 6.2**

In [33]:

ages = [X_test['age'].min(), X_test['age'].max()]
df = pd.DataFrame({'age': ages})
for col in X_test.columns:
    if col != 'age':
        df[col] = X_test[col].mean()  # Set other columns to their mean

df = df[X_train.columns]

# Predict y using the varying ages and fixed means of other variables
tem = lm.predict(df)
df['predicted_y'] = tem

df

Unnamed: 0,subscribe,age,experience_Amateur,experience_Beginner,experience_Regular,experience_Veteran,gender_Agender,gender_Female,gender_Male,gender_Non-binary,gender_Prefer not to say,gender_Two-Spirited,predicted_y
0,0.677966,14,0.372881,0.20339,0.152542,0.20339,0.0,0.135593,0.694915,0.101695,0.033898,0.033898,8.809759
1,0.677966,45,0.372881,0.20339,0.152542,0.20339,0.0,0.135593,0.694915,0.101695,0.033898,0.033898,0.074631


**Figure 6.3**

In [34]:
play_time_vs_age_chart = alt.Chart(df).mark_line().encode(
    x=alt.X('age').title("Age(year)"),
    y=alt.Y('predicted_y').title("Play time(hours)"),
    tooltip=['age', 'predicted_y']
).properties(
    title='Play time(hours) as Age Varies'
)

play_time_vs_age_chart

**Figure 6.4**

## Discussion
#### Variables Affecting Played Hours
In this data analysis, players who are most likely to contribute to the play time are explored. This is largely found by using histograms to compare the frequencies of a variable against the range of that variable. Below are the summarized results:

- **Age (Figure 4.1-2)**: A significant majority (93%) of players are teens (12-18) and young adults (18-35). Although there are some in the child  (2%), adult (4%), and senior (1%) groups, they are a minority and usually contribute insignificant played hours. This is also shown in the histogram in Figure 4.1, which indicates that the largest frequencies of players are aged from 15 to 25. However, even though the large majority of players are between 12-35, most players spend less than one hour playing, suggesting that they contribute very little to the total play time and are unreturning as seen in Figure 4.2. Hence, targeting efforts should focus on teens and young adults, and increasing returning rates.

- **Experience (Figure 4.3-4)**: 32% are amateurs, 24% are veterans, 18% are regulars, 7% are pros, and 18% are beginners. This shows a relatively even distribution of skills, with approximately half of the players being beginners or amateurs, and the other half with more experience (Figure 4.3). From Figure 4.4, we see that although amateurs are the most frequent type of players, regulars contribute played hours on average more than twice the second largest experience level, which are amateurs. Morever, veterans contribte very little to the played hours even though they make up a large majority of the player base, with an average of less than one hour.

- **Gender (Figure 4.5)**: Males represent 63% of all players, indicating they are the primary contributors to playtime. This is followed by females, which make up 19% of all players (Figure 4.5). Targeting efforts could focus on male players or appealing to more non-male players to increase the distribution of other players.

- **Subscription (Figure 4.6-7)**: It is shown in Figure 4.6 that 73% of all players are subscribed, indicating that players with email notifications are more likely to play. Hence, finding other ways of reminding non-subscribed players to play such as through social media or increasing the rate of subscription could help increase the amount of play time contribution from players. Similarly, subscribed players also contribute the most to played hours. According to Figure 4.7, subcribed players have an average played time of over 7.5 hours, whereas non-subscribed players have an average of about 0.5 hours. Hence, subscriptions is an effective method of increasing player engagement.

The results are as expected since the majority of the players are subscribed teen to young adult male players. However, by comparing played hours, amateurs and regulars have the highest average. This contradicts the belief that players with more experience such as veterans and pros spend the most time playing. Rather, the data suggests that people newer to the game spend more time building their skills and exploring the game. This means that to increase the engagement rate of more experienced players such as pros and veterans, more efforts should be made on adding innovative updates so avoid experienced players from getting bored of the game.

#### Linear Regression Model

Using age, experience, gender, and subscribe as predictors, a linear regression model was created to predict the played hours. This model emphasized the significance of players with regular experience, as as they have a high coefficient of 23.5. This means that they contribute the most to played hours compared to other factors. On the contrary, gender seems to contribute the least to played hours. All seven gender options had a negative coefficient, with two-spirited having the most negative value. This means that gender is not the most significant predictor of played hours, and that two-spirited people contribute the least to the dataset. The RMSPE of the linear regression model is 12.6. Considering the size of the database and the randomness in played hours, an error margin of 12.6% is not significantly off, but could be improved.

#### Issue with Low Engagement

In terms of overall contribution, significant efforts need to be directed towards increasing returning rates:

- The mean played hours is 5.8 hours. This value is not representative of the amount of time most players contribute, because some players contribute up to 223.1 hours whereas others do not contribute at all (0 hours), yet are still included in the data analysis. As seen in Figure 4.2, there is a wide distribution of played time from 0 to over 220 hours, but the majority contribute less than 10 hours.

- The standard distribution is 28.4 for played hours (Figure 3.1), further showing that there is a high variability between played hours.

#### Conclusion
With these results, the recruitment team can identify which groups are more likely to play the game, hence focus on either increasing their participation or targeting the groups that currently contribute less to widen the range of different players. In addition, this data analysis also identified a significant problem in this player base: the low amount of returning players. As seen in figure 4.2, most players play 10 hours or less, with very few that exceed this played time. In addition, many players have a total play time of 0.1 or even 0 hours. Hence, there needs to be significant improvement on increasing the return rates in order to bring up the total play time.