In [2]:
# Load the packages
import pandas as pd
import altair as alt
import numpy as np
from sklearn import set_config

In [3]:
# Output dataframes instead of arrays
set_config(transform_output="pandas")

# Set the random seed for reproducibility
np.random.seed(42) # 42 is the answer to the “ultimate question of life, the universe, and everything,”(Just Joking)

### 1. Load the Data

In this step, we load the datasets into Python using the `pandas` library to demonstrate that the dataset can be loaded into Python.
We read the `players.csv` and `sessions.csv` files, which contain essential details about player characteristics and session activities, respectively.

The code used for this step accomplishes the following:
- It imports the data from the relative file paths into pandas DataFrames.
- It displays the first three rows of each dataset to provide a quick overview of the data structure and contents, allowing us to verify that the datasets have been loaded correctly.

In [7]:
# 1. Load the datasets
url_players = "https://raw.githubusercontent.com/DH-Alex/dsci-100-2024w1-group-python23/96404c10ae3b3e68dd68d0cc1197ad0aa2aca598/data/players.csv"
players = pd.read_csv(url_players)
url_sessions = "https://raw.githubusercontent.com/DH-Alex/dsci-100-2024w1-group-python23/refs/heads/main/data/sessions.csv"
sessions = pd.read_csv(url_sessions)

### 2. Wrangle and Clean the Data to the Format Necessary for the Planned Analysis

1. Convert All Data into Proper Data Type

2. Verify the Integrity of the Data

3. Handling Missing Data and Dropping Unnecessary Columns 

4. Combine two dataframe

#### 2-1. Convert All Data into Proper Data Type: 

Proper data types are essential for efficient data processing and accurate analysis. This step ensures that each column in our datasets is stored in the most appropriate format, reflecting the nature of the data and optimizing for both memory usage and processing speed. We will convert date columns to datetime objects and other relevant columns to categorical or numerical types based on their content and role in our analysis.


In [9]:
## check the origin data type
print(players.dtypes)
print()
print(sessions.dtypes)

experience           object
subscribe              bool
hashedEmail          object
played_hours        float64
name                 object
gender               object
age                   int64
individualId        float64
organizationName    float64
dtype: object

hashedEmail             object
start_time              object
end_time                object
original_start_time    float64
original_end_time      float64
dtype: object


In [10]:
# Converting data types in 'players'
players['experience'] = players['experience'].astype('category')
players['gender'] = players['gender'].astype('category')

# Converting data types in 'sessions'
sessions['start_time'] = pd.to_datetime(sessions['start_time'],dayfirst=True)
sessions['end_time'] = pd.to_datetime(sessions['end_time'],dayfirst=True)

## check the origin data type again
print(players.dtypes)
print()
print(sessions.dtypes)

experience          category
subscribe               bool
hashedEmail           object
played_hours         float64
name                  object
gender              category
age                    int64
individualId         float64
organizationName     float64
dtype: object

hashedEmail                    object
start_time             datetime64[ns]
end_time               datetime64[ns]
original_start_time           float64
original_end_time             float64
dtype: object


By converting `experience` and `gender` in the `players` dataset to categorical types, we enhance the efficiency of our data storage and simplify the analysis involving these variables. 

By converting `start_time` and `end_time` from the `sessions` dataset to datetime, we enhance accurate and efficient time-based calculations. This adjustment ensures that our data handling is robust and that our analyses will be based on correctly formatted data, enabling precise and reliable results.


#### 2-2. Verifying the Integrity of the Data

Ensuring data integrity is a critical step before having any other data analysis. We need to check for missing values across the datasets. Missing data may significantly impact the process of the analysis, leading to biased or incorrect conclusions if not properly addressed. By identifying missing values before further analysis, we can decide on appropriate strategies for handling them, such as imputation or removal, ensuring a robust dataset for subsequent analyses.

In [11]:
# 2. verifying the integrity of the data

## Check for missing values in each column
missing_data_counts_players = players.isnull().sum()
missing_data_counts_sessions = sessions.isnull().sum()
print(missing_data_counts_players)
print()
print(missing_data_counts_sessions)

experience            0
subscribe             0
hashedEmail           0
played_hours          0
name                  0
gender                0
age                   0
individualId        196
organizationName    196
dtype: int64

hashedEmail            0
start_time             0
end_time               2
original_start_time    0
original_end_time      2
dtype: int64


By checking for missing values in each column of the `players` dataset, we can see that most columns in the `players` dataset are complete except for `individualId` and `organizationName`, which are entirely missing. This indicates that these columns may not provide any useful information for our analysis, as they contain no data at all. So we will drop them in the following wrangling.

By checking for missing values in each column of the `players` dataset, we can see that the `end_time` and `original_end_time` columns each have 2 missing entries, suggesting minor issues with data recording for these specific sessions. Considering the number of missing data is small, we may fill missing `end_time` with the start_time plus the average session duration. The `original_end_time` is not needed for our future analysis and will be drop in following steps, so we can ignore it.


#### 2-3. Handling Missing Data and Dropping Unnecessary Columns

In the `players` DataFrame, we can identify two completely empty columns: `individualId` and `organizationName`. Since these columns contain no useful data, we will drop them from the DataFrame.

In the `sessions` DataFrame, there is missing values in the `end_time` column. Considering the number of observation that has missing data is quite small, we can simply drop them.
Also, we will drop `original_start_time` and `original_end_time` from the `sessions` DataFrame since these columns are not necessary for our analysis.

In [12]:
# handle missing data, drop needless columns

## drop players 's empty column ['individualId', 'organizationName']
players.drop(columns=['individualId', 'organizationName'],inplace = True)

# # Calculate the average duration in seconds (if time is in datetime format)
# average_duration = (sessions['end_time'] - sessions['start_time']).mean()

'''We have two choice here, so we need to pick one'''
# # Iterate over the DataFrame rows using iterrows()
# for index, row in sessions.iterrows():
#     # Check if 'end_time' is NaN
#     if pd.isnull(row['end_time']):
#         # Impute missing 'end_time' by adding the average duration to 'start_time'
#         sessions.at[index, 'end_time'] = row['start_time'] + average_duration

## drop sessions 's empty column ['individualId', 'organizationName']
sessions.drop(columns=['original_start_time', 'original_end_time'],inplace = True)

# drop the observation that contains missing valuable
sessions = sessions.dropna(subset=['end_time'])

## Check for missing values in each column again
missing_data_counts_players = players.isnull().sum()
missing_data_counts_sessions = sessions.isnull().sum()
print(missing_data_counts_players)
print()
print(missing_data_counts_sessions)

experience      0
subscribe       0
hashedEmail     0
played_hours    0
name            0
gender          0
age             0
dtype: int64

hashedEmail    0
start_time     0
end_time       0
dtype: int64


#### 2-4. Combining DataFrames and Analyzing Session Durations

In [13]:
# Calculate session durations in hours
sessions['session_duration'] = (sessions['end_time'] - sessions['start_time']).dt.total_seconds() / 3600

# Aggregate total session time per player
total_session_time = sessions.groupby('hashedEmail')['session_duration'].sum()

# Merge this with player data
players = players.merge(total_session_time, how='left', left_on='hashedEmail', right_index=True)

# Compare the calculated total session time with 'played_hours'
players['discrepancy'] = players['played_hours'] - players['session_duration']

# Fill NaN only in specific columns to avoid issues with categorical data
players[['session_duration', 'discrepancy']] = players[['session_duration', 'discrepancy']].fillna(0)
players

# Check discrepancies
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,session_duration,discrepancy
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,33.650000,-3.350000
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,4.250000,-0.450000
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,0.083333,-0.083333
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,0.833333,-0.133333
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,0.150000,-0.050000
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,0.000000,0.000000
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,0.350000,-0.050000
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,0.083333,-0.083333
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,2.983333,-0.683333


### 3. Perform a Summary of the Data Set that is Relevant for Exploratory Data Analysis Related to the Planned Analysis

In [24]:
#  Summarise Statistics for numerical variables
summary_stats = players.describe()
summary_stats

Unnamed: 0,played_hours,age,session_duration,discrepancy
count,196.0,196.0,196.0,196.0
mean,5.845918,21.280612,6.629762,-0.783844
std,28.357343,9.706346,31.310015,3.250779
min,0.0,8.0,0.0,-23.816667
25%,0.0,17.0,0.0,-0.133333
50%,0.1,19.0,0.166667,-0.075
75%,0.6,22.0,0.820833,0.0
max,223.1,99.0,244.516667,0.033333


The summary statistics provide valuable insights into the distribution of `played_hours` and `age` within the `players` dataset:

**Played Hours:**
- **Mean**: 0.99 hours—indicating most sessions are short.
- **Standard Deviation**: 3.51—highlighting high variability with some players having significantly longer sessions.
- **Range**: 0 to 30.3 hours, with 25% not engaging and 75% playing less than 0.5 hours.

**Age:**
- **Mean**: The average age of players is approximately 21.38 years.
- **Standard Deviation**: The standard deviation(9.89) is almost half of the mean but does not exceed it, suggesting a relatively uniform distribution around the mean.
- **Range**: 8 to 99 years, with 75% of players under 22 years old, suggesting a predominantly young player base. The oldest player is 99 years old, indicating a broad age range. However, this could potentially be a misentry considering.

In [None]:
import scipy.stats as stats

In [43]:
# Function to calculate the mean confidence interval
def mean_confidence_interval(data, confidence=0.95):
    a = np.array(data)
    n = len(a)
    m, se = np.mean(a), stats.sem(a)
    me = stats.t.ppf((1 + confidence) / 2., n-1) * se
    return m, m - me, m + me


played_hours_mean, lower_bound, upper_bound = mean_confidence_interval(players['played_hours'])
age_mean, age_lower, age_upper = mean_confidence_interval(players['age'])
session_duration_mean, duration_lower, duration_upper = mean_confidence_interval(players['session_duration'])

summary_players = pd.DataFrame({
    'Variable': ['Played Hours', 'Age', 'Session Duration'],
    'Mean': [played_hours_mean, age_mean, session_duration_mean],
    'CI Lower': [lower_bound, age_lower, duration_lower],
    'CI Upper': [upper_bound, age_upper, duration_upper],
    'Margin of Error': [(upper_bound - lower_bound)/2, (age_upper - age_lower)/2, (duration_upper - duration_lower)/2]
})
print(summary_players)

           Variable       Mean   CI Lower   CI Upper  Margin of Error
0      Played Hours   5.845918   1.851171   9.840666         3.994748
1               Age  21.280612  19.913263  22.647962         1.367350
2  Session Duration   6.629762   2.219066  11.040458         4.410696


We are 90% confident that the true mean of `Played Hours` is between 1.851171 hours and 9.840666 hours.

We are 90% confident that the true mean of `Age` is between 19.913263 years old and 22.647962 years old.

We are 90% confident that the true mean of `Session Duration` is between 2.219066 hours and 11.040458 hours.

In [48]:
# Example of a bar chart with error bars for horizontal display
error_bars = alt.Chart(summary_players).mark_errorbar(extent='ci').encode(
    y=alt.Y('Variable:N', title='Variable'),
    x=alt.X('CI Lower:Q', title='Lower Bound of CI'),
    x2=alt.X2('CI Upper:Q', title='Upper Bound of CI')
)

mean_points = alt.Chart(summary_players).mark_point(filled=True, color='black').encode(
    x=alt.X('Mean:Q', title='Mean'),
    y=alt.Y('Variable:N', title='Variable')
)

chart = (error_bars + mean_points).properties(
    width=600,
    height=150,
    title="Confidence Intervals of Key Metrics"
)

chart

In [10]:
# Count frequency for Explanatory Variables

# Categorical variables
experience_counts = players.groupby('experience',observed=False).size().reset_index(name='count')
gender_counts = players.groupby('gender',observed=False).size().reset_index(name='count')
subscribe_counts = players.groupby('subscribe',observed=False).size().reset_index(name='count')

# Numerical variavble `age`
# Define bins and labels
bins = [0, 12, 18, 35, 65, 120]
labels = ['Child(0-12)', 'Teen(12-18)', 'Young Adult(18-35)', 'Adult(35-65)', 'Senior(65-120)']
# Create a new categorical variable 'age_group'
players['age_group'] = pd.cut(players['age'], bins = bins, labels = labels, right = False)
age_counts = players.groupby('age_group',observed=False).size().reset_index(name = 'count')

print(experience_counts,"\n")
print(gender_counts,"\n")
print(subscribe_counts,"\n")
print(age_counts)

  experience  count
0    Amateur     63
1   Beginner     35
2        Pro     14
3    Regular     36
4    Veteran     48 

              gender  count
0            Agender      2
1             Female     37
2               Male    124
3         Non-binary     15
4              Other      1
5  Prefer not to say     11
6       Two-Spirited      6 

   subscribe  count
0      False     52
1       True    144 

            age_group  count
0         Child(0-12)      4
1         Teen(12-18)     83
2  Young Adult(18-35)     99
3        Adult(35-65)      8
4      Senior(65-120)      2


### 4. Create a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis

##### Histograms for continuous variables - Age and Played Hours

In [11]:
hist_age = alt.Chart(players).mark_bar().encode(
    alt.X('age', bin = alt.Bin(maxbins = 30), title = 'Age(Years)'),
    alt.Y('count()', title = 'Frequency')
).properties(
    title = 'Histogram of Age',
    width = 360,
    height = 200
)

hist_age

As is depicted in the histogram, the age groups are concentrated primarily between 15 and 30, with the highest frequency in the 15 to 25 year range. This suggests that the player base is teenagers and young adults.

In [39]:
hist_played_hours = alt.Chart(players).mark_bar().encode(
    alt.X('played_hours', bin = alt.Bin(maxbins=30), title = 'Played Hours'),
    alt.Y('count()', title = 'Frequency', scale=alt.Scale(type='log'))
).properties(
    title = 'Histogram of Played Time',
    width = 360,
    height = 200
)

hist_played_hours

As is depicted in the plot, a vast majority have logged very few hours. Specifically, the highest frequency occurs at the lowest time bracket, indicating most players spend less than 10 hours total. This distribution highlights a potential issue in player engagement, as the data suggests many players do not return or play long.

##### Bar Charts for categorical variables - Experience, Gender, Subscribe

In [13]:
bar_experience = alt.Chart(players).mark_bar().encode(
    x = alt.X('experience:N', sort=['Beginner', 'Amateur', 'Regular', 'Veteran', 'Pro'],
     axis = alt.Axis(labelAngle=0)).title("Experience"),  # Set labelAngle to 0 for horizontal labels
    y = 'count()'
).properties(
    title = 'Player Experience Levels Counts',
    width = 240,
    height = 200
)

bar_experience

As is depicted in the plot, the majority of players are categorized as 'Amateur', followed by 'Regular' and 'Veteran', indicating a player base with a range of experience but leaning towards newer or moderately experienced players.

In [14]:
bar_gender = alt.Chart(players).mark_bar().encode(
    x = alt.X('gender:N',sort='-y', axis=alt.Axis(labelAngle=0)).title("Gender"),
    y = 'count()'
).properties(
    title = 'Gender Counts',
    width = 480,
    height = 200
)
bar_gender

As is depicted in the plot, the distribution of gender among players shows a predominant number of male players compared to other gender identities since the number of male players is over double the number of the next largest group, females.

In [15]:
bar_subscribe = alt.Chart(players).mark_bar().encode(
    x = alt.X('subscribe:N', axis=alt.Axis(labelAngle=0)).title("Subscribe Status"),
    y = 'count()'
).properties(
    title = 'Subscription Status Counts',
    width = 100,
    height = 200
)
bar_subscribe

As is depicted in the plot, there is a significantly higher number of subscribed players compared to non-subscribers.

### 5. Perform the data analysis

#### Use one-hot encoding to turn categorical variables into vectors

In [16]:
players.drop(columns=['hashedEmail', 'name',"age_group","session_duration","discrepancy"],inplace = True)

# Applying one-hot encoding to 'experience' and 'gender'
experience_dummies = pd.get_dummies(players['experience'], prefix='experience').astype(int)
gender_dummies = pd.get_dummies(players['gender'], prefix='gender').astype(int)

print(experience_dummies)
print(gender_dummies)

     experience_Amateur  experience_Beginner  experience_Pro  \
0                     0                    0               1   
1                     0                    0               0   
2                     0                    0               0   
3                     1                    0               0   
4                     0                    0               0   
..                  ...                  ...             ...   
191                   1                    0               0   
192                   0                    0               0   
193                   1                    0               0   
194                   1                    0               0   
195                   0                    0               1   

     experience_Regular  experience_Veteran  
0                     0                   0  
1                     0                   1  
2                     0                   1  
3                     0                   0  
4

In [17]:
# Since 'subscribe' only contains two types (True and False), we can convert it directly to an integer.
players['subscribe'] = players['subscribe'].astype(int)  # Replaces True with 1 and False with 0

# Concatenate the new columns back to the original dataframe
players = pd.concat([players, experience_dummies, gender_dummies], axis=1)

# Drop the original 'experience' and 'gender' columns as they are now encoded
players.drop(['experience', 'gender'], axis=1, inplace=True)

players

Unnamed: 0,subscribe,played_hours,age,experience_Amateur,experience_Beginner,experience_Pro,experience_Regular,experience_Veteran,gender_Agender,gender_Female,gender_Male,gender_Non-binary,gender_Other,gender_Prefer not to say,gender_Two-Spirited
0,1,30.3,9,0,0,1,0,0,0,0,1,0,0,0,0
1,1,3.8,17,0,0,0,0,1,0,0,1,0,0,0,0
2,0,0.0,17,0,0,0,0,1,0,0,1,0,0,0,0
3,1,0.7,21,1,0,0,0,0,0,1,0,0,0,0,0
4,1,0.1,21,0,0,0,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
191,1,0.0,17,1,0,0,0,0,0,1,0,0,0,0,0
192,0,0.3,22,0,0,0,0,1,0,0,1,0,0,0,0
193,0,0.0,17,1,0,0,0,0,0,0,0,0,0,1,0
194,0,2.3,17,1,0,0,0,0,0,0,1,0,0,0,0


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, root_mean_squared_error

In [19]:
players

Unnamed: 0,subscribe,played_hours,age,experience_Amateur,experience_Beginner,experience_Pro,experience_Regular,experience_Veteran,gender_Agender,gender_Female,gender_Male,gender_Non-binary,gender_Other,gender_Prefer not to say,gender_Two-Spirited
0,1,30.3,9,0,0,1,0,0,0,0,1,0,0,0,0
1,1,3.8,17,0,0,0,0,1,0,0,1,0,0,0,0
2,0,0.0,17,0,0,0,0,1,0,0,1,0,0,0,0
3,1,0.7,21,1,0,0,0,0,0,1,0,0,0,0,0
4,1,0.1,21,0,0,0,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
191,1,0.0,17,1,0,0,0,0,0,1,0,0,0,0,0
192,0,0.3,22,0,0,0,0,1,0,0,1,0,0,0,0
193,0,0.0,17,1,0,0,0,0,0,0,0,0,0,1,0
194,0,2.3,17,1,0,0,0,0,0,0,1,0,0,0,0


In [21]:
# Spliting players into train set and test set
players_train, players_test = train_test_split(
    players, train_size=0.7
)
players_train.drop(['experience_Pro', 'gender_Other'], axis=1, inplace=True)
players_test.drop(['experience_Pro', 'gender_Other'], axis=1, inplace=True)

y_train = players_train['played_hours']
X_train = players_train.drop(['played_hours'], axis=1)

y_test = players_test['played_hours']
X_test = players_test.drop(['played_hours'], axis=1)

In [22]:
# Initialize the Linear Regression model
lm = LinearRegression()

lm.fit(
   X_train,  # A single-column data frame
   y_train  # A series
)

# Get coefficients and intercept
coef = lm.coef_
intercept = lm.intercept_

# Assuming you have the names of the features
feature_names = [col for col in X_train.columns]

# Constructing the formula
terms = [f"{coef[i]:.3f} * {feature_names[i]}" for i in range(len(feature_names))]
formula = " + ".join(terms)
linear_model_formula = f"y = {intercept:.3f} + {formula}"

print("Linear Model Formula:")
print(linear_model_formula)

Linear Model Formula:
y = 16.010 + 5.934 * subscribe + -0.239 * age + 6.286 * experience_Amateur + -2.434 * experience_Beginner + 25.714 * experience_Regular + -4.318 * experience_Veteran + -5.761 * gender_Agender + -5.380 * gender_Female + -16.659 * gender_Male + 2.279 * gender_Non-binary + -16.267 * gender_Prefer not to say + -20.028 * gender_Two-Spirited


### 6. Create a visualization of the analysis 

In [23]:
# Extracting coefficients and feature names
coef_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': coef
})

# Adding the intercept as a new row to the DataFrame
intercept_df = pd.DataFrame({
    'Feature': ['Intercept'],
    'Coefficient': [intercept]
})

# If your model includes an intercept, you might want to add it separately
coef_df = pd.concat([coef_df, intercept_df], ignore_index=True)

coef_df

Unnamed: 0,Feature,Coefficient
0,subscribe,5.934306
1,age,-0.238951
2,experience_Amateur,6.285545
3,experience_Beginner,-2.433621
4,experience_Regular,25.713945
5,experience_Veteran,-4.318021
6,gender_Agender,-5.761179
7,gender_Female,-5.379591
8,gender_Male,-16.65942
9,gender_Non-binary,2.27942


In [24]:
categories = {
    'experience_Amateur': 'experience',
    'experience_Beginner': 'experience',
    'experience_Regular': 'experience',
    'experience_Veteran': 'experience',

    'gender_Agender': 'gender',
    'gender_Female': 'gender',
    'gender_Male': 'gender',
    'gender_Non-binary': 'gender',
    'gender_Prefer not to say': 'gender',
    'gender_Two-Spirited': 'gender',
}

# Default category for any features not explicitly listed
default_category = 'other'

# Assign categories based on feature prefixes or exact matches
coef_df['Category'] = coef_df['Feature'].apply(lambda x: categories[x] if x in categories else
                                               categories.get(x.split('_')[0], default_category))

coef_df                                               

Unnamed: 0,Feature,Coefficient,Category
0,subscribe,5.934306,other
1,age,-0.238951,other
2,experience_Amateur,6.285545,experience
3,experience_Beginner,-2.433621,experience
4,experience_Regular,25.713945,experience
5,experience_Veteran,-4.318021,experience
6,gender_Agender,-5.761179,gender
7,gender_Female,-5.379591,gender
8,gender_Male,-16.65942,gender
9,gender_Non-binary,2.27942,gender


In [25]:
color_scale = alt.Scale(domain=['gender', 'experience', 'other'],
                        range=['#e41a1c', '#377eb8', '#984ea3'])  # Adjust the colors as needed

# Creating the bar chart
bar_chart_impact = alt.Chart(coef_df).mark_bar().encode(
    x=alt.X('Coefficient', title='Coefficient Value'),
    y=alt.Y('Feature', sort='-x', title='Feature'),

    color=alt.Color('Category:N', scale = color_scale, legend=alt.Legend(title="Variable Types")),  # Use the Category field for color
    # color=alt.condition(
    #     alt.datum.Coefficient > 0,
    #     alt.value("green"),  # The positive bars will be green
    #     alt.value("red")     # The negative bars will be red
    # ),
    
    tooltip=[alt.Tooltip('Feature'), alt.Tooltip('Coefficient')]
).properties(
    title="Impact of Variables on 'y'",
    width=700,
    height=280
)

bar_chart_impact

In [26]:
X_test

Unnamed: 0,subscribe,age,experience_Amateur,experience_Beginner,experience_Regular,experience_Veteran,gender_Agender,gender_Female,gender_Male,gender_Non-binary,gender_Prefer not to say,gender_Two-Spirited
139,1,20,0,0,0,1,0,0,1,0,0,0
113,0,17,0,1,0,0,0,1,0,0,0,0
16,1,17,0,1,0,0,0,1,0,0,0,0
75,1,21,1,0,0,0,0,0,1,0,0,0
154,1,19,1,0,0,0,0,0,1,0,0,0
185,0,18,0,0,1,0,0,0,1,0,0,0
69,1,21,0,0,1,0,0,0,1,0,0,0
55,1,20,0,0,1,0,0,0,1,0,0,0
18,1,17,1,0,0,0,0,0,1,0,0,0
169,0,17,0,0,0,1,0,0,1,0,0,0


In [27]:

ages = [X_test['age'].min(), X_test['age'].max()]
df = pd.DataFrame({'age': ages})
for col in X_test.columns:
    if col != 'age':
        df[col] = X_test[col].mean()  # Set other columns to their mean

df = df[X_train.columns]

df

Unnamed: 0,subscribe,age,experience_Amateur,experience_Beginner,experience_Regular,experience_Veteran,gender_Agender,gender_Female,gender_Male,gender_Non-binary,gender_Prefer not to say,gender_Two-Spirited
0,0.677966,11,0.305085,0.152542,0.20339,0.305085,0.0,0.186441,0.745763,0.050847,0.016949,0.0
1,0.677966,50,0.305085,0.152542,0.20339,0.305085,0.0,0.186441,0.745763,0.050847,0.016949,0.0


In [28]:
# Predict y using the varying ages and fixed means of other variables
tem = lm.predict(df)
df['predicted_y'] = tem

df

Unnamed: 0,subscribe,age,experience_Amateur,experience_Beginner,experience_Regular,experience_Veteran,gender_Agender,gender_Female,gender_Male,gender_Non-binary,gender_Prefer not to say,gender_Two-Spirited,predicted_y
0,0.677966,11,0.305085,0.152542,0.20339,0.305085,0.0,0.186441,0.745763,0.050847,0.016949,0.0,9.277273
1,0.677966,50,0.305085,0.152542,0.20339,0.305085,0.0,0.186441,0.745763,0.050847,0.016949,0.0,-0.04182


In [30]:
chart = alt.Chart(df).mark_line().encode(
    x=alt.X('age').title("Age(year)"),
    y=alt.Y('predicted_y').title("Play time(hours)"),
    tooltip=['age', 'predicted_y']
).properties(
    title='Play time(hours) as Age Varies'
)

chart

# Extra codes that may be useful in future work

In [31]:
# # Example DataFrame setup
# # players = pd.read_csv("your_data.csv")

# # Define categories and their dummies
# categorical_groups = {
#     'gender': [col for col in players.columns if col.startswith('gender')],
#     'experience': [col for col in players.columns if col.startswith('experience')],
#     'gender': [col for col in players.columns if col.startswith('subscribe')],
# }

# # Start with initial numerical columns that are not part of any categorical dummy groups
# selected = [col for col in players.columns if col not in sum(categorical_groups.values(), []) and col != 'played_hours']

# # Define a function to calculate RMSE
# # def rmse_scorer(model, X, y):
# #     y_pred = model.predict(X)
# #     return np.sqrt(mean_squared_error(y, y_pred))
# def rmse(y_true, y_pred):
#     return root_mean_squared_error(y_true, y_pred)

# # Feature selection process
# for category, features in categorical_groups.items():
#     # Temporarily add this category's features to the selected list
#     trial_features = selected + features
#     X = players[trial_features]
#     y = players['played_hours']
    
#     # Model and scoring
#     lm_preprocessor = make_pipeline(StandardScaler(), LinearRegression())
#     score = np.mean(cross_val_score(lm_preprocessor, X, y, cv=10, scoring=make_scorer(rmse_scorer, greater_is_better=False)))
    
#     # Decide whether to permanently add this group of features
#     print(f"Testing inclusion of {category}: RMSE = {score}")
#     include_decision = input(f"Include {category} features? (yes/no): ")
#     if include_decision.lower() == 'yes':
#         selected.extend(features)  # Only add permanently if confirmed

# print("Selected features:", selected)

# # names = [col for col in players.columns if col != 'played_hours']

# # accuracy_dict = {"size": [], "selected_predictors": [], "accuracy": []}

# # # Store the total number of predictors
# # n_total = len(names)

# # # Start with an empty list of selected predictors
# # selected = []

# # # Define a custom scorer function
# # def rmse(y_true, y_pred):
# #     return root_mean_squared_error(y_true, y_pred)

# # # Create the scorer object
# # rmse_scorer = make_scorer(rmse, greater_is_better=False)

# # # Example using LinearRegression
# # lm = LinearRegression()

# # # Create the pipeline
# # lm_preprocessor = make_pipeline(StandardScaler(), lm)

# # # For every possible number of predictors
# # for i in range(1, n_total + 1):
# #     accs = np.zeros(len(names))
# #     # For every possible predictor to add
# #     for j in range(len(names)):
# #         # Add remaining predictor j to the model
# #         X = players[selected + [names[j]]]
# #         y = players['played_hours']

# #         # Calculate cross-validated R-squared score for this set of predictors
# #         scores = cross_val_score(lm_preprocessor, X, y, cv=10, scoring=rmse_scorer)
# #         accs[j] = np.mean(scores)

# #     # Get the best new set of predictors that maximize the cross-validated R-squared
# #     best_set = selected + [names[accs.argmax()]]

# #     # Store the results for this round of forward selection
# #     accuracy_dict["size"].append(i)
# #     accuracy_dict["selected_predictors"].append(", ".join(best_set))
# #     accuracy_dict["accuracy"].append(accs.max())

# #     # Update the selected & remove the chosen predictor from the list
# #     selected = best_set
# #     names.remove(names[accs.argmax()])

# # # Create a DataFrame to show the results
# # accuracies = pd.DataFrame(accuracy_dict)
# # print(accuracies)


In [33]:
# names = [col for col in players.columns if col != 'played_hours']

# accuracy_dict = {"size": [], "selected_predictors": [], "accuracy": []}

# # Store the total number of predictors
# n_total = len(names)

# # Start with an empty list of selected predictors
# selected = []

# # Define a custom scorer function
# def rmse(y_true, y_pred):
#     return root_mean_squared_error(y_true, y_pred)

# # Create the scorer object
# rmse_scorer = make_scorer(rmse, greater_is_better=False)

# # Example using LinearRegression
# lm = LinearRegression()

# # Create the pipeline
# lm_preprocessor = make_pipeline(StandardScaler(), lm)

# # For every possible number of predictors
# for i in range(1, n_total + 1):
#     accs = np.zeros(len(names))
#     # For every possible predictor to add
#     for j in range(len(names)):
#         # Add remaining predictor j to the model
#         X = players[selected + [names[j]]]
#         y = players['played_hours']

#         # Calculate cross-validated R-squared score for this set of predictors
#         scores = cross_val_score(lm_preprocessor, X, y, cv=10, scoring=rmse_scorer)
#         accs[j] = np.mean(scores)

#     # Get the best new set of predictors that maximize the cross-validated R-squared
#     best_set = selected + [names[accs.argmax()]]

#     # Store the results for this round of forward selection
#     accuracy_dict["size"].append(i)
#     accuracy_dict["selected_predictors"].append(", ".join(best_set))
#     accuracy_dict["accuracy"].append(accs.max())

#     # Update the selected & remove the chosen predictor from the list
#     selected = best_set
#     names.remove(names[accs.argmax()])

# # Create a DataFrame to show the results
# accuracies = pd.DataFrame(accuracy_dict)
# print(accuracies)

## Discussion
In this data analysis, the players who are most likely to contribute to the play time are explored. This is largely found by using histograms to compare the frequencies of a variable against the range of that variable. Below are the summarized results:

- **Age**: A significant majority (93%) of players are teens (12-18) and young adults (18-35). Although there are some in the child  (2%), adult (4%), and senior (1%), they are a small minority and usually contribute insignificant played hours. However, even though the large majority of players are between 12-35, most players only spend less than are hour playing, suggesting that they also contribute very little to the total play time and are unreturning. Hence, targeting efforts should focus on teens and young adults, and increasing returning rates.

- **Experience**: 32% are amateurs, 24% are veterans, 18% are regulars, 7% are pros, and 18% are beginners. This shows a relatively even distribution of skills, with approximately half of the players being beginners or amateurs, and the other half with more experience.

- **Gender**: Males represent 63% of all players, indicating they are the primary contributors to playtime. This is followed by females, which make up 19% of all players. Targeting efforts could focus on male players or appealing to more non-male players to increase the distribution of their players.

- **Subscription**: 73% of all players are subscribed, indicating that players with email notifications are more likely to play. Hence, finding other ways of reminding non-subscribed players to play such as through social media or increasing the rate of subscription could help increase the amount of play time contribution from players.

The results expected are as expected since the majority of the players are subscribed teen to young adult male players. However, by comparing played hours, amateurs and regulars have the highest average. This contradicts the belief that players with more experience such as veterans and pros spend the most time playing. Rather, the data suggests that people newer to the game spend more time building their skills and exploring the game.

With these results, the recruitment team can identify which groups are more likely to play the game, hence focus on either increasing their participation or targeting the groups that currently contribute less to widen the range of different players. In addition, this data analysis also identified a significant problem in this player base: the low amount of returning players. As seen in figure **add figure numbers**, most players play 10 hours or less, with very few that exceed this played time. In addition, many players have a total play time of 0.1 or even 0 hours. Hence, there needs to be significant improvement on increasing the return rates in order to bring up the total play time.

Using age, experience, gender, and subscribe as predictors, a linear regression model is created to predict the played hours. **talk about evaluated model? (e.g. accuracy & RMSPE)**