## Plaicraft: Predicting Played Hours from Player Type

In [1]:
import numpy as np
import pandas as pd
import scipy as sci
import matplotlib.pyplot as plt
import os
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import set_config
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split

alt.data_transformers.enable('vegafusion')
set_config(transform_output="pandas")

print("packages imported")

packages imported


# (1) Introduction:

Aamna's part from google colab goes here...

# (2) Methods & Results:

## Method:

> describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.

K-NN Regression is one way to predict  using `age` because `played_hours` is a quantitative variable, not qualitative, and does not make assumptions about data linearity.

However, the model may not predict values over which there is few or no data, especially ages outside of 15-25 years. K-NN Regression can be computationally expensive with more data and its interpretation is less intuitive than linear regression.

Model selection:
* 70%-30% train-test split with a set random seed.
* 5-fold cross-validation to select K with the lowest cross-validation RMSPE.
* Fit the training data
* Predict with the testing data
* Calculate the model's RMSPE
* Conduct linear regression to calculate RMSPE
* Comparing RMSPE of the two models: selecting the model with the lowest test dataset RMSPE


## Results:

your report should include code which:

> loads data

> wrangles and cleans the data to the format necessary for the planned analysis

(Steph: "Two datasets are provided, but both variables are present in players.csv, so no data from sessions.csv or additional wrangling will be used to find variables." But if we want to do something more complex that's cool too)

> performs a summary of the data set that is relevant for exploratory data

(Steph's findings:)
- Most participants are 15-25 years old (around 70+95=165 participants). K-NN may not predict well in other ages with fewer or no data points.
-A few younger players (ages 10-30) contributed large hours (20+), but most individuals contributed <10 hours.
-Many players play 0-1 hours. Researchers should define a "large" amount of data (eg. >1 hour)
-As age increases, the average `played_hours` fluctuates, suggesting a non-linear pattern with `age`.

> analysis related to the planned analysis

> creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis

> performs the data analysis

> creates a visualization of the analysis

> note: all figures should have a figure number and a legend

### Load and Wrangle data

In [15]:
#import players.csv dataframe as URL
url_players = "https://drive.google.com/uc?id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url_players)

players.head()

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,


In [None]:
#import sessions.csv dataframe as URL
url_sessions = "https://drive.google.com/uc?id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB"
sessions = pd.read_csv(url_sessions)

sessions.head()

In [20]:
#mapping experiences
experience_mapping = {
    'Beginner': 1,
    'Amateur': 2,
    'Regular': 3,
    'Veteran': 4,
    'Pro': 5
}

#converting dates to datetime object
sessions['start_time'] = pd.to_datetime(sessions['start_time'], format="%d/%m/%Y %H:%M")
sessions['end_time'] = pd.to_datetime(sessions['end_time'], format="%d/%m/%Y %H:%M")

sessions['session_length'] = (sessions['end_time'] - sessions['start_time']).dt.total_seconds() / 3600
sessions['original_session_length'] = (sessions['original_end_time'] - sessions['original_start_time'])

#grouping datas in sessions dataframe based onemail and finding summaries
player_sessions = sessions.groupby('hashedEmail').agg(
    number_sessions=('session_length', 'size'), 
    mean_session_length=('session_length', 'mean'), 
    sd_session_length=('session_length', 'std') 
).reset_index()

#merging the datas we need and tidying them
players_combined = pd.merge(players, player_sessions, on='hashedEmail', how='left')
players_combined['experience_val'] = players_combined['experience'].map(experience_mapping)
players_combined['subscribe_binary'] = players_combined['subscribe'].astype(int)
players_combined = players_combined.dropna(subset=['subscribe_binary', 'experience_val', 'age', 'number_sessions', 'played_hours'])
players_combined.head(5)

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName,number_sessions,mean_session_length,sd_session_length,experience_val,subscribe_binary
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,,27.0,1.246296,0.902162,5,1
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,,3.0,1.416667,1.233671,4,1
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,,1.0,0.083333,,4,0
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,,1.0,0.833333,,2,1
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,,1.0,0.15,,3,1


In [17]:
players_age = (
    alt.Chart(players, title="Number of Players per Age in Plaicraft").mark_bar().encode(
        x = alt.X("age").title("Age (years)").bin(maxbins=20),
        y = alt.Y("count()").title("Number of Players")
    )
    .configure_axis(titleFontSize=12)
)

players_age

Most participants are 15-25 years old (around 70+95=165 participants). K-NN may not predict well in other ages with fewer or no data points.

In [16]:
#overplotted but general overview of data
players_scatterplot = (
    alt.Chart(players, title="Hours Contributed by Each Player with Age in Plaicraft")
    .mark_circle(size=60, opacity=0.40)
    .encode(
        x = alt.X("age").title("Age (years)"),
        y = alt.Y("played_hours").title("Total Time Played (hours)"),
        color= alt.Color("experience")
    )
    .configure_axis(titleFontSize=12)
)

players_scatterplot

A few younger players (ages 10-30) contributed large hours (20+), but most individuals contributed <10 hours (in the overplotted area).

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0


Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName,number_sessions,mean_session_length,sd_session_length,experience_val,subscribe_binary
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,,27.0,1.246296,0.902162,5,1
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,,3.0,1.416667,1.233671,4,1
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,,1.0,0.083333,,4,0
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,,1.0,0.833333,,2,1
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,,1.0,0.15,,3,1


In [6]:
#picturing relationship between 'number of played hours' and 'subscribe'
hour_vs_subscribe = (
    alt.Chart(players_combined, title="Played hours vs subscription").mark_bar().encode(
        x = alt.X("subscribe").title("Subscription (True/False)").bin(maxbins=20),
        y = alt.Y("number_sessions").title("Number of Sessions"),
        #color = alt.Color('experience')
    )
    .configure_axis(titleFontSize=12)
)

hour_vs_subscribe

In [7]:
#picturing relationship between 'number of played hours' and 'number of sessions'
\
played_hour_vs_subscribe = (
    alt.Chart(players_combined, title="Played hours vs subscription").mark_point().encode(
        x = alt.X("number_sessions").title("N of Sessions"),
        y = alt.Y("played_hours").title("Total time played (hours)"),
        color = alt.Color('subscribe')
    )
    .configure_axis(titleFontSize=12)
)

played_hour_vs_subscribe

In [8]:
#splitting the testing and training data with ratio 7:3
players_training, players_testing = train_test_split(
   players_combined, test_size=0.3, random_state=2000 
)

X_train = players_training[["age"]]
y_train = players_training['played_hours']

X_test = players_testing[["age"]] 
y_test = players_testing['played_hours'] 

#make a pipeline for our model
players_pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor()
)

#making grid parameters
param_grid = {'kneighborsregressor__n_neighbors': range(1, 30)}

#Doing cross validation
player_tuned = GridSearchCV(
    players_pipe,
    param_grid,
    scoring="neg_root_mean_squared_error",
    cv=5,
    n_jobs=-1
)

##make a new dataframe from the results of our cross validation
player_results = pd.DataFrame(
    player_tuned.fit(X_train, y_train).cv_results_
)

#finding best number of K
player_min = player_tuned.best_params_
player_best_RMSPE = -player_tuned.best_score_

player_min

  _data = np.array(data, dtype=dtype, copy=copy,


{'kneighborsregressor__n_neighbors': 28}

In [9]:
#using our best K in the model
k = 28
knn_model = KNeighborsRegressor(n_neighbors=k)
knn_model.fit(X_train, y_train)

#predicting with our testing dataframe
y_pred = knn_model.predict(X_test)

#finding RM
mse = mean_squared_error(y_test, y_pred)
rmspe = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Prediction Error (RMSPE):", rmspe)
print("R-squared:", r2)

Mean Squared Error (MSE): 851.3491934076262
Root Mean Squared Prediction Error (RMSPE): 29.177888775708674
R-squared: 0.041327806384315546


In [10]:
#plotting the predictions and the actual data 
plot_data = pd.DataFrame({
    'Age': X_test.squeeze(),
    'Actual Played Hours': y_test,
    'Predicted Played Hours': y_pred
})

scatter = alt.Chart(plot_data).mark_circle(size=60).encode(
    x=alt.X('Age', title='Age'),
    y=alt.Y('Actual Played Hours', title='Played Hours'),
    tooltip=['Age', 'Actual Played Hours', 'Predicted Played Hours']
).properties(
    title=f'k-NN Regression: Age vs Played Hours (K={k})'
)

line = alt.Chart(plot_data).mark_line(color='red').encode(
    x='Age',
    y='Predicted Played Hours'
)

final_plot = scatter + line

final_plot.show()

From the plot above, we can see that most of our data comes from the age range 15-28, and the hours played mostly below 10. However, we can see that there are puliers where someone near 100 years old plays the game and datapoint at player hours nearing 180. 

Our model predicted that the most played hours would come from players aged ~18-27, which seems consistent with our real data described before. However, we can also see that there are no data from the age range 33-49 and 51-99, yet our model still predicted an amount of played hours. Therefore we know that our model predicts poorly on age groups with no data points. 

## Attempt for a simple linear regression using age as covariate and played_hour as responsible variable:

In [11]:
#making a new model for linear regression and fit the training data
model = LinearRegression()
model.fit(X_train, y_train)

#prediciting y pred with our testing data and new linear regression model
y_pred = model.predict(X_test)

#finding RMSPE
mse = mean_squared_error(y_test, y_pred)
rmspe = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
n = X_test.shape[0]
p = X_test.shape[1]
adjusted_r2 = 1 - ((1 - r2) * (n - 1) / (n - p - 1))

intercept = model.intercept_
slope = model.coef_[0]


print("Mean Squared Error:", mse)
print("Root Mean Squared Prediction Error (RMSPE):", rmspe)
print("R-squared:", r2)
print("Adjusted R-squared:", adjusted_r2)
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)

print("Intercept (b0):", intercept)
print("Slope (b1):", slope)

Mean Squared Error: 899.8746412307555
Root Mean Squared Prediction Error (RMSPE): 29.99791061442039
R-squared: -0.013314868878676167
Adjusted R-squared: -0.041462504125306054
Intercept: 16.443168705485157
Coefficient: [-0.31916523]
Intercept (b0): 16.443168705485157
Slope (b1): -0.3191652310147073


In [12]:
#plotting results of linear regression

X_all = pd.concat([X_train, X_test], axis=0)
y_all = pd.concat([y_train, y_test], axis=0)

X_all['Actual'] = y_all
X_all['Predicted'] = model.predict(X_all[['age']])

scatter = alt.Chart(X_all.reset_index()).mark_circle(size=60).encode(
    x=alt.X('age', title='Age'),
    y=alt.Y('Actual', title='Played Hours'),
    tooltip=['age', 'Actual', 'Predicted']
).properties(
    title="Linear Regression: Played Hours vs Age (All Data)"
)

line = alt.Chart(X_all.reset_index()).mark_line(color='red', size=2).encode(
    x=alt.X('age', title='Age'),
    y=alt.Y('Predicted', title='Played Hours')
)

plot = scatter + line
plot.display()

The plot above shows a decreasing trend from the prediction made by our model.

In [13]:
X = players_combined[['subscribe_binary', 'experience_val', 'age', 'number_sessions']]
y = players_combined['played_hours']

X_train, X_test = train_test_split(X, test_size=0.3, random_state=42)
y_train, y_test = train_test_split(y, test_size=0.3, random_state=42)


model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
n = X_test.shape[0]
p = X_test.shape[1]
adjusted_r2 = 1 - ((1 - r2) * (n - 1) / (n - p - 1))

print("Mean Squared Error:", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared:", r2)
print("Adjusted R-squared:", adjusted_r2)
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

coefficients = pd.DataFrame({
    'Feature': ['subscribe_binary', 'experience_val', 'age', 'number_sessions'],
    'Coefficient': model.coef_
})

print(coefficients)

Mean Squared Error: 43.453683282655774
Root Mean Squared Error (RMSE): 6.591940782702449
R-squared: 0.5304112973175941
Adjusted R-squared: 0.4734914545682115
Intercept: -5.2211748359803565
Coefficients: [ 4.03748816  2.17606734 -0.10342626  0.64320256]
            Feature  Coefficient
0  subscribe_binary     4.037488
1    experience_val     2.176067
2               age    -0.103426
3   number_sessions     0.643203


In [14]:
okay trying something because something seems sus 

SyntaxError: invalid syntax (1097798074.py, line 1)