# Predicting whether a player is likely to contribute a large amount of data #

In [1]:
import pandas as pd
import altair as alt
import numpy as np

from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_squared_error

# Output dataframes instead of arrays
set_config(transform_output="pandas")

## Introduction ##

**Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report**\
PLAICraft is a data collection project that gathers gameplay data from Minecraft players. It is run by the UBC’s Pacific Laboratory for Artificial Intelligence (PLAI), whose goal is to advance artificial intelligence. This project specifically is focused on the creation of an AI that can understand and learn from its environment (Minecraft) called an embodied AI. The Project relies on data of players' speech and key presses. In order to help with data collection the team is interested in what demographics contribute the most data, PLAI is in collaboration with the students of DSCI 100 to help answer their predictive questions.\
**Clearly state the question you tried to answer with your project**\
We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.\
**Identify and fully describe the dataset that was used to answer the question**\
The dataset has 9 variables and to answer the question, we are choosing played_ hours as the target variable and categorizing the played_ hours into 2 categories. Play_hours more than 7 are categorized as high and Play_hours less than 7 are categorized as low. Players who have high play_hours will be considered the "kinds" of players who are likely to contribute a large amount of data. The trained model is then suitable for classfying new players and see whether the players are likely to contribute data.

## Methods & Results ##

In [2]:
player_data = pd.read_csv("data/players.csv").drop(columns = ['individualId','organizationName','hashedEmail','name'])
#Dropped columns don't contribute to our question 
player_data

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


In [3]:
# create a new column to show the classification of playtime
player_data["play_time"] = ['high' if played_hours >= 4 else 'low' for played_hours in player_data['played_hours']]
player_data

Unnamed: 0,experience,subscribe,played_hours,gender,age,play_time
0,Pro,True,30.3,Male,9,high
1,Veteran,True,3.8,Male,17,low
2,Veteran,False,0.0,Male,17,low
3,Amateur,True,0.7,Female,21,low
4,Regular,True,0.1,Male,21,low
...,...,...,...,...,...,...
191,Amateur,True,0.0,Female,17,low
192,Veteran,False,0.3,Male,22,low
193,Amateur,False,0.0,Prefer not to say,17,low
194,Amateur,False,2.3,Male,17,low


In [4]:
player_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   experience    196 non-null    object 
 1   subscribe     196 non-null    bool   
 2   played_hours  196 non-null    float64
 3   gender        196 non-null    object 
 4   age           196 non-null    int64  
 5   play_time     196 non-null    object 
dtypes: bool(1), float64(1), int64(1), object(3)
memory usage: 8.0+ KB


In [5]:
plot = alt.Chart(player_data).mark_circle().encode(
    x=alt.X("age")
        .title(["age"]),
    y=alt.Y("gender")
        .title(["gender"]),
    color=alt.Color("play_time")
        .title("play time")
).configure_axis(titleFontSize=19).properties(height=500,width=400)
plot

### Age ###

A preliminary graph initially plotted to observe any outliers and help guide analysis. There were 6 data points that were very far from the rest: 4 data points with over 100 hr playtime and 2 data points with age over 60. Overall there did not seem to be any observable trends however for a more informative analysis K-Nearest Neighbors (KNN) regression was employed.The dataset (`filtered_player_data`) was split into training (`player_training`) and testing (`player_testing`) sets. The model was trained using the training data. First a column transformer was applied for preprocessing, and a model pipeline was created for KNNregression. Using 5-fold cross-validation measuring root mean square prediction error (RMSPE), the optimal K was selected from a parameter grid ranging from 1 to 110. K was determined to be 70, yielding a RMSPE value of 6.95 hours.To evaluate the model's accuracy on unseen data, the RMSPE was calculated on the test set, resulting in an RMSPE of 6.96 hours.Visualization was performed by plotting all player data points along with a prediction line generated by the KNNregression model. The prediction line appeared relatively linear, with a slight indentation observed around the 15-20 range.



#### Preliminary Graph ####

In [6]:
#Plot preliminary graph
age_plot = alt.Chart(player_data).mark_point(opacity=0.4).encode(
    x=alt.X('age:Q').title("Age"),
    y=alt.Y("played_hours:Q").title("Played Hours"),
)
age_plot

**Figure 2:** **Preliminary graph depicting outlier data from players.csv**. Scatter plot of quantitive variables from the Dataframe,`player_data`. `age`(x-axis) is plotted against `played_hours`(y-axis). **n=196** player data represented in graph.

In [7]:
#Remove outlier. 4 datapoints with over 100hr playtime and 2 datapoints with age over 60
filtered_player_data = player_data[(player_data['age'] <= 60) & (player_data['played_hours'] <= 100)]

#Plot filter preliminary graph
age_plot_filtered = alt.Chart(filtered_player_data).mark_point(opacity=0.4).encode(
    x=alt.X('age:Q').title("Age").scale(zero=False),
    y=alt.Y("played_hours:Q").title("Played Hours"),
)
age_plot_filtered

**Figure 3:** **Low play time across all age range**. Scatter plot of quantitive variables `filtered_player_data`. `age`(x-axis) is plotted against `played_hours`(y-axis). **n=190** player data represented in graph.

In [8]:
#Mean, Median, Max and Min of filtered_player_data
filtered_player_data_info= filtered_player_data['played_hours'].agg(['mean', 'median', 'max', 'min', 'std'])
print("Info:",filtered_player_data_info)

#Mode of filtered_player_data
filtered_player_data_mode=filtered_player_data['played_hours'].mode()
print("Mode",filtered_player_data_mode)

Info: mean       1.979474
median     0.100000
max       56.100000
min        0.000000
std        7.685742
Name: played_hours, dtype: float64
Mode 0    0.0
Name: played_hours, dtype: float64


#### KNN Regression ####

In [9]:
#Split data into training and testing dataframes
player_training, player_testing = train_test_split(
    filtered_player_data,
    test_size=0.25,
    random_state=33,  
)
X_train = player_training[['age']] 
y_train = player_training['played_hours'] 

X_test = player_testing[['age']]  
y_test = player_testing['played_hours']

In [10]:
# Preprocess the data, make the pipeline
age_pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(),
)

In [11]:
# create the 5-fold GridSearchCV object
param_grid = {
    'kneighborsregressor__n_neighbors': range(1, 111, 1) #neighbors ranging from 1 to 110
}
age_tuned = GridSearchCV(
    age_pipe, 
    param_grid,
    cv=5, 
    n_jobs=-1, 
    scoring='neg_root_mean_squared_error'
)

# Fit the GridSearchCV object and retrieve the CV scores
age_results = pd.DataFrame(age_tuned.fit(X_train, y_train).cv_results_) 
age_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.009683,0.012524,0.002605,0.000918,1,{'kneighborsregressor__n_neighbors': 1},-0.722066,-12.073896,-11.456517,-6.681023,-9.002519,-7.987204,4.105031,107
1,0.002890,0.000026,0.001997,0.000048,2,{'kneighborsregressor__n_neighbors': 2},-7.639823,-15.316804,-11.227851,-6.806057,-6.596516,-9.517410,3.345653,110
2,0.002880,0.000111,0.001961,0.000053,3,{'kneighborsregressor__n_neighbors': 3},-5.155806,-13.604928,-11.103494,-6.162283,-6.629261,-8.531154,3.255206,109
3,0.002829,0.000054,0.001932,0.000006,4,{'kneighborsregressor__n_neighbors': 4},-4.238870,-13.077508,-11.234713,-5.857979,-5.912204,-8.064255,3.444331,108
4,0.004437,0.003235,0.001946,0.000036,5,{'kneighborsregressor__n_neighbors': 5},-3.410022,-12.747023,-11.284856,-6.075152,-5.287915,-7.760994,3.610187,106
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,0.002709,0.000011,0.002033,0.000013,106,{'kneighborsregressor__n_neighbors': 106},-2.242995,-12.079505,-10.663243,-5.657835,-4.618947,-7.052505,3.723086,68
106,0.002717,0.000012,0.002028,0.000009,107,{'kneighborsregressor__n_neighbors': 107},-2.300962,-12.056163,-10.660851,-5.665565,-4.619813,-7.060671,3.700680,73
107,0.002714,0.000032,0.002050,0.000033,108,{'kneighborsregressor__n_neighbors': 108},-2.299375,-12.061981,-10.657457,-5.650980,-4.617129,-7.057384,3.703457,72
108,0.002716,0.000036,0.002027,0.000007,109,{'kneighborsregressor__n_neighbors': 109},-2.284726,-12.049145,-10.665866,-5.653414,-4.616823,-7.053995,3.705253,70


In [12]:
#Find the best K and its RMSPE value
age_min = age_tuned.best_params_
age_best_RMSPE = -age_tuned.best_score_
print("Best Parameters (age_min):", age_min)
print("Best RMSPE (age_best_RMSPE):", age_best_RMSPE)

Best Parameters (age_min): {'kneighborsregressor__n_neighbors': 70}
Best RMSPE (age_best_RMSPE): 6.946063563917676


In [13]:
#Evaluating RMSPE on the test set
age_prediction = age_tuned.predict(X_test)
age_summary = mean_squared_error(
    y_true=y_test, 
    y_pred=age_prediction
)**(1/2)
print("RMPSE of test set:", age_summary)

RMPSE of test set: 6.9597827719687615


In [14]:
np.random.seed(33)
#Predict the hours played for age
age_preds = filtered_player_data.assign(
    predictions= age_tuned.predict(filtered_player_data[['age']])
)
#Plot all players
age_plot = alt.Chart(age_preds).mark_circle(opacity=0.4).encode(
    x=alt.X('age').title('Age').scale(zero=False),
    y=alt.Y('played_hours').title('Hours Played')
)
#Add prediction line
age_plot = age_plot + alt.Chart(age_preds, title= "K=70").mark_line(color="Black").encode(
    x="age",
    y="predictions",
)
age_plot

**Figure 4:** **No relationship between age and hours played.** Scatter plot of quantitive variables `filtered_player_data` with a prediction line(Black). `age`(x-axis) is plotted against `played_hours`(y-axis). **n=190** player data represented in graph. Predicted values of hours played (black line) for K-NN regression model (K=70).

### Experience ###

#### Preliminary Graph ####

Figure 3: No relationship between age and hours played. Scatter plot of quantitive variables filtered_player_data with a prediction line(Black). age(x-axis) is plotted against played_hours(y-axis). n=190 player data represented in graph. Predicted values of hours played (black line) for K-NN regression model (K=70).

In [15]:
#Plot preliminary graph
experience_plot = alt.Chart(player_data).mark_point(opacity=0.4).encode(
    x=alt.X('experience:N').title("Experience"),
    y=alt.Y("played_hours:Q").title("Played Hours"),
)
experience_plot

**Figure 1: Preliminary graph depicting outlier data from players.csv**. Scatter plot of quantitive variables from the Dataframe,`player_data`. `experience`(x-axis) is plotted against `played_hours`(y-axis).

#### Remove outlier ####
To assess meaningful patterns, the playtime range was adjusted to 0-5 hours.

In [16]:
# Remove outliers for played_hours
filtered_player_data = player_data[player_data['played_hours'] <= 5]

#Plot filter preliminary graph
experience_plot_filtered = alt.Chart(filtered_player_data).mark_point(opacity=0.4).encode(
    x=alt.X('experience:N').title("Experience").scale(zero=False),
    y=alt.Y("played_hours:Q").title("Played Hours"),
)
experience_plot_filtered

**Figure 2: Most of players played less than 30 minuates** Scatter plot of quantitive variables filtered_player_data. `experience`(x-axis) is plotted against `played_hours`(y-axis).

### KNN Regression ###

In [17]:
#change experience value to numeric value
filtered_player_data['experience_numeric'] = filtered_player_data['experience'].replace({
    'Amateur': 1, 
    'Beginner': 2, 
    'Pro': 3, 
    'Regular': 4, 
    'Veteran': 5
})

X = filtered_player_data[['experience_numeric']]
y = filtered_player_data['played_hours']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

  filtered_player_data['experience_numeric'] = filtered_player_data['experience'].replace({
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_player_data['experience_numeric'] = filtered_player_data['experience'].replace({


In [30]:
experience_param_grid = {'n_neighbors': range(1, 21)} 

grid_search = GridSearchCV(
    KNeighborsRegressor(), 
    experience_param_grid, 
    cv=5, 
    scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_k = grid_search.best_params_['n_neighbors']
best_rmspe = -grid_search.best_score_

print(best_k)
print(best_rmspe)

15
0.5971404115226336


In [31]:
knn_regressor = KNeighborsRegressor(n_neighbors=best_k)
knn_regressor.fit(X_train, y_train)

In [32]:
#Evaluating RMSPE on the test set
experience_prediction = grid_search.predict(X_test)
experience_summary = mean_squared_error(
    y_true=y_test, 
    y_pred=experience_prediction
)**(1/2)
print("RMPSE of test set:", experience_summary)

RMPSE of test set: 0.7924596333766161


In [33]:
np.random.seed(33)
#Predict the hours played for age
experience_preds = filtered_player_data.assign(
    predictions= grid_search.predict(filtered_player_data[['experience_numeric']])
)
#Plot all players
experience_plot = alt.Chart(experience_preds).mark_circle(opacity=0.4).encode(
    x=alt.X('experience_numeric').title('Experience').scale(zero=False),
    y=alt.Y('played_hours').title('Hours Played')
)
#Add prediction line
experience_plot = experience_plot + alt.Chart(experience_preds, title= "K=15").mark_line(color="Black").encode(
    x="experience_numeric",
    y="predictions",
)
experience_plot

**Figure 3: No relationship between age and hours played.** Experience level does not appear to have a significant impact on playtime. Most of the data is concentrated at lower values (below 0.5 hours), with the exception of Pro (3), where the predicted playtime is slightly higher. The KNN model, set with K=15, reflects an overall average trend in playtime, suggesting that higher experience levels do not lead to increased playtime.

### Gender ###

## Discussion ##

### Age ###

Using KNNregression, analysis indicated that there is no relationship between age and PLAICraft playtime, as shown by the prediction line being horizontal. The RMSPE value for the model was 6.95 hours, suggesting that any given prediction could vary by 6.95 hours. This closely matches the test set’s RMPSE of 6.96 hours, suggesting the model generalizes well to new data. However, the RMSPE value is not within an acceptable range. It is too large compared to the observed playtime values, which have an average of 1.98 hours and mode of 0.0 hours (in the filtered dataset). This is likely a result of high variability in the data. Although the model seems to perform consistently the predictions will not be useful due to its large error margins. 
Although the model is unreliable, the data was still surprising. It was expected the data would show a parabolic relationship between playtime and age, with playtime peaking among young adults (18-29 years old) and then decreasing with increasing age. This expectation comes from the literature, which indicates that 67% of young adults are gamers, compared to 40% of older adults (50-64 years old) and 25% of seniors (65+) (Bunz et al., 2020).

## References ##

* Bunz U.,  Cortese J., Sellers N. (2020). Examining younger and older adults' digital gaming habits and health measures. Gerontechnology, 19:4. https://doi.org/10.4017/gt.2020.19.04.381