# Predicting whether a player is likely to contribute a large amount of data #

In [2]:
import pandas as pd
import altair as alt
import numpy as np

from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_squared_error

# Output dataframes instead of arrays
set_config(transform_output="pandas")

## Introduction ##

**Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report**\
PLAICraft is a data collection project that gathers gameplay data from Minecraft players. It is run by the UBC’s Pacific Laboratory for Artificial Intelligence (PLAI), whose goal is to advance artificial intelligence. This project specifically is focused on the creation of an AI that can understand and learn from its environment (Minecraft) called an embodied AI. The Project relies on data of players' speech and key presses. In order to help with data collection the team is interested in what demographics contribute the most data, this PLACE  is in collaboration with the students of DSCI 100 to help answer their predictive questions.\
**Clearly state the question you tried to answer with your project**\
We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.\
**Identify and fully describe the dataset that was used to answer the question**\
The dataset has 9 variables and to answer the question, we are choosing played_ hours as the target variable and categorizing the played_ hours into 2 categories. Play_hours more than 7 are categorized as high and Play_hours less than 7 are categorized as low. Players who have high play_hours will be considered the "kinds" of players who are likely to contribute a large amount of data. The trained model is then suitable for classfying new players and see whether the players are likely to contribute data.

## Methods & Results ##

In [3]:
player_data = pd.read_csv("data/players.csv").drop(columns = ['individualId','organizationName','hashedEmail','name'])
#Dropped columns don't contribute to our question 
player_data

Unnamed: 0,experience,subscribe,played_hours,gender,age
0,Pro,True,30.3,Male,9
1,Veteran,True,3.8,Male,17
2,Veteran,False,0.0,Male,17
3,Amateur,True,0.7,Female,21
4,Regular,True,0.1,Male,21
...,...,...,...,...,...
191,Amateur,True,0.0,Female,17
192,Veteran,False,0.3,Male,22
193,Amateur,False,0.0,Prefer not to say,17
194,Amateur,False,2.3,Male,17


In [5]:
# create a new column to show the classification of playtime
player_data["play_time"] = ['high' if played_hours >= 4 else 'low' for played_hours in player_data['played_hours']]
player_data

Unnamed: 0,experience,subscribe,played_hours,gender,age,play_time
0,Pro,True,30.3,Male,9,high
1,Veteran,True,3.8,Male,17,low
2,Veteran,False,0.0,Male,17,low
3,Amateur,True,0.7,Female,21,low
4,Regular,True,0.1,Male,21,low
...,...,...,...,...,...,...
191,Amateur,True,0.0,Female,17,low
192,Veteran,False,0.3,Male,22,low
193,Amateur,False,0.0,Prefer not to say,17,low
194,Amateur,False,2.3,Male,17,low


In [7]:
player_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   experience    196 non-null    object 
 1   subscribe     196 non-null    bool   
 2   played_hours  196 non-null    float64
 3   gender        196 non-null    object 
 4   age           196 non-null    int64  
 5   play_time     196 non-null    object 
dtypes: bool(1), float64(1), int64(1), object(3)
memory usage: 8.0+ KB


In [10]:
plot = alt.Chart(player_data).mark_circle().encode(
    x=alt.X("age")
        .title(["age"]),
    y=alt.Y("gender")
        .title(["gender"]),
    color=alt.Color("play_time")
        .title("play time")
).configure_axis(titleFontSize=19).properties(height=500,width=400)
plot

### Age ###

A preliminary graph initially plotted to observe any outliers and help guide analysis. There were 6 data points that were very far from the rest: 4 data points with over 100 hr playtime and 2 data points with age over 60. Overall there did not seem to be any observable trends however for a more informative analysis K-Nearest Neighbors (KNN) regression was employed.\
The dataset (`filtered_player_data`) was split into training (`player_training`) and testing (`player_testing`) sets. The model was trained using the training data. First a column transformer was applied for preprocessing, and a model pipeline was created for KNNregression. Using 5-fold cross-validation measuring root mean square prediction error (RMSPE), the optimal K was selected from a parameter grid ranging from 1 to 110. K was determined to be 70, yielding a RMSPE value of 6.95 hours.To evaluate the model's accuracy on unseen data, the RMSPE was calculated on the test set, resulting in an RMSPE of 6.96 hours.\
Visualization was performed by plotting all player data points along with a prediction line generated by the KNNregression model. The prediction line appeared relatively linear, with a slight indentation observed around the 15-20 range.



#### Preliminary Graph ####

In [46]:
#Plott preliminary graph
age_plot = alt.Chart(player_data).mark_point(opacity=0.4).encode(
    x=alt.X('age:Q').title("Age"),
    y=alt.Y("played_hours:Q").title("Played Hours"),
)
age_plot

In [60]:
#Remove outlier. 4 datapoints with over 100hr playtime and 2 datapoints with age over 60
filtered_player_data = player_data[(player_data['age'] <= 60) & (player_data['played_hours'] <= 100)]

#Plot filter preliminary graph
age_plot_filtered = alt.Chart(filtered_player_data).mark_point(opacity=0.4).encode(
    x=alt.X('age:Q').title("Age").scale(zero=False),
    y=alt.Y("played_hours:Q").title("Played Hours"),
)
age_plot_filtered

#### KNN Regression ####

In [49]:
#Split data into training and testing dataframes
player_training, player_testing = train_test_split(
    filtered_player_data,
    test_size=0.25,
    random_state=33,  
)
X_train = player_training[['age']] 
y_train = player_training['played_hours'] 

X_test = player_testing[['age']]  
y_test = player_testing['played_hours']

In [50]:
# Preprocess the data, make the pipeline
age_pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(),
)

In [53]:
# create the 5-fold GridSearchCV object
param_grid = {
    'kneighborsregressor__n_neighbors': range(1, 111, 1) #neighbors ranging from 1 to 110
}
age_tuned = GridSearchCV(
    age_pipe, 
    param_grid,
    cv=5, 
    n_jobs=-1, 
    scoring='neg_root_mean_squared_error'
)

# Fit the GridSearchCV object and retrieve the CV scores
age_results = pd.DataFrame(age_tuned.fit(X_train, y_train).cv_results_) 
age_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003047,0.000250,0.002122,0.000116,1,{'kneighborsregressor__n_neighbors': 1},-0.722066,-12.073896,-11.456517,-6.681023,-9.002519,-7.987204,4.105031,107
1,0.002824,0.000077,0.001974,0.000083,2,{'kneighborsregressor__n_neighbors': 2},-7.639823,-15.316804,-11.227851,-6.806057,-6.596516,-9.517410,3.345653,110
2,0.002779,0.000024,0.001900,0.000004,3,{'kneighborsregressor__n_neighbors': 3},-5.155806,-13.604928,-11.103494,-6.162283,-6.629261,-8.531154,3.255206,109
3,0.002763,0.000016,0.001895,0.000014,4,{'kneighborsregressor__n_neighbors': 4},-4.238870,-13.077508,-11.234713,-5.857979,-5.912204,-8.064255,3.444331,108
4,0.002755,0.000012,0.001908,0.000026,5,{'kneighborsregressor__n_neighbors': 5},-3.410022,-12.747023,-11.284856,-6.075152,-5.287915,-7.760994,3.610187,106
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,0.002635,0.000011,0.002011,0.000046,106,{'kneighborsregressor__n_neighbors': 106},-2.242995,-12.079505,-10.663243,-5.657835,-4.618947,-7.052505,3.723086,68
106,0.002677,0.000053,0.001975,0.000008,107,{'kneighborsregressor__n_neighbors': 107},-2.300962,-12.056163,-10.660851,-5.665565,-4.619813,-7.060671,3.700680,73
107,0.002633,0.000069,0.002005,0.000066,108,{'kneighborsregressor__n_neighbors': 108},-2.299375,-12.061981,-10.657457,-5.650980,-4.617129,-7.057384,3.703457,72
108,0.002649,0.000074,0.001965,0.000013,109,{'kneighborsregressor__n_neighbors': 109},-2.284726,-12.049145,-10.665866,-5.653414,-4.616823,-7.053995,3.705253,70


In [55]:
#Find the best K and its RMSPE value
age_min = age_tuned.best_params_
age_best_RMSPE = -age_tuned.best_score_
print("Best Parameters (age_min):", age_min)
print("Best RMSPE (age_best_RMSPE):", age_best_RMSPE)

Best Parameters (age_min): {'kneighborsregressor__n_neighbors': 70}
Best RMSPE (age_best_RMSPE): 6.946063563917676


In [56]:
#Evaluating RMSPE on the test set
age_prediction = age_tuned.predict(X_test)
age_summary = mean_squared_error(
    y_true=y_test, 
    y_pred=age_prediction
)**(1/2)

age_summary

np.float64(6.9597827719687615)

In [62]:
np.random.seed(33)
#Predict the hours played for age
age_preds = filtered_player_data.assign(
    predictions= age_tuned.predict(filtered_player_data[['age']])
)
#Plot all players
age_plot = alt.Chart(age_preds).mark_circle(opacity=0.4).encode(
    x=alt.X('age').title('Age').scale(zero=False),
    y=alt.Y('played_hours').title('Hours Played')
)
#Add prediction line
age_plot = age_plot + alt.Chart(age_preds, title= "K=70").mark_line(color="Black").encode(
    x="age",
    y="predictions",
)
age_plot

## Discussion ##

## References ##