# Predicting whether a player is likely to contribute a large amount of data #

In [1]:
import pandas as pd
import altair as alt
import numpy as np

from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import recall_score, precision_score
from sklearn.compose import make_column_selector

# Output dataframes instead of arrays
set_config(transform_output="pandas")

## Introduction ##

PLAICraft is a data collection project that gathers gameplay data from Minecraft players. It is run by the UBC’s Pacific Laboratory for Artificial Intelligence (PLAI), whose goal is to advance artificial intelligence. This project specifically is focused on the creation of an AI that can understand and learn from its environment (Minecraft) called an embodied AI. The Project relies on data of players' speech and key presses. In order to help with data collection the team is interested in targeted recruiting for demographics that contribute the most data. To achieve this, PLAI has collaborated with students from DSCI 100 to identify the "kinds" of players most likely to provide a significant amount of data.

The provided dataset from `players.csv` has 9 variables: `experience`, `subscribe`, `played_hours`, `name` ,`gender`, `age`, `individualId`, and  `organizationName`. Among these, the variables `name` ,`gender`, `individualId`, and  `organizationName` are not related to any measured property relevant to this analysis and will not be considered. To answer what demographic contributes the most data, a variant of `played_ hours` will be our target variable. To help with downstream classification analysis players with more than 2 hours of playtime are categorized as "high", while those with less than 2 hours are categorized as "low” in the new column `play_time`. Although 2 hours is not a significant playtime, the overall play hours in the dataset are low; therefore, this threshold was set to create proportionate categories. Players who have high `play_hours` are the "kinds" of players who are likely to contribute a large amount of data. The data is tidy because every row represents an observation, every column represents a variable, and every entry is a value. `play_time` will be used as our target variable, which is calculated from `played_ hours`. A trained K-Nearest Neighbor(KNN) will be used to classify whether or not someone will have high play time based on their `experience`, `subscribe`, `name` , `age`. In addition a KNN regression model will be used to see if theres any trends with age and play hours.

## Methods & Results ##

In [2]:
player_data = pd.read_csv("data/players.csv")
player_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


In [3]:
# create a new column to show the classification of playtime
player_data["play_time"] = ['high' if played_hours >= 2 else 'low' for played_hours in player_data['played_hours']]
player_data

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName,play_time
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,,high
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,,high
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,,low
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,,low
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,,low
...,...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,,low
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,,low
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,,low
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,,high


In [4]:
player_data.info()
player_data["play_time"].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   gender            196 non-null    object 
 6   age               196 non-null    int64  
 7   individualId      0 non-null      float64
 8   organizationName  0 non-null      float64
 9   play_time         196 non-null    object 
dtypes: bool(1), float64(3), int64(1), object(5)
memory usage: 14.1+ KB


play_time
low     170
high     26
Name: count, dtype: int64

In [5]:
plot_1 = alt.Chart(player_data, title = "played time (hours) vs. age").mark_bar().encode(
    x=alt.X("age")
        .title("age"),
    y=alt.Y("played_hours")
        .title("played time (hours)"),
).configure_axis(titleFontSize=11).properties(height=500,width=500)

plot_1

In [6]:
plot_2 = alt.Chart(player_data, title = "played time (hours) vs. experience").mark_bar().encode(
    x=alt.X("experience")
        .title("experience"),
    y=alt.Y("played_hours")
        .title("played time (hours)"),
).configure_axis(titleFontSize=11).properties(height=500,width=500)

plot_2

In [7]:
plot_3 = alt.Chart(player_data, title = "played time (hours) vs. subscribe").mark_bar().encode(
    x=alt.X("subscribe")
        .title(["subscribe"]),
    y=alt.Y("played_hours")
        .title(["played time (hours)"]),
).configure_axis(titleFontSize=11).properties(height=500,width=500)
plot_3

In [8]:
plot_4 = alt.Chart(player_data, title = "played time (hours) vs. gender").mark_bar().encode(
    x=alt.X("gender")
        .title(["gender"]),
    y=alt.Y("played_hours")
        .title(["played time (hours)"]),
).configure_axis(titleFontSize=11).properties(height=500,width=500)
plot_4

In the summary statistics, the variables chosen to be predictors have 196 entries and do not have empty entries. We do not use `SimpleImputer` because there are not empty entries. In our list of predictors, we have one numeric variable `age`, and `experience`, `subscribe`, `name` ,`gender`, `play_time` are categprical. We will distinguish the categorical variables and ordinal variables later. We used `value_counts()` to see the number of observations in the classes. There is a imbalance in the data where low play_time contribute a larger proportion to th dataset than high play_time because most participants do not spend time playing and there are a lot of zero's in the `played_hours`. We will use various metrics to score our models to avoid bias from the imbalance in the dataset. \

In the visuaization, we plot play_time against other variables to see the correlations. The values in the plots are scaled to give better visualizations and the values are not accurate numbers from the dataset. However, we see that players around 20 have significantly higher `played_hours`. People with medium experience have higher  `played_hours` from the "played time (hours) vs. experience" plot. Players who subscribed have higher `played_hours`. There is not a obvious correlation in the gender and the  `played_hours` of male players are just slightly higher than female and non-binary players. The other genders do not have large representations in the dataset because the observations are not sufficient. Overall, male and female players have the highest `played_hours`. From the plots, we are able observe the correlations between target and predictor variables and we conclude that the predictor variables do have some correlation with target variable. We will include all the predictor variables mentioned previously to build the classifier.

### Correlation between Age and Played hours ###

A preliminary graph initially plotted to observe any outliers and help guide analysis. There were 6 data points that were very far from the rest: 4 data points with over 100 hr playtime and 2 data points with age over 60. Overall there did not seem to be any observable trends however for a more informative analysis K-Nearest Neighbors (KNN) regression was employed.The dataset (`filtered_player_data`) was split into training (`player_training`) and testing (`player_testing`) sets. The model was trained using the training data. First a column transformer was applied for preprocessing, and a model pipeline was created for KNNregression. Using 5-fold cross-validation measuring root mean square prediction error (RMSPE), the optimal K was selected from a parameter grid ranging from 1 to 110. K was determined to be 70, yielding a RMSPE value of 6.95 hours. To evaluate the model's accuracy on unseen data, the RMSPE was calculated on the test set, resulting in an RMSPE of 6.96 hours.Visualization was performed by plotting all player data points along with a prediction line generated by the KNNregression model. The prediction line appeared relatively linear, with a slight indentation observed around the 15-20 range.



#### Preliminary Graph ####

In [9]:
#Plot preliminary graph
age_plot = alt.Chart(player_data).mark_point(opacity=0.4).encode(
    x=alt.X('age:Q').title("Age"),
    y=alt.Y("played_hours:Q").title("Played Hours"),
)
age_plot

**Figure 2:** **Preliminary graph depicting outlier data from players.csv**. Scatter plot of quantitive variables from the Dataframe,`player_data`. `age`(x-axis) is plotted against `played_hours`(y-axis). **n=196** player data represented in graph.

In [10]:
#Remove outlier. 4 datapoints with over 100hr playtime and 2 datapoints with age over 60
filtered_player_data = player_data[(player_data['age'] <= 60) & (player_data['played_hours'] <= 100)]

#Plot filter preliminary graph
age_plot_filtered = alt.Chart(filtered_player_data).mark_point(opacity=0.4).encode(
    x=alt.X('age:Q').title("Age").scale(zero=False),
    y=alt.Y("played_hours:Q").title("Played Hours"),
)
age_plot_filtered

**Figure 3:** **Low play time across all age range**. Scatter plot of quantitive variables `filtered_player_data`. `age`(x-axis) is plotted against `played_hours`(y-axis). **n=190** player data represented in graph.

In [11]:
#Mean, Median, Max and Min of filtered_player_data
filtered_player_data_info= filtered_player_data['played_hours'].agg(['mean', 'median', 'max', 'min', 'std'])
print("Info:",filtered_player_data_info)

#Mode of filtered_player_data
filtered_player_data_mode=filtered_player_data['played_hours'].mode()
print("Mode",filtered_player_data_mode)

Info: mean       1.979474
median     0.100000
max       56.100000
min        0.000000
std        7.685742
Name: played_hours, dtype: float64
Mode 0    0.0
Name: played_hours, dtype: float64


#### KNN Regression ####

In [12]:
#Split data into training and testing dataframes
player_training, player_testing = train_test_split(
    filtered_player_data,
    test_size=0.25,
    random_state=33,  
)
X_train = player_training[['age']] 
y_train = player_training['played_hours'] 

X_test = player_testing[['age']]  
y_test = player_testing['played_hours']

In [13]:
# Preprocess the data, make the pipeline
age_pipe = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(),
)

In [14]:
# create the 5-fold GridSearchCV object
param_grid = {
    'kneighborsregressor__n_neighbors': range(1, 111, 1) #neighbors ranging from 1 to 110
}
age_tuned = GridSearchCV(
    age_pipe, 
    param_grid,
    cv=5, 
    n_jobs=-1, 
    scoring='neg_root_mean_squared_error'
)

# Fit the GridSearchCV object and retrieve the CV scores
age_results = pd.DataFrame(age_tuned.fit(X_train, y_train).cv_results_) 
age_results

  _data = np.array(data, dtype=dtype, copy=copy,


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003299,0.000626,0.002109,0.000156,1,{'kneighborsregressor__n_neighbors': 1},-0.722066,-12.073896,-11.456517,-6.681023,-9.002519,-7.987204,4.105031,107
1,0.002884,0.000031,0.001992,0.000024,2,{'kneighborsregressor__n_neighbors': 2},-7.639823,-15.316804,-11.227851,-6.806057,-6.596516,-9.517410,3.345653,110
2,0.002935,0.000124,0.002389,0.000834,3,{'kneighborsregressor__n_neighbors': 3},-5.155806,-13.604928,-11.103494,-6.162283,-6.629261,-8.531154,3.255206,109
3,0.004009,0.002007,0.018782,0.033479,4,{'kneighborsregressor__n_neighbors': 4},-4.238870,-13.077508,-11.234713,-5.857979,-5.912204,-8.064255,3.444331,108
4,0.002882,0.000072,0.001969,0.000032,5,{'kneighborsregressor__n_neighbors': 5},-3.410022,-12.747023,-11.284856,-6.075152,-5.287915,-7.760994,3.610187,106
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105,0.002696,0.000005,0.002011,0.000006,106,{'kneighborsregressor__n_neighbors': 106},-2.242995,-12.079505,-10.663243,-5.657835,-4.618947,-7.052505,3.723086,68
106,0.002705,0.000025,0.002018,0.000010,107,{'kneighborsregressor__n_neighbors': 107},-2.300962,-12.056163,-10.660851,-5.665565,-4.619813,-7.060671,3.700680,73
107,0.002703,0.000027,0.002025,0.000009,108,{'kneighborsregressor__n_neighbors': 108},-2.299375,-12.061981,-10.657457,-5.650980,-4.617129,-7.057384,3.703457,72
108,0.002722,0.000066,0.002013,0.000007,109,{'kneighborsregressor__n_neighbors': 109},-2.284726,-12.049145,-10.665866,-5.653414,-4.616823,-7.053995,3.705253,70


In [15]:
#Find the best K and its RMSPE value
age_min = age_tuned.best_params_
age_best_RMSPE = -age_tuned.best_score_
print("Best Parameters (age_min):", age_min)
print("Best RMSPE (age_best_RMSPE):", age_best_RMSPE)

Best Parameters (age_min): {'kneighborsregressor__n_neighbors': 70}
Best RMSPE (age_best_RMSPE): 6.946063563917676


In [16]:
#Evaluating RMSPE on the test set
age_prediction = age_tuned.predict(X_test)
age_summary = mean_squared_error(
    y_true=y_test, 
    y_pred=age_prediction
)**(1/2)
print("RMPSE of test set:", age_summary)

RMPSE of test set: 6.9597827719687615


In [17]:
np.random.seed(33)
#Predict the hours played for age
age_preds = filtered_player_data.assign(
    predictions= age_tuned.predict(filtered_player_data[['age']])
)
#Plot all players
age_plot = alt.Chart(age_preds).mark_circle(opacity=0.4).encode(
    x=alt.X('age').title('Age').scale(zero=False),
    y=alt.Y('played_hours').title('Hours Played')
)
#Add prediction line
age_plot = age_plot + alt.Chart(age_preds, title= "K=70").mark_line(color="Black").encode(
    x="age",
    y="predictions",
)
age_plot

**Figure 3:** **No strong relationship between age and hours played.** Scatter plot of quantitive variables `filtered_player_data` with a prediction line(Black). `age`(x-axis) is plotted against `played_hours`(y-axis). **n=190** player data represented in graph. Predicted values of hours played (black line) for K-NN regression model (K=70).

### Correlation between Experience and Played hours ###

In [27]:
#Plot preliminary graph
experience_plot = alt.Chart(player_data).mark_point(opacity=0.4).encode(
    x=alt.X('experience:N').title("Experience"),
    y=alt.Y("played_hours:Q").title("Played Hours"),
).configure_axis(titleFontSize=11)
experience_plot

**Figure 1: Preliminary graph depicting outlier data from players.csv**. Scatter plot of quantitive variables from the Dataframe,`player_data`. `experience`(x-axis) is plotted against `played_hours`(y-axis).

In [28]:
# Remove outliers for played_hours
filtered_player_data = player_data[player_data['played_hours'] <= 5]

#Plot filter preliminary graph
experience_plot_filtered = alt.Chart(filtered_player_data).mark_point(opacity=0.4).encode(
    x=alt.X('experience:N').title("Experience").scale(zero=False),
    y=alt.Y("played_hours:Q").title("Played Hours"),
).configure_axis(titleFontSize=11)
experience_plot_filtered

**Figure 2: Most of players played less than 30 minuates** Scatter plot of quantitive variables filtered_player_data. `experience`(x-axis) is plotted against `played_hours`(y-axis).

### KNN Regression ###

In [30]:
#change experience value to numeric value
filtered_player_data['experience_numeric'] = filtered_player_data['experience'].replace({
    'Amateur': 1, 
    'Beginner': 2, 
    'Pro': 3, 
    'Regular': 4, 
    'Veteran': 5
})

X = filtered_player_data[['experience_numeric']]
y = filtered_player_data['played_hours']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

  filtered_player_data['experience_numeric'] = filtered_player_data['experience'].replace({
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_player_data['experience_numeric'] = filtered_player_data['experience'].replace({


In [32]:
experience_param_grid = {'n_neighbors': range(1, 21)} 

grid_search = GridSearchCV(
    KNeighborsRegressor(), 
    experience_param_grid, 
    cv=5, 
    scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_k = grid_search.best_params_['n_neighbors']
best_rmspe = -grid_search.best_score_

print(best_k)
print(best_rmspe)

15
0.5971404115226336


In [33]:
knn_regressor = KNeighborsRegressor(n_neighbors=best_k)
knn_regressor.fit(X_train, y_train)

In [34]:
#Evaluating RMSPE on the test set
experience_prediction = grid_search.predict(X_test)
experience_summary = mean_squared_error(
    y_true=y_test, 
    y_pred=experience_prediction
)**(1/2)
print("RMPSE of test set:", experience_summary)

RMPSE of test set: 0.7924596333766161


In [35]:
np.random.seed(33)
#Predict the hours played for age
experience_preds = filtered_player_data.assign(
    predictions= grid_search.predict(filtered_player_data[['experience_numeric']])
)
#Plot all players
experience_plot = alt.Chart(experience_preds).mark_circle(opacity=0.4).encode(
    x=alt.X('experience_numeric').title('Experience').scale(zero=False),
    y=alt.Y('played_hours').title('Hours Played')
)
#Add prediction line
experience_plot = experience_plot + alt.Chart(experience_preds, title= "K=15").mark_line(color="Black").encode(
    x="experience_numeric",
    y="predictions",
)
experience_plot

**Figure 3: No relationship between age and hours played.** Experience level does not appear to have a significant impact on playtime. Most of the data is concentrated at lower values (below 0.5 hours), with the exception of Pro (3), where the predicted playtime is slightly higher. The KNN model, set with K=15, reflects an overall average trend in playtime, suggesting that higher experience levels do not lead to increased playtime.

### Correlation between Gender and Played hours ###

In [38]:
#Plot preliminary graph
gender_plot = alt.Chart(player_data).mark_point(opacity=0.4).encode(
    x=alt.X('gender:N').title("Gender"),
    y=alt.Y("played_hours:Q").title("Played Hours"),
).configure_axis(titleFontSize=11)
gender_plot

**Figure 1: Preliminary graph depicting outlier data from players.csv**. Scatter plot of quantitive variables from the Dataframe,`player_data`. `gender`(x-axis) is plotted against `played_hours`(y-axis).

In [40]:
# Remove outliers for played_hours
filtered_player_data = player_data[player_data['played_hours'] <= 5]

#Plot filter preliminary graph
gender_plot_filtered = alt.Chart(filtered_player_data).mark_point(opacity=0.4).encode(
    x=alt.X('gender:N').title("Gender").scale(zero=False),
    y=alt.Y("played_hours:Q").title("Played Hours"),
).configure_axis(titleFontSize=11)
gender_plot_filtered

**Figure 2: Most of players played less than 1 hour** Scatter plot of quantitive variables filtered_player_data. `gender`(x-axis) is plotted against `played_hours`(y-axis).

### KNN Regression ###

In [None]:
X = filtered_player_data[['experience_numeric']]
y = filtered_player_data['played_hours']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

### Classification using KNN classfier 

First, we choose variables that will be the predictors. hashedEmail, name, individualID, and organizationName are unique identifiers for the observation. The variables are not related to any measured property of the cells, and should not be used as predictors. We choose the other variables to be predictors because from EDA, we know they have some weak or strong correlation with the time they spend on playing. We know that there are no missing values in our predictors from previous EDA. In the task, played_hours is categorized and we use play_time to replace played_hours, which helps us build a classifier model that predicts whether a player contributes high amount of data by playing the game or low amount. We use encodings to transform categorical variables into numeric variables in order to be compatible with KNN classifier and use ordinal encoder for variables that have an assumed ordering. For example, we assume players with more experience will play more frequently. We then use the grid search that ranges from k=1 to k=14 to get the best hyperparameter and optimize the model. 

In [20]:
# choose 20% of data to be test data
train_df, test_df = train_test_split(player_data, test_size=0.20, random_state=123)

X_train, y_train = (
    train_df.drop(columns=["hashedEmail", "name", "individualId", "organizationName", "play_time"]),
    train_df["play_time"],
)
X_test, y_test = (
    test_df.drop(columns=["hashedEmail", "name", "individualId", "organizationName", "play_time"]),
    test_df["play_time"],
)


# create the pipeline and CV grid search objects
param_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 15, 1),
}

numeric_features = ["age"]
# "experience" is ordinal because we assume there is a ordering in which more experienced players spend more time to play
ordinal_features = ["experience", "subscribe"]
ordering= [["Beginner", "Amateur", "Regular", "Pro", "Veteran"], [False, True]]
categorical_features = ["gender"]

play_preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OrdinalEncoder(categories=ordering), ordinal_features),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features),
    remainder="passthrough",
    verbose_feature_names_out=False
)

player_tune_pipe = make_pipeline(play_preprocessor, KNeighborsClassifier())
player_tune_grid = GridSearchCV(
    estimator=player_tune_pipe,
    param_grid=param_grid,
    cv=10,
    n_jobs=-1
)
player_tune_grid.fit(X_train, y_train)
accuracies_grid = pd.DataFrame(player_tune_grid.cv_results_)
accuracies_grid["sem_test_score"] = accuracies_grid["std_test_score"] / 10**(1/2)
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "sem_test_score"
    ]]
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
)
print(player_tune_grid.best_params_)
accuracies_grid

{'kneighborsclassifier__n_neighbors': 4}


Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,0.974167,0.010014
1,2,0.974583,0.00985
2,3,0.980417,0.009466
3,4,0.986667,0.008433
4,5,0.980417,0.009466
5,6,0.986667,0.008433
6,7,0.967917,0.010156
7,8,0.967917,0.010156
8,9,0.961667,0.009909
9,10,0.961667,0.009909


In [21]:
accuracy_vs_k = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("n_neighbors").title("Neighbors"),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)

accuracy_vs_k

The highest accuracy is achieved at k=4, which is consistent with the plot where the score is highest at 4. This point is reasonable because it's not too high which causes underfitting, nor it's too low which causes overfitting. The score is high. We use the chosen hyperparameter to predict on test set and see if the test score is low and it's overfitting.

In [22]:
test_df["predicted"] = player_tune_grid.predict(X_test)
player_tune_grid.score(X_test, y_test)

0.975

In [23]:
precision_score(test_df["play_time"], test_df["predicted"], pos_label="high")

np.float64(1.0)

In [24]:
recall_score(test_df["play_time"], test_df["predicted"], pos_label="high")

np.float64(0.8)

In [25]:
pd.crosstab(test_df["play_time"], test_df["predicted"])

predicted,high,low
play_time,Unnamed: 1_level_1,Unnamed: 2_level_1
high,4,1
low,0,35


We want to identify the high play_time in the task so we are setting high to be the positive class in calculating the scores.The score overall is high. The accuracy is high so the correctly identified test observations were a large portion of the total observations. Also, precision is high. Precision quantifies how many of the positive predictions the classifier made were actually positive and our precision is 1 so the false positives were 0. The recall is lower and we have 1 false negative. The scores do not give a totally accurate representation because the dataset size is small and we do not have a lot of test observations to test our model on.

## Discussion ##

### Correlation between Age and Play_hours ###

Using KNNregression, analysis indicated that there is no relationship between age and PLAICraft playtime, as shown by the prediction line being horizontal. The RMSPE value for the model was 6.95 hours, suggesting that any given prediction could vary by 6.95 hours. This closely matches the test set’s RMPSE of 6.96 hours, suggesting the model generalizes well to new data. However, the RMSPE value is not within an acceptable range. It is too large compared to the observed playtime values, which have an average of 1.98 hours and mode of 0.0 hours (in the filtered dataset). This is likely a result of high variability in the data. Although the model seems to perform consistently the predictions will not be useful due to its large error margins. 
Although the model is unreliable, the data was still surprising. It was expected the data would show a parabolic relationship between playtime and age, with playtime peaking among young adults (18-29 years old) and then decreasing with increasing age. This expectation comes from the literature, which indicates that 67% of young adults are gamers, compared to 40% of older adults (50-64 years old) and 25% of seniors (65+) (Bunz et al., 2020).

### Classifier 

We have built a classifier that when given the information `experience`, `subscribe`, `age`, and `gender`, can predict whether a player will contribute data to the research by playing the game for time equal to or more than two hours. The model has high scores and it identifies what "kinds" of players will contribute to research.

## References ##

* Bunz U.,  Cortese J., Sellers N. (2020). Examining younger and older adults' digital gaming habits and health measures. Gerontechnology, 19:4. https://doi.org/10.4017/gt.2020.19.04.381