In [4]:
#importing modules
import pandas as pd
import sklearn as skl
import altair as alt
import numpy as np
from sklearn.preprocessing import StandardScaler

alt.data_transformers.enable('vegafusion')
seed = 21123
np.random.seed(seed)

# (1) Introduction:

This project focuses on identifying the characteristics of players most likely to contribute significant amounts of data in the context of Minecraft gameplay. 

The key research question is: 

*What player characteristics, such as age and experience, correlate with higher levels of playtime?*

The goal is to understand how these factors relate to the total hours players spend in the game (played_hours).

Two datasets are used in the analysis: sessions.csv and players.csv. The sessions.csv file, which includes session-level data (e.g., start and end times), was excluded from the analysis, as it does not directly contribute to understanding the cumulative playtime. Instead, the focus is on the players.csv dataset, which contains demographic and engagement data for 196 players, including their experience level, age, and total hours played.

The analysis involves cleaning and transforming the dataset, performing exploratory data analysis (EDA), binning the played_hours into categories, and training a K-Nearest Neighbors (KNN) classifier to classify players into different playtime levels (low, medium, high). This approach aims to identify which player characteristics are most strongly associated with higher playtime.

**Data Description**

- "sessions.csv" includes:
    - hashedEmail: (object, identifier) - Encoded email address of the player.
    - start_time: (object, quantitative) - Date and time the player started the session.
    - end_time: (object, quantitative) - Date and time the player ended the session.=
    - original_start_time: (float64, quantitative) - (UNIX time) Player started the session.
    - original_end_time: (float64, quantitative) - (UNIX time) Player ended the session.
    - 1535 observations.

- "players.csv" includes:
    - experience: (object, categorical) - Experience level of the player.
    - subscribe: (Boolean) - Whether the player has subscribed to the PlaitCraft mailing list.
    - hashedEmail: (object, identifier) - Encoded email address of the player.
    - played_hours: (float64, quantitative) - Total hours played by the player.
    - name: (object, categorical) - Chosen name of the player.
    - gender: (object, categorical) - Chosen gender of the player.
    - age: (int64, quantitative) - Age of the player.
    - individualId: (float64, quantitative) - Not used (contains NAN values).
    - organizationName: (float64, quantitative) - Not used (contains NAN values).
    - 196 observations.

The dataset was collected from PlaiCraft's Minecraft servers. Some potential issues with the data include the hashedEmail column being an identifier and not useful for analysis, the Unix time format being unnecessary, and the individualId and organizationName columns containing only NAN values. For this analysis, the relevant variables are age, experience, and played_hours, while other columns were removed during preprocessing.

# (2) Methods:

The objective is to determine which type of players are most likely to contribute a large amount of data. In this project, it is assumed that a player with a greater playtime contributes more data. Therefore, the response variable will be the played_hours, and the predictors will be age and experience. 

This project only requires data from the players dataset. The sessions.csv dataset is useful for evaluating whether a player conducts their playtime in a few sessions or if they play consistently. It is assumed that all playtime is equal, and that it is necessary to maximize played_hours. With these assumptions in place, the sessions.csv dataset is not required. It would, however, require more insight into the knowledge domain to make an educated choice.

Firstly, the players.csv dataset will be imported. Then, initial cleaning will be performed on the dataset to remove irrelevant and/or empty columns, which includes all columns besides played_hours, age and experience. 

The played_hours column will then be categorized into three bins - low, medium, and high playtime - based on the median and 80th percentile values. To ensure balanced representation across categories, upsampling wil be applied to the categories with less data. One-hot-encoding will then be applied to the experience response variable to convert it into numerical format for the KNN classifier. 

The dataset will then be split into training and testing sets, with age and encoded experience as predictors and playtime category as the target variable. A K-Nearest Neighbors (KNN) classifier will be trained using a pipeline that includes standard scaling to normalize the features. A grid search with cross-validation will then be used to optimize the number of nearest neighbors for the KNN model. The optimized model will then be used to make predictions, and visualizations will be created to compare the predicted results with the actual data, providing insights into the model’s performance and classification accuracy.

It should be noted that it is not necessary to predict the exact playtime in this project. It is more important to find out which kinds of players have a generally higher number of played_hours.


# (3) Cleaning, Wrangling and Preliminary Visualisations

The dataset was loaded from a CSV file, and columns with all missing values (NaN) or irrelevant information (e.g., hashedEmail, name) were removed. The purpose was to retain only variables necessary for analysis.

In [19]:
# save the URL
players_url = "https://raw.githubusercontent.com/Lionung/dsci_100_group_project/refs/heads/main/players.csv"

# importing the files as CSV
players = pd.read_csv(players_url)
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


The organizationName and individualId columns only consists of missing data.

In [20]:
#organizationName and individualID only have NAN
display(players["organizationName"].values)
display(players["individualId"].values)

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, na

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, na

Columns with missing data are dropped, as well as all other columns besides played_hours, age and experience

In [21]:
#minimum necessary cleaning

# removing NAN values, dropping columns with NAN vals
players.dropna(axis=1, inplace=True)

# removing other unnecessary variables to our investigation
players_clean = players.drop(columns=["hashedEmail", "name", "subscribe", "gender"])
players_clean.head()

Unnamed: 0,experience,played_hours,age
0,Pro,30.3,9
1,Veteran,3.8,17
2,Veteran,0.0,17
3,Amateur,0.7,21
4,Regular,0.1,21


A scatter plot was created using altair to visualize how playtime (played_hours) correlates with player age. The mark_circle() function was used to create data points, and encode() mapped played_hours to the y-axis and age to the x-axis.

In [22]:
# used scatter plot due to having 2 numerical quantities

hours_vs_age = alt.Chart(players_clean).mark_circle().encode(
    x=alt.X("age").title("Age of Player (years)"), 
    y=alt.Y("played_hours").title("Number of hours played (hours)"), 
).properties(title="Number of hours played vs Age of Player (years)")

hours_vs_age

Looking at the above scatter plot, it can be seen that the majority of players have an age of between 10 - 30, and that they play very little. However, there are extreme values and outliers for both age and played_hours, including elderly players and players who have played more than 100 hours. 

To examine how playtime is distributed across players, a bar chart was constructed. The alt.Bin(maxbins=40) function was used to group played_hours into bins that gave enough resolution, with the count of players displayed on the y-axis.

In [23]:
# using altair to create distribution of played_hours
# plot looks busy, but specified number of bins as 40 to get more resolution on the distribution

played_hours_distribution = alt.Chart(players_clean).mark_bar().encode(
    x=alt.X("played_hours").title("Number of hours played (hours)").bin(maxbins=40), 
    y=alt.Y("count()").title("Number of Players")
).properties(title="Distribution of Played Hours")

played_hours_distribution

It can be seen from the scatterplot above that the majority of players have a played_hours value of <10.

The dataset also includes 2 extremes, indicating that players either play very little, or a lot.

Similarly, the distribution of player ages was examined using a bar chart.

In [24]:
# using altair to create distribution of player ages
# used maxbins=40 intentionally to get higher resolution

age_distribution = alt.Chart(players_clean).mark_bar().encode(
    x=alt.X("age").title("Age of player (years)").bin(maxbins=40), 
    y=alt.Y("count()").title("Number of players")
).properties(title="Distribution of Player Age")

age_distribution

It can be seen that the majority of the players have an age between 15 - 25.

The median playtime and 80% percentile for playtime was calculated.

In [25]:
#finding the bin values 

median_playtime = players_clean["played_hours"].median()

#can use the percentile function in numpy to find the values for the bins
top_20_playtime = np.percentile(players_clean["played_hours"], 80)

print("median playtime is", median_playtime, "hours" "\ntop 20% playtime is", top_20_playtime, "hours")

median playtime is 0.1 hours
top 20% playtime is 1.0 hours


It is determined that the median playtime is 0.1 hour, and the top 20% playtime is 1 hour.

With these values in place, a visualisation of where the median playtime and 80th percentile playtime is located across all players was created.

In [26]:
# method is predicting based on age
playtime_distribution = alt.Chart(players_clean).mark_bar().encode(
    x=alt.X("played_hours").title("Played hours").bin(maxbins=90).scale(), 
    y=alt.Y("count()").title("Number of players")
).properties(title="Distribution of Played hours")

median = alt.Chart().mark_rule().encode(x=alt.datum(median_playtime))
top_20 = alt.Chart().mark_rule().encode(x=alt.datum(top_20_playtime))

hist_with_ranges = playtime_distribution + median + top_20
hist_with_ranges

# the vast, vast, vast majority of players barely even get an hour. There is a very small amount of players that have
# a signficant amount of playtime 

# is a significant issue for k means, as we need to create a large amount of data

It can be seen that the vast majority of players play for less than hour. There is a very small amount of players that have a significant amount of playtime.

To see this, the dataset can be isolated to only include players who have played for over 20 hours.

In [27]:
# for instance, there are only 10 played with played_hours > 20
players_clean[players_clean["played_hours"] >= 20]

Unnamed: 0,experience,played_hours,age
0,Pro,30.3,9
17,Amateur,48.4,17
51,Regular,218.1,20
71,Amateur,53.9,17
74,Regular,223.1,17
90,Amateur,150.0,16
130,Amateur,56.1,23
144,Beginner,23.7,24
158,Regular,178.2,19
183,Amateur,32.0,22


Looking at the above table, it can be seen that there are only 10 players who have contributed over 20 hours of playtime.

The played_hours column was divided into three categories: low, medium, and high. The bin boundaries were determined using the median and 80th percentile values of played_hours.

Here, the pandas cut function was utilised, which is useful for segmenting and sorting data into determined bins.

In [28]:
#creating bin categories

#did research on pandas cut function, works exactly for categories like this
#used when have to segment and sort data into bins (pandas documentation)

players_clean["category"] = pd.cut(
    players_clean["played_hours"],
    bins=[0, median_playtime, top_20_playtime, float("inf")],
    labels=["low", "medium", "high"],
    right=False, #means that the bins don't include rightmost edge 
    include_lowest=True #include players with 0
) 

players_clean

Unnamed: 0,experience,played_hours,age,category
0,Pro,30.3,9,high
1,Veteran,3.8,17,high
2,Veteran,0.0,17,low
3,Amateur,0.7,21,medium
4,Regular,0.1,21,medium
...,...,...,...,...
191,Amateur,0.0,17,low
192,Veteran,0.3,22,medium
193,Amateur,0.0,17,low
194,Amateur,2.3,17,high


The amount of players within the three categories can be determined.

In [29]:
# how many exist per category

low = players_clean[players_clean["category"] == "low"]
medium = players_clean[players_clean["category"] == "medium"]
high = players_clean[players_clean["category"] == "high"]

print("There are \n", len(low), ": low\n",
     len(medium), ": medium\n",
     len(high), ": high")

#therefore we need to equalize the categories for the values to have proper k means
#going to make 85 each to make it easiest for ourselves

There are 
 85 : low
 69 : medium
 42 : high


It was determined that there are 85 players who fall under the low category, 69 within the medium category, and 42 within the high category.

The dataset was balanced by upsampling the smaller categories to match the size of the largest category. The sample() function was used to randomly replicate observations.

In [31]:
np.random.seed(seed)

low = players_clean[players_clean["category"] == "low"]
medium = players_clean[players_clean["category"] == "medium"]
high = players_clean[players_clean["category"] == "high"]

high_upsample = high.sample(n=low.shape[0], replace=True)
medium_upsample = medium.sample(n=low.shape[0], replace=True)

upsampled_playtime = pd.concat((low, medium_upsample, high_upsample))
upsampled_playtime["category"].value_counts()

category
low       85
medium    85
high      85
Name: count, dtype: int64

The experience column was converted into numerical values using a mapping dictionary. This step ensures compatibility with the KNN classifier. 

In [32]:
#now need to convert the experience column to numerical values
#doing simple encoding, from 1-5

exp_map = {"Amateur":1,
          "Beginner":2,
          "Regular":3,
          "Pro":4,
          "Veteran":5}

upsampled_playtime["experience_numeric"] = upsampled_playtime["experience"].map(exp_map)
upsampled_playtime

Unnamed: 0,experience,played_hours,age,category,experience_numeric
2,Veteran,0.0,17,low,5
5,Amateur,0.0,17,low,1
6,Regular,0.0,19,low,3
7,Amateur,0.0,21,low,1
9,Veteran,0.0,22,low,5
...,...,...,...,...,...
1,Veteran,3.8,17,high,5
134,Beginner,1.1,20,high,2
114,Beginner,1.0,17,high,2
171,Beginner,1.8,32,high,2


Now that the categories are equalized and converted to numerical data, analysis can officially begin.

# (4) Analysis:

To prepare for model training, the data was split into training and testing sets using the train_test_split() function from sklearn.model_selection. The features included were age and experience_numeric, while the target variable was the category of playtime. A stratified split was used to ensure that each playtime category was proportionally represented in both the training and testing sets.

In [33]:
#train test split

from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
np.random.seed(seed)


train_data, test_data = train_test_split(upsampled_playtime, test_size=0.3, stratify=upsampled_playtime["category"])

x_train = train_data[["age", "experience_numeric"]]
y_train = train_data["category"]

x_test = test_data[["age", "experience_numeric"]]
y_test = test_data[["category"]]

A preprocessing pipeline was constructed using make_column_transformer and StandardScaler to normalize the features. The classifier chosen was KNeighborsClassifier, which relies on the nearest-neighbor algorithm for classification. A pipeline was created using make_pipeline, combining preprocessing and classification steps.

In [34]:
np.random.seed(seed)
preprocessor = make_column_transformer((StandardScaler(),
                                      ["age", "experience_numeric"]))

pipeline = make_pipeline(preprocessor, KNeighborsClassifier())

param_grid = {"kneighborsclassifier__n_neighbors":range(1, 30, 1)}

tune_grid = GridSearchCV(estimator=pipeline, 
                        param_grid=param_grid,
                        cv=10, #used 10 as a good trade off between accuracy and computation/
                        return_train_score=True,
                        n_jobs=-1)

tune_grid.fit(x_train, y_train)

accuracies_grid = pd.DataFrame(tune_grid.cv_results_)
accuracies_grid.head()

  _data = np.array(data, dtype=dtype, copy=copy,


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
0,0.003814,0.00068,0.00363,0.000616,1,{'kneighborsclassifier__n_neighbors': 1},0.444444,0.444444,0.444444,0.444444,...,0.625,0.6625,0.625,0.63125,0.61875,0.6125,0.68323,0.590062,0.623579,0.02915
1,0.003393,0.000269,0.00322,5.8e-05,2,{'kneighborsclassifier__n_neighbors': 2},0.444444,0.555556,0.333333,0.5,...,0.5625,0.59375,0.59375,0.59375,0.5875,0.60625,0.602484,0.57764,0.585512,0.015465
2,0.003381,0.000274,0.003204,5.8e-05,3,{'kneighborsclassifier__n_neighbors': 3},0.5,0.611111,0.5,0.388889,...,0.5625,0.6,0.5625,0.58125,0.6,0.58125,0.602484,0.571429,0.574266,0.022924
3,0.003317,6.5e-05,0.003171,2.6e-05,4,{'kneighborsclassifier__n_neighbors': 4},0.555556,0.444444,0.555556,0.444444,...,0.5375,0.55625,0.60625,0.56875,0.575,0.575,0.583851,0.540373,0.560547,0.027827
4,0.003327,8.2e-05,0.003176,2e-05,5,{'kneighborsclassifier__n_neighbors': 5},0.5,0.555556,0.444444,0.388889,...,0.53125,0.53125,0.575,0.525,0.55625,0.5375,0.559006,0.552795,0.540555,0.018375


In [35]:
cross_val_plot = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Number of Neighbors").scale(zero=False),
    y=alt.Y("mean_test_score").title("Mean Test Score").scale(zero=False))

cross_val_plot

To optimize the number of neighbors (n_neighbors) for the KNN classifier, a grid search was performed using GridSearchCV. A range of values from 1 to 30 was tested to identify the optimal parameter. Cross validation was also conducted with ten folds, a good trade-off between accuracy and computation. The grid search also calculated mean test scores for each value of n_neighbors to evaluate the model's performance.

Based on the grid search results, the optimal number of neighbors was determined to be 4.  We are not worrying about precision or recall, as it is not particularly necessary to minimize false negatives or false positives. There is no life changing ethical issue, we just need the most accurate model.The model was re-trained using this parameter, and predictions were made on the testing set.

In [36]:
np.random.seed(seed)
optimized = make_pipeline(preprocessor, KNeighborsClassifier(n_neighbors=4))
optimized.fit(x_train, y_train)


predictions = pd.DataFrame(test_data).assign(predicted=optimized.predict(x_test))
predictions

Unnamed: 0,experience,played_hours,age,category,experience_numeric,predicted
142,Beginner,1.0,17,high,2,high
160,Beginner,0.0,24,low,2,high
76,Amateur,3.5,21,high,1,low
181,Amateur,0.8,22,medium,1,low
97,Veteran,0.1,18,medium,5,medium
...,...,...,...,...,...,...
67,Amateur,17.2,14,high,1,low
140,Regular,0.0,20,low,3,medium
155,Amateur,1.2,17,high,1,low
154,Amateur,0.0,19,low,1,low


The accuracy of the model was computed using accuracy_score from sklearn.metrics. This metric compares the predicted categories with the actual categories in the test set.

In [37]:
#this function is simpler as it just takes the real ones, and the predicted ones

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(predictions["category"], predictions["predicted"])
accuracy

0.38961038961038963

A grid of values for age and experience_numeric was created to visualize the model's predictions across the feature space. Predictions were made for all combinations of these grid values, and the results were visualized using altair. A scatter plot of the actual test data was overlaid on the background prediction map to provide further context. This map was used to answer the question

In [38]:
#colored prediction map visualization from textbook

#ranges
experience_range = np.linspace(predictions["experience_numeric"].min() - 1, 
                                predictions["experience_numeric"].max() + 1, 
                                100)

age_range = np.linspace(predictions["age"].min() - 10, 
                        predictions["age"].max() + 1, 
                        100)

#creating grid of values
exp_age_grid = np.array(np.meshgrid(experience_range, age_range)).T.reshape(-1, 2)
exp_age_grid = pd.DataFrame(exp_age_grid, 
                            columns=["experience_numeric", "age"])

#Predict categories for the grid points
exp_age_grid["predicted"] = optimized.predict(exp_age_grid)

#Plot the background 
background = alt.Chart(exp_age_grid).mark_point(opacity=0.1).encode(
    x=alt.X("experience_numeric").title("Experience"),
    y=alt.Y("age").title("Age"),
    color=alt.Color("predicted:N")
)

# Overlay the scatter plot for predictions
groups_scatter = alt.Chart(predictions).mark_circle(opacity=1).encode(
    x=alt.X("experience_numeric").title("Experience"),
    y=alt.Y("age").title("Age"),
    color=alt.Color("predicted:N").title("Predicted Category")
)

# Combine both plots
prediction_visualization = background + groups_scatter
prediction_visualization

talk about how the model is pretty bad, overfitting underfitting?

Accuracy should be 38.96 unless some randomstate went wrong

there doesn't seem to be much of a useful relationship

however, using thi model to answer the initial question
visualiation above shows that players with a higher count are
- experience 1 (beginner) under 20
- experience 4 (pro) under 23
- experience 3 (regular) under 7
- experience 4 (pro) over 25

groups not to target
- amateurs are likely to give lowest playtime, don't target them (except for betewen 22-29)
- don't target regular between 22-30
- 
So in marketing in general, lets target younger player (under 20) who have played minecraft before or older people ________

yap more about this show how we use the resultant graph to make direct analysis to answer the question