In [1]:
#importing modules
import pandas as pd
import sklearn as skl
import altair as alt
import numpy as np
from sklearn.preprocessing import StandardScaler

alt.data_transformers.enable('vegafusion')
seed = 21123
np.random.seed(seed)

# (1) Data Description:

**Imported 2 datasets.**

**"sessions.csv"** includes
 - **hashedEmail**             - (object, identifier)
   - encoded email address of player
 - **start_time**              - (object, quantitative)
   - date and time the player started session
 - **end_time**                - (object, quantitative)
   - date and time the player ended session
 - **original_start_time**     - (float64, quantitative)
   - (UNIX time) player started session
 - **original_end_time**       - (float64, quantitative)
   - (Unix time) player ended session

**1535 observations**

**"players.csv"** includes
 - **experience**              (object, categorical)
   - experience level
 - **subscribe**               (Boolean)
   - whether the user has subscribed for the PlaitCraft mailing list
 - **hashedEmail**         
 - **played_hours**            (float64, quantitative)
   - number of hours played 
 - **name**                    (object, categorical)
   - chosen name of player
 - **gender**                  (object, categorical)
   - chosen gender of player
 - **age**                     (int64, quantitative)
   - age of player
 - **individualId**            (float64, quantitative)
   - NAN
 - **organizationName**        (float64, quantitative)
   - NAN

**196 observations**

This is raw data from PlaitCraft, and was collected from their Minecraft servers 

**Potential Issues**
- the hashedEmail is not useful, as it is an identiier
- Unix time format is not useful
- the individualId and organizationName columns are all NAN 

# Question

We would like to know which type of players are most likely to contribute a large amount of data, to target them in recruitment efforts

Assume a player with a greater playtime gives more data. Therefore, the response variable will be the played_hours, and the predictors will be age and experience.

The question only requires data from the players dataset. The sessions dataset makes it useful to know if a player gets their playtime in a few sessions or if they play consistently. We make the assumption that all playtime is equal, and that it is necessary to maximize played_hours, hence we don't need the sessions dataset. It would require more insight into the knowledge domain to make an educated choice.

Firstly, we will clean the dataset by removing all columns besides played_hours, age and experience.

I then aim to wrangle the data in 2 steps. First by binning the played_hours column into 3 different categories, low, medium and high playtime. This will allow to apply KNN Classifier model on Age data. Second, one-hot encoding the experience response variable for use in the KNN Classifer.

This is done because it is not necessary to predict the exact playtime. It is more important to find out which kinds of players have a generally higher number of played_hours.

In [6]:
# save the URL
players_url = "https://raw.githubusercontent.com/Lionung/dsci_100_group_project/refs/heads/main/players.csv"

# importing the files as CSV
players = pd.read_csv(players_url)

In [7]:
#organizationName and individualID only have NAN
players["organizationName"].values

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, na

In [8]:
#minimum necessary cleaning

# removing NAN values, dropping columns with NAN vals
players.dropna(axis=1, inplace=True)

# removing other unnecessary variables to our investigation
players_clean = players.drop(columns=["hashedEmail", "name", "subscribe", "gender"])
players_clean.head()

Unnamed: 0,experience,played_hours,age
0,Pro,30.3,9
1,Veteran,3.8,17
2,Veteran,0.0,17
3,Amateur,0.7,21
4,Regular,0.1,21


In [9]:
#proof for above

#finding rows, columns and dtypes in the players dataset
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   experience    196 non-null    object 
 1   subscribe     196 non-null    bool   
 2   hashedEmail   196 non-null    object 
 3   played_hours  196 non-null    float64
 4   name          196 non-null    object 
 5   gender        196 non-null    object 
 6   age           196 non-null    int64  
dtypes: bool(1), float64(1), int64(1), object(4)
memory usage: 9.5+ KB


Relationship between played_hours and age

there are a very small number of observations with a higher number played_hours.
Most players with high played_hours are from 15-25.

In [10]:
# used scatter plot due to having 2 numerical quantities

hours_vs_age = alt.Chart(players_clean).mark_circle().encode(
    x=alt.X("age").title("Age of Player (years)"), 
    y=alt.Y("played_hours").title("Number of hours played (hours)"), 
).properties(title="Number of hours played vs Age of Player (years)")

hours_vs_age

Distribution of played_hours

the majority of observations have a played_hours value of <10
there are 2 extremes. Players either play a little, or a lot

In [11]:
# using altair to create distribution of played_hours
# plot looks busy, but specified number of bins as 40 to get more resolution on the distribution

played_hours_distribution = alt.Chart(players_clean).mark_bar().encode(
    x=alt.X("played_hours").title("Number of hours played (hours)").bin(maxbins=40), 
    y=alt.Y("count()").title("Number of Players")
).properties(title="Distribution of Played Hours")

played_hours_distribution

Distribution of age

the majority of players are 15-25

In [12]:
# using altair to create distribution of player ages
# used maxbins=40 intentionally to get higher resolution

age_distribution = alt.Chart(players_clean).mark_bar().encode(
    x=alt.X("age").title("Age of player (years)").bin(maxbins=40), 
    y=alt.Y("count()").title("Number of players")
).properties(title="Distribution of Player Age")

age_distribution

In [13]:
#finding the bin values 

median_playtime = players_clean["played_hours"].median()

#can use the percentile function in numpy to find the values for the bins
top_20_playtime = np.percentile(players_clean["played_hours"], 80)

print("median playtime is", median_playtime, "hours" "\ntop 20% playtime is", top_20_playtime, "hours")

median playtime is 0.1 hours
top 20% playtime is 1.0 hours


In [14]:
# method is predicting based on age
playtime_distribution = alt.Chart(players_clean).mark_bar().encode(
    x=alt.X("played_hours").title("Played hours").bin(maxbins=90).scale(), 
    y=alt.Y("count()").title("Number of players")
).properties(title="Distribution of Played hours")

median = alt.Chart().mark_rule().encode(x=alt.datum(median_playtime))
top_20 = alt.Chart().mark_rule().encode(x=alt.datum(top_20_playtime))

hist_with_ranges = playtime_distribution + median + top_20
hist_with_ranges

# the vast, vast, vast majority of players barely even get an hour. There is a very small amount of players that have
# a signficant amount of playtime 

# is a significant issue for k means, as we need to create a large amount of data

In [15]:
# for instance, there are only 10 played with played_hours > 20
players_clean[players_clean["played_hours"] >= 20]

Unnamed: 0,experience,played_hours,age
0,Pro,30.3,9
17,Amateur,48.4,17
51,Regular,218.1,20
71,Amateur,53.9,17
74,Regular,223.1,17
90,Amateur,150.0,16
130,Amateur,56.1,23
144,Beginner,23.7,24
158,Regular,178.2,19
183,Amateur,32.0,22


In [16]:
#creating bin categories

#did research on pandas cut function, works exactly for categories like this
#used when have to segment and sort data into bins (pandas documentation)

players_clean["category"] = pd.cut(
    players_clean["played_hours"],
    bins=[0, median_playtime, top_20_playtime, float("inf")],
    labels=["low", "medium", "high"],
    right=False, #means that the bins don't include rightmost edge 
    include_lowest=True #include players with 0
) 

players_clean

Unnamed: 0,experience,played_hours,age,category
0,Pro,30.3,9,high
1,Veteran,3.8,17,high
2,Veteran,0.0,17,low
3,Amateur,0.7,21,medium
4,Regular,0.1,21,medium
...,...,...,...,...
191,Amateur,0.0,17,low
192,Veteran,0.3,22,medium
193,Amateur,0.0,17,low
194,Amateur,2.3,17,high


In [17]:
#cleaning data
#unnecessary step lowkey got lazy didn't wanna reupdate everything
players_data = players_clean.drop(columns=[])
players_data

Unnamed: 0,experience,played_hours,age,category
0,Pro,30.3,9,high
1,Veteran,3.8,17,high
2,Veteran,0.0,17,low
3,Amateur,0.7,21,medium
4,Regular,0.1,21,medium
...,...,...,...,...
191,Amateur,0.0,17,low
192,Veteran,0.3,22,medium
193,Amateur,0.0,17,low
194,Amateur,2.3,17,high


In [18]:
# how many exist per category

low = players_data[players_data["category"] == "low"]
medium = players_data[players_data["category"] == "medium"]
high = players_data[players_data["category"] == "high"]

print("There are \n", len(low), ": low\n",
     len(medium), ": medium\n",
     len(high), ": high")

#therefore we need to equalize the categories for the values to have proper k means
#going to make 85 each to make it easiest for ourselves

There are 
 85 : low
 69 : medium
 42 : high


In [19]:
np.random.seed(seed)

low = players_data[players_data["category"] == "low"]
medium = players_data[players_data["category"] == "medium"]
high = players_data[players_data["category"] == "high"]

high_upsample = high.sample(n=low.shape[0], replace=True)
medium_upsample = medium.sample(n=low.shape[0], replace=True)

upsampled_playtime = pd.concat((low, medium_upsample, high_upsample))
upsampled_playtime["category"].value_counts()

category
low       85
medium    85
high      85
Name: count, dtype: int64

In [20]:
#now need to convert the experience column to numerical values
#doing simple encoding, from 1-5

exp_map = {"Amateur":1,
          "Beginner":2,
          "Regular":3,
          "Pro":4,
          "Veteran":5}

upsampled_playtime["experience_numeric"] = upsampled_playtime["experience"].map(exp_map)
upsampled_playtime

Unnamed: 0,experience,played_hours,age,category,experience_numeric
2,Veteran,0.0,17,low,5
5,Amateur,0.0,17,low,1
6,Regular,0.0,19,low,3
7,Amateur,0.0,21,low,1
9,Veteran,0.0,22,low,5
...,...,...,...,...,...
1,Veteran,3.8,17,high,5
134,Beginner,1.1,20,high,2
114,Beginner,1.0,17,high,2
171,Beginner,1.8,32,high,2


Now the categories are equalized, we can officially start analysis

In [21]:
#train test split

from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
np.random.seed(seed)


train_data, test_data = train_test_split(upsampled_playtime, test_size=0.3, stratify=upsampled_playtime["category"])

x_train = train_data[["age", "experience_numeric"]]
y_train = train_data["category"]

x_test = test_data[["age", "experience_numeric"]]
y_test = test_data[["category"]]

In [22]:
np.random.seed(seed)
preprocessor = make_column_transformer((StandardScaler(),
                                      ["age", "experience_numeric"]))

pipeline = make_pipeline(preprocessor, KNeighborsClassifier())

param_grid = {"kneighborsclassifier__n_neighbors":range(1, 30, 1)}

tune_grid = GridSearchCV(estimator=pipeline, 
                        param_grid=param_grid,
                        cv=10, #used 10 as a good trade off between accuracy and computation/
                        return_train_score=True,
                        n_jobs=-1)

tune_grid.fit(x_train, y_train)

accuracies_grid = pd.DataFrame(tune_grid.cv_results_)
accuracies_grid.head()

  _data = np.array(data, dtype=dtype, copy=copy,


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
0,0.003798,0.000581,0.003405,0.000316,1,{'kneighborsclassifier__n_neighbors': 1},0.444444,0.444444,0.444444,0.444444,...,0.625,0.6625,0.625,0.63125,0.61875,0.6125,0.68323,0.590062,0.623579,0.02915
1,0.003384,4.2e-05,0.003265,5.9e-05,2,{'kneighborsclassifier__n_neighbors': 2},0.444444,0.555556,0.333333,0.5,...,0.5625,0.59375,0.59375,0.59375,0.5875,0.60625,0.602484,0.57764,0.585512,0.015465
2,0.003352,1.7e-05,0.003293,6.5e-05,3,{'kneighborsclassifier__n_neighbors': 3},0.5,0.611111,0.5,0.388889,...,0.5625,0.6,0.5625,0.58125,0.6,0.58125,0.602484,0.571429,0.574266,0.022924
3,0.003497,0.000204,0.003484,0.000301,4,{'kneighborsclassifier__n_neighbors': 4},0.555556,0.444444,0.555556,0.444444,...,0.5375,0.55625,0.60625,0.56875,0.575,0.575,0.583851,0.540373,0.560547,0.027827
4,0.005088,0.005143,0.003328,0.000117,5,{'kneighborsclassifier__n_neighbors': 5},0.5,0.555556,0.444444,0.388889,...,0.53125,0.53125,0.575,0.525,0.55625,0.5375,0.559006,0.552795,0.540555,0.018375


In [23]:
cross_val_plot = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Number of Neighbors").scale(zero=False),
    y=alt.Y("mean_test_score").title("Mean Test Score").scale(zero=False))

cross_val_plot

from the gridsearch, the optimal number of neighbors is 4, based on the accuracy.
we are not worrying about precision or recall, as it is not particularly necessary to minimize false negatives or false positives. There is no lifechanging ethical issue, we just need the most accurate model. However, it is important to note that the accuracy is not great at all even in the best case scenario. We tried doing it beforehand using just age, however result was similar. If wordcount allows redo the initial analysis to show poor result, and then do this one to show we tried multiple methods (prob gonna run out tho)

In [24]:
np.random.seed(seed)
optimized = make_pipeline(preprocessor, KNeighborsClassifier(n_neighbors=4))
optimized.fit(x_train, y_train)


predictions = pd.DataFrame(test_data).assign(predicted=optimized.predict(x_test))
predictions

Unnamed: 0,experience,played_hours,age,category,experience_numeric,predicted
142,Beginner,1.0,17,high,2,high
160,Beginner,0.0,24,low,2,high
76,Amateur,3.5,21,high,1,low
181,Amateur,0.8,22,medium,1,low
97,Veteran,0.1,18,medium,5,medium
...,...,...,...,...,...,...
67,Amateur,17.2,14,high,1,low
140,Regular,0.0,20,low,3,medium
155,Amateur,1.2,17,high,1,low
154,Amateur,0.0,19,low,1,low


In [25]:
#this function is simpler as it just takes the real ones, and the predicted ones

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(predictions["category"], predictions["predicted"])
accuracy

0.38961038961038963

In [26]:
#colored prediction map visualization from textbook

#ranges
experience_range = np.linspace(predictions["experience_numeric"].min() - 1, 
                                predictions["experience_numeric"].max() + 1, 
                                100)

age_range = np.linspace(predictions["age"].min() - 10, 
                        predictions["age"].max() + 1, 
                        100)

#creating grid of values
exp_age_grid = np.array(np.meshgrid(experience_range, age_range)).T.reshape(-1, 2)
exp_age_grid = pd.DataFrame(exp_age_grid, 
                            columns=["experience_numeric", "age"])

#Predict categories for the grid points
exp_age_grid["predicted"] = optimized.predict(exp_age_grid)

#Plot the background 
background = alt.Chart(exp_age_grid).mark_point(opacity=0.1).encode(
    x=alt.X("experience_numeric").title("Experience"),
    y=alt.Y("age").title("Age"),
    color=alt.Color("predicted:N")
)

# Overlay the scatter plot for predictions
groups_scatter = alt.Chart(predictions).mark_circle(opacity=1).encode(
    x=alt.X("experience_numeric").title("Experience"),
    y=alt.Y("age").title("Age"),
    color=alt.Color("predicted:N").title("Predicted Category")
)

# Combine both plots
prediction_visualization = background + groups_scatter
prediction_visualization

talk about how the model is pretty bad, overfitting underfitting?

Accuracy should be 38.96 unless some randomstate went wrong

there doesn't seem to be much of a useful relationship

however, using thi model to answer the initial question
visualiation above shows that players with a higher count are
- experience 1 (beginner) under 20
- experience 4 (pro) under 23
- experience 3 (regular) under 7
- experience 4 (pro) over 25

groups not to target
- amateurs are likely to give lowest playtime, don't target them (except for betewen 22-29)
- don't target regular between 22-30
- 
So in marketing in general, lets target younger player (under 20) who have played minecraft before or older people ________

yap more about this show how we use the resultant graph to make direct analysis to answer the question