![PUBG_Logo](assets/PUBG_logo.png)

## Objective
* Employ K-means clustering algorithm on dataset.
* Discuss pertinent results.

## Background Information
* Playerunknown's Battleground (PUBG) is a video game, which set the standard for preceding games in the Battle Royale Genre. The main goal is to SURVIVE at all costs.

## Process:
* Exploratory Data Analysis conducted utilizing various python packages (Numpy, Matplotlib, Pandas, and Plotly).
* K-means clustering algorithm (Sci-Kit Learn)


## Table of Contents:
* Part I: Exploratory Data Analysis
    * EDA
* Part II: K - means clustering
    * 3D
        * Two Clusters
        * Four Clusters
    * 2D
        * Two Clusters
        * Four Clusters


In [None]:
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns

# PART I - Exploratory Data Analysis

### Data Preprocessing / Feature Engineering

Let us begin by reading in the CSV file containing the data, and examining the data contents such as the number of features and rows. It seems there are 152 column entries (features) and 87898 row entries (number of samples).

In [None]:
#--------- Pandas Dataframe
## Read in CSVhttp://localhost:8888/notebooks/OneDrive/Documents/Data%20Science/Projects/PUBG_EDA_Clustering/assets/PUBG_logo.png
orig = pd.read_csv('data/PUBG_Player_Statistics.csv')

Now, let us remove and combine features, which do not pertain to our goal of clustering solo player behavior. 

Remove:
* player_name
* tracker_id
* duo
* squad

Add:
* Total Distance

This can be achieved by removing all columns after the 52nd. Also, create a new feature that combines the walking and riding distance.

Also, we will reduce the variance in the data by removing players with less than the mean number of rounds in our data.

In [None]:
#---------Preprocessing
## Create a copy of the dataframe
df = orig.copy()
cols = np.arange(52, 152, 1)

# Drop entries if they have null values
df.dropna(inplace = True)

## Drop columns after the 52nd index
df.drop(df.columns[cols], axis = 1, inplace = True)

## Drop player_name and tracker id
df.drop(df.columns[[0, 1]], axis = 1, inplace = True)

## Drop Knockout and Revives
df.drop(df.columns[[49]], axis = 1, inplace = True)
df.drop(columns = ['solo_Revives'], inplace = True)

## Drop the string solo from all strings
df.rename(columns = lambda x: x.lstrip('solo_').rstrip(''), inplace = True)

## Combine a few columns 
df['TotalDistance'] = df['WalkDistance'] + df['RideDistance']
df['AvgTotalDistance'] = df['AvgWalkDistance'] + df['AvgRideDistance']

# Remove Outliers
df = df.drop(df[df['RoundsPlayed'] < df['RoundsPlayed'].mean()].index)

Split the data into three sets: train, dev, and test set.

In [None]:
# Create train and test set using Sci-Kit Learn
train, test = train_test_split(df, test_size=0.3, random_state = 10)
dev, test = train_test_split(test, test_size = 0.2, random_state = 10)
data = train

print("The number of training samples is", len(train))
print("The number of development samples is", len(dev))
print("The number of testing samples is", len(test))

It is important we go through the final output to make sure that are data preprocessing is complete. And it looks great!

In [None]:
with pd.option_context('display.max_columns', 52):
    print(data.describe(include = 'all'))



The only factors above which have a positive correlation to average survival time are Average Total Distance, Win Ratio, and Top 10 Ratio.

# PART 2 - Clustering

Procedure: 
* 3D
    * Two Clusters
    * Four Clusters
* 2D
    * Two Clusters
    * Four Clusters

### Clustering in 3D (Selected Few Features)

We selected the following features because of experts and my domain experience playing PUBG.

In [None]:
# Select four features
train_data = train.loc[:,['KillDeathRatio', "HeadshotKillRatio", 'WinRatio' , "Top10Ratio"]]
dev_data = dev.loc[:,['KillDeathRatio', "HeadshotKillRatio", 'WinRatio' , "Top10Ratio"]]
test_data = test.loc[:,['KillDeathRatio', "HeadshotKillRatio", 'WinRatio' , "Top10Ratio"]]

Feature scaling is utilized to make sure all features are normalized and have similar orders of magnitude. This is important because our clustering algorithms look into calculating the distance between points. In our case, we employed a zero-mean and unit-variance scaling.

In [None]:
# Scale the data (Normaliz)
scaler = StandardScaler()
X_train_std = scaler.fit_transform(train_data)
X_dev_std = scaler.transform(dev_data)
X_test_std = scaler.transform(test_data)

#### K-means clustering


K-means clustering is an algorithm to classify or to group objects based on attributes/features into K number of groups [1]. The grouping is done by minimizing the sum of square of distances between data and the corresponding cluster centroid.

##### Algorithm
1. Select K points randomly from the dataset as the centroids of the clusters.
2. Assign data points to centroids closest to it.
3. Recompute the centroid so that it is closest to all the data points allocated to that cluster.
4. Repeat step 2 and 3 until the algorithm converges. 


##### Parameters

In K-means clustering, we'll be examining two parameters:
* How clusters are initialized
* The number of clusters

###### Initialization

Standard k-means clustering has a challenge initializing the cluster centroids. If a wrong cluster initialization is set, the clusters will be wrong.  We'll be using the K-means +++ initialization to solve this issue by first initializing the cluster centroids before following the standard k-means clustering algorithm. 


1. The first cluster is chosen uniformly at random from the data points that we want to cluster. This is similar to what we do in K-Means, but instead of randomly picking all the centroids, we just pick one centroid here
2. Compute the distance (D(x)) of each data point (x) from the cluster center that has already been chosen
3. Choose the new cluster center from the data points with the probability of x being proportional to (D(x))2
4. Repeat steps 2 and 3 until k clusters have been chosen


###### Optimal Number of Clusters
Now, we'll want to identify the optimal number of clusters [2]. We'll use the inertia between clusters and silhouette analysis as our internal scoring metrics because we do not have access to correctly labeled data.


Inertia is the within-cluster sum of squares. It calculates the variance of points in each cluster.

Silhouette Analysis measures the similarity of points in a cluster and the dissimilarity of points in neighboring clusters.

For our problem, SA provides a better solution because we are not looking for how points are similar in each cluster, but how they are separated.

------------------------------------------------------------------------------------------------------------------
We plot the number of clusters vs the inertia, to identify the optimal number of clusters by selecting the number that is the elbow-point in the graph, or the point in which the graph doesn't have a steep slope.

The elbow point is 4, but 5 or 6 are adequate choices too.


In [None]:
# The number of clusters from 1 to 10
ks = range(1, 10)

inertias = []
for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k, init='k-means++', random_state = 10)
    
    # Fit model to samples
    model.fit(X_train_std)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    print('Inertia for %i Clusters: %0.4f' % (k, model.inertia_))

# Plot parameters
plt.figure(figsize = (20, 10))
plt.plot(ks, inertias, '-o', color = 'black')
plt.plot(4, inertias[3], '-o', color = 'red', markersize = 12)
plt.xlabel('Number of clusters', fontsize = 24)
plt.ylabel('Inertia', fontsize = 24)
plt.title('Optimal number of clusters (Inertia)', fontsize = 24)
plt.xticks(ks, fontsize = 18)
plt.yticks(fontsize = 18)



plt.show()

Silhouette analysis studies how similar and dissimilar neighboring cluster centroids are. We select the point which is closest to +1. In our case, two clusters have the greatest value following with four clusters.

Being able to distinguish between a player and a hacker by having two clusters is the perfect solution. But from my experience, I feel there would be more segmentation than a cheater and a player such as
* Beginner
* Experienced
* Professional
* Hacker

Nevertheless, we will explore all possible outcomes.

In [None]:
# Number of clusters
ks = range(2, 10)
score = []

# Silhouette Method
for k in ks:
    kmeans = KMeans(n_clusters = k, init='k-means++', random_state = 10).fit(X_train_std)
    ss = metrics.silhouette_score(X_train_std, kmeans.labels_, sample_size = 10000)
    score.append(ss)
    print('Silhouette Score for %i Clusters: %0.4f' % (k, ss))

# Plot Parameters
plt.figure(figsize = (20, 10))
plt.plot(ks, score, '-o', color = 'blue')
s = ['D', 'D', 'D' ]
col = ['red','green','orange' ]
x = np.array([2, 3, 4, 3])
y = score[0:3]
plt.xticks(ks, fontsize = 18)
plt.yticks(fontsize = 18)

## Different Markers for first three points
for _s, c, _x, _y in zip(s, col, x, y):
    plt.scatter(_x, _y, marker=_s, c=c, s = 100)
plt.xlabel("Number of clusters", fontsize = 24)
plt.ylabel("Silhouette score", fontsize = 24)
plt.title('Optimal number of clusters (Silhouette)', fontsize = 24)

plt.text(1.90, score[0] + 0.005, str(round(score[0], 3)), size = 14, color = 'red', weight = 'semibold')

plt.text(2.97, score[1] + 0.005, str(round(score[1], 3)), size = 14, color = 'green', weight = 'semibold')

plt.text(3.90, score[2] + 0.005, str(round(score[2], 3)), size = 14, color = 'orange', weight = 'semibold')



plt.show()

###### Two Clusters

Begin with configuring all paramters for k-means clustering.

In [None]:
# K means clustering of Training Data
number_cluster = 2
kmeans = KMeans(n_clusters = number_cluster, init = 'k-means++', n_init = 10, random_state = 10).fit(X_train_std)
labels = kmeans.labels_

Next, load in our function to plot our 3D scatter plots.

In [None]:
def scatter3d_cluster(df, x, y, z, code, title):
    scatter  =  px.scatter_3d(df, x = x, y = y, z = z, color  =  code,  
                            color_discrete_sequence = px.colors.qualitative.Light24)
    
    scatter.update_layout(title  =  title, title_font  =  dict(size  =  30),
                          scene  =  dict(
                              xaxis  =  dict(
                                  backgroundcolor = "rgb(200, 200, 230)",
                                  gridcolor = "white",
                                  showbackground = True,
                                  zerolinecolor = "white",
                                  nticks = 10, ticks = 'outside',
                                  tick0 = 0, tickwidth  =  4,
                                  title_font  =  dict(size  =  16)),
                              yaxis  =  dict(
                                  backgroundcolor = "rgb(230, 200,230)",
                                  gridcolor = "white",
                                  showbackground = True,
                                  zerolinecolor = "white",
                                  nticks = 10, ticks = 'outside',
                                  tick0 = 0, tickwidth  =  4,
                                  title_font  =  dict(size  =  16)),
                              zaxis  =  dict(
                                  backgroundcolor = "rgb(230, 230,200)",
                                  gridcolor = "white",
                                  showbackground = True,
                                  zerolinecolor = "white",
                                  nticks = 10, ticks = 'outside',
                                  tick0 = 0, tickwidth  =  4,
                                  title_font  =  dict(size  =  16),
                              ),
                          ),
                          width  =  700
                         )
    return scatter.show()

Lastly, plot the data and let's label the data with our assumptions on how hackers are perceived.

Hackers tend to have high Kill-Death Ratios, Headshot-Kill Ratios, Top 10 Ratios, and Win Ratios.

In [None]:
## 3D Plot of Training Data
# Create and modify dataframe for the cluster column
df_X_train_std = pd.DataFrame(X_train_std)
df_X_train_std['Cluster'] = pd.Series(labels, index=df_X_train_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_train_std['Cluster_Labels'] = df_X_train_std['Cluster'].map(cluster_label_names)  

df_X_train_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_train_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_train_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 


From the plots, we observed roughly 17225 humans and 3546 hackers. This seems to be an enormous amount of hackers, and I do not trust the results, maybe four clusters will provide a better result. Nevertheless, let's continue with predicting on the dev and test sets.

In [None]:
df_X_train_std.groupby('Cluster_Labels').count()

##### Predicting on the dev set

Begin with predicting on the dev set.

In [None]:
# # K means clustering of Dev Data
predict_labels = kmeans.predict(X_dev_std)

Next, create the 3D scatter plots.

In [None]:
## 3D Plot of deving Data
# Create and modify dataframe for the cluster column
df_X_dev_std = pd.DataFrame(X_dev_std)
df_X_dev_std['Cluster'] = pd.Series(predict_labels, index=df_X_dev_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_dev_std['Cluster_Labels'] = df_X_dev_std['Cluster'].map(cluster_label_names)  

df_X_dev_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio' )

scatter3d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_dev_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 

From the plots, we observed roughly 5898 humans and 1223 hackers.

In [None]:
df_X_dev_std.groupby('Cluster_Labels').count()

##### Predicting on the test set

Begin with predicting on the test set.

In [None]:
# # K means clustering of Training Data
predict_labels = kmeans.predict(X_test_std)

Next, create the 3D scatter plots.

In [None]:
## 3D Plot of testing Data
# Create and modify dataframe for the cluster column
df_X_test_std = pd.DataFrame(X_test_std)
df_X_test_std['Cluster'] = pd.Series(predict_labels, index=df_X_test_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_test_std['Cluster_Labels'] = df_X_test_std['Cluster'].map(cluster_label_names)  

df_X_test_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_test_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_test_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 

From the plots, we observed roughly 1454 humans and 317 hackers.

In [None]:
df_X_test_std.groupby('Cluster_Labels').count()

###### Four Clusters

Begin with configuring all paramters for k-means clustering.

In [None]:
# K means clustering of Training Data
number_cluster = 4
kmeans = KMeans(n_clusters = number_cluster, init = 'k-means++', random_state = 10).fit(X_train_std)
labels = kmeans.labels_

Lastly, plot the data and let's label the data with our assumptions on how clusters are labeled.

In terms of Kill-Death Ratios, Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio, I'd expect the following trend Hackers > Professional > Experienced > Beginner. 

In [None]:
## 3D Plot of Training Data
# Create and modify dataframe for the cluster column
df_X_train_std = pd.DataFrame(X_train_std)
df_X_train_std['Cluster'] = pd.Series(labels, index=df_X_train_std.index)
#Rename Cluster label names from k-means
cluster_label_names = {0: "Beginner", 1: "Hacker", 2: "Experienced", 3: "Professional"}

df_X_train_std['Cluster_Labels'] = df_X_train_std['Cluster'].map(cluster_label_names)  

df_X_train_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio' )

scatter3d_cluster(df = df_X_train_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_train_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 


From the plots, we observed roughly 9059 beginners, 6561 experienced, 4332 professionals, and 819 hackers. This seems more reasonable in how many hackers were selected, as I'd expect hackers to take up a small number of the population roughly 0 to 10%. 

In [None]:
df_X_train_std.groupby('Cluster_Labels').count()

##### Predicting on the Dev Set

Begin with predicting on the dev set.

In [None]:
# # K means clustering of Dev Data
predict_labels = kmeans.predict(X_dev_std)

Next, create the 3D scatter plots.

In [None]:
## 3D Plot of deving Data
# Create and modify dataframe for the cluster column
df_X_dev_std = pd.DataFrame(X_dev_std)
df_X_dev_std['Cluster'] = pd.Series(predict_labels, index=df_X_dev_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {0: "Beginner", 1: "Hackers", 2: "Experienced", 3: "Professional"}
df_X_dev_std['Cluster_Labels'] = df_X_dev_std['Cluster'].map(cluster_label_names)  

df_X_dev_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio' )

scatter3d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_dev_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 

From the plots, we observed roughly 3139 beginners, 2221 experienced, 1448 professionals, and 313 hackers.

In [None]:
df_X_dev_std.groupby('Cluster_Labels').count()

##### Predicting on the test set

Begin with predicting on the test set.

In [None]:
# # K means clustering of Training Data
predict_labels = kmeans.predict(X_test_std)

Next, create the 3D scatter plots.

In [None]:
## 3D Plot of testing Data
# Create and modify dataframe for the cluster column
df_X_test_std = pd.DataFrame(X_test_std)
df_X_test_std['Cluster'] = pd.Series(predict_labels, index=df_X_test_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {0: "Beginner", 1: "Hackers", 2: "Experienced", 3: "Professional"}
df_X_test_std['Cluster_Labels'] = df_X_test_std['Cluster'].map(cluster_label_names)  

df_X_test_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']


# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio' )

scatter3d_cluster(df = df_X_test_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_test_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 

From the plots, we observed roughly 766 beginners, 564 experienced, 380 professionals, and 71 hackers.

In [None]:
df_X_test_std.groupby('Cluster_Labels').count()

### Clustering in 2D (Selected Few Features)

Let's examine all of these clusters in two-dimensions to see any patterns.

Begin with configuring all paramters for k-means clustering.

In [None]:
# K means clustering of Training Data
number_cluster = 2
kmeans = KMeans(n_clusters = number_cluster, init = 'k-means++', random_state = 10).fit(X_train_std)
labels = kmeans.labels_

Next, we'll use the function below to create our 2D scatter plots.

In [None]:
def scatter2d_cluster(df, x, y,  code, title):
    scatter = px.scatter(df, x = x, y = y, color = code,
                         color_discrete_sequence = px.colors.qualitative.Light24)
    
    scatter.update_xaxes(showline = True, linewidth = 1, linecolor = 'black', 
                          mirror = True, gridcolor = 'LightPink', automargin = True, 
                          zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
                          ticks = "outside", tickwidth = 2, tickcolor = 'black', ticklen = 10,
                          title_font = dict(size = 18))
    scatter.update_yaxes(showline = True, linewidth = 2, linecolor = 'black', 
                          mirror = True, gridcolor = 'LightPink',
                          zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
                          ticks = "outside", tickwidth = 2, tickcolor = 'black', ticklen = 10,
                          title_font = dict(size = 18))
    
    
    scatter.update_layout(title = title, title_font = dict(size = 24), 
                          legend = dict(
                              x = 1,
                              y = 1,
                              traceorder = "normal",
                              font = dict(
                                  family = "sans-serif",
                                  size = 14,
                                  color = "black"
                              ),
                              bgcolor = "#e5ecf6",
                              bordercolor = "Black",
                              borderwidth = 2
                          )
                         )
    return scatter.show()


Finally, we'll populate our scatter plots.

In [None]:
## 2D Plot of Training Data
# Create and modify dataframe for the cluster column
df_X_train_std = pd.DataFrame(X_train_std)
df_X_train_std['cluster'] = pd.Series(labels, index=df_X_train_std.index)
df_X_train_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio", "Top 10 Ratio", 'cluster']

#Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_train_std['Cluster_Labels'] = df_X_train_std['cluster'].map(cluster_label_names)  


# Plots of Win Ratio, KDR, Headshott KIll Ratio
scatter2d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Win Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Headshot Kill Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Headshot Kill Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Win Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Top 10 Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Top 10 Ratio')


From the plots, we observed roughly 17225 humans and 3546 hackers.

In [None]:
df_X_train_std.groupby('Cluster_Labels').count()

##### Predicting on the dev set

Begin with predicting on the dev set.

In [None]:
# # K means clustering of dev set
predict_labels = kmeans.predict(X_dev_std)

Next, populate our 2D scatter plots.

In [None]:
## 3D Plot of Testing Data
# Create and modify dataframe for the cluster column
df_X_dev_std = pd.DataFrame(X_dev_std)
df_X_dev_std['cluster'] = pd.Series(predict_labels, index=df_X_dev_std.index)
df_X_dev_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio", "Top 10 Ratio", 'cluster']

#Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_dev_std['Cluster_Labels'] = df_X_dev_std['cluster'].map(cluster_label_names)  


# Plots of Win Ratio, KDR, Headshott KIll Ratio
scatter2d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Win Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Headshot Kill Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Headshot Kill Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Win Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Top 10 Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Top 10 Ratio')

From the plots, we observed roughly 5898 humans and 1223 hackers.

In [None]:
df_X_dev_std.groupby('Cluster_Labels').count()

##### Predicting on the test set

Begin with predicting on the test set.

In [None]:
# # K means clustering of dev set
predict_labels = kmeans.predict(X_test_std)

Next, populate our 2D scatter plots.

In [None]:
## 3D Plot of Testing Data
# Create and modify dataframe for the cluster column
df_X_test_std = pd.DataFrame(X_test_std)
df_X_test_std['cluster'] = pd.Series(predict_labels, index=df_X_test_std.index)
df_X_test_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio", "Top 10 Ratio", 'cluster']

#Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_test_std['Cluster_Labels'] = df_X_test_std['cluster'].map(cluster_label_names)  


# Plots of Win Ratio, KDR, Headshott KIll Ratio
scatter2d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Win Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Headshot Kill Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Headshot Kill Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Win Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Top 10 Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Top 10 Ratio')

From the plots, we observed roughly 1464 humans and 317 hackers.

In [None]:
df_X_test_std.groupby('Cluster_Labels').count()

#### Four Clusters

Begin with configuring all paramters for k-means clustering.

In [None]:
# K means clustering of Training Data
number_cluster = 4
kmeans = KMeans(n_clusters = number_cluster, init = 'k-means++', random_state = 10).fit(X_train_std)
labels = kmeans.labels_

Next, we'll populate our scatter plots.

In [None]:
## 2D Plot of Training Data
# Create and modify dataframe for the cluster column
df_X_train_std = pd.DataFrame(X_train_std)
df_X_train_std['cluster'] = pd.Series(labels, index=df_X_train_std.index)
df_X_train_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio", "Top 10 Ratio", 'cluster']

#Rename Cluster label names from k-means
cluster_label_names = {0: "Beginner", 1: "Hacker", 2: "Experienced", 3: "Professional"}
df_X_train_std['Cluster_Labels'] = df_X_train_std['cluster'].map(cluster_label_names)  


# Plots of Win Ratio, KDR, Headshott KIll Ratio
scatter2d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Win Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Headshot Kill Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Headshot Kill Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Win Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Top 10 Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Top 10 Ratio')


From the plots, we observed roughly 9059 beginners, 6561 experienced, 4332 professionals, and 819 hackers.

In [None]:
df_X_train_std.groupby('Cluster_Labels').count()

##### Predicting on the dev set

Begin with predicting on the dev set.

In [None]:
# # K means clustering of dev set
predict_labels = kmeans.predict(X_dev_std)

Next, populate our 2D scatter plots.

In [None]:
## 3D Plot of Testing Data
# Create and modify dataframe for the cluster column
df_X_dev_std = pd.DataFrame(X_dev_std)
df_X_dev_std['cluster'] = pd.Series(predict_labels, index=df_X_dev_std.index)
df_X_dev_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio", "Top 10 Ratio", 'cluster']

#Rename Cluster label names from k-means
cluster_label_names = {0: "Beginner", 1: "Hacker", 2: "Experienced", 3: "Professional"}
df_X_dev_std['Cluster_Labels'] = df_X_dev_std['cluster'].map(cluster_label_names)  


# Plots of Win Ratio, KDR, Headshott KIll Ratio
scatter2d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Win Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Headshot Kill Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Headshot Kill Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Win Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Top 10 Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Top 10 Ratio')

From the plots, we observed roughly 3139 beginners, 2221 experienced, 1448 professionals, and 313 hackers.

In [None]:
df_X_dev_std.groupby('Cluster_Labels').count()

##### Predicting on the test set

Begin with predicting on the test set.

In [None]:
# # K means clustering of dev set
predict_labels = kmeans.predict(X_test_std)

Next, populate our 2D scatter plots.

In [None]:
## 3D Plot of Testing Data
# Create and modify dataframe for the cluster column
df_X_test_std = pd.DataFrame(X_test_std)
df_X_test_std['cluster'] = pd.Series(predict_labels, index=df_X_test_std.index)
df_X_test_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio", "Top 10 Ratio", 'cluster']

#Rename Cluster label names from k-means
cluster_label_names = {0: "Beginner", 1: "Hacker", 2: "Experienced", 3: "Professional"}
df_X_test_std['Cluster_Labels'] = df_X_test_std['cluster'].map(cluster_label_names)  


# Plots of Win Ratio, KDR, Headshott KIll Ratio
scatter2d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Win Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Headshot Kill Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Headshot Kill Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Win Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Top 10 Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Top 10 Ratio')

From the plots, we observed roughly 766 beginners, 564 experienced, 380 professionals, and 71 hackers.


In [None]:
df_X_test_std.groupby('Cluster_Labels').count()

### Remarks
* Without external labels, we cannot verify the accuracy of these clusters. But we can make an educated guess on what these clusters are by using domain experience and advice from experts playing the game.
* Also, it's not well defined on how these clusters are formed, maybe look into dimensionality reduction techniques such as PCA to solve this issue.
* K-means does well in segmenting common patterns in the data, but what if these hackers only comprise a small portion of the population? Then, we should examine a different approach and reimagine this as an anomaly detection problem.

### References

[1] "Survey Report on K-Means Clustering Algorithm", International Journal of Modern Trends in Engineering & Research, vol. 4, no. 4, pp. 218-221, 2017. Available: 10.21884/ijmter.2017.4143.lgjzd.

[2] https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/