![PUBG_Logo](assets/PUBG_logo.png)

## Objective
* Employ Anomaly Detection algorithms on dataset.
* Discuss pertinent results.

## Background Information
* Playerunknown's Battleground (PUBG) is a video game, which set the standard for preceding games in the Battle Royale Genre. The main goal is to SURVIVE at all costs.

## Process:
* Exploratory Data Analysis conducted utilizing various python packages (Numpy, Matplotlib, Pandas, and Plotly).
* Anomaly Detection Algorithms (Sci-Kit Learn)
    * Local Outlier Field
    * Ellipitic Envelope
    * Isolation Forest


## Table of Contents:
* Part I: Exploratory Data Analysis
    * EDA
* Part II: Anomaly Detection Algorithm
    * Local Outlier Field
    * Ellipitic Envelope
    * Isolation Forest
* Part III: Isolation Forest in 2D

In [3]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns

# PART I - Exploratory Data Analysis

### Data Preprocessing / Feature Engineering

Let us begin by reading in the CSV file containing the data, and examining the data contents such as the number of features and rows. It seems there are 152 column entries (features) and 87898 row entries (number of samples).

In [4]:
#--------- Pandas Dataframe
## Read in CSV
orig = pd.read_csv('data/PUBG_Player_Statistics.csv')

Now, let us remove and combine features, which do not pertain to our goal of clustering solo player behavior. 

Remove:
* player_name
* tracker_id
* duo
* squad

Add:
* Total Distance

This can be achieved by removing all columns after the 52nd. Also, create a new feature that combines the walking and riding distance.

Also, we will reduce the variance in the data by removing players with less than the mean number of rounds in our data.

In [5]:
#---------Preprocessing
## Create a copy of the dataframe
df = orig.copy()
cols = np.arange(52, 152, 1)

# Drop entries if they have null values
df.dropna(inplace = True)

## Drop columns after the 52nd index
df.drop(df.columns[cols], axis = 1, inplace = True)

## Drop player_name and tracker id
df.drop(df.columns[[0, 1]], axis = 1, inplace = True)

## Drop Knockout and Revives
df.drop(df.columns[[49]], axis = 1, inplace = True)
df.drop(columns = ['solo_Revives'], inplace = True)

## Drop the string solo from all strings
df.rename(columns = lambda x: x.lstrip('solo_').rstrip(''), inplace = True)

## Combine a few columns 
df['TotalDistance'] = df['WalkDistance'] + df['RideDistance']
df['AvgTotalDistance'] = df['AvgWalkDistance'] + df['AvgRideDistance']

# Remove Outliers
df = df.drop(df[df['RoundsPlayed'] < df['RoundsPlayed'].mean()].index)

Split the data into three sets: train, dev, and test set.

In [6]:
# Create train and test set using Sci-Kit Learn
train, test = train_test_split(df, test_size=0.3, random_state = 10)
dev, test = train_test_split(test, test_size = 0.2, random_state = 10)
data = train

print("The number of training samples is", len(train))
print("The number of development samples is", len(dev))
print("The number of testing samples is", len(test))

The number of training samples is 20771
The number of development samples is 7121
The number of testing samples is 1781


It is important we go through the final output to make sure that are data preprocessing is complete. And it looks great!

In [7]:
with pd.option_context('display.max_columns', 52):
    print(data.describe(include = 'all'))

       KillDeathRatio      WinRatio  TimeSurvived  RoundsPlayed          Wins  \
count    20771.000000  20771.000000  2.077100e+04  20771.000000  20771.000000   
mean         1.289158      2.204012  1.484172e+05    174.985894      3.554475   
std          0.602602      2.510500  9.339460e+04    113.147056      4.939222   
min          0.100000      0.000000  3.813548e+04     80.000000      0.000000   
25%          0.900000      0.680000  9.091498e+04    104.000000      1.000000   
50%          1.160000      1.460000  1.195404e+05    139.000000      2.000000   
75%          1.520000      2.910000  1.733681e+05    205.000000      4.000000   
max         17.410000     40.210000  1.219536e+06   1552.000000    102.000000   

       WinTop10Ratio        Top10s    Top10Ratio        Losses        Rating  \
count   20771.000000  20771.000000  20771.000000  20771.000000  20771.000000   
mean        0.138708     23.884743     14.369067    171.431419   2059.159131   
std         0.137145     19.21

The only factors above which have a positive correlation to average survival time are Average Total Distance, Win Ratio, and Top 10 Ratio.

# PART 2 - Clustering

Procedure: 
* LoF
    * 3D
* Elliptic
    * 3D
* IsolatorForest
    * 3D


### Clustering in 3D (Selected Few Features)

We selected the following features because of experts and my domain experience playing PUBG.

In [8]:
# Select four features
train_data = train.loc[:,['WinRatio', 'KillDeathRatio', "HeadshotKillRatio", "Top10Ratio"]]
dev_data = dev.loc[:, ['WinRatio', 'KillDeathRatio', "HeadshotKillRatio", "Top10Ratio"]]
test_data = test.loc[:, ['WinRatio', 'KillDeathRatio', "HeadshotKillRatio", "Top10Ratio"]]

Feature scaling is utilized to make sure all features are normalized and have similar orders of magnitude. This is important because our clustering algorithms look into calculating the distance between points. In our case, we employed a zero-mean and unit-variance scaling.

In [9]:
# Scale the data (Normaliz)
scaler = StandardScaler()
X_train_std = scaler.fit_transform(train_data)
X_dev_std = scaler.transform(dev_data)
X_test_std = scaler.transform(test_data)

#### Local Outlier Factor

LOF compares outliers to their local neighborhood rather than the global data distribution. Because the density around an outlier object is different than the density around the neighbors. The method involves using the relative density of an object against its neighbors as indicator of the degree of the object being outliers.

##### Algorithm
1. Arbitrarily select a point P.
2. Calculate distances between P and every other point.
3. Find the Kth closest point
4. Find K closest points whose distances are smaller than the Kth point.
5. Find its Local Reachability Density (how close its neighbors are to it), the lower the density, the farther p from its neighbors.
6. Find its local outlier factor, the sum of the distances between P and its neighboring points.


##### Parameters

In LOF, we'll be examining two parameters:
* contamination: proportion of outliers in a dataset. 
* num_neighbors (K): the minimum number of points to form a dense region.

For contamination: We'll be making an educated guess given this article stating that out of 26,000,000 accounts, 1,500,000 were caught cheating ~ 0.58% [1]. However, this does not include the people who didnt get caught with cheating. 

For num_neighbors: We'll be using a research paper on LOF to select the k-number of neearest neighbors. If you want to consider a point near a group of N points as an outlier, rather than part of that group, your k value should be at least N, ~2000 [2]. 

Begin with configuring all parameters for LOF.

In [13]:
LOF = LocalOutlierFactor(n_neighbors = 20, contamination = 0.0058)

In [14]:
labels = LOF.fit_predict(X_train_std)

Silhouette analysis studies how similar and dissimilar neighboring cluster centroids are. We select the point which is closest to +1. In our case, the parameters have a silhouette score of 0.7332.

In [15]:
# Print Silhouette Score
ss = metrics.silhouette_score(X_train_std, labels)
print('The Silhouette Score for the training set is ' + str(ss) + ".")

The Silhouette Score for the training set is 0.5318327170584771.


Next, load in our function to plot our 3D scatter plots.

In [None]:
def scatter3d_cluster(df, x, y, z, code, title):
    scatter = px.scatter_3d(df, x=x, y=y, z=z, color = code,  
                            color_discrete_sequence=px.colors.qualitative.Light24)
    
    scatter.update_layout(title = title, title_font = dict(size = 30),
                          scene = dict(
                              xaxis = dict(
                                  backgroundcolor="rgb(200, 200, 230)",
                                  gridcolor="white",
                                  showbackground=True,
                                  zerolinecolor="white",
                                  nticks=10, ticks='outside',
                                  tick0=0, tickwidth = 4,
                                  title_font = dict(size = 16)),
                              yaxis = dict(
                                  backgroundcolor="rgb(230, 200,230)",
                                  gridcolor="white",
                                  showbackground=True,
                                  zerolinecolor="white",
                                  nticks=10, ticks='outside',
                                  tick0=0, tickwidth = 4,
                                  title_font = dict(size = 16)),
                              zaxis = dict(
                                  backgroundcolor="rgb(230, 230,200)",
                                  gridcolor="white",
                                  showbackground=True,
                                  zerolinecolor="white",
                                  nticks=10, ticks='outside',
                                  tick0=0, tickwidth = 4,
                                  title_font = dict(size = 16),
                              ),
                          ),
                          width = 700
                         )
    return scatter.show()

Lastly, plot the data and let's label the data with our assumptions on how hackers are perceived.

Hackers tend to have high Kill-Death Ratios, Headshot-Kill Ratios, Top 10 Ratios, and Win Ratios.

In [None]:
## 3D Plot of Training Data
# Create and modify dataframe for the cluster column
df_X_train_std = pd.DataFrame(X_train_std)
df_X_train_std['Cluster'] = pd.Series(labels, index = df_X_train_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {1: "Human", -1: "Hacker"}
df_X_train_std['Cluster_Labels'] = df_X_train_std['Cluster'].map(cluster_label_names)  

df_X_train_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_train_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_train_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 


In [None]:
df_X_train_std['Cluster'].value_counts()

From the plots, we observed roughly 20650 humans and 121 hackers. 

In [None]:
df_X_train_std.groupby('Cluster_Labels').count()

##### Predicting on the dev set

Begin with predicting on the dev set.

In [None]:
LOF = LocalOutlierFactor(n_neighbors = 2000, contamination = 0.0058, novelty = True).fit(X_train_std)

In [None]:
predict_labels = LOF.predict(X_dev_std)

Next, create the 3D scatter plots.

In [None]:
## 3D Plot of deving Data
# Create and modify dataframe for the cluster column
df_X_dev_std = pd.DataFrame(X_dev_std)
df_X_dev_std['Cluster'] = pd.Series(predict_labels, index=df_X_dev_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {1: "Human", -1: "Hacker"}
df_X_dev_std['Cluster_Labels'] = df_X_dev_std['Cluster'].map(cluster_label_names)  

df_X_dev_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio' )

scatter3d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_dev_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 

From the plots, we observed roughly 7072 humans and 49 hackers.

In [None]:
df_X_dev_std.groupby('Cluster_Labels').count()

##### Predicting on the test set

Begin with predicting on the test set.

In [None]:
LOF = LocalOutlierFactor(n_neighbors = 2000, contamination = 0.0058, novelty = True).fit(X_train_std)

In [None]:
predict_labels = LOF.predict(X_test_std)

Next, create the 3D scatter plots.

In [None]:
## 3D Plot of testing Data
# Create and modify dataframe for the cluster column
df_X_test_std = pd.DataFrame(X_test_std)
df_X_test_std['Cluster'] = pd.Series(predict_labels, index = df_X_test_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {1: "Human", -1: "Hacker"}
df_X_test_std['Cluster_Labels'] = df_X_test_std['Cluster'].map(cluster_label_names)  

df_X_test_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_test_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_test_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 

From the plots, we observed roughly 1768 humans and 13 hackers.

In [None]:
df_X_test_std.groupby('Cluster_Labels').count()

#### Elliptic Envelope

Assumes the data is Gaussian and learns an ellipse. The model fits a multivariate Gaussian density to it.

##### Algorithm [3]
1. Draw a random h-subset
2. Draw a random (p +1) subset J and then compute the covariance.

##### Parameters

In EE, we'll be examining two parameters:
* contamination: proportion of outliers in a dataset. 

For contamination: We'll be making an educated guess given this article stating that out of 26,000,000 accounts, 1,500,000 were caught cheating ~ 0.58% [1]. However, this does not include the people who didnt get caught with cheating. 


Begin with configuring all parameters for EE.

In [None]:
EE = EllipticEnvelope(random_state = 10, contamination = 0.0058).fit(X_train_std)

In [None]:
labels = EE.fit_predict(X_train_std)

Silhouette analysis studies how similar and dissimilar neighboring cluster centroids are. We select the point which is closest to +1. In our case, the parameters have a silhouette score of 0.7427.

In [None]:
ss = metrics.silhouette_score(X_train_std, labels)
print('The Silhouette Score for the training set is ' + str(ss) + ".")

Next, plot the data.

In [None]:
## 3D Plot of Training Data
# Create and modify dataframe for the cluster column
df_X_train_std = pd.DataFrame(X_train_std)
df_X_train_std['Cluster'] = pd.Series(labels, index = df_X_train_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {1: "Human", -1: "Hacker"}
df_X_train_std['Cluster_Labels'] = df_X_train_std['Cluster'].map(cluster_label_names)  

df_X_train_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_train_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_train_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 


In [None]:
df_X_train_std['Cluster'].value_counts()

From the plots, we observed roughly 20650 humans and 121 hackers.

In [None]:
df_X_train_std.groupby('Cluster_Labels').count()

##### Predicting on the dev set

Begin with predicting on the dev set.

In [None]:
EE = EllipticEnvelope(random_state = 10, contamination = 0.0058).fit(X_train_std)

In [None]:
predict_labels = EE.predict(X_dev_std)

Next, create the 3D scatter plots.

In [None]:
## 3D Plot of deving Data
# Create and modify dataframe for the cluster column
df_X_dev_std = pd.DataFrame(X_dev_std)
df_X_dev_std['Cluster'] = pd.Series(predict_labels, index=df_X_dev_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {1: "Human", -1: "Hacker"}
df_X_dev_std['Cluster_Labels'] = df_X_dev_std['Cluster'].map(cluster_label_names)  

df_X_dev_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio' )

scatter3d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_dev_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 

From the plots, we observed roughly 7073 humans and 48 hackers.

In [None]:
df_X_dev_std.groupby('Cluster_Labels').count()

##### Predicting on the test set

Begin with predicting on the test set.

In [None]:
EE = EllipticEnvelope(random_state = 10, contamination = 0.0058).fit(X_train_std)

In [None]:
predict_labels = EE.predict(X_test_std)

Next, create the 3D scatter plots.

In [None]:
## 3D Plot of testing Data
# Create and modify dataframe for the cluster column
df_X_test_std = pd.DataFrame(X_test_std)
df_X_test_std['Cluster'] = pd.Series(predict_labels, index = df_X_test_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {1: "Human", -1: "Hacker"}
df_X_test_std['Cluster_Labels'] = df_X_test_std['Cluster'].map(cluster_label_names)  

df_X_test_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_test_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_test_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 

From the plots, we observed roughly 1770 humans and 11 hackers.

In [None]:
df_X_test_std.groupby('Cluster_Labels').count()

#### Isolation Forest

The algorithm isolates each point in the data and splits them into outliers or inliers. The split depends on how long it takes to separate the points. If the point is an outlier, it will be easy to split, but if the point is an inlier, it will be difficult to isolate.

##### Algorithm [4]
1. Select the point to isolate.
2. For each feature, set the range to isolate between the minimum and the maximum.
3. Choose a feature randomly.
4. Pick a value that’s in the range, again randomly:
    * If the chosen value keeps the point above, switch the minimum of the range of the feature to the value.
    * If the chosen value keeps the point below, switch the maximum of the range of the feature to the value.
5. Repeat steps 3 & 4 until the point is isolated. That is, until the point is the only one which is inside the range for all features.
6. Count how many times you’ve had to repeat steps 3 & 4. We call this quantity the isolation number.


##### Parameters

In IF, we'll be examining three parameters:
* contamination: proportion of outliers in a dataset.
* max_samples: The number of samples to draw from X to train each base estimator.
* n_estimators: The number of base estimators in the ensemble.

For contamination: We'll be making an educated guess given this article stating that out of 26,000,000 accounts, 1,500,000 were caught cheating ~ 0.58% [1]. However, this does not include the people who didnt get caught with cheating. 

For max_samples: We'll be setting the max_samples to auto ~ 256 samples.

For n_estimators: We'll be setting the n_estimators to 500.


Begin with configuring all parameters for IF.

In [None]:
IF = IsolationForest(max_samples = 'auto' ,random_state = 10, contamination = .0058, n_estimators = 500) 
IF.fit(X_train_std)
IF_scores = IF.decision_function(X_train_std)
IF_anomalies = IF.predict(X_train_std)
IF_anomalies = pd.Series(IF_anomalies).replace([-1,1], [1,0])

In [None]:
labels = IF_anomalies

In [None]:
labels.value_counts(0)

Silhouette analysis studies how similar and dissimilar neighboring cluster centroids are. We select the point which is closest to +1. In our case, the parameters have a silhouette score of 0.7440.

In [None]:
ss = metrics.silhouette_score(X_train_std, labels)
print('The Silhouette Score for the training set is ' + str(ss) + ".")

Next, plot the data.

In [None]:
## 3D Plot of Training Data
# Create and modify dataframe for the cluster column
df_X_train_std = pd.DataFrame(X_train_std)
df_X_train_std['Cluster'] = pd.Series(labels, index = df_X_train_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_train_std['Cluster_Labels'] = df_X_train_std['Cluster'].map(cluster_label_names)  

df_X_train_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_train_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_train_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 


In [None]:
df_X_train_std['Cluster'].value_counts()

From the plots, we observed roughly 20650 humans and 121 hackers.

In [None]:
df_X_train_std.groupby('Cluster_Labels').count()

##### Predicting on the dev set

Begin with predicting on the dev set.

In [None]:
IF = IsolationForest(max_samples = 'auto' ,random_state = 10, contamination = .0058, n_estimators = 500) 
IF.fit(X_train_std)

In [None]:
IF_anomalies = IF.predict(X_dev_std)
IF_anomalies = pd.Series(IF_anomalies).replace([-1,1], [1,0])

In [None]:
predict_labels = IF_anomalies

Next, create the 3D scatter plots.

In [None]:
## 3D Plot of deving Data
# Create and modify dataframe for the cluster column
df_X_dev_std = pd.DataFrame(X_dev_std)
df_X_dev_std['Cluster'] = pd.Series(predict_labels, index=df_X_dev_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_dev_std['Cluster_Labels'] = df_X_dev_std['Cluster'].map(cluster_label_names)  

df_X_dev_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio' )

scatter3d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_dev_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 

From the plots, we observed roughly 7071 humans and 50 hackers.

In [None]:
df_X_dev_std.groupby('Cluster_Labels').count()

##### Predicting on the test set

Begin with predicting on the test set.

In [None]:
IF = IsolationForest(max_samples = 'auto' ,random_state = 10, contamination = .0058, n_estimators = 500) 
IF.fit(X_train_std)

In [None]:
IF_anomalies = IF.predict(X_test_std)
IF_anomalies = pd.Series(IF_anomalies).replace([-1,1], [1,0])

In [None]:
predict_labels = IF_anomalies

Next, create the 3D scatter plots.

In [None]:
## 3D Plot of testing Data
# Create and modify dataframe for the cluster column
df_X_test_std = pd.DataFrame(X_test_std)
df_X_test_std['Cluster'] = pd.Series(predict_labels, index = df_X_test_std.index)

#Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_test_std['Cluster_Labels'] = df_X_test_std['Cluster'].map(cluster_label_names)  

df_X_test_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio",
                          "Top 10 Ratio", 'Cluster', 'Cluster_Labels']

# Plots of Win Ratio, Kill Death Ratio, Headshott KIll Ratio
scatter3d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Headshot-Kill Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_test_std , x = 'Kill Death Ratio', 
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                  title = 'Clustering of Kill-Death Ratio, Top 10 Ratio, and Win Ratio')

scatter3d_cluster(df = df_X_test_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', z = 'Win Ratio', code = 'Cluster_Labels', 
                 title = 'Clustering of Headshot-Kill Ratio, Top 10 Ratio, and Win Ratio') 

From the plots, we observed roughly 1769 humans and 12 hackers.

In [None]:
df_X_test_std.groupby('Cluster_Labels').count()

## Isolation Forest in 2D

Let's examine our best performing anomaly detection algorithm Isolation Forest.

Begin with configuring all parameters for Isolation Forest.

In [None]:
# Isolation Forest on Training Data
IF = IsolationForest(max_samples = 'auto' ,random_state = 10, contamination = .0058, n_estimators = 500) 
IF.fit(X_train_std)
IF_scores = IF.decision_function(X_train_std)
IF_anomalies = IF.predict(X_train_std)
IF_anomalies = pd.Series(IF_anomalies).replace([-1,1], [1,0])

In [None]:
labels = IF_anomalies

Next, we'll use the function below to create our 2D scatter plots.

In [None]:
def scatter2d_cluster(df, x, y,  code, title):
    scatter = px.scatter(df, x = x, y = y, color = code,
                         color_discrete_sequence = px.colors.qualitative.Light24)
    
    scatter.update_xaxes(showline = True, linewidth = 1, linecolor = 'black', 
                          mirror = True, gridcolor = 'LightPink', automargin = True, 
                          zeroline = True, zerolinewidth = 2, zerolinecolor = 'LightPink', 
                          ticks = "outside", tickwidth = 2, tickcolor = 'black', ticklen = 10,
                          title_font = dict(size = 18))
    scatter.update_yaxes(showline = True, linewidth = 2, linecolor = 'black', 
                          mirror = True, gridcolor = 'LightPink',
                          zeroline = True, zerolinewidth = 1, zerolinecolor = 'LightPink', 
                          ticks = "outside", tickwidth = 2, tickcolor = 'black', ticklen = 10,
                          title_font = dict(size = 18))
    
    
    scatter.update_layout(title = title, title_font = dict(size = 24), 
                          legend = dict(
                              x = 1,
                              y = 1,
                              traceorder = "normal",
                              font = dict(
                                  family = "sans-serif",
                                  size = 14,
                                  color = "black"
                              ),
                              bgcolor = "#e5ecf6",
                              bordercolor = "Black",
                              borderwidth = 2
                          )
                         )
    return scatter.show()


Finally, we'll populate our scatter plots.

In [None]:
## 2D Plot of Training Data
# Create and modify dataframe for the cluster column
df_X_train_std = pd.DataFrame(X_train_std)
df_X_train_std['cluster'] = pd.Series(labels, index = df_X_train_std.index)
df_X_train_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio", "Top 10 Ratio", 'cluster']

#Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_train_std['Cluster_Labels'] = df_X_train_std['cluster'].map(cluster_label_names)  


# Plots of Win Ratio, KDR, Headshott KIll Ratio
scatter2d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Win Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Headshot Kill Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Headshot Kill Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Win Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Kill Death Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Top 10 Ratio')

scatter2d_cluster(df = df_X_train_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Top 10 Ratio')


From the plots, we observed roughly 20650 humans and 121 hackers.

In [None]:
df_X_train_std.groupby('Cluster_Labels').count()

##### Predicting on the dev set

Begin with predicting on the dev set.

In [None]:
IF = IsolationForest(max_samples = 'auto' ,random_state = 10, contamination = .0058, n_estimators = 500) 
IF.fit(X_train_std)

In [None]:
IF_anomalies = IF.predict(X_dev_std)
IF_anomalies = pd.Series(IF_anomalies).replace([-1,1], [1,0])

In [None]:
predict_labels = IF_anomalies

Next, populate our 2D scatter plots.

In [None]:
## 3D Plot of Testing Data
# Create and modify dataframe for the cluster column
df_X_dev_std = pd.DataFrame(X_dev_std)
df_X_dev_std['cluster'] = pd.Series(predict_labels, index = df_X_dev_std.index)
df_X_dev_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio", "Top 10 Ratio", 'cluster']

# Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_dev_std['Cluster_Labels'] = df_X_dev_std['cluster'].map(cluster_label_names)  


# Plots of Win Ratio, KDR, Headshott KIll Ratio
scatter2d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Win Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Headshot Kill Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Headshot Kill Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Win Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Kill Death Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Top 10 Ratio')

scatter2d_cluster(df = df_X_dev_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Top 10 Ratio')

From the plots, we observed roughly 7071 humans and 50 hackers.

In [None]:
df_X_dev_std.groupby('Cluster_Labels').count()

##### Predicting on the test set

Begin with predicting on the test set.

In [None]:
IF = IsolationForest(max_samples = 'auto' ,random_state = 10, contamination = .0058, n_estimators = 500) 
IF.fit(X_train_std)

In [None]:
IF_anomalies = IF.predict(X_test_std)
IF_anomalies = pd.Series(IF_anomalies).replace([-1,1], [1,0])

In [None]:
predict_labels = IF_anomalies

Next, populate our 2D scatter plots.

In [None]:
## 3D Plot of Testing Data
# Create and modify dataframe for the cluster column
df_X_test_std = pd.DataFrame(X_test_std)
df_X_test_std['cluster'] = pd.Series(predict_labels, index=df_X_test_std.index)
df_X_test_std.columns = ['Win Ratio', 'Kill Death Ratio', "Headshot Kill Ratio", "Top 10 Ratio", 'cluster']

#Rename Cluster label names from k-means
cluster_label_names = {0: "Human", 1: "Hacker"}
df_X_test_std['Cluster_Labels'] = df_X_test_std['cluster'].map(cluster_label_names)  


# Plots of Win Ratio, KDR, Headshott KIll Ratio
scatter2d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Win Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Headshot Kill Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Headshot Kill Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Headshot Kill Ratio',
                  y = 'Win Ratio',  code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Win Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Kill Death Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Kill Death Ratio and Top 10 Ratio')

scatter2d_cluster(df = df_X_test_std , x = 'Headshot Kill Ratio',
                  y = 'Top 10 Ratio', code = 'Cluster_Labels',
                  title = 'Clustering of Headshot Kill Ratio and Top 10 Ratio')

From the plots, we observed roughly 1769 humans and 12 hackers.

In [None]:
df_X_test_std.groupby('Cluster_Labels').count()

### Remarks
* Without external labels, we cannot verify the accuracy of these clusters. But we can make an educated guess on what these clusters are by using domain experience and advice from experts playing the game.
* Treating this problem as an outlier detection (anomaly detection) problem has yielded promising results.
    * With the assumption that the number of hackers in our population is low, we can treat them as outliers.
* However, most of the outliers in our models can be misclassifying actual humans, so further tuning needs to be done to reduce that misclassification rate.
* In terms of algorithm performance - the algorithms were rated based on the Silhouette Score and Possible Misclassification
    * Isolation Forest > Elipitical Envelope > LOF

### References

[1] https://gamerant.com/playerunknowns-battlegrounds-cheater-ban-count/

[2] M. Breunig, H. Kriegel, R. Ng and J. Sander, "LOF", ACM SIGMOD Record, vol. 29, no. 2, pp. 93-104, 2000. Available: 10.1145/335191.335388.

[3] P. J. Rousseeuw and K. V. Driessen, “A Fast Algorithm for the Minimum Covariance Determinant Estimator,” Technometrics, vol. 41, no. 3, pp. 212–223, 1999.

[4] K. M. Ting, “Adaptive Anomaly Detection using Isolation Forest,” 2009. 