## Introduction

This article will be a cakewalkthrough a **consulting project** where we will be working a large technology firm to predict that **certain type of hackers were involved in hacking their servers or not!** For solving this real world problem we will take help from **PySpark's KMeans algorithm** and then based on the features that the **forensic engineers** have extracted will pave out the way to find out whether **3rd type of hacker was involved in this malicious act or not!**


## About the dataset

The dataset was actually generated after the hackers have hacked the servers in order to save the company data from such activities in future by the **forensic engineers and they grabbed some features which will give us some relevant meta data about the type of hackers.**

**Here is the brief description of each features:**

1. **Session connection time:** This indicates the total time **session existed in minutes**.
2. **Bytes transferred:** This will let us know how many **mega bytes were transferred during the session**.
3. **Kali trace used:** This is kind of flag variable which indcates that whether hacker used the **Kali linux operator**.
4. **Servers corrupted:** How many **servers got corrupted** during the attack.
5. **Pages corrupted:** How many pages were accessed by them illegally.
6. **Location:** Though this meta information is also available with us but this one is of no use as **hackers use VPNs**
7. **WPM typing speed:** **Typing speed** of those criminals based on the logs available.


## What approach we have to follow?

First let's understand what company already know, So they are aware of the fact that **there are 3 types of hackers** who might penerated the attack. They are quite sure about the **2 of them** but they want us to know whether the third type of attacker was involved in this act of **criminal or not**.

One key thing we should know before moving forward i.e. forensic engineers knew that **hackers trade off**, which means the number of attacks were same from each hacker. So if there will be **3 type of hackers** then three of them might have equally distributed the attacks otherwise third suspect would have not involved this time.

**Example:** If all three type of attackers were the suspect then for **100 attacks** each one will be responsible for **33**.

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 44 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 48.1 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=2e59b4e3fc6e0e5e5361288f36e53f701d917650c536b2477c049d035c2109e5
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


## Mandatory steps to follow

Before analysing the dataset at high level and implementing the KMeans clustering algorithm on top of it we have to follow some steps that are mentioned below:

1. **Initializing the Spark object:** In this step we are gonna setup an environment for the Apache Spark so that the Spark session would be created and one can access all the libraries supported by Spark.

2. **Reading the dataset:** If one have to do cooking then fire is necessary similarly before model prepration and data analysis reading the dataset is equally important.

**Starting the PySpark session**

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('find_hackers').getOrCreate()
spark

**Inference:** We created the **SparkSession** named as "**find_hackers**" by using **getOrCreate**() function. Note that before "**create**" method builder method is there which sets the name of the **session/app**.

In [3]:
dataset = spark.read.csv("hack_data.csv",header=True,inferSchema=True)
dataset.show()

+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|            Location|WPM_Typing_Speed|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|                    8.0|           391.09|              1|             2.96|            7.0|            Slovenia|           72.37|
|                   20.0|           720.99|              0|             3.04|            9.0|British Virgin Is...|           69.08|
|                   31.0|           356.32|              1|             3.71|            8.0|             Tokelau|           70.58|
|                    2.0|           228.08|              1|             2.48|            8.0|             Bolivia|            70.8|
|                   20.0|            408.5|              0|             3.57

**Inference:** Keeping the **header** and **inferSchema** as **True** so that the first tuple of record should be treated as the heading of the features column and returning the original type of data of each column as well. **In the output the code returned a DataFrame with too 20 rows of it.**

In [4]:
dataset.head()

Row(Session_Connection_Time=8.0, Bytes Transferred=391.09, Kali_Trace_Used=1, Servers_Corrupted=2.96, Pages_Corrupted=7.0, Location='Slovenia', WPM_Typing_Speed=72.37)

**Inference:** Head is one of those method supported by PySpark which will not only return the name of the columns but also the **values associated with it.**

In [5]:
dataset.describe().show()

+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|summary|Session_Connection_Time| Bytes Transferred|   Kali_Trace_Used|Servers_Corrupted|   Pages_Corrupted|   Location|  WPM_Typing_Speed|
+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|  count|                    334|               334|               334|              334|               334|        334|               334|
|   mean|     30.008982035928145| 607.2452694610777|0.5119760479041916|5.258502994011977|10.838323353293413|       null|57.342395209580864|
| stddev|     14.088200614636158|286.33593163576757|0.5006065264451406| 2.30190693339697|  3.06352633036022|       null| 13.41106336843464|
|    min|                    1.0|              10.0|                 0|              1.0|               6.0|Afghanistan|              40.0|
|    max|           

**Inference:** Getting the statistical information is one of the key things to do while analyzing the dataset as it will tell us about the **minimum**, **maximum** value, **standard deviation** and what not!.

Similarly here one inference is clearly visible that **there are no null values in the dataset (from the count row)**.

## Vector Assembler

Machine learning algorithm always accepts the **rightly formatted data** no matter what libraries we are using whether it is **scikit-learn** or as in our case **PySpark**, formatting the data is always necessary so that we only fed a right type of data to our **KMeans model**.

In [7]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

feat_cols = ['Session_Connection_Time', 'Bytes Transferred', 'Kali_Trace_Used',
             'Servers_Corrupted', 'Pages_Corrupted','WPM_Typing_Speed']

vec_assembler = VectorAssembler(inputCols = feat_cols, outputCol='features')

final_data = vec_assembler.transform(dataset)

**Code breakdown:**  

1.  Importing the **VectorAssembler** object from the **feature** module of **ml** library.
2. Then making a **new variable** where we will store all the **features** in the form of **list**.
3. Then at the last calling that **assembler object** and passing the **input columns as features** along with that transforming it too so that changes should be there in **original data** as well.

## Scaling the features

Scaling the features turns out to be an important steps when we have **diversity in the range of values** in our dataset i.e. **the range is quite variable** that it might leads to the condition of **curse of dimensionality** which will yeilds results but not as we expect hence now we will scale down our **feature** columns.

In [11]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

scalerModel = scaler.fit(final_data)

cluster_final_data = scalerModel.transform(final_data)

**Code breakdown:** 

1. Importing the **StandardScaler** object and then calling the same so that we can give the input column as our features keeping the **withStd parameter as True** because here we want to scale the data in terms of **standard deviation** not with **mean**.

2. Plucking out the **summary of the statistics** which is obtained by **fitting the standard scaler from fit** method.

3. After **fitting the object** **transforming** is the next step where each feature will be **normalized** for unit **standard deviation** metric.

**Let's find out whether k=2 is required or 3!**

## Model training 

Here comes the model training phase where we are gonna create a **KMeans model** to help us **create a cluster** of all three or two types of attacker involved in hacking the server. 

Note that we will build two model here one will be when number of **clusters is 2 and one when it's 3.**

In [15]:
from pyspark.ml.clustering import KMeans

kmeans3 = KMeans(featuresCol='scaledFeatures',k=3)
kmeans2 = KMeans(featuresCol='scaledFeatures',k=2)

model_k3 = kmeans3.fit(cluster_final_data)
model_k2 = kmeans2.fit(cluster_final_data)

model_k3 = model_k3.transform(cluster_final_data)
model_k2 = model_k2.transform(cluster_final_data)

**Inference:** In the above set of code we are simply creating two KMeans models one where the **k=3** and the other when **k=2** so that we can compare both the scenarios and solve the problem where we need to find out the **third hacker was involved in this malicious act or not**.

**Note:** **fit** and **transform** method can be used simulataneously as they both contribute to the model building phase.

## Model Evaluation

After building the model evaluating it is equally important because, "**there could be millions of model but only 1 will be actually useful**" and to get the most optimal model we need to evaluate it based on some metrics. In the case of **KMeans model** we use the Clustering Evaluator object for evaluation purpose.

In [33]:
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator()

k3_evaluator = evaluator.evaluate(model_k3)
k2_evaluator = evaluator.evaluate(model_k2)

print("When K=3")
print("Error results = " + str(k3_evaluator))
print('-'*53)
print("When K=2")
print("Error results = " + str(k2_evaluator))

When K=3
Error results = 0.3068084951287429
-----------------------------------------------------
When K=2
Error results = 0.6683623593283755


**Inference:** After evaluation we printed the results for the cases, We can note from here that when the k value was 2 the **error results are relatively more** as compare to when k=3. 

Though we should not be satisified with just checking two k values instead we have to try with multiple K values for that I have created one **for loop setup that will check further for more number of clusters at one go.**

In [34]:
for k in range(2,9):
    kmeans = KMeans(featuresCol='scaledFeatures',k=k)
    model = kmeans.fit(cluster_final_data)
    model = model.transform(cluster_final_data)
    k_evaluator = evaluator.evaluate(model)
    print("With K={}".format(k))
    print("Error results  = " + str(k_evaluator))
    print('-'*53)

With K=2
Error results  = 0.6683623593283755
-----------------------------------------------------
With K=3
Error results  = 0.3068084951287429
-----------------------------------------------------
With K=4
Error results  = -0.04792891045570489
-----------------------------------------------------
With K=5
Error results  = -0.1047113268903205
-----------------------------------------------------
With K=6
Error results  = -0.10603693913180695
-----------------------------------------------------
With K=7
Error results  = -0.13283304792499523
-----------------------------------------------------
With K=8
Error results  = -0.1645464293373172
-----------------------------------------------------


**Inference:** From the above output we can see the pattern as the **number of k values increases the results are getting worse** and worse. K=2 seems to be the most optimal value of K as the error results of evaluation metrics are on the positive side.

The final evaluation will be done based on the point that we discuss earlier i.e. **hacker's trade off** - equal number of attacks be each attacker was there. To confirm this we will **groupBy prediction column to get the count of each**.

In [25]:
model_k3.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         2|   84|
|         0|   83|
+----------+-----+



In [26]:
model_k2.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



**Inference:** Here we can see that equal number of predictions are there in the case where number of **clusters are only 2** hence we can **conclude that third type of attacker was not involved in hacking the servers of the company.**

## Conclusion

The final part of the article cum **end to end solution of a consulting project is the conclusion** where we will brief out each step so that the **pipeline of the project** is understandable and crystal clear so the one can use it as a template for other such problems statement.

1. Firstly we throughly investigated what is the problem statement and clarified the approach then we move forward and completed some mandatory steps **such as reading dataset and setting up spark session.**

2. After reading we did some analysis on the dataset and **formatted** it further to make it ready for **model development phase (Vector assembler and Standard Scaling).**

3. At the last we **build the model and evaluated it** and came to the conclusion that third type of attacker has nothing to do with this session of hacking along with other 2 type. 