# Clustering Consulting Project 

A large technology firm needs your help, they've been hacked! Luckily their forensic engineers have grabbed valuable data about the hacks, including information like session time,locations, wpm typing speed, etc. The forensic engineer relates to you what she has been able to figure out so far, she has been able to grab meta data of each session that the hackers used to connect to their servers. These are the features of the data:

* 'Session_Connection_Time': How long the session lasted in minutes
* 'Bytes Transferred': Number of MB transferred during session
* 'Kali_Trace_Used': Indicates if the hacker was using Kali Linux
* 'Servers_Corrupted': Number of server corrupted during the attack
* 'Pages_Corrupted': Number of pages illegally accessed
* 'Location': Location attack came from (Probably useless because the hackers used VPNs)
* 'WPM_Typing_Speed': Their estimated typing speed based on session logs.


The technology firm has 3 potential hackers that perpetrated the attack. Their certain of the first two hackers but they aren't very sure if the third hacker was involved or not. They have requested your help! Can you help figure out whether or not the third suspect had anything to do with the attacks, or was it just two hackers? It's probably not possible to know for sure, but maybe what you've just learned about Clustering can help!

**One last key fact, the forensic engineer knows that the hackers trade off attacks. Meaning they should each have roughly the same amount of attacks. For example if there were 100 total attacks, then in a 2 hacker situation each should have about 50 hacks, in a three hacker situation each would have about 33 hacks. The engineer believes this is the key element to solving this, but doesn't know how to distinguish this unlabeled data into groups of hackers.**

In [22]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('clustering').getOrCreate()

In [23]:
df = spark.read.csv('hack_data.csv', header=True, inferSchema=True)
df.show(2)

+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|            Location|WPM_Typing_Speed|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|                    8.0|           391.09|              1|             2.96|            7.0|            Slovenia|           72.37|
|                   20.0|           720.99|              0|             3.04|            9.0|British Virgin Is...|           69.08|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
only showing top 2 rows



In [24]:
df.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



In [25]:
from pyspark.sql.functions import countDistinct

df.agg(countDistinct(df['Location'])).show()

+------------------------+
|count(DISTINCT Location)|
+------------------------+
|                     181|
+------------------------+



too many locations, no point of categorical encoding

In [26]:
df.count()

334

## Dataset transformation

In [27]:
df.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

In [28]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'WPM_Typing_Speed'], outputCol='features')

In [29]:
df2 = assembler.transform(df)
df2.head(1)

[Row(Session_Connection_Time=8.0, Bytes Transferred=391.09, Kali_Trace_Used=1, Servers_Corrupted=2.96, Pages_Corrupted=7.0, Location='Slovenia', WPM_Typing_Speed=72.37, features=DenseVector([8.0, 391.09, 1.0, 2.96, 7.0, 72.37]))]

In [30]:
df2.show(1)

+-----------------------+-----------------+---------------+-----------------+---------------+--------+----------------+--------------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|Location|WPM_Typing_Speed|            features|
+-----------------------+-----------------+---------------+-----------------+---------------+--------+----------------+--------------------+
|                    8.0|           391.09|              1|             2.96|            7.0|Slovenia|           72.37|[8.0,391.09,1.0,2...|
+-----------------------+-----------------+---------------+-----------------+---------------+--------+----------------+--------------------+
only showing top 1 row



## Feature scaling

In [31]:
# As algorithm depends on distance, we need to scale them as sale of features vary
from pyspark.ml.feature import StandardScaler

In [32]:
scale = StandardScaler(inputCol='features', 
                       outputCol='scaled_features', 
                       withMean=False, withStd=True)
df_scaled = scale.fit(df2).transform(df2)
df_scaled.show(1)

+-----------------------+-----------------+---------------+-----------------+---------------+--------+----------------+--------------------+--------------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|Location|WPM_Typing_Speed|            features|     scaled_features|
+-----------------------+-----------------+---------------+-----------------+---------------+--------+----------------+--------------------+--------------------+
|                    8.0|           391.09|              1|             2.96|            7.0|Slovenia|           72.37|[8.0,391.09,1.0,2...|[0.56785108466505...|
+-----------------------+-----------------+---------------+-----------------+---------------+--------+----------------+--------------------+--------------------+
only showing top 1 row



# K-means clustering model and prediction

### 2 hackers

In [33]:
# No label, no point of splitting data
from pyspark.ml.clustering import KMeans
kmeans2 = KMeans(featuresCol='scaled_features',
                 predictionCol='prediction2',k=2)
model2 = kmeans2.fit(df_scaled)

# Within Set Sum of Squared Errors.
wsse2 = model2.computeCost(df_scaled)
print(f'model_2 error {wsse2}')

model_2 error 601.7707512676716


### 3 hackers

In [34]:
kmeans3 = KMeans(featuresCol='scaled_features',
                 predictionCol='prediction3',k=3)
model3 = kmeans3.fit(df_scaled)

wsse3 = model3.computeCost(df_scaled)
print(f'model_3 error {wsse3}')

model_3 error 434.75507308487647


In [35]:
model2.clusterCenters()

[array([1.26023837, 1.31829808, 0.99280765, 1.36491885, 2.5625043 ,
        5.26676612]),
 array([2.99991988, 2.92319035, 1.05261534, 3.20390443, 4.51321315,
        3.28474   ])]

In [36]:
model3.clusterCenters()

[array([1.26023837, 1.31829808, 0.99280765, 1.36491885, 2.5625043 ,
        5.26676612]),
 array([3.05623261, 2.95754486, 1.99757683, 3.2079628 , 4.49941976,
        3.26738378]),
 array([2.93719177, 2.88492202, 0.        , 3.19938371, 4.52857793,
        3.30407351])]

#### As no. of hacking by each hacker is almost same, we need to see how many times cluster specific to each hacker shows up!

## 2 HACKERS

In [37]:
df_pred = model2.transform(df_scaled)
df_pred.select(df_pred['scaled_features'], df_pred['prediction2']).show()

+--------------------+-----------+
|     scaled_features|prediction2|
+--------------------+-----------+
|[0.56785108466505...|          0|
|[1.41962771166263...|          0|
|[2.20042295307707...|          0|
|[0.14196277116626...|          0|
|[1.41962771166263...|          0|
|[0.07098138558313...|          0|
|[1.27766494049636...|          0|
|[1.56159048282889...|          0|
|[1.06472078374697...|          0|
|[0.85177662699757...|          0|
|[1.06472078374697...|          0|
|[2.27140433866020...|          0|
|[1.63257186841202...|          0|
|[0.63883247024818...|          0|
|[1.91649741074455...|          0|
|[0.85177662699757...|          0|
|[1.49060909724576...|          0|
|[0.70981385583131...|          0|
|[1.41962771166263...|          0|
|[1.56159048282889...|          0|
+--------------------+-----------+
only showing top 20 rows



In [39]:
df_pred.groupBy('prediction2').count().show()

+-----------+-----+
|prediction2|count|
+-----------+-----+
|          1|  167|
|          0|  167|
+-----------+-----+



## 3 HACKERS

In [40]:
df_pred = model3.transform(df_scaled)
df_pred.select(df_pred['scaled_features'], df_pred['prediction3']).show()

+--------------------+-----------+
|     scaled_features|prediction3|
+--------------------+-----------+
|[0.56785108466505...|          0|
|[1.41962771166263...|          0|
|[2.20042295307707...|          0|
|[0.14196277116626...|          0|
|[1.41962771166263...|          0|
|[0.07098138558313...|          0|
|[1.27766494049636...|          0|
|[1.56159048282889...|          0|
|[1.06472078374697...|          0|
|[0.85177662699757...|          0|
|[1.06472078374697...|          0|
|[2.27140433866020...|          0|
|[1.63257186841202...|          0|
|[0.63883247024818...|          0|
|[1.91649741074455...|          0|
|[0.85177662699757...|          0|
|[1.49060909724576...|          0|
|[0.70981385583131...|          0|
|[1.41962771166263...|          0|
|[1.56159048282889...|          0|
+--------------------+-----------+
only showing top 20 rows



In [41]:
df_pred.groupBy(df_pred['prediction3']).count().show()

+-----------+-----+
|prediction3|count|
+-----------+-----+
|          1|   88|
|          2|   79|
|          0|  167|
+-----------+-----+



When 3 hackers I don't see equal number of hacking so it doesn't match the criteria asked
    
Error is also much low 

So, no of clusters/hackers is most likely 2

## For loop of checking errors with different clusters

In [42]:
for k in range(2,11):
    print(f'No. of clusters {k}')
    kmeans = KMeans(featuresCol='scaled_features', 
                    predictionCol='prediction', k=k)
    model = kmeans.fit(df_scaled)
    wsse = model.computeCost(df_scaled)
    print(f'wsse with {k} no. of clusters: {wsse}', end='\n\n')

No. of clusters 2
wsse with 2 no. of clusters: 601.7707512676716

No. of clusters 3
wsse with 3 no. of clusters: 434.75507308487647

No. of clusters 4
wsse with 4 no. of clusters: 267.1336116887891

No. of clusters 5
wsse with 5 no. of clusters: 248.97305882286832

No. of clusters 6
wsse with 6 no. of clusters: 227.2036624315653

No. of clusters 7
wsse with 7 no. of clusters: 214.4706560026703

No. of clusters 8
wsse with 8 no. of clusters: 196.7633646545008

No. of clusters 9
wsse with 9 no. of clusters: 184.7017624465714

No. of clusters 10
wsse with 10 no. of clusters: 179.90466857065763



#### Based on error value 4 clusters would be better. And that is where domain knowledge is much better than elbow method. Because if we would have followed the elbow method we will end up with 4 clusters.