# Clustering Project

A tech company has been hacked. Fortunatley, the forensic engineers were able to pull some meta data from the attack such as: session time,locations, wpm typing speed, etc. The tech company also has 3 suspects for the attack, and are certain that at least two were involved. The company needs help determining if there was a third hacker involved. The forensic engineer mentions that in their recent hacks, the hackers roughly do the same amount of attacks, so if there was 1000 attacks from 2 hackers, the hackers would have performed around 500 attacks each.

### The data

**For each attack we have the following data:**

* 'Session_Connection_Time': How long the session lasted in minutes
* 'Bytes Transferred': Number of MB transferred during session
* 'Kali_Trace_Used': Indicates if the hacker was using Kali Linux
* 'Servers_Corrupted': Number of server corrupted during the attack
* 'Pages_Corrupted': Number of pages illegally accessed
* 'Location': Location attack came from (Probably useless because the hackers used VPNs)
* 'WPM_Typing_Speed': Their estimated typing speed based on session logs.

In [3]:
from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
import matplotlib.pyplot as plt

In [4]:

spark = SparkSession.builder.appName('hack_find').getOrCreate()

In [5]:
df = spark.read.csv("/FileStore/tables/hack_data.csv",inferSchema=True,header=True)

In [6]:
df.head()

In [7]:
df.describe().show()

In [8]:
df.columns

In [9]:
feat_cols = ['Session_Connection_Time', 'Bytes Transferred', 'Kali_Trace_Used',
             'Servers_Corrupted', 'Pages_Corrupted','WPM_Typing_Speed']

In [10]:
vec_assembler = VectorAssembler(inputCols = feat_cols, outputCol='features')

In [11]:
final_df = vec_assembler.transform(df)

In [12]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

In [13]:
scalerModel = scaler.fit(final_df)
cluster_final_data = scalerModel.transform(final_df)

Analysis of KMeans for both K=2 and K=3

In [15]:
# Instantiate 2 KMeans models (one for 3 hackers, one for 2)
kmeans3 = KMeans(featuresCol='scaledFeatures',k=3)
kmeans2 = KMeans(featuresCol='scaledFeatures',k=2)

In [16]:
# Fit the two models
model_k3 = kmeans3.fit(cluster_final_data)
model_k2 = kmeans2.fit(cluster_final_data)

In [17]:
sse_k3 = model_k3.computeCost(cluster_final_data)
sse_k2 = model_k2.computeCost(cluster_final_data)

In [18]:
print("With K=3")
print("Within Set SSE for k=2 = " + str(round(sse_k2,2)))
print(' ')
print("With K=2")
print("Within Set SSE for k=3 = " + str(round(sse_k3,2)))

It is known that the SSE decreases as you increase K, so although these results are helpful they cannot stand alone. Checking the SSE vs K values for multiple values of K can also help indicate what the correct value of K is.

In [20]:
Ex = []
y = []
for k in range(2,9):    
    kmeans = KMeans(featuresCol='scaledFeatures',k=k)
    model = kmeans.fit(cluster_final_data)
    sse = model.computeCost(cluster_final_data)
    
    x.append(k)
    y.append(sse)

    print("Within Set SSE for K="+str(k) + " is " + str(round(sse,2)))
    print('')

Veiwing the decrease of SSE with the increase of K to try and visiually determine the correct value of K using the elbow method

In [22]:
%matplotlib inline
plt.plot(x,y)
plt.ylabel('SSE')
plt.xlabel('K');



** Unfortunatley there is no clear answer from this graph. However we know that since the hackers split the hacks evenly, each hacker should have roughly the same amount of hacks.

In [24]:
model_k3.transform(cluster_final_data).groupBy('prediction').count().show()

In [25]:
model_k2.transform(cluster_final_data).groupBy('prediction').count().show()

## By comparing the amount of attacks per hacker for each value of K it is clear that there were only 2 hackers!