# Clustering Consulting Project

A large technology firm needs help, they've been hacked! Luckily their forensic engineers have grabbed valuable data about the hacks, including information like session time,locations, wpm typing speed, etc. The forensic engineer relates what she has been able to figure out so far, she has been able to grab meta data of each session that the hackers used to connect to their servers. These are the features of the data:

* 'Session_Connection_Time': How long the session lasted in minutes
* 'Bytes Transferred': Number of MB transferred during session
* 'Kali_Trace_Used': Indicates if the hacker was using Kali Linux
* 'Servers_Corrupted': Number of server corrupted during the attack
* 'Pages_Corrupted': Number of pages illegally accessed
* 'Location': Location attack came from (Probably useless because the hackers used VPNs)
* 'WPM_Typing_Speed': Their estimated typing speed based on session logs.


The technology firm has 3 potential hackers that perpetrated the attack. Their certain of the first two hackers but they aren't very sure if the third hacker was involved or not. 

**One last key fact, the forensic engineer knows that the hackers trade off attacks. Meaning they should each have roughly the same amount of attacks. For example if there were 100 total attacks, then in a 2 hacker situation each should have about 50 hacks, in a three hacker situation each would have about 33 hacks. The engineer believes this is the key element to solving this, but doesn't know how to distinguish this unlabeled data into groups of hackers.**

In [None]:
! pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 49 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 29.8 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=4c955409f87a663a20f3c484a0a8dbebe0359ce57d0a38b3151ed9b276a7a3c1
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [None]:
# import the library
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

In [None]:
#spark and load dataset
spark = SparkSession.builder.appName('hackFind').getOrCreate()
data = spark.read.csv('hack_data.csv', inferSchema=True, header=True)
data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



In [None]:
data.show()

+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|            Location|WPM_Typing_Speed|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|                    8.0|           391.09|              1|             2.96|            7.0|            Slovenia|           72.37|
|                   20.0|           720.99|              0|             3.04|            9.0|British Virgin Is...|           69.08|
|                   31.0|           356.32|              1|             3.71|            8.0|             Tokelau|           70.58|
|                    2.0|           228.08|              1|             2.48|            8.0|             Bolivia|            70.8|
|                   20.0|            408.5|              0|             3.57

In [None]:
data.describe().show()

+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|summary|Session_Connection_Time| Bytes Transferred|   Kali_Trace_Used|Servers_Corrupted|   Pages_Corrupted|   Location|  WPM_Typing_Speed|
+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|  count|                    334|               334|               334|              334|               334|        334|               334|
|   mean|     30.008982035928145| 607.2452694610777|0.5119760479041916|5.258502994011977|10.838323353293413|       null|57.342395209580864|
| stddev|     14.088200614636158|286.33593163576757|0.5006065264451406| 2.30190693339697|  3.06352633036022|       null| 13.41106336843464|
|    min|                    1.0|              10.0|                 0|              1.0|               6.0|Afghanistan|              40.0|
|    max|           

In [None]:
data.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

In [None]:
# VectorAssembler the data
assembler = VectorAssembler(inputCols=[ 'Session_Connection_Time',
                                        'Bytes Transferred',
                                        'Kali_Trace_Used',
                                        'Servers_Corrupted',
                                        'Pages_Corrupted',
                                        'WPM_Typing_Speed'],
                            outputCol='features')
dataFinal = assembler.transform(data)

In [None]:
#data scaling
scaler = StandardScaler(inputCol='features', outputCol='featuresScaled', withStd=True, withMean=False)

#Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(dataFinal)

# Normalize each feature to have unit standard deviation.
dataCluster = scalerModel.transform(dataFinal)

**Time to find out whether its 2 or 3!**

In [None]:
#KMeans model with 2 and 3 cluster
kmeans2 = KMeans(featuresCol ='featuresScaled', k=2)
kmeans3 = KMeans(featuresCol ='featuresScaled', k=3)

#KMeans fit 
modelK2 = kmeans2.fit(dataCluster)
modelK3 = kmeans3.fit(dataCluster)

#model predictions
predictionsK2 = modelK2.transform(dataCluster)
predictionsK3 = modelK3.transform(dataCluster)

#Kmeans evaluation
evaluator = ClusteringEvaluator()

modelK2Eval = evaluator.evaluate(predictionsK2)
modelK3Eval = evaluator.evaluate(predictionsK3)

print("Squared euclidean distance for k=2  = " + str(modelK2Eval))
print('-'*80)
print("Squared euclidean distance for k=3  = " + str(modelK3Eval))

Squared euclidean distance for k=2  = 0.6683623593283755
--------------------------------------------------------------------------------
Squared euclidean distance for k=3  = 0.30412315937808737


Not much to be gained from the squared euclidean distance, after all, we would expect that as K increases, the euclidean distance decreases. We could however continue the analysis by seeing the drop from K=3 to K=4 to check if the clustering favors even or odd numbers. This won't be substantial, but its worth a look:

In [None]:
from pyspark.pandas import DataFrame

In [None]:
distance = []
K = []
for k in range(2,9):
    kmeans = KMeans(featuresCol='featuresScaled',k=k)
    model = kmeans.fit(dataCluster)
    predictions = model.transform(dataCluster)
    evaluator = ClusteringEvaluator()
    SED = evaluator.evaluate(predictions)
    print("With K={}".format(k))
    print("Within squared euclidean distance = " + str(SED))
    print('--'*30)
    distance.append(SED)
    K.append(k)

With K=2
Within squared euclidean distance = 0.6683623593283755
------------------------------------------------------------
With K=3
Within squared euclidean distance = 0.30412315937808737
------------------------------------------------------------
With K=4
Within squared euclidean distance = -0.04792891045570489
------------------------------------------------------------
With K=5
Within squared euclidean distance = -0.09700416254857948
------------------------------------------------------------
With K=6
Within squared euclidean distance = -0.19010616305778094
------------------------------------------------------------
With K=7
Within squared euclidean distance = -0.15812655957480537
------------------------------------------------------------
With K=8
Within squared euclidean distance = -0.28710986231284036
------------------------------------------------------------


In [None]:
kmeansK4 = KMeans(featuresCol='featuresScaled',k=4)
modelK4 = kmeansK4.fit(dataCluster)
predictionsK4 = modelK4.transform(dataCluster)
predictionsK4.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   79|
|         3|   83|
|         2|   88|
|         0|   84|
+----------+-----+



In [None]:
kmeansK2 = KMeans(featuresCol='featuresScaled',k=2)
modelK2 = kmeansK2.fit(dataCluster)
predictionsK2 = modelK2.transform(dataCluster)
predictionsK2.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



**Nothing definitive can be said with the above, but The last key fact that the engineer mentioned was that the attacks should be evenly numbered between the hackers. We have cluster K=2 and K=4 with evenly number.**