# K-means Clustering - Technology Firm's Hacked Data

Aim - Use hacked data assembled by forensic engineers of a Technology Firm to identify the number of hackers that perpetrated the attack.

The features of the data are as follows:

* 'Session_Connection_Time': How long the session lasted in minutes
* 'Bytes Transferred': Number of MB transferred during session
* 'Kali_Trace_Used': Indicates if the hacker was using Kali Linux
* 'Servers_Corrupted': Number of server corrupted during the attack
* 'Pages_Corrupted': Number of pages illegally accessed
* 'Location': Location attack came from (Probably useless because the hackers used VPNs)
* 'WPM_Typing_Speed': Their estimated typing speed based on session logs.

One key evidence - The forensic engineer knows that the hackers trade off attacks. Meaning they should each have roughly the same amount of attacks. For example if there were 100 total attacks, then in a 2 hacker situation each should have about 50 hacks, in a three hacker situation each would have about 33 hacks.

Steps to follow: 

1. Create a Spark Session and load data
2. Check for missing values (if yes, drop or fill them)
3. Check whether or not data is in the format - features (since it's unsupervised learning we do not need labels) 
4. If not in 'features' format, assemble the features using an assembler.
5. Scale the features to avoid curse of high dimensionality (even if there aren't many features it's a good practise to employ scaling of features in unsupervised learning)
6. Import Kmeans and create it's instance
5. Create multiple models of different number of clusters and employ elbow method to identify a drop in WSSE for a particular number of custers
6. If this method fails to provide any evidence on the number of hackers then check the number of attacks of each hacker for multiple number of clusters. If for a particular number of clusters, the number of attacks by hackers are equal, we have the answer. 

In [1]:
# Create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cluster').getOrCreate()

In [2]:
# Loads data
data = spark.read.csv("hack_data.csv",header=True,inferSchema=True)

In [3]:
data.head().asDict()

{'Bytes Transferred': 391.09,
 'Kali_Trace_Used': 1,
 'Location': 'Slovenia',
 'Pages_Corrupted': 7.0,
 'Servers_Corrupted': 2.96,
 'Session_Connection_Time': 8.0,
 'WPM_Typing_Speed': 72.37}

All numerical features except for location, but the hackers would probably be using VPN so we could drop this feature.

In [4]:
# Check for any missing values
from pyspark.sql.functions import isnan, isnull, when, count, col

data.select([count(when(isnan(c)| isnull(c), c)).alias(c) for c in data.columns]).show()

+-----------------------+-----------------+---------------+-----------------+---------------+--------+----------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|Location|WPM_Typing_Speed|
+-----------------------+-----------------+---------------+-----------------+---------------+--------+----------------+
|                      0|                0|              0|                0|              0|       0|               0|
+-----------------------+-----------------+---------------+-----------------+---------------+--------+----------------+



In [5]:
# Get a summary of the data
data.describe().show()

+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|summary|Session_Connection_Time| Bytes Transferred|   Kali_Trace_Used|Servers_Corrupted|   Pages_Corrupted|   Location|  WPM_Typing_Speed|
+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|  count|                    334|               334|               334|              334|               334|        334|               334|
|   mean|     30.008982035928145| 607.2452694610777|0.5119760479041916|5.258502994011977|10.838323353293413|       null|57.342395209580864|
| stddev|     14.088200614636158|286.33593163576757|0.5006065264451406| 2.30190693339697|  3.06352633036022|       null| 13.41106336843464|
|    min|                    1.0|              10.0|                 0|              1.0|               6.0|Afghanistan|              40.0|
|    max|           

In [6]:
# Check columns
data.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

In [7]:
from pyspark.ml.feature import VectorAssembler

In [8]:
cols = ['Session_Connection_Time', 'Bytes Transferred', 'Kali_Trace_Used',
             'Servers_Corrupted', 'Pages_Corrupted','WPM_Typing_Speed']

In [9]:
assembler = VectorAssembler(inputCols = cols, outputCol='features')

In [10]:
final_data = assembler.transform(data)

In [11]:
final_data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)



In [12]:
# Always better to scale the features to avoid curse of dimensionality

from pyspark.ml.feature import StandardScaler

In [13]:
# Scale with respect to standard deviation

scaler = StandardScaler(inputCol="features", 
                        outputCol="scaledFeatures", 
                        withStd=True, 
                        withMean=False)

In [14]:
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(final_data)

In [15]:
# Normalize each feature to have unit standard deviation.
cluster_final_data = scalerModel.transform(final_data)

In [16]:
# Make sure it works
cluster_final_data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)



In [17]:
from pyspark.ml.clustering import KMeans

In [18]:
# Checking whether it was 2 or 3 hackers

# For 3 hackers
kmeans3 = KMeans(featuresCol='scaledFeatures',k=3)
model_k3 = kmeans3.fit(cluster_final_data)
wssse_k3 = model_k3.computeCost(cluster_final_data)

# For 2 hackers
kmeans2 = KMeans(featuresCol='scaledFeatures',k=2)
model_k2 = kmeans2.fit(cluster_final_data)
wssse_k2 = model_k2.computeCost(cluster_final_data)

In [19]:
print("For 3 hackers: ")
print("Within Set Sum of Squared Errors = " + str(wssse_k3))
print('--'*30)
print("For 2 hackers: ")
print("Within Set Sum of Squared Errors = " + str(wssse_k2))

For 3 hackers: 
Within Set Sum of Squared Errors = 434.1492898715845
------------------------------------------------------------
For 2 hackers: 
Within Set Sum of Squared Errors = 601.7707512676716


There isn't enough evidence that confirms whether they were 2 or 3 hackers. And as we know in Kmeans as number of clusters increase the WSSE decreases. We could check the WSSE for more number of clusters to check whether the WSSE suddenly drops for a particular number of clusters (Elbow Method).

In [20]:
for k in range(2,9):
    kmeans = KMeans(featuresCol='scaledFeatures',k=k)
    model = kmeans.fit(cluster_final_data)
    wssse = model.computeCost(cluster_final_data)
    print("For {} hackers: ".format(k))
    print("Within Set Sum of Squared Errors = " + str(wssse))
    print('--'*30)

For 2 hackers: 
Within Set Sum of Squared Errors = 601.7707512676716
------------------------------------------------------------
For 3 hackers: 
Within Set Sum of Squared Errors = 434.1492898715845
------------------------------------------------------------
For 4 hackers: 
Within Set Sum of Squared Errors = 267.1336116887891
------------------------------------------------------------
For 5 hackers: 
Within Set Sum of Squared Errors = 401.0534138416211
------------------------------------------------------------
For 6 hackers: 
Within Set Sum of Squared Errors = 240.58180197600024
------------------------------------------------------------
For 7 hackers: 
Within Set Sum of Squared Errors = 224.75676930192807
------------------------------------------------------------
For 8 hackers: 
Within Set Sum of Squared Errors = 228.32460058156488
------------------------------------------------------------


There is no sudden drop in WSSE for any of the above considered number of clusters. 

Lastly we could check the number of trade off attacks by the hackers since as mentioned by the forensic engineer the hackers had roughly same amount of attacks.

In [21]:
# Check number of attacks for 3 hackers

model_k3.transform(cluster_final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         2|   83|
|         0|   84|
+----------+-----+



In [22]:
# Check number of attacks for 2 hackers

model_k2.transform(cluster_final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



As seen above, there is a even split in the number of attacks for 2 hackers. This proves that 2 hackers perpetrated the attack on the technology firm.

________