# Determine hacking situation using K-Means Clustering & PySpark

These are the features of the hacking data of the company that was attacked:

* 'Session_Connection_Time': How long the session lasted in minutes
* 'Bytes Transferred': Number of MB transferred during session
* 'Kali_Trace_Used': Indicates if the hacker was using Kali Linux
* 'Servers_Corrupted': Number of server corrupted during the attack
* 'Pages_Corrupted': Number of pages illegally accessed
* 'Location': Location attack came from (Probably useless because the hackers used VPNs)
* 'WPM_Typing_Speed': Their estimated typing speed based on session logs.


The technology firm has 3 potential hackers that perpetrated the attack. Their certain of the first two hackers but they aren't very sure if the third hacker was involved or not.

**The hackers have roughly the same amount of attacks. For example if there were 100 total attacks, then in a 2 hacker situation each should have about 50 hacks, in a three hacker situation each would have about 33 hacks.**

In [0]:
# Start a spark session
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('hack').getOrCreate()

In [0]:
# Import data
data=spark.read.csv('dbfs:/FileStore/shared_uploads/hrishagni95@gmail.com/hack_data-1.csv',inferSchema=True,header=True)

In [0]:
data.show()

In [0]:
# Import VectorAssembler to assemble the necessary features for clustering
from pyspark.ml.feature import VectorAssembler

In [0]:
# Create the assembler
assembler=VectorAssembler(inputCols=['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'WPM_Typing_Speed'],outputCol='features')

In [0]:
# Transform the original data with the assembler to get a dense vector of the features
final_data=assembler.transform(data)

In [0]:
# Sneak peek of the transformed data
final_data.show()

In [0]:
# Filter only the neccessary column
final_data=final_data.select('features')

In [0]:
# Import StandardScaler to scale the data in 'features', so that the data ranges are scaled to proportion
from pyspark.ml.feature import StandardScaler

In [0]:
# Create an instance of StandardScaler
scale_model=StandardScaler(inputCol='features',outputCol='ScaledFeatures')

In [0]:
# Fit the filtered data on the model
scaled_fit_model=scale_model.fit(final_data)

In [0]:
# Use the fitted model to transform or scale the data
scaled_data=scaled_fit_model.transform(final_data)

In [0]:
# Sneak peek at scaled data
scaled_data.head(3)

In [0]:
# Import Kmeans
from pyspark.ml.clustering import KMeans

In [0]:
# Create an instance of KMeans with k=2, that is with 2 clusters
kmeans=KMeans(featuresCol='ScaledFeatures',k=2).setSeed(1)

In [0]:
# Fit the scaled data
kmeans_model=kmeans.fit(scaled_data)

In [0]:
# Transform the scaled data with the fitted model to get the predictions
pred=kmeans_model.transform(scaled_data)

In [0]:
# Group the predictions to get their count
pred.groupBy('prediction').count().show()

In [0]:
# As we can see that the group count matches in number, indicating the analysis made by the forensic team to be true, which in turn concludes that there were 2 hackers and not 3

In [0]:
# Import the Cluster Evaluator to evaluate the model
from pyspark.ml.evaluation import ClusteringEvaluator

In [0]:
silhouette=ClusteringEvaluator(featuresCol='ScaledFeatures')

In [0]:
# Generate the Silhouette score, which is an interpretation and validation of consistency within clusters of data
silhouette.evaluate(pred)

In [0]:
# Display the cluster coordinates
kmeans_model.clusterCenters()