##*Importing spark session into colab

In [37]:
!pip install -q findspark
!pip install -q pyspark

import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cluster').getOrCreate()

##Importing dataset

In [50]:
from pyspark.ml.clustering import KMeans

df = spark.read.csv("hack_data.csv",header=True,inferSchema=True)
df.show()

+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|            Location|WPM_Typing_Speed|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|                    8.0|           391.09|              1|             2.96|            7.0|            Slovenia|           72.37|
|                   20.0|           720.99|              0|             3.04|            9.0|British Virgin Is...|           69.08|
|                   31.0|           356.32|              1|             3.71|            8.0|             Tokelau|           70.58|
|                    2.0|           228.08|              1|             2.48|            8.0|             Bolivia|            70.8|
|                   20.0|            408.5|              0|             3.57

## Creation features dataset

In [39]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [40]:
#removing irrelavant column of location of feature that would be used in the model to predict the number of hackers

feat_cols = ['Session_Connection_Time', 'Bytes Transferred', 'Kali_Trace_Used',
             'Servers_Corrupted', 'Pages_Corrupted','WPM_Typing_Speed']

In [41]:
#A feature transformer that merges multiple columns into a vector column

vec_assembler = VectorAssembler(inputCols = feat_cols, outputCol='features')

In [42]:
#Defining final df vectorized as a variable

final_df = vec_assembler.transform(df)

##Apply K-means 
In order to find the numbers of hackers we use cluestering model that will group with the statistic method how similar they are between. Thus, with k=2 and k=3 that mean grouping the attacks into 2 and 3 categories we will be able to see if there is the same number of attacks. When the number of attacks will be the same between k=2 and k=3 we will know if there were a 3rd hacker or not.

In [43]:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)
scalerModel = scaler.fit(final_df)
cluster_final = scalerModel.transform(final_df)

In [51]:
kmeans3 = KMeans(featuresCol='scaledFeatures',k=3)
kmeans2 = KMeans(featuresCol='scaledFeatures',k=2)

model_k3 = kmeans3.fit(cluster_final)
model_k2 = kmeans2.fit(cluster_final)


## Final Results 
The predictions with k=3 and k=2 and the total count of attacks with the hypothetic situation of 2 or 3 Hackers

In [52]:
model_k3.transform(cluster_final).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   79|
|         2|   88|
|         0|  167|
+----------+-----+



In [53]:
model_k2.transform(cluster_final).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



##Conclusion

We can see that for 2 hackers there is the same amount of attacks, which is according to the forensic engineer statement "the numbers of each hacker should be equal" means that there is 2 hackers.



Question 2- What do you think are the features that really distinguish the 2 or 3 hackers?

In [54]:
df.describe().show()

+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|summary|Session_Connection_Time| Bytes Transferred|   Kali_Trace_Used|Servers_Corrupted|   Pages_Corrupted|   Location|  WPM_Typing_Speed|
+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|  count|                    334|               334|               334|              334|               334|        334|               334|
|   mean|     30.008982035928145| 607.2452694610777|0.5119760479041916|5.258502994011977|10.838323353293413|       null|57.342395209580864|
| stddev|     14.088200614636158|286.33593163576757|0.5006065264451406| 2.30190693339697|  3.06352633036022|       null| 13.41106336843464|
|    min|                    1.0|              10.0|                 0|              1.0|               6.0|Afghanistan|              40.0|
|    max|           

The model used all the feature we fit in, it's difficult to know if on has more importance than is other but there all distinguish the number of hackers. The features are: 'Session_Connection_Time', 'Bytes Transferred', 'Kali_Trace_Used','Servers_Corrupted', 'Pages_Corrupted','WPM_Typing_Speed'.

In my opinion, one of the most important feature that could is "Kali_Trace_Used" that specifiy if a hacker use Linux or not. Without any knowledge in cybersecurity and a depper understanding of the dataset I can't know for sur if the hackers session time, byte transfered, servers corrupted, page corrupted are not random and could allow us to undertstand better if the model rely a lot of them for the cluestering.
Indeed the mean is really close to 0,5 suggesting that there is as much linux attacks as other which could means that the 2 hackers use different OS.