A technology start-up in California needs your help! They’ve been recently hacked and need your help finding out about the hackers!

Luckily their forensic engineers have grabbed valuable data about the hacks, including information like session time,locations, wpm typing speed, etc. Use your machine-learning skills predict how many hackers (two or three) took part in the attacks, and bust them!

First, we need to create the Spark Session

In [1]:
#In collab, we need to install everything:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://mirrors.sonic.net/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar xzf spark-3.1.2-bin-hadoop3.2.tgz
!pip install -q findspark


import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"


import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

#In a native Jupyter notebook, we would simply do:
#from pyspark.sql import SparkSession
#spark = SparkSession.builder.appName('seedfinder').getOrCreate()

Afterwards, we can read the file and inspect it

In [2]:
#Please drop the file in the environments 'Files' panel
df = spark.read.options(header="true", inferSchema="true").csv("/content/hack_data.csv")
df.describe().toPandas()

Unnamed: 0,summary,Session_Connection_Time,Bytes Transferred,Kali_Trace_Used,Servers_Corrupted,Pages_Corrupted,Location,WPM_Typing_Speed
0,count,334.0,334.0,334.0,334.0,334.0,334,334.0
1,mean,30.008982035928145,607.2452694610777,0.5119760479041916,5.258502994011977,10.838323353293411,,57.342395209580864
2,stddev,14.088200614636158,286.3359316357676,0.5006065264451406,2.30190693339697,3.06352633036022,,13.41106336843464
3,min,1.0,10.0,0.0,1.0,6.0,Afghanistan,40.0
4,max,60.0,1330.5,1.0,10.0,15.0,Zimbabwe,75.0


The idea for this assignment is to use clustering methods to see if we can find which attacks belong to which hacker: essentially, we want to create n number of groups of attacks, where n is the number of involved hackers. We are also told that hackers like to equally divide work; so, for example, if we have (as we do) 335 attacks and 3 hackers, each would do 110 attacks; if, however, we only had 2 hackers, each would only do 165 attacks, and so on. I will use K-means clustering, which is not surprising, given its the only one we have been told how to use 😋.

We were told the "Location" feature is not really important due to VPN use, but I think including it might be interesting nonetheless, if not for the final result, just to learn a bit! So, here I use the StringIndexer to convert it to string format:

In [3]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer

#Bonus! Change this code to index multiple columns at once!
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(["Location"]) ]


pipeline = Pipeline(stages=indexers)
df_indexed = pipeline.fit(df).transform(df)
df_indexed.describe().toPandas()

Unnamed: 0,summary,Session_Connection_Time,Bytes Transferred,Kali_Trace_Used,Servers_Corrupted,Pages_Corrupted,Location,WPM_Typing_Speed,Location_index
0,count,334.0,334.0,334.0,334.0,334.0,334,334.0,334.0
1,mean,30.008982035928145,607.2452694610777,0.5119760479041916,5.258502994011977,10.838323353293411,,57.342395209580864,64.99700598802396
2,stddev,14.088200614636158,286.3359316357676,0.5006065264451406,2.30190693339697,3.06352633036022,,13.41106336843464,50.98975334284259
3,min,1.0,10.0,0.0,1.0,6.0,Afghanistan,40.0,0.0
4,max,60.0,1330.5,1.0,10.0,15.0,Zimbabwe,75.0,180.0


Now, we can use the VectorAssembler to define our "features" column:

In [4]:
from pyspark.ml.feature import VectorAssembler
#By using a list comprehension we can define inputcols as the exclusion of some columns from df_indexed
assembler = VectorAssembler(inputCols= [e for e in df_indexed.columns if e not in ('Location')]  , outputCol='features', 
                            handleInvalid='skip')
output = assembler.transform(df_indexed)

Another interesting thing to do is "standarization". In essence, this adjusts all values to follow a "common scale", so that they are easier to compare and process. Thus:

In [5]:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol='features', outputCol='scaled_features')
scalar_model = scaler.fit(output)
scaled_data = scalar_model.transform(output)
scaled_data.select('scaled_features').head()

Row(scaled_features=DenseVector([0.5679, 1.3658, 1.9976, 1.2859, 2.2849, 5.3963, 1.7258]))

Now starts the difficult, more think-about-it part! We already know that we might have 2 OR 3 [harkers](https://www.youtube.com/watch?v=H3edGTP7GVY), so, we are going to try to do it first with 3, then with 2, and compare which gets the best clustering score!

In [6]:
from pyspark.ml.clustering import KMeans #Import the module

First, we define and apply the models:

In [7]:
kmeans3 = KMeans(featuresCol='scaled_features', k=3)
model3 = kmeans3.fit(scaled_data)
kmeans2 = KMeans(featuresCol='scaled_features', k=2)
model2 = kmeans2.fit(scaled_data)

And then, we get the results:

In [8]:
results3 = model3.transform(scaled_data)
results2 = model2.transform(scaled_data)

We could also visualize the results; this has been abbreviated for efficiency

In [9]:
#results3.select('prediction').show()
#results2.select('prediction').show()

Thats it! We have the results! Now, lets see how good the classification is: if all the attacks classify neatly in two groups, then, the two-hacker-theory would be validated; else, if we need a group more to explain all the attacks better, the three-hacker-thesis would be king! Lets see how this works:

In [10]:
from pyspark.ml.evaluation import ClusteringEvaluator

Lets generate the evaluations:

In [11]:
ClusteringEvaluator().evaluate(results2)

0.6555369436993117

In [12]:
ClusteringEvaluator().evaluate(results3)

0.3008773897853434

As we can see, ¡the two-hacker-theorem gets way, way better evaluation scores! This means that the underlying patterns in the data fit two-group classification way, way better than three-group classification! (ClusteringEvaluator uses the [silhouette method](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.ClusteringEvaluator.html), for which closeness to 1 signals closeness between the clusters and the clustering center. This centers can be shown using model.clusterCenters() )

To sum up: there are, definetely, **two and only two hackers here**

And interesting to-do would be to show the characteristic's % on each cluster, to see if we can unmask the criminals.