# Consulting project
You’re becoming world famous due to your machine learning skills!
A technology start-up in California needs your help!

They’ve been recently hacked and need your help finding out about the hackers! Luckily their forensic engineers have grabbed valuable data about the hacks, including information like session time,locations, wpm typing speed, etc. 

The forensic engineer relates to you what she has been able to figure out so far, she has been able to grab meta-data of each session that the hackers used to connect to their servers. 

These are the features of the data...

* 'Session_Connection_Time': How long the session lasted in minutes
* 'Bytes Transferred': Number of MB transferred during session
* 'Kali_Trace_Used': Indicates if the hacker was using Kali Linux
* 'Servers_Corrupted': Number of server corrupted during the attack
* 'Pages_Corrupted': Number of pages illegally accessed
* 'Location': Location attack came from (Probably useless because the hackers used VPNs)
* 'WPM_Typing_Speed': Their estimated typing speed based on session logs.


The technology firm has 3 potential hackers that perpetrated the attack. They are certain of the first two hackers but they aren't very sure if the third hacker was involved or not. 
They have requested your help! 

**Can you help figure out whether or not the third suspect had anything to do with the attacks, or was it just two hackers?**


*One last key fact, the forensic engineer knows that the hackers trade off attacks. 
Meaning they should each have roughly the same amount of attacks.*


In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://mirrors.sonic.net/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar xzf spark-3.1.2-bin-hadoop3.2.tgz
!pip install -q findspark


import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"


import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [13]:
spark

In [15]:
from pyspark.sql import SparkSession
spark= SparkSession.builder.appName('hackfinder').getOrCreate()

In [16]:
#Carga del archivo
data= spark.read.csv('hack_data.csv',header=True,inferSchema=True)

In [18]:
#Esquema y tipado de las columnas del csv
data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



In [20]:
#Importación de librerias y paquetes
from pyspark.ml.clustering import KMeans

In [21]:
data.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

In [22]:
data.head(1)

[Row(Session_Connection_Time=8.0, Bytes Transferred=391.09, Kali_Trace_Used=1, Servers_Corrupted=2.96, Pages_Corrupted=7.0, Location='Slovenia', WPM_Typing_Speed=72.37)]

In [23]:
data.count()

334

In [24]:
#Solo cogemos las columnas con valor numérico
input=['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'WPM_Typing_Speed']

In [26]:
#Creación del set de características
from pyspark.ml.feature import VectorAssembler
assembler= VectorAssembler(inputCols=input,outputCol='features')
with_features= assembler.transform(dataset)
with_features.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)



In [28]:
#Escalado de características
from pyspark.ml.feature import StandardScaler
scaler=StandardScaler(inputCol='features',outputCol='scaled_features')
scaler_model=scaler.fit(with_features)
scaled_data=scaler_model.transform(with_features)
scaled_data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaled_features: vector (nullable = true)



In [29]:
#Entrenamiento del modelo
kmeans2= KMeans(featuresCol='scaled_features',k=2)
kmeans3= KMeans(featuresCol='scaled_features',k=3)

model_k2=kmeans2.fit(scaled_data)
model_k3=kmeans3.fit(scaled_data)

In [34]:
#Obtenición de resultados
model_k3.transform(scaled_data).select('prediction').show()
results_model_3 = model_k3.transform(scaled_data)

+----------+
|prediction|
+----------+
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
+----------+
only showing top 20 rows



In [35]:
model_k3.transform(scaled_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   88|
|         2|   79|
|         0|  167|
+----------+-----+



In [38]:
model_k2.transform(scaled_data).groupBy('prediction').count().show()
results_model_2 = model_k2.transform(scaled_data)

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



In [37]:
#Interpretación de los resultados de los clusters
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator()

In [39]:
evaluator.evaluate(results_model_2)

0.6683623593283755

In [40]:
evaluator.evaluate(results_model_3)

0.30412315937808737

**Solución**
Obtenemos un mejor resultado con k = 2, por lo tanto habrá dos hackers

