En este modelo, se busca encontrar la cantidad de **hackers** en base a la cantidad de hackeos. Se clasifican mediante **KMeans**.

[Dataset](https://www.dropbox.com/s/g5r2dh46abx1vdr/hack_data.csv?dl=0)

In [19]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('find_hacker').getOrCreate()

In [20]:
from pyspark.ml.clustering import KMeans
 
dataset = spark.read.csv("./hack_data.csv",header=True,inferSchema=True)
dataset.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



Se transforman las columnas de características (*features*) a 1 sola columna con **VectorAssembler**, al cual se le pasan las columnas de *input* (**feat_cols**) y entrega la columna *output* con todos los valores de las columnas entrantes.

In [21]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
 
feat_cols = ['Session_Connection_Time', 'Bytes Transferred', 'Kali_Trace_Used',
'Servers_Corrupted', 'Pages_Corrupted','WPM_Typing_Speed']
 
vec_assembler = VectorAssembler(inputCols = feat_cols, outputCol='features')
 
final_data = vec_assembler.transform(dataset)
final_data.show(3)

+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+--------------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|            Location|WPM_Typing_Speed|            features|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+--------------------+
|                    8.0|           391.09|              1|             2.96|            7.0|            Slovenia|           72.37|[8.0,391.09,1.0,2...|
|                   20.0|           720.99|              0|             3.04|            9.0|British Virgin Is...|           69.08|[20.0,720.99,0.0,...|
|                   31.0|           356.32|              1|             3.71|            8.0|             Tokelau|           70.58|[31.0,356.32,1.0,...|
+-----------------------+-----------------+---------------+-----------------+-----

Se estandarizan las columnas *features*, removiendo la media y escalando a la unidad de la varianza. Así, los datos están normalmente distribuidos.

Ahora con los *features* estandarizados, se prodece a hacer **fit** a los datos y luego **transform**.

El modelo de clasificación de **KMeans** se le entregan las columnas *features* y la cantidad de agrupaciones (**k**)

**KMeans**(<span style="color:green;">featuresCol='features'</span>, predictionCol='prediction', <span style="color:green;">k=2</span>, initMode='k-means||', initSteps=2, tol=0.0001, maxIter=20, <span style="color:darkolivegreen;">seed=None</span>, distanceMeasure='euclidean')

In [22]:
from pyspark.ml.feature import StandardScaler
 
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

scalerModel = scaler.fit(final_data)
 
cluster_final_data = scalerModel.transform(final_data)
 
kmeans3 = KMeans(featuresCol='scaledFeatures',k=3)
kmeans2 = KMeans(featuresCol='scaledFeatures',k=2)

Se calcula la suma de las diferencias al cuadrado entre cada observación y la media de su grupo, para **k = 2** y **k = 3**.

In [23]:
model_k3 = kmeans3.fit(cluster_final_data)
model_k2 = kmeans2.fit(cluster_final_data)
 
wssse_k3 = model_k3.computeCost(cluster_final_data)
wssse_k2 = model_k2.computeCost(cluster_final_data)

print("With K=3")
print("Within Set Sum of Squared Errors = " + str(wssse_k3))
print('--'*30)
print("With K=2")
print("Within Set Sum of Squared Errors = " + str(wssse_k2))

With K=3
Within Set Sum of Squared Errors = 434.1492898715845
------------------------------------------------------------
With K=2
Within Set Sum of Squared Errors = 601.7707512676716


Checking the Elbow Point (WSSSE) del 2 al 8

In [24]:
for k in range(2,9):
    kmeans = KMeans(featuresCol='scaledFeatures',k=k)
    model = kmeans.fit(cluster_final_data)
    wssse = model.computeCost(cluster_final_data)
    print("With K={}".format(k))
    print("Within Set Sum of Squared Errors = " + str(wssse))
    print('--'*30)

With K=2
Within Set Sum of Squared Errors = 601.7707512676716
------------------------------------------------------------
With K=3
Within Set Sum of Squared Errors = 434.1492898715845
------------------------------------------------------------
With K=4
Within Set Sum of Squared Errors = 413.50933955528234
------------------------------------------------------------
With K=5
Within Set Sum of Squared Errors = 246.44966476509273
------------------------------------------------------------
With K=6
Within Set Sum of Squared Errors = 232.6312088685911
------------------------------------------------------------
With K=7
Within Set Sum of Squared Errors = 225.49066138623704
------------------------------------------------------------
With K=8
Within Set Sum of Squared Errors = 214.28028203478016
------------------------------------------------------------


Los valores de WSSSE bajan continuamente y tiene un quiebre en K = 5, dado que da un mayor salto.

Cantidad de hackers involucrados en el número de hacks que se han hecho

In [25]:
# K = 5
kmeans5 = KMeans(featuresCol='scaledFeatures',k=5)
model_k5 = kmeans5.fit(cluster_final_data)
model_k5.transform(cluster_final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   79|
|         3|   63|
|         4|   88|
|         2|   21|
|         0|   83|
+----------+-----+

