# TLD Cluster Analysis

This is an analysis of the overall risky profile of TLDs using DomainTools Threat Profile scores to create a risk profile for each TLD, and then clusering analysis to group TLDs that have a similar risk profile together.

In [1]:
# import the required libraries 
import pyspark
from pyspark.ml.clustering import KMeans, BisectingKMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    avg,
    col,
    collect_list,
    count,
    lit,
    max,
    min,
    stddev,
    size,
    sum,
    udf,
)
from pyspark.sql.types import StringType, StructType, StructField, FloatType, LongType
from pyspark.ml.linalg import Vectors, VectorUDT

#### Setup Plotly for our graphs

In [23]:
import plotly.plotly as py
import plotly.graph_objs as go
import random
import string
title_font=dict(family='Arial, sans-serif',size=22)
axis_font=dict(family='Arial, sans-serif',size=18)
x_axis=dict(
        title='Threat Profile Score',        
        titlefont=axis_font
    )
y_axis=dict(
        title='% of Total Domains',
        titlefont=axis_font
    )

def plot_name():
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=8))

#### Setup Spark artifacts and read in the data

In [3]:
app_name = "tld_cluster_analysis"
spark = SparkSession.builder.appName(app_name).config("spark.speculation", "false").master("local[4]").getOrCreate()

TLD_FEATURES = StructType(
    [
        StructField("tld", StringType(), True),
        StructField("phish_features", VectorUDT(), True),
        StructField("malware_features", VectorUDT(), True),
        StructField("spam_features", VectorUDT(), True),
        StructField("domain_count", LongType(), True),
    ]
)

DF_DATA_PATH = "/Users/turbo/projects/tld_risk_score_analysis_blog/data/"

df = spark.read.load(
    DF_DATA_PATH,
    format="orc",
    schema=TLD_FEATURES)

### What does the TLD data look like?

Each threat type (phish, malware, spam) has a features column.  These represent the probability distributions described in the analysis article.  The `domain_count` column is the count of domains in for the TLD that were young enough to be scored by Threat Profile

In [27]:
df.printSchema()
df.show()

root
 |-- tld: string (nullable = true)
 |-- phish_features: vector (nullable = true)
 |-- malware_features: vector (nullable = true)
 |-- spam_features: vector (nullable = true)
 |-- domain_count: long (nullable = true)

+------------+--------------------+--------------------+--------------------+------------+
|         tld|      phish_features|    malware_features|       spam_features|domain_count|
+------------+--------------------+--------------------+--------------------+------------+
|         .io|[0.34609439752693...|[0.39338556483949...|[0.87051981362424...|      357128|
|       .loan|[0.17286702735385...|[0.01401396357812...|[0.22308407061770...|     2203017|
|       .love|[0.20245398773006...|[0.14982043992219...|[0.55001496333981...|       26732|
|.photography|[0.28393539637529...|[0.25619529034644...|[0.77117494760202...|       24333|
|      .world|[0.15915677960051...|[0.16037749318256...|[0.62821061949429...|      140082|
|       .date|[0.02797370711648...|[0.048001020500

### Plot the phish distribution for .COM

What does the phishiness of .COM look like all by itself? Most of the domains had a fairly low Threat Profile phish score, but there is an uptic of phishy domains at the far right. 

In [26]:
com_features = df.where(col("TLD") == ".com").select("phish_features").collect()[0][0].array

buckets = [x for x in range(100)]

dot_com = go.Scatter(
    x = buckets, 
    y = com_features,
    name = ".com"
)
data = [dot_com]

layout = go.Layout(
    title="Phishiness of .COM ",
    titlefont=title_font,
    xaxis=x_axis,
    yaxis=y_axis
)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename=plot_name())

### Look at .COM vs .TK
.TK and .COM are almost mirrors of each other

In [28]:
tk_features = df.where(col("TLD") == ".tk").select("phish_features").collect()[0][0].array
com_features = df.where(col("TLD") == ".com").select("phish_features").collect()[0][0].array

buckets = [x for x in range(100)]

dot_tk = go.Scatter(
    x = buckets, 
    y = tk_features,
    name = ".tk"
)
dot_com = go.Scatter(
    x = buckets, 
    y = com_features,
    name = ".com"
)
data = [dot_com, dot_tk]

layout = go.Layout(
    title="Phishiness of .COM vs .TK",
    titlefont=title_font,
    xaxis=x_axis,
    yaxis=y_axis
)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename=plot_name())

## Phish Clusters
### Cluster phish's TLD Risk vectors and see what we get

I used the BisectingKMeans algorithm. It gives more even sized clusters than KMeans, but is more "smooth" in how it groups items together.  Play around with the two algorithms (just comment one out, and uncomment the other) and see how the clusters change.  You can also play around with the `k` value. Set it to other values and see what kind of clusters you get.

In [29]:
# Pick the clustering algorithm to use
clustering = BisectingKMeans(k=20, seed=42)
# clustering = KMeans(k=20, seed=42)

phish_df = df.select(col("tld"), col("phish_features").alias("features"), col("domain_count"))

phish_model = clustering.fit(phish_df)

# Evaluate clustering.
# cost = model.computeCost(df)
# print("Within Set Sum of Squared Errors = " + str(cost))

# Shows the cluster centroid vectors.
# print("Cluster Centers: ")
# centers = phish_model.clusterCenters()
# for center in centers:
#     print(center)

# assign TLDs to vectors
phish_predictions = phish_model.transform(phish_df)\
    .orderBy(col("prediction"), col("domain_count").desc()).cache()

# Evaluate clustering by computing Silhouette score
# silhouette = ClusteringEvaluator().evaluate(predictions)
# print("Silhouette with squared euclidean distance = " + str(silhouette))

#### Play around with the graph

Cleaning up the chart: There are a lot of lines in this chart. If you move your mouse around it shows the raw value at that point for each cluster.  If you click third from the right button at the top of the chart (looks like a single arrow or something), the chart will only show the raw value for the closest line.

You can also click/drag sections of the chart and it will zoom into that region. This alows you to see crowded sections of the chart more clearly. To go back to the full view click the button at the top of the chart that looks like a house.

In [30]:
centroids = phish_model.clusterCenters()
buckets = [x for x in range(100)]
data = [go.Scatter(x = buckets, y = centroids[c], name = f"cluster_{c}") for c in range(20)]

layout = go.Layout(
    title="TLDs Clustered By Phish Score",
    titlefont=title_font,
    xaxis=x_axis,
    yaxis=y_axis
)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename=plot_name())

### Clusters 7, 8 and 9 look pretty phishy. Lets pull them out and look at just them

From the plot above it looks like trace 7, 8, and 9 are the most sketchy in terms of phish. 

Note: if you change the clustering algorithm or the number of clusters, you will get a completly different set of results. You'll have to explore the clusters to figure out which ones look phishy.

In [31]:
centroids = phish_model.clusterCenters()
buckets = [x for x in range(100)]
data = [go.Scatter(x = buckets, y = centroids[c], name = f"cluster_{c}") for c in [7, 8, 9]]

layout = go.Layout(
    title="Phish Clusters 7, 8, 9",
    titlefont=title_font,
    xaxis=x_axis,
    yaxis=y_axis
)
fig = go.Figure(data=data, layout=layout)


py.iplot(fig, filename=plot_name())

#### Cluster 8 is definitely phishy. Cluster 9 has ~2x more sketchy domains than it does non-sketchy. And cluster 7 is fairly even on both ends

### List of Phishy TLDs from clusters 7, 8, and 9

Notice the top 5 TLDs are all Freenom TLDs

In [37]:
phish_predictions\
    .where(col("prediction").isin([7,8,9]))\
    .select("tld", "prediction", "domain_count")\
    .orderBy("domain_count", ascending=False)\
    .show(100)

+---------------+----------+------------+
|            tld|prediction|domain_count|
+---------------+----------+------------+
|            .tk|         7|     3158181|
|            .ga|         7|     1950194|
|            .ml|         9|     1682226|
|            .cf|         7|     1667317|
|            .gq|         9|     1534950|
|          .work|         9|      597322|
|           .icu|         9|      539214|
|           .gdn|         8|      354361|
|           .men|         9|      338756|
|           .win|         9|      314803|
|        .stream|         8|      275279|
|           .bid|         8|      228408|
|        .review|         9|      208278|
|         .trade|         9|      161101|
|          .host|         7|      155228|
|         .cloud|         9|      154666|
|          .date|         8|      133268|
|           .dev|         7|      121098|
|      .download|         8|      100084|
|         .party|         7|       96533|
|           .ink|         7|      

## Malware Clusters
### Cluster malware's TLD Risk vectors and see what we get

In [32]:
# Pick the clustering algorithm to use
clustering = BisectingKMeans(k=20, seed=42)
# clustering = KMeans(k=20, seed=42)

malware_df = df.select(col("tld"), col("malware_features").alias("features"), col("domain_count"))

malware_model = clustering.fit(malware_df)

# Shows the result.
# print("Cluster Centers: ")
# centers = malware_model.clusterCenters()
# for center in centers:
#     print(center)

# assign TLDs to vectors
malware_predictions = malware_model.transform(malware_df)\
    .orderBy(col("prediction"), col("domain_count").desc()).cache()

In [33]:
centroids = malware_model.clusterCenters()
buckets = [x for x in range(100)]
data = [go.Scatter(x = buckets, y = centroids[c], name = f"cluster_{c}") for c in range(20)]

layout = go.Layout(
    title="TLDs Clustered By Malware Score",
    titlefont=title_font,
    xaxis=x_axis,
    yaxis=y_axis
)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename=plot_name())

### Clusters 6-9 look pretty bad from a malware perspective. Lets pull them out and see what they look like

In [34]:
centroids = malware_model.clusterCenters()
buckets = [x for x in range(100)]
data = [go.Scatter(x = buckets, y = centroids[c], name = f"cluster_{c}") for c in [6,7,8,9]]

layout = go.Layout(
    title="Malware Clusters 6 - 9",
    titlefont=title_font,
    xaxis=x_axis,
    yaxis=y_axis
)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename=plot_name())

### List of Malwareish TLDs from clusters 6 - 9

In [35]:
malware_predictions\
    .where(col("prediction").isin([6,7,8,9]))\
    .select("tld", "prediction", "domain_count")\
    .orderBy("domain_count", ascending=False)\
    .show(100)

+---------------+----------+------------+
|            tld|prediction|domain_count|
+---------------+----------+------------+
|           .top|         9|     4106700|
|            .tk|         8|     3158181|
|           .xyz|         7|     2526182|
|            .tw|         6|     2476020|
|          .loan|         8|     2203017|
|            .ga|         7|     1950194|
|            .ml|         7|     1682226|
|            .cf|         7|     1667317|
|            .gq|         7|     1534950|
|          .club|         7|     1407012|
|        .online|         6|     1210913|
|          .site|         9|     1182432|
|           .ltd|         8|      652897|
|          .work|         9|      597322|
|           .vip|         6|      566351|
|           .icu|         8|      539214|
|            .pw|         6|      517159|
|            .cc|         8|      427158|
|           .gdn|         7|      354361|
|           .men|         9|      338756|
|           .win|         7|      

## Spam Clusters
### Cluster spam's TLD Risk vectors and see what we get

In [36]:
# Pick the clustering algorithm to use
clustering = BisectingKMeans(k=20, seed=42)
# clustering = KMeans(k=20, seed=42)

spam_df = df.select(col("tld"), col("spam_features").alias("features"), col("domain_count"))

spam_model = clustering.fit(spam_df)

# Shows the result.
# print("Cluster Centers: ")
# centers = spam_model.clusterCenters()
# for center in centers:
#     print(center)

# assign TLDs to vectors
spam_predictions = spam_model.transform(spam_df)\
    .orderBy(col("prediction"), col("domain_count").desc()).cache()

In [37]:
centroids = spam_model.clusterCenters()
buckets = [x for x in range(100)]
data = [go.Scatter(x = buckets, y = centroids[c], name = f"cluster_{c}") for c in range(20)]

layout = go.Layout(
    title="TLDs Clustered By Spam Score",
    titlefont=title_font,
    xaxis=x_axis,
    yaxis=y_axis
)
fig = go.Figure(data=data, layout=layout)

py.iplot(fig, filename=plot_name())

### Clusters 6 and 7 look pretty spamy. Lets see what they look like

Note: clusters 4 and 5 look kind'a questionable, but I think I'll leave them out for now

In [19]:
centroids = spam_model.clusterCenters()
buckets = [x for x in range(100)]
data = [go.Scatter(x = buckets, y = centroids[c], name = f"cluster_{c}") for c in [6, 7]]

layout = go.Layout(
    title="Spam Clusters 0-3",
    titlefont=title_font,
    xaxis=x_axis,
    yaxis=y_axis
)
fig = go.Figure(data=data, layout=layout)

py.iplot(data, filename=plot_name())

### List of Spamy TLDs from clusters 6, 7

In [38]:
spam_predictions\
    .where(col("prediction").isin([6, 7]))\
    .select("tld", "prediction", "domain_count")\
    .orderBy("domain_count", ascending=False)\
    .show(100)

+-----------+----------+------------+
|        tld|prediction|domain_count|
+-----------+----------+------------+
|      .loan|         7|     2203017|
|      .work|         7|      597322|
|       .gdn|         6|      354361|
|       .men|         6|      338756|
|    .stream|         6|      275279|
|       .bid|         6|      228408|
|    .review|         7|      208278|
|     .trade|         7|      161101|
|      .host|         6|      155228|
|      .date|         7|      133268|
|  .download|         7|      100084|
|     .party|         6|       96533|
|   .science|         7|       83805|
|    .racing|         6|       71005|
|.accountant|         6|       57219|
|     .faith|         6|       55141|
|    .webcam|         7|       51932|
| .xn--p1acf|         6|       34300|
|   .cricket|         6|       22460|
+-----------+----------+------------+



### Hmmm...where are the Freenom TLDs in the spam results?

I noticed that .TK was at or near the top of the list for both phish and malware, but not in the spam list. In fact none of the Freenom TLDs made it into the spamy clusters.

Lets go find .TK and see what cluster its in

In [40]:
spam_predictions.where(col("tld") == ".tk").show()

+---+--------------------+------------+----------+
|tld|            features|domain_count|prediction|
+---+--------------------+------------+----------+
|.tk|[0.08287903701529...|     3158181|         5|
+---+--------------------+------------+----------+



#### Cluster 5 again
So .TK is in cluster 5.  What else is in cluster 5?

In [41]:
spam_predictions.where(col("prediction") == 5).show()

+-------+--------------------+------------+----------+
|    tld|            features|domain_count|prediction|
+-------+--------------------+------------+----------+
|    .tk|[0.08287903701529...|     3158181|         5|
|    .ml|[0.13501515254192...|     1682226|         5|
|    .cf|[0.12309656771927...|     1667317|         5|
|    .gq|[0.11551907228248...|     1534950|         5|
|   .ren|[0.03478192990086...|       89673|         5|
|.boston|[0.07702943800178...|       22420|         5|
+-------+--------------------+------------+----------+



#### Thats an interesting cluster

4 out of the 5 Freenom TLDs showed up in cluster 5

So just for fun lets look at cluster 4, the other semi-sketcy cluster

In [42]:
spam_predictions.where(col("prediction") == 4).show()

+----+--------------------+------------+----------+
| tld|            features|domain_count|prediction|
+----+--------------------+------------+----------+
| .ga|[0.18340739434128...|     1950194|         4|
| .us|[0.27380820779777...|     1194084|         4|
|.icu|[0.12178652631422...|      539214|         4|
|.win|[0.18524283440755...|      314803|         4|
|.ink|[0.15954833277586...|       86701|         4|
+----+--------------------+------------+----------+



#### There it is!
The missing Freenom TLD, .GA, is in cluster 4.

Cluster 4 is kind'a interesting because .ICU, .WIN, and .INK all showed up in the lists for phish and malware but .US didn't. The only list .US showed up in was for spam.