# Domain Name clustering

Let's start from the top: we have a bunch of registrable domains on the disneyland team from a peer, Clay Blankenship. The question: can we cluster them together to find common targets for impersonation?

First try: let's use the same technique used in the Sean/Kai webinar - use zipf's law splitting of the domains into words, and then cluster the domains by the words in the domain.

If you haven't seen zipf's law before, it's an observational law that a number of distributions in nature are power laws - population of cities in the US, for example (8M, 4M, 2M for the top 3). You can use that with language (which also follows that power law), to predict which word combination is the most likely split of a word by finding the split that uses the most popular words first. We have a dictionary from wikipedia, and a splitter from StackOverflow, so here we go:

In [1]:
import json
from collections import defaultdict
from zipf_split import ZipfSplitter

from sklearn.cluster import DBSCAN
from sklearn.feature_extraction.text import CountVectorizer

import numpy as np

import matplotlib.pyplot as plt

In [2]:
INFILE = "disneyland_domains.json"
SCORE_FILE = "zipf_dictionary.csv"

In [3]:
def load_zipf_dict():
    splitter = ZipfSplitter(SCORE_FILE)
    return splitter

def get_domains():
    with open(INFILE) as infile:
        data = json.load(infile)
        domains = [entry["domain"] for entry in data["response"]["results"]]
    return domains


def split_domains(domains):
    splitter = load_zipf_dict()
    response = list()
    for name in domains:
        registrable_name = name.split(".")[0]
        components = splitter.split_dns_name(registrable_name)
        response.append(components)
    return response


In [4]:
domains = get_domains()

In [5]:
domains_as_words = split_domains(domains)

In [6]:
domains_as_words[:5]

[['26', 'uy', '6'], ['53', 'vb'], ['53', 'vl'], ['53', 'vz'], ['53', 'x', 'a']]

In [7]:
domains[:5]

['26uy6.top', '53vb.com', '53vl.com', '53vz.com', '53xa.com']

In [8]:
domains[-10:]

['ẹxperss53.com',
 'ẹxpness53.com',
 'ẹxprass53.com',
 'ẹxpres53.com',
 'ẹxprses53.com',
 'ẹxqress53.com',
 'ẹxrpess53.com',
 'zeero0ze.com',
 'zionshamk.com',
 'zlonshank.com']

In [9]:
domains_as_words[-10:]

[['ẹ', 'x', 'p', 'e', 'r', 's', 's', '53'],
 ['ẹ', 'x', 'p', 'n', 'e', 's', 's', '53'],
 ['ẹ', 'x', 'p', 'r', 'a', 's', 's', '53'],
 ['ẹ', 'x', 'p', 'r', 'e', 's', '53'],
 ['ẹ', 'x', 'p', 'r', 's', 'e', 's', '53'],
 ['ẹ', 'x', 'q', 'r', 'e', 's', 's', '53'],
 ['ẹ', 'x', 'r', 'p', 'e', 's', 's', '53'],
 ['zeero', '0', 'ze'],
 ['zions', 'hamk'],
 ['zl', 'on', 'shank']]

Well, that didn't work. 

The problem: these aren't simple word splits. The domains here are homoglyphs and typos of words. The zipf splitting algorithm is having trouble because none of the strings in these domain names are dictionary words, so splitting on dictionary words isn't working. We'll have to clean up the homoglyphs before we can do any clustering. 

Next try:
 1. build map of character substituions (this isn't a general answer, but probably better than nothing), build all possible new words substituting characters in, modify the zipf split to look for those. 
 2. Do clustering on the names, with a custom distance metric of the edit distance between words.

In [10]:
word_map = {
    'ạ': "a",
    'ė': "e",
    'ẹ': "e",
    'ȩ': "e",
    'ọ': "o",
    'ņ': "n",
    'ŗ': "r",
    'ș': "s",
    'ț': "t",
    'ụ': "u"}


In [11]:
def clean_domains(domains):
    response = list()
    for domain in domains:
        text = "".join([word_map.get(c, c) for c in domain])
        response.append(text)
    return response

In [12]:
cleaned_domains = clean_domains(domains)

In [13]:
cleaned_domains_as_words = split_domains(cleaned_domains)

In [14]:
cleaned_domains_as_words[-10:]

[['experss', '53'],
 ['exp', 'ness', '53'],
 ['ex', 'prass', '53'],
 ['expres', '53'],
 ['exp', 'rses', '53'],
 ['exq', 'ress', '53'],
 ['exr', 'pess', '53'],
 ['zeero', '0', 'ze'],
 ['zions', 'hamk'],
 ['zl', 'on', 'shank']]

This looks much better. Not perfect by any stretch, but better.

So, now let's try clustering on that. The first try will be very simple: treat each domain name as a small "document", count the "words" in the document, and cluster the domains together by how many words they share. We'll do that with DBSCAN rather than K-Nearest Neighbors as DBSCAN is better suited to "categorical" data like this.

In [15]:
vectorizer = CountVectorizer()
words_as_docs = [" ".join(words) for words in cleaned_domains_as_words]
X = vectorizer.fit_transform(words_as_docs)

In [16]:
db = DBSCAN(eps=0.3, min_samples=3).fit(X)

In [17]:
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)

Estimated number of clusters: 17
Estimated number of noise points: 405


One thing about DBSCAN: it has a "none of the above" cluster, where the entries that didn't fit into any other cluster go. That "noise" cluster is huge in this case: 405 of the 494 domains didn't cluster. That's pretty poor. Just for argument's sake, let's look at what it did find, though.

In [18]:
translated = defaultdict(list)

In [19]:
for label, entry in zip(labels, words_as_docs):
    translated[label].append(entry)

In [20]:
translated.keys()

dict_keys([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])

In [21]:
key = 15
translated[key]

['key navl gators key',
 'key navl gators key',
 'key navl gators key',
 'key navl gators key']

In [26]:
key=4
translated[key]

['sebl v', 'sebl v', 'sebl v']

It sort of worked, but not well enough to be useful. We expected this to be naive, and it is. It's not working super well with typos, and the like. The thing is, even after cleaning up the homoglyphs, few of the words in the clusters are real english words, so clustering on them isn't working. Also, since they're not real english words, trying any other fancy language processing like Lemmatization won't work since those need dictionary words.

This approach isn't working. More likely we need to look into a way to cluster the homoglyph'd & typo'd domains together somehow. On to the next experiment (see `Homoglyph identification.ipynb`).