# Homoglyph identification

Round 2: Since so many of the domains in the Disneyland dataset appear to be homoglyph attacks, I think this analysis needs to start with decoding the homoglyphs.

In the earlier notebook we tried simple substitutions for the letters, I'd like to try something more general (though much more computationally intensive): one repeating idea I've seen is to render the entry to an image, then do something like OCR with only ascii letters to see what it finds. let's see if we can get that working here.

The idea here is to make a picture of the word, blur the word slightly, and then OCR it back to ASCII text. The hope is that bluring the dots over/under the letters will make them appear to be noise on the letter, and discarded by the OCR. 

In [1]:
import io
import json
from PIL import Image, ImageDraw, ImageFont, ImageFilter
import pytesseract
from leven import levenshtein
from collections import defaultdict
from sklearn.cluster import DBSCAN
import numpy as np

In [2]:
INFILE = "disneyland_domains.json"

In [3]:
def get_domains():
    with open(INFILE) as infile:
        data = json.load(infile)
        domains = [entry["domain"] for entry in data["response"]["results"]]
    return domains


In [4]:
img = Image.new("RGB", (1000,50), (255,255,255))
font = ImageFont.truetype("Arial.ttf", 24)
drawer = ImageDraw.Draw(img)
drawer.text((5,5), "suncoastcrẹditunlọn.com", fill=(0,0,0), font=font)

In [5]:
pytesseract.image_to_string(img, lang="eng")

'suncoastcreditunlon.com\n'

In [6]:
im2 = img.filter(ImageFilter.GaussianBlur(1.0))

In [7]:
print(pytesseract.image_to_string(im2))

suncoastcreditunion.com



That's perfect. That is doing exactly what I want it to: translating the "l" to an i, dropping the dots over and under letters. Let's see about applying that to the whole dataset.

In [8]:
def translate_domain(domain):
    img = Image.new("RGB", (1000,50), (255,255,255))
    font = ImageFont.truetype("Arial.ttf", 24)
    drawer = ImageDraw.Draw(img)
    drawer.text((5,5), domain, fill=(0,0,0), font=font)
    im2 = img.filter(ImageFilter.GaussianBlur(1.0))
    return pytesseract.image_to_string(im2)


In [9]:
raw_domains = get_domains()
fixed_domains = list()
for entry in raw_domains:
    translated = translate_domain(entry)
    translated = translated.strip()
    if not translated:
        fixed_domains.append(entry)
    else:
        fixed_domains.append(translated)

In [10]:
fixed_domains[-10:]

['experss53.com',
 'expness53.com',
 'exprass53.com',
 'expres53.com',
 'exprses53.com',
 'exqress53.com',
 'exrpess53.com',
 'zeero0ze.com',
 'zionshamk.com',
 'zlonshank.com']

Much better. We've at least got these down to typos now, rather than unicode-homoglyphs, which is a great start. Now, let's try clustering the typo domains. 

It's worth noting that this takes a few minutes to go through all the domains, so this isn't a great solution for the case where you have many thousands of these, or where you need to do them in near real time. 

Anyway, with that fixed, on to clustering. Last time, we tried this with just word counts, and that didn't work. This time, we'll use the levenshtein distance (edit distance) as the metric for clustering, so domains that require fewer changes to get from one string to another will be considered "close", while ones that require more letter changes will be considered "far" apart.

In [11]:
def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices, see comment about X below
    # pull just the registrable domain name, not the eTLD
    first = fixed_domains[i].split(".", 1)[0]
    second = fixed_domains[j].split(".", 1)[0]
    return levenshtein(first, second)

In [12]:
levenshtein(fixed_domains[-1].split(".", 1)[0], fixed_domains[-2].split(".", 1)[0])

2

In [13]:
# in this case, we just make the input to the clustering alg a set of array indexes, and let the lev_metric method grab them to 
# do the comparisons
X = np.arange(len(fixed_domains)).reshape(-1, 1)
db = DBSCAN(eps=2, metric=lev_metric, min_samples=2).fit(X)

It's worth mentioning here that we chose `eps` here intentionally to be the edit distance of 2. So, we're allowing words with up to 2 letter differences between them to be in the same cluster.

In [14]:
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)

Estimated number of clusters: 57
Estimated number of noise points: 37


This is more like it. only 37 things in the noise category, and 57 other clusters. Let's look at them.

In [15]:
translated = defaultdict(list)
for label, entry in zip(labels, raw_domains):
    translated[int(label)].append(entry)

In [19]:
key = 12
translated[key]

['cwbanb.com',
 'cwhanb.com',
 'cwhanh.com',
 'cwhank.com',
 'cwkanh.com',
 'cwkank.com',
 'eqbanb.com',
 'eqhanh.com',
 'eqhank.com',
 'tdhank.com',
 'ụșbamh.com',
 'ụșbamk.com',
 'ụșbanh.com',
 'ụșbbanh.com',
 'ụșbhanh.com',
 'ụșhaank.com',
 'ụșhamk.com',
 'ușbhạnk.com',
 'ușbạạnk.com',
 'ușhạmk.com',
 'ușhạnk.com',
 'usbhạnk.com',
 'ushaạnk.com']

In [20]:
key=55
translated[key]

['sẹotiaomline-sẹotiahank.com',
 'sẹotiaonline-sẹotiahank.com',
 'sẹotiaonline-sẹotlahank.com']

Beautiful. That works really well. The scotiaonline-scotiabank ones clustered seperately to the scotiabank only ones, but for this exercise, that's going to be really hard to fix. (Doing both typos and internal word similarity will be hard.) I'm going to declare victory on this exercise and move on.

Let's export this back out so that we can hand it back to Clay.

In [21]:
json.dump(translated, open("disney-clusters.json","w"))