# Detecting Phishing Domains with UniSim

This colab showcases how to use UniSim near-duplicate text embeddings to find look-a-like phishing domains.

This technique can be used to monitor Certificate Transparency logs to find look-a-like phishing domains in near-realtime. If you are interested in an end-to-end implementation of this idea, you can find a complete server side / client side implementation at [CThunter GitHub repository](https://github.com/ebursztein/cthunter).

For this colab, we are going to use a static list of domains and fake domains to demonstrate the underlying technical concepts.

In [1]:
import logging, os
logging.getLogger('tensorflow').disabled = True
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" # ignore TensorFlow debugging info

from time import time
import json
from unisim import TextSim
from tabulate import tabulate

## Loading Data

First, we are loading made-up data that simulates our use-case (finding if the domains present in the logs look like the domains we are monitoring).

The `domains.json` file contains two lists:

- **domains**: this is a list of domains that are monitored
- **logs**: those are the made-up domains seen in the logs


In [2]:
data = json.load(open('./data/domains.json'))
domains = data['domains']
logs = data['logs']
print('Some Domains:', domains[:5])
print("Some logs:", logs[:5])

Some Domains: ['google', 'facebook', 'twitter', 'linkedin', 'instagram']
Some logs: ['gooolgle', 'g00gle', 'g𝙤ogl', 'googglee', 'google']


## How Unisim Embeddings Work

UniSim text embeddings leverage the RETSim model (a tiny transformer model trained with metric learning) to embed texts into vector representations that can be compared using cosine similarity.

Let's demonstrate how this works in practice so we get a sense of the value returned.

In [3]:
ts = TextSim()  # init UniSim text similarity

test_domain = "Google"
test_logs = ["g𝙤ogle", 'googlee', 'loogle', "thisisnotamatch"]
for log in test_logs:
    s = ts.similarity(test_domain, log)
    print(f"Similarity between {test_domain} and {log} is {s}")



Similarity between Google and g𝙤ogle is 0.8847256302833557
Similarity between Google and googlee is 0.894428014755249
Similarity between Google and loogle is 0.7688487768173218
Similarity between Google and thisisnotamatch is 0.4283703863620758


As visible in the example above, the father the domains, the lower is the similarity so we need a threshold. 0.85 is usually a good starting point until you calibrate the value based on your own data.

## Indexing and Searching

A key issue with traditional text similarity algorithms based on edit distance is that they have are $O(N^2)$ in complexity, which makes them prohibitively expensive to compute at scale.

However, like other text embeddings, UniSim transforms strings into vector representations and we can take advantage of this in multiple ways to speed up computation:

1. We can pre-compute the domain embeddings and only compute the log domains' embeddings as they arrive, saving computation time.
2. We can leverage GPU to do batch computation to compute domain embeddings.
3. If the list of domains become very large, we can use an Approximate Nearest Neighboor (ANN) algorithm to do search in sub-linear time.

Here our list of domains is quite small, so we are only going to use batching and use an exact SIMD accelerated search. To do so, we are going to perform three steps:

1. Index our domains
2. Create a batch of logs
3. Look through the results to find matches

In [4]:
match_threshold = 0.85  # 0.8-0.9 is typically a good similarity threshold for matching texts
ts.reset_index()  # reset index in case the cell is ran multiple times
idxs = ts.add(domains)  # index the domains we want to monitor

In [5]:
start = time()
results = ts.search(logs, k=1, similarity_threshold=match_threshold)  # match the logs against the dataset
print(f"Search took {time() - start:.2f}s")

# get matching matching results
matches = []
not_matches = []   # just to show what it looks like
for result in results.results:
    log_match = result.query_data
    for match in result.matches:
        if match.is_match:
            matches.append((log_match, match.data, match.similarity))
        else:
            not_matches.append((log_match, match.data, match.similarity))
print(tabulate(matches, headers=['Log', 'Match', 'Similarity']))

Search took 0.12s
Log        Match       Similarity
---------  --------  ------------
gooolgle   google        0.88643
g𝙤ogl      google        0.875504
googglee   google        0.936634
google     google        1
goolgle    google        0.887561
thegoogle  google        0.901916
go0gl3     google        0.878335
bookface   facebook      0.952432
faceb0ok   facebook      0.965319
twiter     twitter       0.954167
twittter   twitter       0.989152
twiiter    twitter       0.935445
Eebay      ebay          0.917119
ebayy      ebay          0.951802
eb4y       ebay          0.899173
amazoon    amazon        0.979939
ammazonn   amazon        0.898798


In [6]:
# let's display some non-matches to show what it looks like
print(tabulate(not_matches, headers=['Log', 'Match', 'Similarity']))

Log           Match        Similarity
------------  ---------  ------------
g00gle        google         0.787034
goolge        google         0.741084
goog-le.net   google         0.84237
G00gleLogin   google         0.701801
faecbook      facebook       0.849864
faecbook      facebook       0.849864
f@ceb00k.bl   facebook       0.689276
faecbo-ok     facebook       0.798683
myfaecbook    facebook       0.739695
ficebo0k.com  facebook       0.781612
amazin        amazon         0.721841
theamazon     amazon         0.839458
myamazin      amazon         0.609729
tw1tt3r       twitter        0.74809
3b4y          ebay           0.791277
paypall       apple          0.607747
palpay        apple          0.529317
p4yoal        youtube        0.605841
p4ypal        apple          0.620593
p4ypa1        pinterest      0.572656
p4yp4l        pinterest      0.574713
Amazz0n       amazon         0.730577
amaz00n       amazon         0.791594
amz𝙤n         amazon         0.822388
