# Detecting Phishing domains with UniSim

This colab showcases how to use UniSim near-duplicate text embeddings to find look-a-like phishing domains.

This technique can be used to monitor Certificate Transparency logs to find look-a-like phishing domains in near-realtime. If you are interested to an end-to-end implementation of this idea you will find a complete server side / client side implementation of this idea at [cthunter github repository](https://github.com/ebursztein/cthunter).

For this colab we are going to use a static list of domains and fake domains to demonstrate the underlying technical concepts.


In [1]:
from time import time
import json
from unisim import TextSim
from tabulate import tabulate




## Loading data

First thing first, we are loading made-up data that simulate our use-case finding if the domains present in the logs look alike the domains we are monitoring.

The `domains.json` file contains two list:

- **domains**: this is a list of domains that are monitored
- **logs**: those are the made-up domains seen in the logs


In [2]:
data = json.load(open('./data/domains.json'))
domains = data['domains']
logs = data['logs']
print('Some Domains:', domains[:5])
print("Some logs:", logs[:5])

Some Domains: ['google', 'facebook', 'twitter', 'linkedin', 'instagram']
Some logs: ['gooolgle', 'g00gle', 'g𝙤ogl', 'googglee', 'google']


## How Unisim embedding works

UniSim near-duplicate embedding is a very small transformer that leverage the RetVec encoder and metric learning to compute vectors that allows us to tell if two strings are closers to each-other.

Let's demonstrate how this works in practice so we get a sense of the value returned.

In [3]:
ts = TextSim()  # init UniSim text similarity
test_domain = "Google"
test_logs = ["g𝙤ogle", 'googlee', 'loogle', "thisisnotamatch"]
for log in test_logs:
    s = ts.similarity(test_domain, log)
    print(f"Similarity between {test_domain} and {log} is {s}")

Similarity between Google and g𝙤ogle is 0.8848875164985657
Similarity between Google and googlee is 0.8945294618606567
Similarity between Google and loogle is 0.7687983512878418
Similarity between Google and thisisnotamatch is 0.4284525513648987


As visible in the example above, the father the domains, the lower is the similarity so we need a threshold. 0.85 is usually a good starting point until you calibrate the value based on your own data.

## Indexing and searching

A key issue with traditional text similarity algorithms is that they have a $N^2$ in complexity which makes it prohibitively expensive to compute at scale.

However UniSim transforms strings into vectors representation so we can take advantages of this in multiples ways to speed up computations:
1. We can pre-compute the domains embeddings and only compute the match domains vectors as they arrive saving computation time.
2. We can leverage GPU to do batch computation to compute domains embeddings leveraging GPU paralellization
3. If the list of domains become very large we can use an Approximate Nearest Neighboor (ANN) algorithm to do search in $O(N)$

Here our list of domains is quite small so we only going to use batching and use an exact SIMD accelerated search.  To do so we are going to perform three steps:

1. Index our domains
2. Create a batch of logs
3. Look throught the results to find matches

In [4]:
match_threshold = 0.85
ts.reset_index()   # reset index in case the cell is ran multiple times
idxs = ts.add(domains)  # index the domains we want to monitor

In [9]:
ts.index_size

26

In [15]:
start = time()
results = ts.search(logs, k=2, similarity_threshold=match_threshold)  # match the logs against the dataset
print(f"Search took {time() - start:.2f}s")
# get matching matching results
matches = []
not_matches = []   # just to show what it looks like
for result in results.results:
    log_match = result.query_data
    for match in result.matches:
        if match.is_match:
            matches.append((log_match, match.data, match.similarity))
        else:
            not_matches.append((log_match, match.data, match.similarity))
print(tabulate(matches, headers=['Log', 'Match', 'Similarity']))

Search took 0.17s
Log        Match       Similarity
---------  --------  ------------
gooolgle   google        0.886348
g𝙤ogl      google        0.875465
googglee   google        0.936674
google     google        1
goolgle    google        0.887597
thegoogle  google        0.901965
go0gl3     google        0.878208
bookface   facebook      0.952432
faceb0ok   facebook      0.965293
twiter     twitter       0.954171
twittter   twitter       0.98916
twiiter    twitter       0.935433
Eebay      ebay          0.917151
ebayy      ebay          0.951734
eb4y       ebay          0.899255
amazoon    amazon        0.979886
ammazonn   amazon        0.898796


In [7]:
# let's display some non-matches to show what it looks like for the wrong pairs
print(tabulate(not_matches, headers=['Log', 'Match', 'Similarity']))

Log           Match        Similarity
------------  ---------  ------------
gooolgle      facebook       0.533099
g00gle        google         0.787087
g00gle        apple          0.509571
g𝙤ogl         facebook       0.477082
googglee      facebook       0.538006
google        facebook       0.557381
goolgle       facebook       0.525342
goolge        google         0.741205
goolge        facebook       0.526553
goog-le.net   google         0.842378
goog-le.net   facebook       0.499355
thegoogle     facebook       0.526736
G00gleLogin   google         0.701787
G00gleLogin   youtube        0.461425
go0gl3        facebook       0.549299
bookface      google         0.58106
faecbook      facebook       0.849847
faecbook      google         0.555835
faecbook      facebook       0.849847
faecbook      google         0.555835
faceb0ok      tiktok         0.510899
f@ceb00k.bl   facebook       0.689414
f@ceb00k.bl   ebay           0.564488
faecbo-ok     facebook       0.798672
faecbo-ok    