# <font style="color:#008fff;">Adding Innovation</font>
<hr>

### This part demonstrates the innovative twist we will be implementing. For every sample URL that is classified as malicious, we want to compare its domain name to domain names of another dataset purely of legitimate URL's found here: https://www.kaggle.com/datasets/peopledatalabssf/free-7-million-company-dataset?resource=download

### By doing this, we can predict whether the malicious URL is disguising as another legitimate entity, which is a common practice done by cybercriminals to trick victims to clicking into a malicious link

In [2]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import spacy

import time
import os
import warnings
import sys
import random
import pickle
import collections


import warnings
from tld import get_tld
import time

#Disabling Warnings
warnings.filterwarnings('ignore')

## <font style="color:#008fff;">Reading in dataframe of legitimate URL's and preprocessing it:</font>

### Download the companies_sorted.csv file from the link above since the file is having issues to upload to github

In [3]:
legit_companies = pd.read_csv('Dataset/Company Names/companies_sorted.csv')

In [4]:
legit_companies.head()

Unnamed: 0.1,Unnamed: 0,name,domain,year founded,industry,size range,locality,country,linkedin url,current employee estimate,total employee estimate
0,5872184,ibm,ibm.com,1911.0,information technology and services,10001+,"new york, new york, united states",united states,linkedin.com/company/ibm,274047,716906
1,4425416,tata consultancy services,tcs.com,1968.0,information technology and services,10001+,"bombay, maharashtra, india",india,linkedin.com/company/tata-consultancy-services,190771,341369
2,21074,accenture,accenture.com,1989.0,information technology and services,10001+,"dublin, dublin, ireland",ireland,linkedin.com/company/accenture,190689,455768
3,2309813,us army,goarmy.com,1800.0,military,10001+,"alexandria, virginia, united states",united states,linkedin.com/company/us-army,162163,445958
4,1558607,ey,ey.com,1989.0,accounting,10001+,"london, greater london, united kingdom",united kingdom,linkedin.com/company/ernstandyoung,158363,428960


In [5]:
legit_domain_names = legit_companies[['name', 'domain']]

In [6]:
legit_domain_names

Unnamed: 0,name,domain
0,ibm,ibm.com
1,tata consultancy services,tcs.com
2,accenture,accenture.com
3,us army,goarmy.com
4,ey,ey.com
...,...,...
7173421,certiport vouchers,certiportvouchers.com
7173422,black tiger fight club,blacktigerclub.com
7173423,catholic bishop of chicago,
7173424,medexo robotics ltd,


Dropping NA values:

In [7]:
legit_domain_names.isnull().sum()

name            3
domain    1650621
dtype: int64

In [8]:
print(f'Before {len(legit_domain_names)}')
legit_domain_names = legit_domain_names.dropna()
print(f'After {len(legit_domain_names)}')

Before 7173426
After 5522803


In [9]:
legit_domain_names

Unnamed: 0,name,domain
0,ibm,ibm.com
1,tata consultancy services,tcs.com
2,accenture,accenture.com
3,us army,goarmy.com
4,ey,ey.com
...,...,...
7173416,fit plus s.r.o.,fitplus.sk
7173417,coriex srl,coriex.it
7173421,certiport vouchers,certiportvouchers.com
7173422,black tiger fight club,blacktigerclub.com


In [10]:
def extract_domain_names(url):
    fragment = url.split('.')
    return fragment[0]

In [11]:
domain_names = legit_domain_names['domain'].map(extract_domain_names)

In [12]:
legit_domain_names['domain no tld'] = list(domain_names)

In [13]:
legit_domain_names.drop('domain', axis=1, inplace=True)

## <font style="color:#008fff;">Dataset of Malicious and Benign Webpages</font>

### Reading in a sample of datapoints from main Dataset of Malicious and Benign Webpages (Our original dataset we've used to preprocess, train, test on the previous 2 notebooks), which we will perform classification on with models trained and tested on notebook 2

In [17]:
# PREPROCESSING HELPER FUNCTIONS
from alt_profanity_check import predict_prob, predict
from urllib.parse import urlparse
from tld import get_tld

# Getting rid of outliers using clamp transformation
def find_outliers_IQR(df):
    q1=df.quantile(0.25)
    q3=df.quantile(0.75)
    IQR=q3-q1
    
    for index, val in df.items():
        if val < (q1 - 1.5 * IQR): # Small outliers below lower quartile
            df[index] = (q1 - 1.5 * IQR)
        elif val > (q3 + 1.5 * IQR): # Large outliers above upper quartile
            df[index] = (q3 + 1.5 * IQR)

    return df

# If tld == gov, then is_gov_tld = 1, else gov_tld = 0
def make_gov_column(df):
    gov_col = []
    for index, val in df.items():
        if val == 'gov':
            gov_col.append(1)
        else:
            gov_col.append(0)
    return np.array(gov_col)


def clean_url(url):
    url_text=""
    try:
        domain = get_tld(url, as_object=True)
        domain = get_tld(url, as_object=True)
        url_parsed = urlparse(url)
        url_text= url_parsed.netloc.replace(domain.tld," ").replace('www',' ') +" "+ url_parsed.path+" "+url_parsed.params+" "+url_parsed.query+" "+url_parsed.fragment
        url_text = url_text.translate(str.maketrans({'?':' ','\\':' ','.':' ',';':' ','/':' ','\'':' '}))
        url_text.strip(' ')
        url_text.lower()
    except:
        url_text = url_text.translate(str.maketrans({'?':' ','\\':' ','.':' ',';':' ','/':' ','\'':' '}))
        url_text.strip(' ')
    return url_text

def predict_profanity(url_cleaned):
    arr=predict_prob(url_cleaned.astype(str).to_numpy())
    arr= arr.round(decimals=3)
    #df['url_vect'] = pd.DataFrame(data=arr,columns=['url_vect'])
    return arr

def preprocess(df_current):
    df = df_current.copy()
    
    start_time= time.time()

    # ------------ Address outliers via clamp transformation --------------
    url_len_clamped = df['url_len'].copy()
    url_len_clamped = find_outliers_IQR(url_len_clamped)
    js_len_clamped = df['js_len'].copy()
    js_len_clamped = find_outliers_IQR(js_len_clamped)
    js_obf_len_clamped = df['js_obf_len'].copy()
    js_obf_len_clamped = find_outliers_IQR(js_obf_len_clamped)
    
    df['url_len'] = url_len_clamped
    df['js_len'] = js_len_clamped
    df['js_obf_len'] = js_obf_len_clamped
    
    # --------------- Scaling numerical features ---------------
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    
    url_len_scaled = scaler.fit_transform(df[['url_len']])
    df['url_len_scaled'] = url_len_scaled

    js_len_scaled = scaler.fit_transform(df[['js_len']])
    df['js_len_scaled'] = js_len_scaled

    js_obf_len_scaled = scaler.fit_transform(df[['js_obf_len']])
    df['js_obf_len_scaled'] = js_obf_len_scaled
    
    
    # ---------------- Binary Encoding for Categorical Attributes ------------------
    identifyWho_Is = {'incomplete': 0, 'complete': 1}
    df['who_is'] = [identifyWho_Is[item] for item in df.who_is]
    
    identifyHTTPS = {'no': 0, 'yes': 1}
    df.https = [identifyHTTPS[item] for item in df.https]
    
    # --------------- Handling TLD Column -------------------------
    gov_binary_val = make_gov_column(df['tld'])
    df.insert(2, column = "is_gov_tld", value=gov_binary_val)
    
    
    # ---------------- Probabilty based profanity score on text columnsk ------------------
    from profanity_check import predict_prob, predict
    profanity_score_prob = predict_prob(np.array(df['content']))
    df.insert(5, column='profanity_score_prob', value=profanity_score_prob)
    
    
    # ------------------ Cleaning URL's --------------------
    url_cleaned = df['url'].map(clean_url)
    df.insert(1, column='url_cleaned', value=url_cleaned)
    url_vect = predict_profanity(df['url_cleaned'])
    df.insert(2, column='url_vect', value=url_vect)
    
    # ---------------------- Preprocess labels into binary values ----------------------
    identifyLabels = {'bad': 1, 'good': 0}
    df['label'] = [identifyLabels[item] for item  in df.label]
    
    # ------------ Drop Unecessary Columns, or Original Columns after preprocessing that still remain -------------
    df.drop(['geo_loc', 'ip_add', 'url_len', 'js_len', 'js_obf_len', 'tld', 'content', 'url', 'url_cleaned'], axis=1, inplace=True)
    
    # ---------------------- Rearrange Columns ----------------------
    titles = ['url_vect', 'is_gov_tld', 'who_is', 'https', 'profanity_score_prob', 
              'url_len_scaled', 'js_len_scaled','js_obf_len_scaled',
              'label'] # Same order as our training data

    df = df[titles] 
    
    print("***Elapsed time preprocess --- %s seconds ---***" % (time.time() - start_time))
    return df

In [18]:
def loadDataset(file_name, idx_col=False):
    start_time= time.time()
    if idx_col:
        df = pd.read_csv(file_name, index_col=[0])
    else:
        df = pd.read_csv(file_name)
    print("***Elapsed time to read csv files --- %s seconds ---***" % (time.time() - start_time))
    return df

df_test = loadDataset("Dataset/Webpages_Classification_test_data.csv", idx_col=True)

***Elapsed time to read csv files --- 10.864612102508545 seconds ---***


In [19]:
df_test = df_test.sample(13627)
df_test_preprocessed = preprocess(df_test)

***Elapsed time preprocess --- 3.7690954208374023 seconds ---***


### NOTE: We want to keep original test data because for any datapoint in our testing set that is considered malicious by our machine learning models, we extract its domain name, then compare it to domain names from our dataframe of legitimate URLs

In [20]:
# Original test data
X_test = df_test.drop('label', axis=1)
y_test = df_test['label']

# Preprocessed version of test data
X_test_preprocessed = df_test_preprocessed.drop('label', axis=1)
y_test_preprocessed = df_test_preprocessed['label']

## <font style="color:#008fff;">Making Our Predictions</font>

### Reading in our models (In this notebook, we will just be testing on our optimized models trained on reduced features since they have better accuracy)

In [21]:
knn_reduced_filename = 'Models/Optimized/knn_reduced_features_opt.sav' # Reduced feature set
knn_reduced = pickle.load(open(knn_reduced_filename, 'rb'))

gnb_reduced_filename = 'Models/Optimized/gnb_reduced_features_opt.sav'
gnb_reduced = pickle.load(open(gnb_reduced_filename, 'rb'))

dc_reduced_filename = 'Models/Optimized/dc_reduced_features_opt.sav'
dc_reduced = pickle.load(open(dc_reduced_filename, 'rb'))

rfc_reduced_filename = 'Models/Optimized/rfc_reduced_features_opt.sav'
rfc_reduced = pickle.load(open(rfc_reduced_filename, 'rb'))

In [22]:
# Function to have all 4 models make a majority vote
def vote_predictions(row):
    reduced_features_1 = ['who_is', 'https', 'profanity_score_prob', 'js_len_scaled', 'js_obf_len_scaled']
    knn_input = [row.loc[reduced_features_1]]
    
    reduced_features_2 = ['url_vect', 'is_gov_tld', 'js_obf_len_scaled']
    gnb_dc_rfc_input = [row.loc[reduced_features_2]]
    
    # Each models predicts
    preds = []
    preds.append(knn_reduced.predict(knn_input)[0])
    preds.append(gnb_reduced.predict(gnb_dc_rfc_input)[0])
    preds.append(dc_reduced.predict(gnb_dc_rfc_input)[0])
    preds.append(rfc_reduced.predict(gnb_dc_rfc_input)[0])
    
    vote_counts = collections.Counter(preds)
    
    if vote_counts[0] > vote_counts[1]:
        return 0
    elif vote_counts[0] < vote_counts[1]:
        return 1
    else: # If tie, randomly choose either one
        return random.choice([0, 1])


start_time = time.time()

majority_preds = []
potentially_risky_urls = []
for index, row in X_test_preprocessed.iterrows():
    # Take majority vote
    vote = vote_predictions(row)
    majority_preds.append(vote)
    
    if vote == 1: # If majority vote == 1, means URL with majority vote from all 1 models is potentially risky
        potentially_risky_urls.append(X_test.loc[index]['url']) # Getting raw URL of risky URL
    
print("***Elapsed time to make predictions --- %s seconds ---***" % (time.time() - start_time))

***Elapsed time to make predictions --- 87.13893723487854 seconds ---***


In [23]:
print(f'Numebr of URLs classified as malicious according to our 4 models\'s votes: {len(potentially_risky_urls)}')

Numebr of URLs classified as malicious according to our 4 models's votes: 135


## <font style="color:#008fff;">Measuring Word Similarity Using Edit Distance</font>

Edit distance (AKA Levenshtein Distance), is a measure of the minimum number of operations (Insert, delete, and replace) required to transform one string to another. For example, consider strings "kitten" and "sitting". To transform "kitten" into "sitting":
 - Substitute 'k' with 's'
 - Substitute 'e' with 'i'
 - Insert 'i' before 't'
 - Substitute 'n' with 'g'

In [24]:
# Function taking in 2 strings to return the edit distance. (DYNAMIC PROGRAMMING)
def calculate_edit_distance(str1, str2):
    len1 = len(str1)
    len2 = len(str2)

    # Create a 2D matrix to store the edit distances
    dp = [[0] * (len2 + 1) for _ in range(len1 + 1)]

    # Initialize the first row and column of the matrix
    for i in range(len1 + 1):
        dp[i][0] = i
    for j in range(len2 + 1):
        dp[0][j] = j

    # Compute the edit distances
    for i in range(1, len1 + 1):
        for j in range(1, len2 + 1):
            if str1[i - 1] == str2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1])

    # Return the edit distance between the two strings
    return dp[len1][len2]

In [25]:
calculate_edit_distance('amazon', 'amzon')

1

In [26]:
# Extract domain name from our list of potentially risky urls
risky_urls_domains = []
for url in potentially_risky_urls:
    fragments = clean_url(url).split()
    risky_urls_domains.append(fragments[0])

**NOTE: We will only be sampling to 22,000 legitimate URLs to make our comparison. Originally, we had ~5 million, which took too much time to calculate edit distance as seen in the next couple cells**

In [27]:
safe_urls_domains = list(legit_domain_names['domain no tld'])[:30000]

## <font style="color:#008fff;">Calculate Edit Distance</font>

In [28]:
results = {'Risky Domain': [],
           'Shortest Edit Distance': [],
           'Potentially Disguising As': []
          }

start_time = time.time()

for i in risky_urls_domains:
    edit_dist_recorded = {}
    for j in safe_urls_domains:
        ed = calculate_edit_distance(i, j)
        edit_dist_recorded[j] = ed # Getting all edit distances 
    
    # With the corresponding i, find which j it has the shortest edit distance with
    j_with_lowest_dist = min(edit_dist_recorded, key=lambda k: edit_dist_recorded[k])
    lowest_dist = edit_dist_recorded[j_with_lowest_dist]
    
    # Appending to results
    results['Risky Domain'].append(i)
    results['Shortest Edit Distance'].append(lowest_dist)
    results['Potentially Disguising As'].append(j_with_lowest_dist)
    
print("***Elapsed time to make predictions --- %s seconds ---***" % (time.time() - start_time))

***Elapsed time to make predictions --- 183.1327612400055 seconds ---***


In [29]:
results = pd.DataFrame(results)

### Filtered all malicious domain names where the edit distance is less than 3, and here are our results. Some of these malicious domain names may not actually be disguising as a legitimate domain, but this gives insights of what cybercriminals can do to trick users by pretending to be a legitimate entity domain.

In [30]:
results[results['Shortest Edit Distance'] < 3]

Unnamed: 0,Risky Domain,Shortest Edit Distance,Potentially Disguising As
11,mat,1,bat
12,ass,1,asu
13,gning,2,ing
21,abanet,1,amanet
23,hugill,2,mcgill
24,adamsmith,2,cdmsmith
26,merlyn,2,mercy
29,cancer,0,cancer
30,ncc-tu,2,nccu
39,bae,1,bat
