# Detecting Maclicious URLs using Machine Learning<br>
The malicious urls can be detected using the lexical features along with tokenization of the url strings. I aim to build a basic binary classifier which would help classify the URLs as malicious or benign.

Steps followed in building the machine learning classifier<br>
1. Data Preprocessing / Feature Engineering
2. Data Visualization
3. Building Machine Learning Models using Lexical Features.
4. Building Machine Learning Models using Lexical Features and Tokenization. (Will Update this part)

Importing The Dependencies

In [1]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import os


In [2]:
df = pd.read_csv("./urldata.csv")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,url,label,result
0,0,https://www.google.com,benign,0
1,1,https://www.youtube.com,benign,0
2,2,https://www.facebook.com,benign,0
3,3,https://www.baidu.com,benign,0
4,4,https://www.wikipedia.org,benign,0


In [4]:
#Removing the unnamed columns as it is not necesary.
df = df.drop('Unnamed: 0',axis=1)

In [5]:
df.head()

Unnamed: 0,url,label,result
0,https://www.google.com,benign,0
1,https://www.youtube.com,benign,0
2,https://www.facebook.com,benign,0
3,https://www.baidu.com,benign,0
4,https://www.wikipedia.org,benign,0


In [6]:
df.shape

(450176, 3)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450176 entries, 0 to 450175
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   url     450176 non-null  object
 1   label   450176 non-null  object
 2   result  450176 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 10.3+ MB


Checking Missing Values

In [8]:
df.isnull().sum()

url       0
label     0
result    0
dtype: int64

No missing values in any column.

## 1. DATA PREPROCESSING

The following features will be extracted from the URL for classification. <br>
<ol>
    <li>Length Features
    <ul>
        <li>Length Of Url</li>
        <li>Length of Hostname</li>
        <li>Length Of Path</li>
        <li>Length Of First Directory</li>
        <li>Length Of Top Level Domain</li>
    </ul>
    </li>
    <br>
   <li>Count Features
    <ul>
    <li>Count Of  '-'</li>
    <li>Count Of '@'</li>
    <li>Count Of '?'</li>
    <li>Count Of '%'</li>
    <li>Count Of '.'</li>
    <li>Count Of '='</li>
    <li>Count Of 'http'</li>
    <li>Count Of 'www'</li>
    <li>Count Of Digits</li>
    <li>Count Of Letters</li>
    <li>Count Of Number Of Directories</li>
    </ul>
    </li>
    <br>
    <li>Binary Features
    <ul>
        <li>Use of IP or not</li>
        <li>Use of Shortening URL or not</li>
    </ul>
    </li>
    
</ol>

Apart from the lexical features, we will use TFID - Term Frequency Inverse Document as well.

### 1.1 Length Features

In [9]:
!pip install tld




[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
#Importing dependencies
from urllib.parse import urlparse
from tld import get_tld
import os.path

In [11]:
#Length of URL
df['url_length'] = df['url'].apply(lambda i: len(str(i)))

In [12]:
#Hostname Length
df['hostname_length'] = df['url'].apply(lambda i: len(urlparse(i).netloc))

ValueError: '.' does not appear to be an IPv4 or IPv6 address

In [13]:
#Path Length
df['path_length'] = df['url'].apply(lambda i: len(urlparse(i).path))

ValueError: '.' does not appear to be an IPv4 or IPv6 address

In [None]:
#First Directory Length
def fd_length(url):
    urlpath= urlparse(url).path
    try:
        return len(urlpath.split('/')[1])
    except:
        return 0

df['fd_length'] = df['url'].apply(lambda i: fd_length(i))

In [None]:
#Length of Top Level Domain
df['tld'] = df['url'].apply(lambda i: get_tld(i,fail_silently=True))
def tld_length(tld):
    try:
        return len(tld)
    except:
        return -1

df['tld_length'] = df['tld'].apply(lambda i: tld_length(i))

In [None]:
df.head()

In [None]:
df = df.drop("tld",1)

Dataset after extracting length features

In [None]:
df.sample(10)

### 1.2 Count Features

In [None]:
df['count-'] = df['url'].apply(lambda i: i.count('-'))

In [None]:
df['count@'] = df['url'].apply(lambda i: i.count('@'))

In [None]:
df['count?'] = df['url'].apply(lambda i: i.count('?'))

In [None]:
df['count%'] = df['url'].apply(lambda i: i.count('%'))

In [None]:
df['count.'] = df['url'].apply(lambda i: i.count('.'))

In [None]:
df['count='] = df['url'].apply(lambda i: i.count('='))

In [None]:
df['count-http'] = df['url'].apply(lambda i : i.count('http'))

In [None]:
df['count-https'] = df['url'].apply(lambda i : i.count('https'))

In [None]:
df['count-www'] = df['url'].apply(lambda i: i.count('www'))

In [None]:
def digit_count(url):
    digits = 0
    for i in url:
        if i.isnumeric():
            digits = digits + 1
    return digits
df['count-digits']= df['url'].apply(lambda i: digit_count(i))

In [None]:
def letter_count(url):
    letters = 0
    for i in url:
        if i.isalpha():
            letters = letters + 1
    return letters
df['count-letters']= df['url'].apply(lambda i: letter_count(i))

In [None]:
def no_of_dir(url):
    urldir = urlparse(url).path
    return urldir.count('/')
df['count_dir'] = df['url'].apply(lambda i: no_of_dir(i))

Data after extracting Count Features

In [None]:
df.head()

### 1.3 Binary Features

In [None]:
import re

In [None]:
#Use of IP or not in domain
def having_ip_address(url):
    match = re.search(
        '(([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.'
        '([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\/)|'  # IPv4
        '((0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\/)' # IPv4 in hexadecimal
        '(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}', url)  # Ipv6
    if match:
        # print match.group()
        return 1
    else:
        # print 'No matching pattern found'
        return 0
df['use_of_ip'] = df['url'].apply(lambda i: having_ip_address(i))

In [None]:
def shortening_service(url):
    match = re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                      'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                      'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                      'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                      'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                      'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                      'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|'
                      'tr\.im|link\.zip\.net',
                      url)
    if match:
        return 1
    else:
        return 0
df['short_url'] = df['url'].apply(lambda i: shortening_service(i))

Data after extracting Binary Features

In [None]:
df.head()

In [None]:
df[df['use_of_ip'] == 1].shape

# 2. Data Visualization

In [None]:
#Heatmap
corrmat = df.corr()
f, ax = plt.subplots(figsize=(25,19))
sns.heatmap(corrmat, square=True, annot = True, annot_kws={'size':10})

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(x='label',data=df)
plt.title("Count Of URLs",fontsize=20)
plt.xlabel("Type Of URLs",fontsize=18)
plt.ylabel("Number Of URLs",fontsize=18)

In [None]:
print("Percent Of Malicious URLs:{:.2f} %".format(len(df[df['label']=='malicious'])/len(df['label'])*100))
print("Percent Of Benign URLs:{:.2f} %".format(len(df[df['label']=='benign'])/len(df['label'])*100))

In [None]:
import seaborn as sns

sns.set(style="darkgrid")
ax = sns.countplot(x="label", data=df, hue="use_of_ip")

The data shows a class imbalance to some extent.

In [None]:
# plt.figure(figsize=(20,5))
# plt.hist(df['url_length'],bins=50)
# plt.title("URL-Length",fontsize=20)
# plt.xlabel("Url-Length",fontsize=18)
# plt.ylabel("Number Of Urls",fontsize=18)
# plt.ylim(0,1000)
sns.distplot(df[df['label']=='benign']['url_length'],hist=False )
sns.distplot(df[df['label']=='malicious']['url_length'],hist=False)


In [None]:
sns.distplot(df[df['label']=='benign']['hostname_length'],hist=False)
sns.distplot(df[df['label']=='malicious']['hostname_length'],hist=False)

In [None]:
plt.figure(figsize=(20,5))
plt.hist(df['hostname_length'],bins=50,color='Lightgreen')
plt.title("Hostname-Length",fontsize=20)
plt.xlabel("Length Of Hostname",fontsize=18)
plt.ylabel("Number Of Urls",fontsize=18)
plt.ylim(0,1000)

In [None]:
plt.figure(figsize=(20,5))
plt.hist(df['tld_length'],bins=50,color='Lightgreen')
plt.title("TLD-Length",fontsize=20)
plt.xlabel("Length Of TLD",fontsize=18)
plt.ylabel("Number Of Urls",fontsize=18)
plt.ylim(0,1000)

In [None]:
plt.figure(figsize=(15,5))
plt.title("Number Of Directories In Url",fontsize=20)
sns.countplot(x='count_dir',data=df)
plt.xlabel("Number Of Directories",fontsize=18)
plt.ylabel("Number Of URLs",fontsize=18)

In [None]:
plt.figure(figsize=(15,5))
plt.title("Number Of Directories In Url",fontsize=20)
sns.countplot(x='count_dir',data=df,hue='label')
plt.xlabel("Number Of Directories",fontsize=18)
plt.ylabel("Number Of URLs",fontsize=18)

In [None]:
plt.figure(figsize=(15,5))
plt.title("Use Of IP In Url",fontsize=20)
plt.xlabel("Use Of IP",fontsize=18)
plt.ylabel("Number of URLs",fontsize=18)
sns.countplot(x = df['use_of_ip'],hue='label' , data = df)
plt.ylabel("Number of URLs",fontsize=18)

In [None]:
plt.figure(figsize=(15,5))
plt.title("Use Of http In Url",fontsize=20)
plt.xlabel("Count Of http",fontsize=18)
plt.ylabel("Number of URLs",fontsize=18)
plt.ylim((0,1000))
sns.countplot(x=df['count-http'],hue='label',data=df)
plt.ylabel("Number of URLs",fontsize=18)

In [None]:
plt.figure(figsize=(15,5))
plt.title("Use Of http In Url",fontsize=20)
plt.xlabel("Count Of http",fontsize=18)

sns.countplot(x = df['count-http'],hue='label',data=df)

plt.ylabel("Number of URLs",fontsize=18)

In [None]:
# plt.figure(figsize=(15,5))
# plt.title("Use Of WWW In URL",fontsize=20)
# plt.xlabel("Count Of WWW",fontsize=18)
# sns.countplot(df['count-www'])
# plt.ylim(0,1000)
# plt.ylabel("Number Of URLs",fontsize=18)

In [None]:
plt.figure(figsize=(15,5))
plt.title("Use Of WWW In URL",fontsize=20)
plt.xlabel("Count Of WWW",fontsize=18)

sns.countplot(x = df['count-www'],hue='label',data= df)
plt.ylim(0,1000)
plt.ylabel("Number Of URLs",fontsize=18)

## 3. Building Models Using Lexical Features Only

I will be using three models for my classification.
<br>1. Logistic Regression
<br>2. Decision Trees
<br>3. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

from sklearn.tree import DecisionTreeClassifier

from sklearn.linear_model import LogisticRegression



In [None]:
#Predictor Variables
x = df[['hostname_length',
       'path_length', 'fd_length', 'tld_length', 'count-', 'count@', 'count?',
       'count%', 'count.', 'count=', 'count-http','count-https', 'count-www', 'count-digits',
       'count-letters', 'count_dir', 'use_of_ip']]

#Target Variable
y = df['result']

In [None]:
x.shape

In [None]:
y.shape

In [None]:
#Splitting the data into Training and Testing
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=42)

In [None]:
#Decision Tree
dt_model = DecisionTreeClassifier()
dt_model.fit(x_train,y_train)

dt_predictions = dt_model.predict(x_test)
accuracy_score(y_test,dt_predictions)


In [None]:
print(confusion_matrix(y_test,dt_predictions))

In [None]:
cm = confusion_matrix(y_test, dt_predictions)
cm_df = pd.DataFrame(
    cm,
    index=["benign", "malware"],
    columns=["benign", "malware"],
)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_df, annot=True, fmt=".1f")
plt.title("Confusion Matrix")
plt.ylabel("Actal Values")
plt.xlabel("Predicted Values")
plt.show()

In [None]:
#Random Forest
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)

rfc_predictions = rfc.predict(x_test)
accuracy_score(y_test, rfc_predictions)

In [None]:
print(confusion_matrix(y_test,rfc_predictions))

In [None]:
cm = confusion_matrix(y_test, rfc_predictions)
cm_df = pd.DataFrame(
    cm,
    index=["benign", "malware"],
    columns=["benign", "malware"],
)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_df, annot=True, fmt=".1f")
plt.title("Confusion Matrix")
plt.ylabel("Actal Values")
plt.xlabel("Predicted Values")
plt.show()

In [None]:
#Logistic Regression
log_model = LogisticRegression()
log_model.fit(x_train,y_train)

log_predictions = log_model.predict(x_test)
accuracy_score(y_test,log_predictions)

In [None]:
print(confusion_matrix(y_test,log_predictions))

In [None]:
cm = confusion_matrix(y_test, log_predictions)
cm_df = pd.DataFrame(
    cm,
    index=["benign", "malware"],
    columns=["benign", "malware"],
)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_df, annot=True, fmt=".1f")
plt.title("Confusion Matrix")
plt.ylabel("Actal Values")
plt.xlabel("Predicted Values")
plt.show()

Overall all the models showed great results with decent accuracy and low error rate.<br>
The high accuracy can be due to the class imbalance situation which is not fixed yet.

Further Improvements<br>
1. Analyse the code and tags used in the webpages.
2. Reduce the class imbalance problem.

In [None]:
import pickle
pickle.dump(rfc,open('model.pkl' , 'wb'))