# Building Model for Website security prediction

## Dataset:
- [Dataset of Malicious and Benign Webpages](https://data.mendeley.com/datasets/gdx3pkwp47/1)

Singh, AK (2020), “Dataset of Malicious and Benign Webpages”, Mendeley Data, V1, doi: 10.17632/gdx3pkwp47.1

## Preprocessing and feature extraction:
- Since the dataset is very huge and imbalanced, we have taken only 3000 samples from each class so total 6000 dataset avalable.
- We considered only the label and the url column for our model building.
- Since this dataset is for webpages we removed the path and only considered the domain name for our model building.
- We pass each url to our scanning function to scan for vulnerabilities where they are the features

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests


In [17]:
dataset = pd.read_csv('Webpages_Classification_train_data.csv')
dataset.head()

Unnamed: 0.1,Unnamed: 0,url,url_len,ip_add,geo_loc,tld,who_is,https,js_len,js_obf_len,content,label
0,0,http://members.tripod.com/russiastation/,40,42.77.221.155,Taiwan,com,complete,yes,58.0,0.0,Named themselves charged particles in a manly ...,good
1,1,http://www.ddj.com/cpp/184403822,32,3.211.202.180,United States,com,complete,yes,52.5,0.0,And filipino field \n \n \n \n \n \n \n \n the...,good
2,2,http://www.naef-usa.com/,24,24.232.54.41,Argentina,com,complete,yes,103.5,0.0,"Took in cognitivism, whose adherents argue for...",good
3,3,http://www.ff-b2b.de/,21,147.22.38.45,United States,de,incomplete,no,720.0,532.8,fire cumshot sodomize footaction tortur failed...,bad
4,4,http://us.imdb.com/title/tt0176269/,35,205.30.239.85,United States,com,complete,yes,46.5,0.0,"Levant, also monsignor georges. In 1800, lists...",good


In [18]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 12 columns):
 #   Column      Non-Null Count    Dtype  
---  ------      --------------    -----  
 0   Unnamed: 0  1200000 non-null  int64  
 1   url         1200000 non-null  object 
 2   url_len     1200000 non-null  int64  
 3   ip_add      1200000 non-null  object 
 4   geo_loc     1200000 non-null  object 
 5   tld         1200000 non-null  object 
 6   who_is      1200000 non-null  object 
 7   https       1200000 non-null  object 
 8   js_len      1200000 non-null  float64
 9   js_obf_len  1200000 non-null  float64
 10  content     1200000 non-null  object 
 11  label       1200000 non-null  object 
dtypes: float64(2), int64(2), object(8)
memory usage: 109.9+ MB


In [19]:
dataset["label"].value_counts()

label
good    1172747
bad       27253
Name: count, dtype: int64

In [20]:
# check for repeted website, no need to check multiple pages just the main link is needed
dataset["url"].value_counts()

url
http://www.hiddenforest.co.nz/fungi/family/pluteaceae/amanita.htm    6
http://www.fcc.gov/                                                  5
http://cyber.law.harvard.edu/                                        5
http://www.symbols.com/                                              5
http://omacl.org/                                                    5
                                                                    ..
http://www.falconups.com/                                            1
http://www.angelfire.com/ga2/ewanlinks/main.html                     1
http://www.unidata.ucar.edu/packages/dods/                           1
http://www.geocities.com/sunsetstrip/birdland/8001/football.html     1
http://homepages.gotadsl.co.uk/~jgm/ekmm/                            1
Name: count, Length: 1171470, dtype: int64

In [21]:
# trim the url to get the main link (http or https, domain) any thing after the / is should be trimmed use regex r'^(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})'
import re
def extract_base_url(url):
    # This regex will match the protocol (http/https), domain, and top-level domain
    regex = r'^(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})'
    match = re.match(regex, url)
    if match:
        # This will concatenate the matched groups to form the base URL
        return match.group(0)
    else:
        return None

dataset["base_url"] = dataset["url"].apply(extract_base_url)
dataset.head()

Unnamed: 0.1,Unnamed: 0,url,url_len,ip_add,geo_loc,tld,who_is,https,js_len,js_obf_len,content,label,base_url
0,0,http://members.tripod.com/russiastation/,40,42.77.221.155,Taiwan,com,complete,yes,58.0,0.0,Named themselves charged particles in a manly ...,good,http://members.tripod.com
1,1,http://www.ddj.com/cpp/184403822,32,3.211.202.180,United States,com,complete,yes,52.5,0.0,And filipino field \n \n \n \n \n \n \n \n the...,good,http://www.ddj.com
2,2,http://www.naef-usa.com/,24,24.232.54.41,Argentina,com,complete,yes,103.5,0.0,"Took in cognitivism, whose adherents argue for...",good,http://www.naef-usa.com
3,3,http://www.ff-b2b.de/,21,147.22.38.45,United States,de,incomplete,no,720.0,532.8,fire cumshot sodomize footaction tortur failed...,bad,http://www.ff-b2b.de
4,4,http://us.imdb.com/title/tt0176269/,35,205.30.239.85,United States,com,complete,yes,46.5,0.0,"Levant, also monsignor georges. In 1800, lists...",good,http://us.imdb.com


In [22]:
# get the base_url and label only
dataset = dataset[["base_url", "label"]]
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   base_url  1198752 non-null  object
 1   label     1200000 non-null  object
dtypes: object(2)
memory usage: 18.3+ MB


In [23]:
# remove duplicates
dataset = dataset.drop_duplicates()
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 760595 entries, 0 to 1199999
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   base_url  760593 non-null  object
 1   label     760595 non-null  object
dtypes: object(2)
memory usage: 17.4+ MB


In [24]:
dataset["label"].value_counts()

label
good    742332
bad      18263
Name: count, dtype: int64

In [25]:
# balance the dataset since label good is more than label bad
dataset_good = dataset[dataset["label"] == "good"]
dataset_bad = dataset[dataset["label"] == "bad"]
dataset_good = dataset_good.sample(dataset_bad.shape[0])
dataset = pd.concat([dataset_good, dataset_bad])
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36526 entries, 976874 to 1199910
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   base_url  36525 non-null  object
 1   label     36526 non-null  object
dtypes: object(2)
memory usage: 856.1+ KB


In [26]:
# check for null values
dataset.isnull().sum()

# remove null values
dataset = dataset.dropna()

In [27]:
# the final dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36525 entries, 976874 to 1199910
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   base_url  36525 non-null  object
 1   label     36525 non-null  object
dtypes: object(2)
memory usage: 856.1+ KB


In [28]:
# reduce the dataset more and keep only 6000 url from both good and bad
dataset_good = dataset[dataset["label"] == "good"]
dataset_bad = dataset[dataset["label"] == "bad"]

# keep only 3000 url on each dataset
dataset = pd.concat([dataset_good[:500], dataset_bad[:500]])

# shuffle the dataset
dataset = dataset.sample(frac=1).reset_index(drop=True)

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   base_url  1000 non-null   object
 1   label     1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


In [29]:
from vulner import *

vulnerability_checks = {
    "SQL Injection": is_sql_injection_check,
    "Cross-Site Scripting (XSS)": is_xss_vulnerable,
    "Command Injection": is_command_injection_vulnerable,
    "Remote File Inclusion Vulnerability": is_remote_file_inclusion_vulnerable,
    "Server-Side Request Forgery (SSRF)": is_ssrf_vulnerable,
    "Unvalidated Redirect Vulnerability": is_unvalidated_redirect_vulnerable,
    "Cross-Site Request Forgery (CSRF) Vulnerability": is_csrf_vulnerable,
    "Remote Code Execution (RCE) Vulnerability": is_rce_vulnerable,
    "Cross-Site Script Inclusion (XSSI) Vulnerability": is_xssi_vulnerable,
    "File Upload Vulnerability": is_file_upload_vulnerable,
    "Insecure Direct  (IDOR)": is_idor_vulnerable,
    "XML External Entity (XXE) Vulnerability": is_xxe_vulnerable,
    "Server-Side Template Injection (SSTI) Vulnerability": is_ssti_vulnerable,
    "Remote Code Inclusion (RCI) Vulnerability": is_rci_vulnerable,
    "Server-Side Template Injection (SSTI)": is_ssti_templating_vulnerable,
    "Insecure Deserialization": is_insecure_deserialization_vulnerable,
    "Server-Side Request": is_ssrf_dns_rebinding_vulnerable,
    "Clickjacking Vulnerability": is_clickjacking_vulnerable,
    "Security Misconfiguration": is_security_misconfiguration_vulnerable,
    "Cross-Site Scripting": is_dom_based_xss_vulnerable,
    "Open Redirect Vulnerability": check_open_redirect,
    "Cross-Origin Resource": check_cors_misconfiguration,
    "Cross-Site Script": is_xssi_json_vulnerable,
    "Content Security Policy": is_csp_bypass_vulnerable,
    "Insecure Cross-Origin": is_insecure_cors_configured,
    "HTTP Parameter Pollution Vulnerability": check_http_parameter_pollution,
    "Insufficient Transport": check_transport_layer_protection,
}

In [30]:
# try for few examples create an array to store the result where is it true or false or fail
# result = []
# for i in dataset['base_url'].head():
#     url = i
#     subresult = []
#     print(i)
#     print('------------------')
#     for key, check in vulnerability_checks.items():
#         try:
#             subresult.append(check(url))
#         except:
#             subresult.append('fail')
#     print(subresult)
#     result.append(subresult)


In [31]:
dataset['label'].head()

0     bad
1     bad
2    good
3    good
4     bad
Name: label, dtype: object

In [32]:
# apply these functions to each row in the dataset
for key, check in vulnerability_checks.items():
    # iterate through each row in the dataset
    for index, row in dataset.iterrows():
        # get the url
        url = row['base_url']
        # check the url
        try:
            dataset.loc[index, key] = check(url)
        except:
            # delete that row
            dataset.drop(index, inplace=True)
            continue

dataset.head()

SQL Injection Vulnerability Detected!
SQL Injection Vulnerability Detected!
SQL Injection Vulnerability Detected!
SQL Injection Vulnerability Detected!
SQL Injection Vulnerability Detected!
