# **My Implementation - Thomas Hintze**

I started off by reading the material and educating myself on the topic. I found a couple other articles, that helped me get a better picture of what distinguishes phishing URLs. 

## Articles:

[Detecting Malicious URLs in E-mail – An Implementation](https://reader.elsevier.com/reader/sd/pii/S2212671613000218?token=EDF32AFC183DB659B6FBD4BDB32CD345633A44F43036B686CBF7186EEA9E1E810042C3619AF67DD148DA2CD679B22098)

[Datasets for phishing websites detection](https://www.sciencedirect.com/science/article/pii/S2352340920313202)





## Extraction functions

I decided to extract a couple of features from the URLs, focusing on symbols and digits, as many of the articles I read descirbed these as typical features of phishing URLs. 

The first function was a parsing funtion for the URL, so that different parts of the URL can be analyzed separately. Some parts of the URL, such as the top-level domain(tld) and path+query, can more freely be modified, and must therfore more carefully be analyzed. 

The other two extraction functions are for counting the amount of symbols and digits in the URL. Legit websites tend to have less digits and symbols in their URLs. These basic features could also be used in combination with sub-strings of the URL and the implemented "parse_url" function.

## Storage
I decided to implement storage in the form of nested dictionaries, as they are quick and easy to work with and can easily be converted into other data structures. The main function in this notebook. It test the extraction fucntions(found in the extractors.py file) with a set of real urls(found in the test_urls.py file), visualises one of the features, and finally returns the dictionary of analyzed URLs and their respective features.


## Visualisation
I decided to use the total number of digits per URL as the feature for the visualisation. I found that around a third of the test URLs, which are verified phishing URLs, contained more than five digits. Some URLs contaied over 20 digits. The visualisation displayed the presence of large amounts of digits in phishing URLs, but further analysis and comparisons, with other metrics, would be needed to ensure and validate a phishing URL.

## What could go wrong?
There are parts of the code that would need to be improved, for it to be reliable.

Firstly, the feature extractors would need to have input verification, to ensure that the right kind of input is given. In this case, I could save the already created, error containing dictionary externally, perhaps as a .csv file. I would then assess the error, fix it, re-run the extraction, and compare it with the original to verify that the error was fixed.

Secondly, the extractors could possibly fail to identify crucial information and cause malfunctions in the analysis process. For example, the symbols extractor only contains the ten symbols: "´", "%", "#", "^", "$", "&", "-", "*", ":", and "@". If a URL containing symbols other than these ten is analyzed, the extractor fails to recognize the symbol, and skips it as if it was a normal letter. This feature would need improvement and development, as more symbols show up. There could be a filter included, that would add the unrecognized symbols to a list called "others", that could be viewed and used for development of the extractor.

Here is the main fucntion that is used for the testing. It returns the dictionary with 30 analyzed URLs below, in a dictionary, and visualises the total amounts of digits found in the URLs.
Libraries used are "pandas" and "matplotlib" for visualising.

In [None]:
import pandas as pd
import matplotlib
import extractors as e, test_urls as t

def main(URL_list):
    if len(URL_list) == 0:
        print("Plese input a list of strings.")     #Simple input verification
        return False
    else:
        dictionary = {}     #Initializes a dictionary for url data to be stored in
        data = {}           #Initializes a dictionary to be transformed into a dataframe that can be visualized
        i = 0

        for url in URL_list:
            url_features = {}
            url_features = e.parse_url(url)
            url_features['digits'] = e.count_digits(url)
            url_features['symbols'] = e.count_symbols(url)
            dictionary[i] = url_features                    #Adds features to the already initialized dictionary for data storage
            tempdata = {}
            tempdata['Digits'] = e.count_digits(url)
            data[i] = tempdata                              #Adds data on digit amount to dataframe dictionary, for visualisation
            i += 1
        
        df = pd.DataFrame(data)
        df = df.transpose()
        df = df.sort_values(by='Digits', ascending=False)  #Creates and sorts a dataframe, ready for visualisation

        plot1 = df.plot.bar()                              #Decided on barplot, as it visualizes teh feature well
        plot1.set_ylim(0, (max(data) + 5))
        plot1.set(xlabel = "URL (Index in dictionary)", title='Number of digits found in analyzed URLs')
        matplotlib.pyplot.show()

        return dictionary

main(t.urls)        #runs main function using the URL list from test_urls.py
