# Experiment 5

## Dataset
What is Phishing?

Phishing is a type of cybercrime in which a target or targets are contacted by email, telephone or text message by someone posing as a legitimate institution to lure individuals into providing sensitive data such as personally identifiable information, banking and credit card details, and passwords.
The information is then used to access important accounts of the victims and it mostly results in identity theft and financial loss.

Phishing is one of the most popular cybercrimes among attackers as it is easier to trick someone into clicking a malicious link which seems legitimate than trying to break through a computer’s defence systems.

In [10]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

df = pd.read_csv('phishing_data.csv')
print(df.columns)

Index(['url', 'length_url', 'length_hostname', 'ip', 'nb_dots', 'nb_hyphens',
       'nb_at', 'nb_qm', 'nb_and', 'nb_or', 'nb_eq', 'nb_underscore',
       'nb_tilde', 'nb_percent', 'nb_slash', 'nb_star', 'nb_colon', 'nb_comma',
       'nb_semicolumn', 'nb_dollar', 'nb_space', 'nb_www', 'nb_com',
       'nb_dslash', 'http_in_path', 'https_token', 'ratio_digits_url',
       'ratio_digits_host', 'punycode', 'port', 'tld_in_path',
       'tld_in_subdomain', 'abnormal_subdomain', 'nb_subdomains',
       'prefix_suffix', 'random_domain', 'shortening_service',
       'path_extension', 'nb_redirection', 'nb_external_redirection',
       'length_words_raw', 'char_repeat', 'shortest_words_raw',
       'shortest_word_host', 'shortest_word_path', 'longest_words_raw',
       'longest_word_host', 'longest_word_path', 'avg_words_raw',
       'avg_word_host', 'avg_word_path', 'phish_hints', 'domain_in_brand',
       'brand_in_subdomain', 'brand_in_path', 'suspecious_tld',
       'statistical_report', 

## Analyse probleem
#### Wat is het probleem?
De target feature is status, die beschrijft of een webpagina gebruikt was voor phishing of niet.
#### Dimensies dataset?

In [11]:
print(f'Rows/Columns: {df.shape}')
print(f"Class distribution is: \n{df['status'].value_counts()}\n")
print(df.isnull().sum())

Rows/Columns: (11481, 89)
Class distribution is: 
phishing      5741
legitimate    5740
Name: status, dtype: int64

url                0
length_url         0
length_hostname    0
ip                 0
nb_dots            0
                  ..
web_traffic        0
dns_record         0
google_index       0
page_rank          0
status             0
Length: 89, dtype: int64


#### Soorten data/features?
Onze dataset is vrij groot, en bevat heel veel numerieke en categorieke (binair) features.

## Methoden
#### Welke preprocessing is nodig?
We gaan de eerste feature (url) eruit gooien, sinds die beschreven is in andere features. We moeten ook bepaalde
categorieke features encoden, aangezien ze soms in strings staan, in plaats van nummers, dit geldt ook voor de target
feature. Ook kan het nodig zijn om bepaalde features te droppen die niet veel toevoegen.
#### Welke modellen ga je vergelijken?
Ik ga Support Vector Machine en Logistic Regression vergelijken.
#### Welke performance metric is passend?
Onze labels zijn compleet in balans, dus accuracy is de ideale metric hier. Bovenop dat ga ik ook cross-validation
gebruiken, om te testen hoe generaliseerbaar mijn modellen zijn.

## Voorspelling
#### Welke classifier zal het beste zijn en waarom?
Ik verwacht dat SVM het beste zal zijn, met reden dat dit model het goed doet met feel features.
#### Welke hyperparameters zijn relevant en waarom?
SVM:
- C = Regularizatie kracht
- Kernel = Kernel type
- Gamma = Coeffecient voor de kernel


LREG:
- C = Regularizatie kracht
- Solver =  Welke optimizer gebruikt wordt

## Process

In [12]:
df.drop('url', axis=1, inplace=True, errors='ignore')

In [13]:
for col in df.columns:
    print(col, [i for i in df[col].unique()])

# So the columns we need to fix are:
# ip, nb_hyphens, domain_with_copyright, status

# Columns we can drop due to only containing a single value:
# sfh, ratio_intErrors , nb_or, ratio_intRedirection, submit_email 

length_url [46, 128, 52, 21, 28, 50, 51, 35, 22, 57, 39, 44, 56, 110, 40, 62, 69, 24, 45, 26, 33, 36, 55, 25, 59, 31, 96, 27, 47, 32, 53, 68, 29, 61, 94, 71, 92, 78, 73, 75, 84, 127, 58, 142, 67, 42, 95, 49, 101, 81, 132, 30, 74, 108, 125, 17, 43, 18, 60, 259, 113, 63, 80, 65, 109, 404, 23, 99, 86, 72, 135, 20, 41, 158, 85, 153, 48, 37, 119, 79, 16, 157, 38, 76, 89, 54, 19, 120, 66, 88, 70, 141, 183, 131, 206, 64, 203, 152, 129, 124, 102, 34, 91, 137, 121, 140, 149, 116, 83, 77, 210, 106, 557, 243, 136, 339, 155, 97, 123, 105, 98, 90, 169, 115, 163, 299, 107, 196, 228, 185, 118, 122, 104, 82, 112, 466, 204, 1386, 275, 93, 150, 111, 263, 198, 154, 126, 87, 276, 192, 117, 114, 188, 249, 167, 100, 187, 162, 172, 168, 219, 133, 148, 156, 181, 144, 145, 103, 294, 159, 190, 248, 346, 305, 282, 164, 260, 13, 134, 166, 225, 138, 202, 303, 15, 200, 406, 180, 239, 165, 147, 160, 201, 194, 342, 218, 139, 208, 388, 226, 629, 197, 437, 176, 252, 236, 151, 174, 459, 146, 271, 430, 254, 186, 279, 461

In [14]:
df.drop(['sfh', 'ratio_intErrors' , 'nb_or', 'ratio_intRedirection', 'submit_email', 'ratio_nullHyperlinks'],
        axis=1,
        inplace=True,
        errors='ignore')

In [15]:
df.replace(to_replace = ['one', 'One'], value=1, inplace=True)
df.replace(to_replace = ['zero', 'Zero'], value=0, inplace=True)
df.replace(to_replace = ['phishing'], value=1, inplace=True)
df.replace(to_replace = ['legitimate'], value=0, inplace=True)

In [16]:
for col in df.columns:
    print(col, [i for i in df[col].unique()])


length_url [46, 128, 52, 21, 28, 50, 51, 35, 22, 57, 39, 44, 56, 110, 40, 62, 69, 24, 45, 26, 33, 36, 55, 25, 59, 31, 96, 27, 47, 32, 53, 68, 29, 61, 94, 71, 92, 78, 73, 75, 84, 127, 58, 142, 67, 42, 95, 49, 101, 81, 132, 30, 74, 108, 125, 17, 43, 18, 60, 259, 113, 63, 80, 65, 109, 404, 23, 99, 86, 72, 135, 20, 41, 158, 85, 153, 48, 37, 119, 79, 16, 157, 38, 76, 89, 54, 19, 120, 66, 88, 70, 141, 183, 131, 206, 64, 203, 152, 129, 124, 102, 34, 91, 137, 121, 140, 149, 116, 83, 77, 210, 106, 557, 243, 136, 339, 155, 97, 123, 105, 98, 90, 169, 115, 163, 299, 107, 196, 228, 185, 118, 122, 104, 82, 112, 466, 204, 1386, 275, 93, 150, 111, 263, 198, 154, 126, 87, 276, 192, 117, 114, 188, 249, 167, 100, 187, 162, 172, 168, 219, 133, 148, 156, 181, 144, 145, 103, 294, 159, 190, 248, 346, 305, 282, 164, 260, 13, 134, 166, 225, 138, 202, 303, 15, 200, 406, 180, 239, 165, 147, 160, 201, 194, 342, 218, 139, 208, 388, 226, 629, 197, 437, 176, 252, 236, 151, 174, 459, 146, 271, 430, 254, 186, 279, 461

In [17]:
scaler = StandardScaler()
labels = df.pop('status')
df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

In [18]:
print("Starting SVM grid search")
svm = SVC()
svm_params =  {'C':[1, 10, 100, 1000],
             'gamma':[1, 0.1, 0.001],
             'kernel':['linear', 'rbf']}
svm_grid = GridSearchCV(svm, svm_params, cv=2, scoring ='accuracy', verbose=0)
svm_grid.fit(df, labels)
print("Printing best results SVM...\n")
print(svm_grid.best_params_)

Starting SVM grid search
Printing best results SVM...

{'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}


In [19]:
print("Starting LREG grid search")
lreg = LogisticRegression(max_iter=500)
lreg_params = {'C':[1, 10, 100, 1000],
               'solver': ['liblinear', 'lbfgs', 'sag', 'saga'],
               }
lreg_grid = GridSearchCV(lreg, lreg_params, cv=2, scoring ='accuracy', verbose=0)
lreg_grid.fit(df, labels)
print("Printing best results LREG...\n")
print(lreg_grid.best_params_)

Starting LREG grid search
Printing best results LREG...

{'C': 1, 'solver': 'lbfgs'}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [20]:
svm = SVC(**svm_grid.best_params_)
lreg = LogisticRegression(**lreg_grid.best_params_, max_iter=300)

kf = StratifiedKFold(n_splits=10)
history_lreg = cross_validate(lreg, X=df, y=labels, cv=kf, scoring='accuracy')
history_svm = cross_validate(svm, X=df, y=labels, cv=kf, scoring='accuracy')

## Resultaten

In [21]:
print(f"Mean accuracy LREG: {np.mean(history_lreg['test_score'])}")
print(f"Mean accuracy SVM: {np.mean(history_svm['test_score'])}")

Mean accuracy LREG: 0.9479144870710178
Mean accuracy SVM: 0.9764832622216562


## Conclusie
#### Welke classifier had het beste resultaat?
Support Vector Machine had het beste resultaat met 0.976 accuracy
#### Kwam dit overeen met de voorspelling? Waarom wel/niet?
Dit kwam overeen. Net zoals mijn verwachting, denk ik dat SVM dit kan door de grote hoeveelheid verschillende features.