<img src="https://pa-legg.github.io/images/uwe_banner.png">

# UFCFFY-15-M Cyber Security Analytics 23-24

## Portfolio Assignment: Worksheet 2

## Conduct an investigation on a URL database to develop a DGA classification system using machine learning techniques
---

For this task, the company **"UWEtech"** enlist your help once more. They have identified a number of suspicious URLs on their logging systems, suspecting that these URLs contain various malware, and so require your expertise to investigate these further. Specifically, they seek a machine learning approach to identify the malware families as observed on their network.

You will need to develop a machine learning tool using Python and scikit-learn that can identify URLs based on [Domain Generator Algorithms (DGA)](https://blog.malwarebytes.com/security-world/2016/12/explained-domain-generating-algorithm/), widely used by command and control malware to avoid static IP blocking.

You need to demonstrate experimental design of appropriate feature engineering to characterise the data, that will be used to inform your machine learning classifiers. You should show **at least two** schemes of curating appropriate features, based on the raw data as provided, and show this impacts the performance of your classifier.

You are also expected to utilise **3 different classifiers** using the scikit-learn library, and show how the model parameters can impact the performance of the classifiers. It is suggested that you use a Logistic Regression, a Random Forest Classifier, and a Multi-Layer Perceptron Classifier.

Finally, you should investigate the **performance and explainability** of your classifiers. It is recommended that you use the confusion matrix approach along with performance metrics, to assess how your model performs as well as when and why misclassification may occur. In reporting your findings, you should explain and reflect on this to understand which malware families are more separable, and which are more challenging to classify, using this approach. It is expected that a good performing classifier will achieve over 90% accuracy - however you will be assessed on your experimental design in finding a suitable classifier to achieve this.

**Dataset**: Please see the folder ***"Portfolio Assignment"*** under the Assignment tab on Blackboard for further detail related to the access and download of the necessary dataset.

**Hint**: You should conduct research using the [scikit-learn documentation and API reference](https://scikit-learn.org/stable/user_guide.html), making full use of the sample code that has been provided for your to help guide your research. You should also research Shapley Additive Explanations, and utilise the [online documentation](https://shap.readthedocs.io/en/latest/index.html). You should also think about a suitable means of generating input features for your classifier that capture sequential properties of text data.

### Assessment and Marking
---
The completion of this worksheet is worth **35%** of your portfolio assignment for the UFCFFY-15-M Cyber Security Analytics (CSA) module.

This is an **unguided** task that will be graded against the following core criteria:

* **A clear and iterative experimental approach for developing and refining the classifier to improve performance (10 Marks)**
  * *For the higher mark band, it would be expected that you would show an initial experimental design, and then refine this through improving the feature engineering stage, subsequently improving the model performance.*
* **Suitable feature engineering stages demonstrating at least two different methods and their performance (10 Marks)**
  * *For the higher mark band, it would be expected that you would demonstrate two sensible approaches for curating features, with strong justification as to why they would characterise the data fairly.*
* **Suitable use of the sci-kit machine learning library (5 Marks)**
  * *For the higher mark band, it would be expected that you would show a good comprehension of the library usage.*
* **Clear evaluation of ML performance and explainability (5 Marks)**
  * *For the higher mark band, it would be expected that you would use confusion matrices to explain which malware classes are more separable, and which share similarity according to a well-trained model.*
* **Clarity and presentation (5 Marks)**
  * *For the higher mark band, it would be expected that your notebook is clear and concise, with good use of Markdown to annotate your work professionally.*

### Submission Documents
---

Your submission for this task should include:

- **1 Jupyter Notebook file (*.ipynb)**

You should complete your work using the ipynb file provided (i.e., this document). Once you have completed your work, you should ensure that all code cells have been executed and then you should save your notebook. **Please note: Staff will NOT execute your notebook during marking. It is your responsibility to ensure that your saved notebook shows the code cell outputs as required.**

The deadline for your portfolio submission is **THURSDAY 2ND MAY @ 14:00**. This assignment is eligible for the [48-hour late submission window](https://www.uwe.ac.uk/study/academic-information/personal-circumstances/late-submission-window), however module staff will not be able to assist with any queries after the deadline.

Your portfolio submitted to Blackboard must contain 3 independent documents:

- ***STUDENT_ID-TASK1.ipynb*** (your iPYNB with all cells executed)
- ***STUDENT_ID-TASK2.ipynb*** (your iPYNB with all cells executed)
- ***STUDENT_ID-TASK3.pdf*** (a PDF report of your research investigation)

### Contact
---

Questions about this assignment should be directed to your module leader (Phil.Legg@uwe.ac.uk). You should use the [online Q&A form](https://forms.office.com/e/yxFJZDraRG) to ask questions related to this module and this assignment, as well as utilising the on-site teaching sessions.

---

# Student ID: 23008852

- **By submitting this assignment to Blackboard as part of your portfolio, I declare that the submission is my own work.**

***

In [4]:
# Import libraries as required
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 10)


# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics

# Parsing Urls
import re
from urllib.parse import urlparse
import os.path

# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import RocCurveDisplay

# Default settings of CSA Assessment 2
from collections import Counter
from timeit import timeit
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder

%matplotlib inline

In [5]:
# Count the possible unique URLs 'cs(Referer)'
def factor_col(col):
    factor = pd.Categorical(col)
    return pd.Series(factor).value_counts(sort=False)

In [6]:
# Load in the data set as required
df = pd.read_csv('./dga-24000.csv')
df

Unnamed: 0,Domain,Family
0,google.com,benign
1,facebook.com,benign
2,youtube.com,benign
3,twitter.com,benign
4,instagram.com,benign
...,...,...
23995,fhyibfwhpahb.su,locky
23996,nlgusntqeqixnqyo.org,locky
23997,awwduqqrjxttmn.su,locky
23998,ccxmwif.pl,locky


In [7]:
df.shape

(24000, 2)

In [8]:
df.describe()

Unnamed: 0,Domain,Family
count,24000,24000
unique,24000,24
top,google.com,benign
freq,1,1000


In [9]:
df.columns

Index(['Domain', 'Family'], dtype='object')

In [10]:
# Compare target column with sex column
df_dga = pd.crosstab(df.Domain, df.Family)
df_dga

Family,banjori,benign,emotet,flubot,gameover,locky,murofet,mydoom,necro,necurs,...,ramnit,ranbyus,rovnix,shifu,shiotob,simda,suppobox,symmi,tinba,virut
Domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
01ejk9ev8p2f.com,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
01u3cpy749eb.org,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
01ujw92vo9if.net,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
01yzkl67sta3.net,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
01yzo9mb45e7.org,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zzij3gsfdwwjzrz.com,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
zzjaua.com,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
zzlxestnessbiophysicalohax.com,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zzrshclp2w2xu21.net,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [11]:
df['Domain'].value_counts()

Domain
google.com                                     1
frstqegbcnmeqqaj.eu                            1
antizerolant-monogevudom.info                  1
imunolance-postodinenetn-antifipuketn.net      1
overahudulize-unazibezize-overuzozerish.org    1
                                              ..
bdkrwgvuqnosl.com                              1
gtpjkwwqxelpfc.com                             1
urqsbghiuvfeti.com                             1
kowqxgjugu.com                                 1
yhrryqjimvgfbqrv.pw                            1
Name: count, Length: 24000, dtype: int64

In [12]:
df['Family'].value_counts()

Family
benign       1000
banjori      1000
qadars       1000
suppobox     1000
shifu        1000
             ... 
simda        1000
pykspa_v1    1000
tinba        1000
rovnix       1000
locky        1000
Name: count, Length: 24, dtype: int64

In [13]:
n_samples, n_features = df.shape
print('Number of samples:', n_samples)
print('Number of features:', n_features)

Number of samples: 24000
Number of features: 2


In [14]:
df.isnull().sum()

Domain    0
Family    0
dtype: int64

In [15]:
df

Unnamed: 0,Domain,Family
0,google.com,benign
1,facebook.com,benign
2,youtube.com,benign
3,twitter.com,benign
4,instagram.com,benign
...,...,...
23995,fhyibfwhpahb.su,locky
23996,nlgusntqeqixnqyo.org,locky
23997,awwduqqrjxttmn.su,locky
23998,ccxmwif.pl,locky


In [16]:
from sklearn.preprocessing import LabelEncoder

lb_make = LabelEncoder()
df["label"] = lb_make.fit_transform(df["Family"])
df["label"].value_counts()

label
1     1000
0     1000
12    1000
20    1000
17    1000
      ... 
19    1000
11    1000
22    1000
16    1000
5     1000
Name: count, Length: 24, dtype: int64

In [17]:
df

Unnamed: 0,Domain,Family,label
0,google.com,benign,1
1,facebook.com,benign,1
2,youtube.com,benign,1
3,twitter.com,benign,1
4,instagram.com,benign,1
...,...,...,...
23995,fhyibfwhpahb.su,locky,5
23996,nlgusntqeqixnqyo.org,locky,5
23997,awwduqqrjxttmn.su,locky,5
23998,ccxmwif.pl,locky,5


In [18]:
df.dtypes

Domain    object
Family    object
label      int32
dtype: object

# Random Forest


In [19]:
# Split into X & y and train/test
X = df.drop("label", axis=1)
y = df["label"]

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, shuffle=True, random_state=5)
print(f"X_train Shape : {X_train.shape}")
print(f"Y_train Shape : {y_train.shape}")
print(f"X_test  Shape : {X_test.shape}")
print(f"Y_test  Shape : {y_test.shape}")

X_train Shape : (19200, 2)
Y_train Shape : (19200,)
X_test  Shape : (4800, 2)
Y_test  Shape : (4800,)


### First Try

 # Convert the non-numerical features into numbers first.

In [20]:
# 1. Import OneHotEncoder and ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# 2. Define the categorical features to transform
categorical_features = ["Domain", "Family"]

In [21]:
# 3. Create an instance of OneHotEncoder
one_hot = OneHotEncoder()

# 4. Create an instance of ColumnTransformer
transformer = ColumnTransformer([("one_hot", # name
                                  one_hot, # transformer
                                  categorical_features)], # columns to transform
                                  remainder="passthrough") # what to do with the rest of the columns? ("passthrough" = leave unchanged)

# 5. Turn the categorical features into numbers (this will return an array-like sparse matrix, not a DataFrame)
transformed_X = transformer.fit_transform(X)
transformed_X

<24000x24024 sparse matrix of type '<class 'numpy.float64'>'
	with 48000 stored elements in Compressed Sparse Row format>

In [22]:
X.head()

Unnamed: 0,Domain,Family
0,google.com,benign
1,facebook.com,benign
2,youtube.com,benign
3,twitter.com,benign
4,instagram.com,benign


In [23]:
# View first transformed sample
print(transformed_X[0])

  (0, 7962)	1.0
  (0, 24001)	1.0


In [24]:
# View original first sample
X.iloc[0]

Domain    google.com
Family        benign
Name: 0, dtype: object

In [25]:
y

0        1
1        1
2        1
3        1
4        1
        ..
23995    5
23996    5
23997    5
23998    5
23999    5
Name: label, Length: 24000, dtype: int32

# Nuemrically encoding data with pandas

In [26]:
df.head()

Unnamed: 0,Domain,Family,label
0,google.com,benign,1
1,facebook.com,benign,1
2,youtube.com,benign,1
3,twitter.com,benign,1
4,instagram.com,benign,1


In [27]:
categorical_variables = ["Domain", "Family"]

dummies = pd.get_dummies(data=df[categorical_variables])
dummies

Unnamed: 0,Domain_01ejk9ev8p2f.com,Domain_01u3cpy749eb.org,Domain_01ujw92vo9if.net,Domain_01yzkl67sta3.net,Domain_01yzo9mb45e7.org,Domain_02cew24e0q4m.net,Domain_02s6kmk20mkq.top,Domain_05mns9evgtqj.top,Domain_09mzctevk1yz.net,Domain_09uzgtiz4xa7.top,...,Family_ramnit,Family_ranbyus,Family_rovnix,Family_shifu,Family_shiotob,Family_simda,Family_suppobox,Family_symmi,Family_tinba,Family_virut
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23995,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
23996,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
23997,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
23998,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [28]:
# Have to convert labels to object for dummies to work on it.
df["label"] = df["label"].astype(object)
dummies = pd.get_dummies(data=df[["Domain", "Family"]],
                         dtype=float)
dummies

Unnamed: 0,Domain_01ejk9ev8p2f.com,Domain_01u3cpy749eb.org,Domain_01ujw92vo9if.net,Domain_01yzkl67sta3.net,Domain_01yzo9mb45e7.org,Domain_02cew24e0q4m.net,Domain_02s6kmk20mkq.top,Domain_05mns9evgtqj.top,Domain_09mzctevk1yz.net,Domain_09uzgtiz4xa7.top,...,Family_ramnit,Family_ranbyus,Family_rovnix,Family_shifu,Family_shiotob,Family_simda,Family_suppobox,Family_symmi,Family_tinba,Family_virut
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
np.random.seed(42)

# Create train and test splits with transformed_X
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                    y,
                                                    train_size=0.8,
                                                    test_size=0.2)

# Create the model instance
clf = RandomForestClassifier(n_estimators=100,max_depth=5,n_jobs=-1) # 100 is the default, but you could try 1000 and see what happens

# Fit the model on the numerical data (this errored before since our data wasn't fully numeric)
clf.fit(X_train, y_train)



In [30]:
# Score the model (returns r^2 metric by default, also called coefficient of determination, higher is better)
clf.score(X_train, y_train)

0.959375

In [31]:
clf.score(X_test, y_test)

0.9579166666666666

# Random Forest Classifier

ver 2

In [None]:
def extract_root_domain(url):
    extracted = tldextract.extract(url)
    root_domain = extracted.domain
    return root_domain

In [None]:
df['root_domain'] = df['Domain'].apply(lambda x: extract_root_domain(str(x)))

# Logistic Regression

In [None]:
count = df['Family'].value_counts()
colors = [
    '#FF6633', '#FFB399', '#FF33FF', '#FFFF99', '#00B3E6',
    '#E6B333', '#3366E6', '#999966', '#99FF99', '#B34D4D',
    '#FF6633', '#FFB399', '#FF33FF', '#FFFF99', '#00B3E6',
    '#E6B333', '#3366E6', '#999966', '#99FF99', '#B34D4D',
    '#FF6633', '#FFB399', '#FF33FF', '#FFFF99', '#00B3E6',
]
fig = go.Figure(data=[go.Bar(x=count.index, y=count, marker=dict(color=colors))])
fig.update_layout(
    xaxis_title='Types',
    yaxis_title='Count',
    title='Count of Different Types of URLs',
    plot_bgcolor='black',
    paper_bgcolor='black',
    font=dict(color='white')
)
fig.update_xaxes(tickfont=dict(color='white'))
fig.update_yaxes(tickfont=dict(color='white'))
fig.show()

# Multi-Layer Perceptron Classifier.