# Assignment: Malicious and Benign Websites


## Kaggle Competition: https://www.kaggle.com/xwolf12/malicious-and-benign-websites

The project consisted to evaluate different classification models to predict malicious and benign websites, based on application layer and network characteristics. The data were obtained by using different verified sources of benign and malicious URL's, in a low interactive client honeypot to isolate network traffic. We used additional tools to get other information, such as, server country with Whois.

This is the first version and we have some initial results from applying machine learning classifiers in a bachelor thesis. Further details on the data process making and the data description can be found in the article below.

### Dataset
This is an important topic and one of the most difficult thing to process, according to other articles and another open resource, we used three black list:

* machinelearning.inginf.units.it/data-andtools/hidden-fraudulent-urls-dataset
* malwaredomainlist.com
* zeuztacker.abuse.ch

From them we got around 185181 URLs, we supposed that they were malicious according to their information, we recommend in a next research step to verity them though another security tool, such as, VirusTotal.



## Task 1: Problem Statement
Discuss the problem setting and the first implcations of the given data set... 
* What assumptions can we make about the data?
* What problems are we expecting?

## Task 2: First Data Analysis, Cleaning and Feature Extraction
* Import the data to a Pandas DataFrame
* Run first simple statistics and visualizations
* Is there a need to clean the data? If yes, do so...
* Can you use the raw data directly, or should you extract features? What features are suitable ? 


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


In [2]:
data = pd.read_csv('dataset.csv' , encoding = "ISO-8859-1" )

In [3]:
data.head()

Unnamed: 0,URL,URL_LENGTH,NUMBER_SPECIAL_CHARACTERS,CHARSET,SERVER,CONTENT_LENGTH,WHOIS_COUNTRY,WHOIS_STATEPRO,WHOIS_REGDATE,WHOIS_UPDATED_DATE,...,DIST_REMOTE_TCP_PORT,REMOTE_IPS,APP_BYTES,SOURCE_APP_PACKETS,REMOTE_APP_PACKETS,SOURCE_APP_BYTES,REMOTE_APP_BYTES,APP_PACKETS,DNS_QUERY_TIMES,Type
0,M0_109,16,7,iso-8859-1,nginx,263.0,,,10/10/2015 18:21,,...,0,2,700,9,10,1153,832,9,2.0,1
1,B0_2314,16,6,UTF-8,Apache/2.4.10,15087.0,,,,,...,7,4,1230,17,19,1265,1230,17,0.0,0
2,B0_911,16,6,us-ascii,Microsoft-HTTPAPI/2.0,324.0,,,,,...,0,0,0,0,0,0,0,0,0.0,0
3,B0_113,17,6,ISO-8859-1,nginx,162.0,US,AK,7/10/1997 4:00,12/09/2013 0:45,...,22,3,3812,39,37,18784,4380,39,8.0,0
4,B0_403,17,6,UTF-8,,124140.0,US,TX,12/05/1996 0:00,11/04/2017 0:00,...,2,5,4278,61,62,129889,4586,61,4.0,0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1781 entries, 0 to 1780
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   URL                        1781 non-null   object 
 1   URL_LENGTH                 1781 non-null   int64  
 2   NUMBER_SPECIAL_CHARACTERS  1781 non-null   int64  
 3   CHARSET                    1781 non-null   object 
 4   SERVER                     1780 non-null   object 
 5   CONTENT_LENGTH             969 non-null    float64
 6   WHOIS_COUNTRY              1781 non-null   object 
 7   WHOIS_STATEPRO             1781 non-null   object 
 8   WHOIS_REGDATE              1781 non-null   object 
 9   WHOIS_UPDATED_DATE         1781 non-null   object 
 10  TCP_CONVERSATION_EXCHANGE  1781 non-null   int64  
 11  DIST_REMOTE_TCP_PORT       1781 non-null   int64  
 12  REMOTE_IPS                 1781 non-null   int64  
 13  APP_BYTES                  1781 non-null   int64

In [5]:
data.isna().all()

URL                          False
URL_LENGTH                   False
NUMBER_SPECIAL_CHARACTERS    False
CHARSET                      False
SERVER                       False
CONTENT_LENGTH               False
WHOIS_COUNTRY                False
WHOIS_STATEPRO               False
WHOIS_REGDATE                False
WHOIS_UPDATED_DATE           False
TCP_CONVERSATION_EXCHANGE    False
DIST_REMOTE_TCP_PORT         False
REMOTE_IPS                   False
APP_BYTES                    False
SOURCE_APP_PACKETS           False
REMOTE_APP_PACKETS           False
SOURCE_APP_BYTES             False
REMOTE_APP_BYTES             False
APP_PACKETS                  False
DNS_QUERY_TIMES              False
Type                         False
dtype: bool

In [6]:
data["Type"].unique()

array([1, 0], dtype=int64)

In [7]:
data = data.replace([np.inf, -np.inf], np.nan)
data = data.fillna(value=np.nan)
data = data.fillna(0)
data.drop('URL', axis=1, inplace=True)

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1781 entries, 0 to 1780
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   URL_LENGTH                 1781 non-null   int64  
 1   NUMBER_SPECIAL_CHARACTERS  1781 non-null   int64  
 2   CHARSET                    1781 non-null   object 
 3   SERVER                     1781 non-null   object 
 4   CONTENT_LENGTH             1781 non-null   float64
 5   WHOIS_COUNTRY              1781 non-null   object 
 6   WHOIS_STATEPRO             1781 non-null   object 
 7   WHOIS_REGDATE              1781 non-null   object 
 8   WHOIS_UPDATED_DATE         1781 non-null   object 
 9   TCP_CONVERSATION_EXCHANGE  1781 non-null   int64  
 10  DIST_REMOTE_TCP_PORT       1781 non-null   int64  
 11  REMOTE_IPS                 1781 non-null   int64  
 12  APP_BYTES                  1781 non-null   int64  
 13  SOURCE_APP_PACKETS         1781 non-null   int64

In [9]:
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
#data["URL"] = enc.fit_transform(data[["URL"]])
data["CHARSET"] = enc.fit_transform(data[["CHARSET"]])
data["WHOIS_COUNTRY"] = enc.fit_transform(data[["WHOIS_COUNTRY"]])
data["WHOIS_STATEPRO"] = enc.fit_transform(data[["WHOIS_STATEPRO"]])
data["WHOIS_REGDATE"] = enc.fit_transform(data[["WHOIS_REGDATE"]])
data["WHOIS_UPDATED_DATE"] = enc.fit_transform(data[["WHOIS_UPDATED_DATE"]])

data["SERVER"] = data["SERVER"].str.replace('\W', '')
data = data.dropna()

data["SERVER"] = enc.fit_transform(data[["SERVER"]])


In [10]:
data.head()

Unnamed: 0,URL_LENGTH,NUMBER_SPECIAL_CHARACTERS,CHARSET,SERVER,CONTENT_LENGTH,WHOIS_COUNTRY,WHOIS_STATEPRO,WHOIS_REGDATE,WHOIS_UPDATED_DATE,TCP_CONVERSATION_EXCHANGE,DIST_REMOTE_TCP_PORT,REMOTE_IPS,APP_BYTES,SOURCE_APP_PACKETS,REMOTE_APP_PACKETS,SOURCE_APP_BYTES,REMOTE_APP_BYTES,APP_PACKETS,DNS_QUERY_TIMES,Type
0,16,7,4.0,200.0,263.0,29.0,98.0,59.0,593.0,7,0,2,700,9,10,1153,832,9,2.0,1
1,16,6,3.0,59.0,15087.0,29.0,98.0,889.0,593.0,17,7,4,1230,17,19,1265,1230,17,0.0,0
2,16,6,5.0,114.0,324.0,29.0,98.0,889.0,593.0,0,0,0,0,0,0,0,0,0,0.0,0
3,17,6,1.0,200.0,162.0,42.0,4.0,806.0,68.0,31,22,3,3812,39,37,18784,4380,39,8.0,0
4,17,6,3.0,123.0,124140.0,42.0,137.0,93.0,42.0,57,2,5,4278,61,62,129889,4586,61,4.0,0


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1780 entries, 0 to 1780
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   URL_LENGTH                 1780 non-null   int64  
 1   NUMBER_SPECIAL_CHARACTERS  1780 non-null   int64  
 2   CHARSET                    1780 non-null   float64
 3   SERVER                     1780 non-null   float64
 4   CONTENT_LENGTH             1780 non-null   float64
 5   WHOIS_COUNTRY              1780 non-null   float64
 6   WHOIS_STATEPRO             1780 non-null   float64
 7   WHOIS_REGDATE              1780 non-null   float64
 8   WHOIS_UPDATED_DATE         1780 non-null   float64
 9   TCP_CONVERSATION_EXCHANGE  1780 non-null   int64  
 10  DIST_REMOTE_TCP_PORT       1780 non-null   int64  
 11  REMOTE_IPS                 1780 non-null   int64  
 12  APP_BYTES                  1780 non-null   int64  
 13  SOURCE_APP_PACKETS         1780 non-null   int64

## Task 3: Train a  Model
* Which ML model would you choose and why?
* Train and evaluate the model using the train data
* Is the data blanced? What are the implications, how can you deal with this?
* Discuss the results -> possible improvements?


In [12]:
from sklearn.model_selection import train_test_split

X = data.drop(["Type"], axis = 1)
y = data["Type"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

In [13]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [14]:
from sklearn.ensemble import RandomForestClassifier

rfmodel = RandomForestClassifier(random_state=1, n_jobs=4)
rfmodel.fit(X_train, y_train)
predictrfmodel = rfmodel.predict(X_test)


In [15]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(15,), random_state=1,learning_rate_init=0.0001)
mlp.fit(X_train, y_train)
mlp_predict = mlp.predict(X_test)

## Task 4: Evaluate 
* report the F1-Score on the test data - Who will build the bes model?

In [16]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predictrfmodel))

              precision    recall  f1-score   support

           0       0.95      1.00      0.97       457
           1       0.96      0.66      0.78        77

    accuracy                           0.95       534
   macro avg       0.95      0.83      0.88       534
weighted avg       0.95      0.95      0.94       534



In [17]:
print(classification_report(y_test, mlp_predict))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96       457
           1       0.78      0.75      0.77        77

    accuracy                           0.93       534
   macro avg       0.87      0.86      0.87       534
weighted avg       0.93      0.93      0.93       534

