**Malicious Websites Prediction**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

**Step 1 - Data Exploration**

1. Import the dataset and overview it at the first 10 lines.

In [None]:
df = pd.read_csv('../input/exercise.csv')
df.head(10)

**About the dataset**

*If the variable Type = '1' then the URL is Malicious URL. *

*If the variable Type = '0' then the URL is Benign URL.*

2. Discover the dataset

In [None]:
df.shape

The dataset has 1,781 records with 21 features.

In [None]:
df.dtypes

In [None]:
#Check null values
print(df.isnull().sum())
df[pd.isnull(df).any(axis=1)]

CONTENT_LENGTH has 812 NaN.

In [None]:
#Remove null values
df = df.interpolate()
print(df.isnull().sum())

In [None]:
#Charset
df['CHARSET'].unique()

In [None]:
df['CHARSET']=np.where(df['CHARSET'] =='iso-8859-1', 'ISO-8859-1', df['CHARSET'])
df['CHARSET']=np.where(df['CHARSET'] =='utf-8', 'UTF-8', df['CHARSET'])

In [None]:
#WHOIS_COUNTRY 
df['WHOIS_COUNTRY'].unique()

In [None]:
df['WHOIS_COUNTRY']=np.where(df['WHOIS_COUNTRY'] =='United Kingdom', 'UK', df['WHOIS_COUNTRY'])
df['WHOIS_COUNTRY']=np.where(df['WHOIS_COUNTRY'] =="[u'GB'; u'UK']", 'UK', df['WHOIS_COUNTRY'])
df['WHOIS_COUNTRY']=np.where(df['WHOIS_COUNTRY'] =='United Kingdom', 'UK', df['WHOIS_COUNTRY'])
df['WHOIS_COUNTRY']=np.where(df['WHOIS_COUNTRY'] =='us', 'US', df['WHOIS_COUNTRY'])
df['WHOIS_COUNTRY']=np.where(df['WHOIS_COUNTRY'] =='se', 'SE', df['WHOIS_COUNTRY'])
df['WHOIS_COUNTRY']=np.where(df['WHOIS_COUNTRY'] =='ru', 'RU', df['WHOIS_COUNTRY'])

In [None]:
df.describe(include='all')

There are 1,781 unique URLs from this dataset.

The average URL length is ~57, the range of the URL length is between 16 and 249.

The average number of special characters of the URL is ~11, the range of the number of special characters is between 5 and 43.

The average DNS Query Times is ~2.3, the range of DNS Query Times is between 0 and 20.


In [None]:
#How many URLs are malicious?
df['Type'].value_counts()

There are 216 URLs are malicious in this dataset.

In [None]:
df.groupby('Type').mean()

Observations:

(1) The average length of malicious URLs is longer than that of benign URLs.

(2) The average number of special characters of malicious URLs is greater than that of benign URLs.

(3) The average length of content of malicious URLs is much shorter than that of benign URLs.

(4) The average number of exchanged TCP packets of malicious URLs is less than that of benign URLs.

(5) The average number of TCP ports of malicious URLs is less than that of benign URLs.

(6) The average number of transfered bytes of malicious URLs is less than that of benign URLs.

(7) The average number of generated DNS packets of malicious URLs is a little greater than that of benign URLs.

In [None]:
df.groupby('Type').median()

In [None]:
df.groupby(['CHARSET','Type']).count()

Observations:

(1) The only URL which CHARSET is 'windows-1251' is a malicious URL.

(2) 10.3% of URLs with CHARSET as 'ISO-8859-1' are malicious URLs.

(3) 14.4% of URLs with CHARSET as 'UTF-8' are malicious URLs.


In [None]:
df.groupby(['WHOIS_COUNTRY','Type']).count()

From this dataset,

(1) 98.4% of URLs from ES (Spain) are malicious URLs. (Total 63 URLs)

(2) 97.6% of URLs from CA (Canada) are benign URLs. (Total 84 URLs)

(3) 4.5% of URLs from US (United States) are malicious URLs. (Total 1,106 URLs)

(4) 21.2% of URLs which countries are unknown are malicous URLs. (Total 306 URLs)

**Step 2 - Find important factors
**
1. Find the correlation with Type by Heatmap

In [None]:
#Calculate the correlation between each variable
correlation = df.corr() 

In [None]:
plt.figure(figsize = (20, 20))
sns.set(font_scale = 2)
sns.heatmap(correlation, annot = True, annot_kws = {'size': 15}, cmap = 'Blues')

From the heatmap, variables such as URL_LENGTH, NUMBER_SPECIAL_CHARACTERS, CONTENT_LENGTH, and DNS_QUERY_TIMES are more likely correlated with the variable Type.

2. Create plots to better understand features such as

   - Type x URL Length
   
   - Type x Number of Special Characters
   
   - Type x Content Length
   
   - Type x DNS Query Times

In [None]:
#Type x URL Length
plt.figure(figsize=(5, 5))
sns.boxenplot(data = df, x="Type", y="URL_LENGTH",
              color="b", scale="linear")

Most of URL_LENGTH of malicious URLs is between 40 and 100. (Median: 49)

Most of URL_LENGTH of benign URLs is between 40 and 65. (Median: 50)

In [None]:
#Type x Number of Special Characters
plt.figure(figsize=(5, 5))
sns.boxenplot(data = df, x="Type", y="NUMBER_SPECIAL_CHARACTERS",
              color="g", scale="linear")

Most of NUMBER_SPECIAL_CHARACTERS of malicious URLs is between 10 and 20. (Median: 12)

Most of NUMBER_SPECIAL_CHARACTERS of benign URLs is between 8 and 12. (Median: 10)

In [None]:
#Type x Content Length
plt.figure(figsize=(5, 5))
sns.boxenplot(data = df, x="Type", y="CONTENT_LENGTH",
              color="y", scale="linear")

Most of CONTENT_LENGTH of malicious URLs is between 0 and 300000. (Median: 4912.75)

Most of CONTENT_LENGTH of benign URLs is between 0 and 160000. (Median: 3074.84)

In [None]:
#Type x DNS Query Times
plt.figure(figsize=(5, 5))
sns.boxenplot(data = df, x="Type", y="DNS_QUERY_TIMES",
              color="y", scale="linear")

Most of DNS_QUERY_TIMES of malicious URLs is between 0 and 6. (Median: 2)

Most of DNS_QUERY_TIMES of benign URLs is between 0 and 4. (Median: 0)

**Step 3 - Logistic Regression**

According to the findings from step 1, independent variables will include URL_LENGTH, NUMBER_SPECIAL_CHARACTERS, and DNS_QUERY_TIMES. 

Although Type and CONTENT_LENGTH are correlated, CONTENT_LENGTH has a lot of missing values so that I don't put it into the list.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

X = df[['URL_LENGTH', 'NUMBER_SPECIAL_CHARACTERS','DNS_QUERY_TIMES']]
y = df['Type']  

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

In [None]:
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)

In [None]:
print (X_test) 
print (y_pred) 

In [None]:
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logistic_regression.score(X_test, y_test)))

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, logistic_regression.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logistic_regression.predict_proba(X_test)[:,1])

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Since AUC = 0.61, this model is predictable. 