## Description of Malicious Website
A malicious website is a site that attempts to install malware (a term for anything that will disrupt computer operation, gather your personal information or, in a worst-case scenario, gain total access to your machine) onto your device. This usually requires some action on your part, however, in the case of a drive-by download, the website will attempt to install software on your computer without asking for permission first. (source: https://us.norton.com/internetsecurity-malware-what-are-malicious-websites.html)
## Instruction
Here a model will be created to detect malicious websites. Website url is used as a feature and 1D Convolutional Neural Network (CNN) is used as an algorithm for detection malicious websites. Model will be validated by holdout validation
Consult: https://blog.csdn.net/m0_37876745/article/details/84937339
## Consult
https://blog.csdn.net/sinat_26917383/article/details/72857454  
https://blog.csdn.net/zwqjoy/article/details/86677030  
https://blog.csdn.net/vesper305/article/details/44927047  
https://blog.csdn.net/akadiao/article/details/78788864  
https://blog.csdn.net/qq_40549291/article/details/85274581  

## Setup Notebook

In [None]:
# tldextract model is used to extract top-level domain from URL
# Consult: https://blog.csdn.net/weixin_44285988/article/details/89235814
!pip install tldextract

## Model Description
numpy: Array and matrix operation.  
pandas: Data analysis.  
re: Match & processing strings using regular expressions.  
matplotlib.pyplot: Data visualization.  
matplotlib.image: Used fro basic image loading, rescaling and display operations.  
seaborn: Data visualization  
random: Generate a random number.  
os: Miscellaneous operating system interfaces.  
pickle:  Python object serialization(Consult: https://docs.python.org/3/library/pickle.html)  
urllib.parse: Process url
urlparse: Identification and segmentation of URL.  
tldextract: vide supra



In [None]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
import gc
import random
import os
import pickle
import tensorflow as tf
from tensorflow.python.util import deprecation
from urllib.parse import urlparse
import tldextract

train_test_split: Split test set training set.  
Tokenizer: Vectorize the text, converting the text to a sequence.  
pad_sequences: https://blog.csdn.net/wcy23580/article/details/84957471  
backend: Backend operation.  
metrics: Measure loss or change in model accuracy. Consult: https://www.cnblogs.com/zdm-code/p/12244043.html  
EarlyStopping: Used to stop training ahead of time.  
plot_model: Depict the neural network by flow chart.  

In [None]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import models, layers, backend, metrics
from tensorflow.keras.callbacks import EarlyStopping
from keras.utils.vis_utils import plot_model
from PIL import Image
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
# set random seed
# (为了在同样的数据集上获得可复现的训练结果)
# Consult: https://www.jianshu.com/p/917962bef4a2
os.environ['PYTHONHASHSEED'] = '0'
# (log信息输出设置)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
np.random.seed(0)
random.seed(0)
tf.set_random_seed(0)

In [None]:
# other setup
# (适应retina屏幕)
# Consult: 
%config InlineBackend.figure_format = 'retina'
pd.set_option('max_colwidth', 500)
# Disable deprecation warnings
deprecation._PRINT_DEPRECATION_WARNINGS = False

# Load Data

In [None]:
# load data
data = pd.read_csv('../input/data.csv')
# Shuffle data
data = data.sample(frac=1, random_state=0)
print(f'Data size: {data.shape}')
data.head()

This notebook uses hold The holdout method is a method that separates training and test data by 80% and 20%

In [None]:
val_size = 0.2
train_data, val_data = train_test_split(data, test_size=val_size, stratify=data['label'], random_state=0)
print(f'Train shape: {train_data.shape}, Validation shape: {val_data.shape}')

# Data Analysis and Feature Engineering
Let's do some data analysis to expand our knowledge of this data and do some feature engineering. First we want to find out whether the data is imbalance

In [None]:
data.label.value_counts().plot.barh()
plt.title('All Data')
plt.show()

In [None]:
good, bad = data.label.value_counts()
print(f'Ratio of data between target labels (bad & good) is {bad//bad}:{good//bad}')
# 注意print中{}的用法

Next, lets find out the most used suffix domain, domain and sub domain. We need to extract subdomains, domains and domain suffixes to be able to do the analysis

In [None]:
def parsed_url(url):
    # extract subdomain, domain, and domain suffix from url
    # if item == '', fill with '<empty>'(将item == ''的部分用<empty>标签填充)
    subdomain, domain, domain_suffix = ('<empty>' if extracted == '' else extracted for extracted in tldextract.extract(url))
    return [subdomain, domain, domain_suffix]

In [None]:
#extract_url_data = [parsed_url(url) for url in train_data['url']]
#extract_url_data

In [None]:
def extract_url(data):
    # parsed url
    extract_url_data = [parsed_url(url) for url in data['url']]
    extract_url_data = pd.DataFrame(extract_url_data, columns=['subdomain', 'domain', 'domain_suffix'])
    # concat extracted feature with original data
    data = data.reset_index(drop=True)
    data = pd.concat([data, extract_url_data], axis=1)
    return data

In [None]:
train_data = extract_url(train_data)
val_data = extract_url(val_data)

In [None]:
#train_data.head()
#val_data.head()

In [None]:
def plot(train_data, val_data, column):
    plt.figure(figsize=(10, 17))
    plt.subplot(411)
    plt.title(f'Train data {column}')
    plt.ylabel(column)
    train_data[column].value_counts().head(10).plot.barh()
    plt.subplot(412)
    plt.title(f'Validation data {column}')
    plt.ylabel(column)
    val_data[column].value_counts().head(10).plot.barh()
    plt.subplot(413)
    plt.title(f'Train data {column} (groupped)')
    plt.ylabel(f'(label, {column})')
    train_data.groupby('label')[column].value_counts().head(10).plot.barh()
    plt.subplot(414)
    plt.title(f'Validation data {column} (groupped)')
    plt.ylabel(f'(label, {column})')
    val_data.groupby('label')[column].value_counts().head(10).plot.barh()
    plt.show()

In [None]:
plot(train_data, val_data, 'subdomain')

In [None]:
plot(train_data, val_data, 'domain')

In [None]:
plot(train_data, val_data, 'domain_suffix')

Based on the plot above there are interesting things to note, there are websites that have google and twitter domains with bad labels. It's time we do the filter to see data with google domains and Twitter with bad labels

In [None]:
train_data[(train_data['domain'] == 'google') & (train_data['label'] == 'bad')].head()

In [None]:
train_data[(train_data['domain'] == 'twitter') & (train_data['label'] == 'bad')].head()

Next we need to do tokenization on the url so that it can be used as input to the CNN model

In [None]:
tokenizer = Tokenizer(filters='', char_level=True, lower=False, oov_token=1)
# fit only on training data
tokenizer.fit_on_texts(train_data['url'])
tokenizer.word_index.keys()
# (token字典键的数量)
n_char = len(tokenizer.word_index.keys())
print(f'N Char: {n_char}')
# Consult: https://www.cnblogs.com/jielongAI/p/10178585.html
# Consult: https://www.runoob.com/python/att-dictionary-keys.html

In [None]:
train_seq = tokenizer.texts_to_sequences(train_data['url'])
val_seq = tokenizer.texts_to_sequences(val_data['url'])
print('Before tokenization: ')
print(train_data.iloc[0]['url'])
print('\nAfter tokenization: ')
print(train_seq[0])

In [None]:
sequence_length = np.array([len(i) for i in train_seq])
sequence_length = np.percentile(sequence_length, 99).astype(int)
print(f'Sequence length: {sequence_length}')
# Consult: https://blog.csdn.net/ximibbb/article/details/79149887(文本预处理方法)

Each text length has a different length, therefore using padding to equalize each text length

In [None]:
train_seq = pad_sequences(train_seq, padding='post', maxlen=sequence_length)
val_seq = pad_sequences(val_seq, padding='post', maxlen=sequence_length)
print('After padding: ')
print(train_seq[0])
# Consult: https://blog.csdn.net/wcy23580/article/details/84957471

Save the tokenizer for later use

In [None]:
with open('tokenizer.pkl', 'wb') as f:
    pickle.dump(tokenizer, f)
# Consult: https://blog.csdn.net/gdkyxy2013/article/details/80495353
# Consult: https://www.cnblogs.com/cainiaoxuexi2017-ZYA/p/11673982.html
# Consult: https://www.cnblogs.com/zhangbao003/p/8926366.html

Also encode subdomain, domain, suffix domains and label into numerical variables

In [None]:
def encode_label(label_index, data):
    try:
        return label_index[data]
    except:
        return label_index['<unknown>']

In [None]:
unique_value = {}
for feature in ['subdomain', 'domain', 'domain_suffix']:
    # get unique value
    label_index = {label: index for index, label in enumerate(train_data[feature].unique())}
    # add unknown label in last index
    label_index['<unknown>'] = list(label_index.values())[-1] + 1
    # count unique value
    unique_value[feature] = label_index['<unknown>']
    # encode
    train_data.loc[:, feature] = [encode_label(label_index, i) for i in train_data.loc[:, feature]]
    val_data.loc[:, feature] = [encode_label(label_index, i) for i in val_data.loc[:, feature]]
    # save label index
    with open(f'{feature}.pkl', 'wb') as f:
        pickle.dump(label_index, f)
# https://www.runoob.com/python/python-func-enumerate.html
# https://blog.csdn.net/weixin_39549734/article/details/81224567

In [None]:
# (对标签进行编码)
for data in [train_data, val_data]:
    data.loc[:, 'label'] = [0 if i == 'good' else 1 for i in data.loc[:, 'label']]
# Consult: https://www.cnblogs.com/zknublx/p/9623080.html

In [None]:
print(f"Unique subdomain in Train data: {unique_value['subdomain']}")
print(f"Unique domain in Train data: {unique_value['domain']}")
print(f"Unique domain suffix in Train data: {unique_value['domain_suffix']}")

# Create CNN Model

In [None]:
def convolution_block(x):
    # 3 sequence conv layer
    conv_3_layer = layers.Conv1D(64, 3, padding='same', activation='elu')(x)
    # 5 sequence conv layer
    conv_5_layer = layers.Conv1D(64, 5, padding='same', activation='elu')(x)
    # concat conv layer
    conv_layer = layers.concatenate([x, conv_3_layer, conv_5_layer])
    # flatten
    conv_layer = layers.Flatten()(conv_layer)
    return conv_layer
# Consult: https://blog.csdn.net/qq_42004289/article/details/105367854
# Consult: https://blog.csdn.net/kilotwo/article/details/88403079

In [None]:
def embedding_block(unique_value, size, name):
    # (构建网络的第一层-输入层)
    input_layer = layers.Input(shape=(1,), name=name + '_input')
    # (向量编码转换)
    embedding_layer = layers.Embedding(unique_value, size, input_length=1)(input_layer)
    return input_layer, embedding_layer
# Consult: https://blog.csdn.net/weixin_44441131/article/details/105901178
# Consult: https://blog.csdn.net/u013249853/article/details/89194787(embedding层)

In [None]:
def create_model(sequence_length, n_char, unique_value):
    input_layer = []
    # sequence input layer
    sequence_input_layer = layers.Input(shape=(sequence_length,), name='url_input')
    input_layer.append(sequence_input_layer)
    # convolution block
    char_embedding = layers.Embedding(n_char + 1, 32, input_length=sequence_length)(sequence_input_layer)
    conv_layer = convolution_block(char_embedding)
    # entity embedding
    entity_embedding = []
    for key, n in unique_value.items():
        size = 4
        input_l, embedding_l = embedding_block(n, size, key)
        embedding_l = layers.Reshape(target_shape=(size,))(embedding_l)
        input_layer.append(input_l)
        entity_embedding.append(embedding_l)
    # concat all layer
    fc_layer = layers.concatenate([conv_layer, *entity_embedding])
    fc_layer = layers.Dropout(rate=0.5)(fc_layer)
    # dense layer
    fc_layer = layers.Dense(128, activation='elu')(fc_layer)
    fc_layer = layers.Dropout(rate=0.2)(fc_layer)
    # output layer
    output_layer = layers.Dense(1, activation='sigmoid')(fc_layer)
    model = models.Model(inputs=input_layer, outputs=output_layer)
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[metrics.Precision(), metrics.Recall()])
    return model

In [None]:
# reset session
backend.clear_session()
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(0)
random.seed(0)
tf.set_random_seed(0)
# create model
model = create_model(sequence_length, n_char, unique_value)
model.summary()
# Consult: https://blog.csdn.net/qq_34418352/article/details/106636200
# Consult: https://blog.csdn.net/ybdesire/article/details/85217688

In [None]:
plot_model(model, to_file='model.png')
model_image = mpimg.imread('model.png')
plt.figure(figsize=(75, 75))
plt.imshow(model_image)
plt.show()

The model received 4 inputs, the first input came from URL that has been done tokenization and padding. Other inputs are subdomains, domains and suffix domains that have been encoded. URL input will pass through embedding layer and convolution layer while other input will pass embedding layer. Then the results from each input will be concatenated.

# Model Training

In [None]:
train_x = [train_seq, train_data['subdomain'], train_data['domain'], train_data['domain_suffix']]
train_y = train_data['label']
val_x = [val_seq, val_data['subdomain'], val_data['domain'], val_data['domain_suffix']]
val_y = val_data['label']

In [None]:
early_stopping = [EarlyStopping(monitor='val_precision', patience=5, restore_best_weights=True, mode='max')]
history = model.fit(train_x, train_y, batch_size=64, epochs=25, verbose=1, validation_data=[val_x, val_y], shuffle=True, callbacks=early_stopping)
model.save('model.h5')
# Consult: https://blog.csdn.net/DoReAGON/article/details/88552892
# Consult: https://blog.csdn.net/leviopku/article/details/86612293
# COnsult: https://blog.csdn.net/tszupup/article/details/85198949(可直接加载模型进行训练)

In [None]:
plt.figure(figsize=(20, 5))
for index, key in enumerate(['loss', 'precision', 'recall']):
    plt.subplot(1, 3, index+1)
    plt.plot(history.history[key], label=key)
    plt.plot(history.history[f'val_{key}'], label=f'val {key}')
    plt.legend()
    plt.title(f'{key} vs val {key}')
    plt.ylabel(f'{key}')
    plt.xlabel('epoch')

# Model Validation

In [None]:
val_pred = model.predict(val_x)
val_pred = np.where(val_pred[:, 0] >= 0.5, 1, 0)
print(f'Validation Data:\n{val_data.label.value_counts()}')
print(f'\n\nConfusion Matrix:\n{confusion_matrix(val_y, val_pred)}')
print(f'\n\nClassification Report:\n{classification_report(val_y, val_pred)}')

# Conclusion

In conclusion, the trained model has a high precision and recall value but what must be considered is the precision value. The precision value must be high because if it is low then a website that is not malicious has the possibility to be classified as malicious