# HTTP Detection
HTTP Detection use machine learning to detect anomalous HTTP request.  
Dataset: [CSIC 2010 HTTP Dataset](https://petescully.co.uk/research/csic-2010-http-dataset-in-csv-format-for-weka-analysis/)  
In this code, we trained Three models, including random forest model, n-gram CNN model, and n-gram LSTM model.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from utils import evaluate_model
import os
from sklearn.ensemble import RandomForestClassifier

# Number of CPUs for ensemble learning methods
N_ENSEMBLE_CPUS = max(os.cpu_count()//2, 1)

## Analysis HTTP Request Attributes
We will first combine normal data and anomalous data.  
Then we will analyze the value in HTTP requests.  
We will find out those values that are same in every request, and these data will not be used to train our model.  

In [2]:
normal_traffic_dataset = pd.read_csv("dataset/normalTrafficTraining.csv")
anomalous_traffic_dataset = pd.read_csv("dataset/anomalousTrafficTest.csv")
# Preview the dataset
traffic_dataset = pd.concat([normal_traffic_dataset, anomalous_traffic_dataset])

In [3]:
# The attribute that all have same value in every columns.
same_value_columns = []
for name, item in traffic_dataset.items():
    if len(item.unique()) == 1:
        same_value_columns.append(name)
print("same value columns: ", same_value_columns)

same value columns:  ['protocol', 'userAgent', 'pragma', 'cacheControl', 'accept', 'acceptEncoding', 'acceptCharset', 'acceptLanguage', 'connection']


In [4]:
# The attribute that all have binary value.
binary_columns = []
print("value in binary column")
for name, item in traffic_dataset.items():
    if len(item.unique()) == 2:
        print(item.unique())
        binary_columns.append(name)
print("binary columns: ", binary_columns)

# The attribute that have multiple value.
multiple_columns = []
for name, item in traffic_dataset.items():
    if len(item.unique()) > 2:
        multiple_columns.append(name)
print("multiple columns: ", multiple_columns)

value in binary column
['GET' 'POST']
['localhost:8080' 'localhost:9090']
[nan 'application/x-www-form-urlencoded']
['norm' 'anom']
binary columns:  ['method', 'host', 'contentType', 'label']
multiple columns:  ['url', 'contentLength', 'cookie', 'payload']


## Random Forest Model
We use three datasets to train the model
- GET & POST
- GET only
- POST only

For each of the dataset, we used them to train three models.  
The number of decision trees in three models are 5, 40, and 100.

In [5]:
feat_get_post = traffic_dataset[['method', 'host', 'contentType', 'contentLength', 'label']]
feat_get_post = feat_get_post.replace({"method" : {"GET" : 0, "POST" : 1}})
feat_get_post = feat_get_post.replace({"host" : {"localhost:8080" : 0, "localhost:9090" : 1}})
feat_get_post = feat_get_post.replace({"contentType" : {"application/x-www-form-urlencoded" : 1}})
feat_get_post = feat_get_post.replace({"label" : {"norm" : 0, "anom" : 1}})
feat_get_post = feat_get_post.fillna(0)

feat_all = feat_get_post.drop(["label"], axis=1).values
y_all = feat_get_post["label"]
feat_train, feat_test, y_train, y_test = train_test_split(
    feat_all, y_all, test_size=0.4, random_state=0
)

rf_5_model = RandomForestClassifier(n_estimators=5, n_jobs=N_ENSEMBLE_CPUS)
rf_5_model.fit(feat_train, y_train)

rf_40_model = RandomForestClassifier(n_estimators=40, n_jobs=N_ENSEMBLE_CPUS)
rf_40_model.fit(feat_train, y_train)

rf_100_model = RandomForestClassifier(n_jobs=N_ENSEMBLE_CPUS)
rf_100_model.fit(feat_train, y_train)

evaluate_model(rf_5_model, "Random forest classifier using GET & POST (5 DTs)", feat_test, y_test)
evaluate_model(rf_40_model, "Random forest classifier using GET & POST (40 DTs)", feat_test, y_test)
evaluate_model(rf_100_model, "Random forest classifier using GET & POST (100 DTs)", feat_test, y_test)

[ Evaluation result for Random forest classifier using GET & POST (5 DTs) ]
Classification report:
              precision    recall  f1-score   support

           0       0.67      0.94      0.78     14373
           1       0.80      0.33      0.46      9895

    accuracy                           0.69     24268
   macro avg       0.74      0.64      0.62     24268
weighted avg       0.72      0.69      0.65     24268

Confusion matrix:
[[13579   794]
 [ 6658  3237]] 

[ Evaluation result for Random forest classifier using GET & POST (40 DTs) ]
Classification report:
              precision    recall  f1-score   support

           0       0.67      0.94      0.78     14373
           1       0.80      0.33      0.46      9895

    accuracy                           0.69     24268
   macro avg       0.74      0.64      0.62     24268
weighted avg       0.72      0.69      0.65     24268

Confusion matrix:
[[13573   800]
 [ 6657  3238]] 

[ Evaluation result for Random forest classif

In [6]:
feat_get = traffic_dataset[['method', 'host', 'contentType', 'contentLength', 'label']]
feat_get = feat_get[feat_get["method"] == "GET"]
feat_get = feat_get.replace({"method" : {"GET" : 0, "POST" : 1}})
feat_get = feat_get.replace({"host" : {"localhost:8080" : 0, "localhost:9090" : 1}})
feat_get = feat_get.replace({"contentType" : {"application/x-www-form-urlencoded" : 1}})
feat_get = feat_get.replace({"label" : {"norm" : 0, "anom" : 1}})
feat_get = feat_get.fillna(0)

feat_all = feat_get.drop(["label"], axis=1).values
y_all = feat_get["label"]
feat_train, feat_test, y_train, y_test = train_test_split(
    feat_all, y_all, test_size=0.4, random_state=0
)

rf_5_model = RandomForestClassifier(n_estimators=5, n_jobs=N_ENSEMBLE_CPUS)
rf_5_model.fit(feat_train, y_train)

rf_40_model = RandomForestClassifier(n_estimators=40, n_jobs=N_ENSEMBLE_CPUS)
rf_40_model.fit(feat_train, y_train)

rf_100_model = RandomForestClassifier(n_jobs=N_ENSEMBLE_CPUS)
rf_100_model.fit(feat_train, y_train)

evaluate_model(rf_5_model, "Random forest classifier using GET (5 DTs)", feat_test, y_test)
evaluate_model(rf_40_model, "Random forest classifier using GET (40 DTs)", feat_test, y_test)
evaluate_model(rf_100_model, "Random forest classifier using GET (100 DTs)", feat_test, y_test)

[ Evaluation result for Random forest classifier using GET (5 DTs) ]
Classification report:
              precision    recall  f1-score   support

           0       0.66      1.00      0.79     11227
           1       1.00      0.02      0.03      6009

    accuracy                           0.66     17236
   macro avg       0.83      0.51      0.41     17236
weighted avg       0.78      0.66      0.53     17236

Confusion matrix:
[[11227     0]
 [ 5907   102]] 

[ Evaluation result for Random forest classifier using GET (40 DTs) ]
Classification report:
              precision    recall  f1-score   support

           0       0.66      1.00      0.79     11227
           1       1.00      0.02      0.03      6009

    accuracy                           0.66     17236
   macro avg       0.83      0.51      0.41     17236
weighted avg       0.78      0.66      0.53     17236

Confusion matrix:
[[11227     0]
 [ 5907   102]] 

[ Evaluation result for Random forest classifier using GET 

In [7]:
feat_post = traffic_dataset[['method', 'host', 'contentType', 'contentLength', 'label']]
feat_post = feat_post[feat_post["method"] == "POST"]
feat_post = feat_post.replace({"method" : {"GET" : 0, "POST" : 1}})
feat_post = feat_post.replace({"host" : {"localhost:8080" : 0, "localhost:9090" : 1}})
feat_post = feat_post.replace({"contentType" : {"application/x-www-form-urlencoded" : 1}})
feat_post = feat_post.replace({"label" : {"norm" : 0, "anom" : 1}})
feat_post = feat_post.fillna(0)

feat_all = feat_post.drop(["label"], axis=1).values
y_all = feat_post["label"]
feat_train, feat_test, y_train, y_test = train_test_split(
    feat_all, y_all, test_size=0.4, random_state=0
)
rf_5_model = RandomForestClassifier(n_estimators=5, n_jobs=N_ENSEMBLE_CPUS)
rf_5_model.fit(feat_train, y_train)

rf_40_model = RandomForestClassifier(n_estimators=40, n_jobs=N_ENSEMBLE_CPUS)
rf_40_model.fit(feat_train, y_train)

rf_100_model = RandomForestClassifier(n_jobs=N_ENSEMBLE_CPUS)
rf_100_model.fit(feat_train, y_train)

evaluate_model(rf_5_model, "Random forest classifier (5 DTs)", feat_test, y_test)
evaluate_model(rf_40_model, "Random forest classifier (40 DTs)", feat_test, y_test)
evaluate_model(rf_100_model, "Random forest classifier (100 DTs)", feat_test, y_test)

[ Evaluation result for Random forest classifier (5 DTs) ]
Classification report:
              precision    recall  f1-score   support

           0       0.76      0.77      0.77      3186
           1       0.81      0.80      0.80      3846

    accuracy                           0.79      7032
   macro avg       0.79      0.79      0.79      7032
weighted avg       0.79      0.79      0.79      7032

Confusion matrix:
[[2461  725]
 [ 772 3074]] 

[ Evaluation result for Random forest classifier (40 DTs) ]
Classification report:
              precision    recall  f1-score   support

           0       0.78      0.75      0.76      3186
           1       0.80      0.83      0.81      3846

    accuracy                           0.79      7032
   macro avg       0.79      0.79      0.79      7032
weighted avg       0.79      0.79      0.79      7032

Confusion matrix:
[[2386  800]
 [ 671 3175]] 

[ Evaluation result for Random forest classifier (100 DTs) ]
Classification report:
   