# HTTP requests classificator using HTTP dataset CSIC 2010

90 % of combined normal and anomalous data are used for training, and the rest 10 % for testing. The Linear SVM is used for classification because it achieved the best result from a few classificators tried on this dataset.
<br>
<br>
Data files from HTTP dataset CSIC 2010 are loaded and transformed into a list of HTTP requests containing some relevant data (e.g., lines starting with PUT, GET or POST). Then these lists are vectorized by TfidfVectorizer. Data set is randomly split to train and test data. Finally the example of the classification of some new HTTP traffic request is shown.
## Conclusion
I believe that this classifier can be used in production because it achieved 0.9997 accuracy on this dataset with only two detection errors using Linear SVM. Moreover, Linear SVM is significantly faster to train in comparison with some more complex classifiers (e.g., Random Forest Classifier, SVM with different kernel settings, Complement Naive Bayes ...) and also fast to predict compared to KNN classifier, for instance.


## Online resources
http://www.isi.csic.es/dataset/
<br>
https://www.tutorialspoint.com/http/http_requests.htm
<br>
https://github.com/Monkey-D-Groot/Machine-Learning-on-CSIC-2010
<br>
https://stackoverflow.com/a/29788612
<br>
<br>
scikit-learn documentation, especially:
<br>
- https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
<br>
- https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py



In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [2]:
# Data files
normal_traffic_file_1 = 'normalTrafficTraining.txt'
normal_traffic_file_2 = 'normalTrafficTest.txt'
anomalous_traffic_file = 'anomalousTrafficTest.txt'

In [3]:
# Load data and remove newlines characters
def load_data(file_name):
    file = open(file_name, 'r')
    contents = file.read().split('\n')
    file.close()
    return contents

In [4]:
# Load normal and anomalous data into seperate lists
normal_traffic = load_data(normal_traffic_file_1) + load_data(normal_traffic_file_2)
anomalous_traffic = load_data(anomalous_traffic_file)

In [5]:
# Create a list of http requests
def create_requests_list(traffic_list):
    traffic_requests = []
    for line_num in range(len(traffic_list)):
        line = traffic_list[line_num]
        if line.startswith('GET'):
            # Remove unnecessary white spaces and uppercase
            line = line.lower().replace(' ', '')
            traffic_requests.append(line)
        elif line.startswith('POST') or line.startswith('PUT'):
            request_str = line
            while not line.startswith('Content-Length:'):
                line_num += 1
                line = traffic_list[line_num]
            # Second line below 'Content-Length' may be relevant
            request_str = request_str + traffic_list[line_num + 2]
            request_str = request_str.lower().replace(' ', '')
            traffic_requests.append(request_str)
    return traffic_requests

In [6]:
normal_traffic_requests = create_requests_list(normal_traffic)
anomalous_traffic_requests = create_requests_list(anomalous_traffic)

In [7]:
# Class normal: 1
# Class anomalous: 0
labels_normal = [1] * len(normal_traffic_requests)
labels_anomalous = [0] * len(anomalous_traffic_requests)

requests_all = normal_traffic_requests + anomalous_traffic_requests
labels_all = labels_normal + labels_anomalous

In [8]:
# Vectorization of text data
# Could adjust TfidfVectorizer ngram_range settings, (5,5) works best in this case
vectorizer = TfidfVectorizer(analyzer = "char", ngram_range = (5, 5))
X = vectorizer.fit_transform(requests_all)

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, labels_all, test_size = 0.1, random_state = 42)

In [10]:
# Linear SVM
clf = LinearSVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
score_test = accuracy_score(y_test, y_pred)
print("Score Linear SVM: ", score_test)

Score Linear SVM:  0.9997939631193984


In [11]:
confusion_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ")
print(confusion_matrix)

Confusion Matrix: 
[[2560    2]
 [   0 7145]]


In [12]:
# The example of classification of some new samples in file sample_traffic.txt
sample_traffic_file = 'sample_traffic.txt'
sample_traffic = load_data(sample_traffic_file)
traffic_request = create_requests_list(sample_traffic)
X_sample = vectorizer.transform(traffic_request)
sample_pred = clf.predict(X_sample)
for result in sample_pred:
    if result == 0:
        print('anomalous')
    else:
        print('normal')

anomalous
