# XSSClassifier:  Detection of XSS Attacks Using Machine Learning Approach

### Research Paper:

Based on this paper, machine learning classifiers are used for categorizing webpages into either XSS or non-XSS. The process primarily encompasses four steps: identifying features, gathering webpages, extracting features, and constructing a training dataset, and employing machine learning classification. 

https://pdfs.semanticscholar.org/2c74/d8e94b73c35e8651189262d0c7f32e6cfc7c.pdf?_ga=2.60428845.1871262773.1581209097-1873828646.1581209097
 
 
 ### Goal:
    Create a training algorithm for detecting an XSS attack vector in a Query Parameter string. 
    Define a set of features that we can use to train the model and label a sample as XSS(1) or Not XSS(0).

## 1. Gather Test Data 
Get a list of HTTP requests samples generated from different webpages from mentiond sources in the paper: XSSed, Alexa & Elgg. and create an array with labels non_xss_count = (0), xss_count = (1).

In [84]:
import sys
!{sys.executable} -m pip install -U numpy
!{sys.executable} -m pip install -U gensim
!{sys.executable} -m pip install -U python-Levenshtein
!{sys.executable} -m pip install -U nltk
!{sys.executable} -m pip install -U scikit-learn
import warnings
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('punkt')

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from numpy import *
from urllib.parse import unquote

import numpy as np
import pandas as pd
import csv
import urllib.parse as parse
import pickle


xss_count = 0
non_xss_count = 0

test_XSS_strings = []
test_normal_string = []

temp_x = []
X = []
y = []

print("Gathering Data...")
# gather the XSS string and append the label of 1 to y array
with open('lib/testXSS.txt', 'r') as f:
    test_XSS_strings = f.readlines()
print("*", sep=' ', end='', flush=True)
# parse out the query part of the URL 
for line in test_XSS_strings:
    query = parse.urlsplit(line)[3]
    #try to remove open redirect vulns
    if "?http" in str(line):
        continue
    if "?url=http" in str(line):
        continue
    if "?fwd=http" in str(line):
        continue
    if "?path=http" in str(line):
        continue
    if "=http" in str(query):
        continue
    if "page=search" in str(query):
        continue
    if len(query) > 8:
        xss_count += 1
        temp_x.append(line)
        
# remove duplicates
dedup = list(dict.fromkeys(temp_x))
print("*", sep=' ', end='', flush=True)
# Add a feature to X and label to the y array
for line in dedup:
    X.append(line)
    y.append(1)
    
temp_x = []
dedup = []
print("*", sep=' ', end='', flush=True)

# gather the list of normal string and append the label of 0 to y array 
with open('lib/testNORM.txt', 'r') as f:
    test_normal_string = f.readlines()
    
# parse out the query part of the URL 
for line in test_normal_string:
    query = parse.urlsplit(line)[3]
    if len(query) > 3:
        non_xss_count += 1
        temp_x.append(line)
        
# remove duplicates
dedup = list(dict.fromkeys(temp_x))
print("*", sep=' ', end='', flush=True)
# Add a feature to X and a label to the y array
for line in dedup:
    X.append(line)
    y.append(0)



Collecting numpy
  Using cached numpy-1.26.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (115 kB)
Using cached numpy-1.26.2-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)
[33mDEPRECATION: jupyter-server 2.0.0 has a non-standard dependency specifier jupyter-core!=~5.0,>=4.12. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of jupyter-server or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.25.2
    Uninstalling numpy-1.25.2:
      Successfully uninstalled numpy-1.25.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scipy 1.9.3 requires numpy<1.26.0,>=1.18.5, but you have 

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/sarahussein/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


***

In [85]:
print("Number of XSS Samples: "+str(xss_count))
print("Number of NOT XSS Samples: "+str(non_xss_count))
print("Total Samples: "+str(xss_count+non_xss_count))

Number of XSS Samples: 38608
Number of NOT XSS Samples: 44826
Total Samples: 83434


##### The sample set collected was larger then the one used in the reasearch paper
```
The paper has a training dataset containing 1,000 webpages (400 malicious and 600 benign) and their extracted features.

```
We expanded the data set and samples used to train the model with bengin and malicious samples from the attached resources:

Benign:

https://raw.githubusercontent.com/Xyntax/ML/master/DL_for_xss/data/normal_examples.csv
https://raw.githubusercontent.com/Xyntax/ML/master/DL_for_xss/data/white.csv

Malicious:

https://gist.github.com/ThomasOrlita/e2e4a6d72877c8c897082eefe969578a
https://raw.githubusercontent.com/Xyntax/ML/master/DL_for_xss/data/xssed.csv
https://raw.githubusercontent.com/Xyntax/ML/master/DL_for_xss/data/black.csv
https://raw.githubusercontent.com/foospidy/payloads/master/other/xss/reddit_xss_get.txt
https://github.com/danielmiessler/SecLists/tree/master/Fuzzing/XSS


## 2. Create an array of the features for each test sample

Certain HTML tags can be used by an attacker to inject the XSS code scripts from outside. These HTML tags consist of ```<link>, <object>, <form>, <script>, <embed>, <ilayer>, <layer>, <style>, <applet>, <meta>, <img>, <iframe>```. 

Also, attacker can misuse some methods on the embedded XSS payload such as ```exec(), fromCharCode(), eval, alert(), getElementsByTagName(), write(), unscape(), and escape()```



In [86]:
# Create a function to convert an array of query strings to a set of features
def getVec(text):
    tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(text)]
    max_epochs = 25
    vec_size = 20
    alpha = 0.025

    model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm=1)
    model.build_vocab(tagged_data)
    print("Building the sample vector model...")
    features = []
    for epoch in range(max_epochs):
        print("*", sep=' ', end='', flush=True)
        model.random.seed(42)
        model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.epochs)
        # decrease the learning rate
        model.alpha -= 0.0002
        # fix the learning rate, no decay
        model.min_alpha = model.alpha
    model.save("lib/d2v.model")
    print()
    print("Model Saved")
    for i, line in enumerate(text):
        featureVec = [model.dv[i]]
        lineDecode = unquote(line)
        lineDecode = lineDecode.replace(" ", "")
        lowerStr = str(lineDecode).lower()
  
        # Features related to malicious HTML tags
        malicious_tags = ['<link', '<object', '<form', '<embed', '<ilayer', '<layer', '<style', '<applet', '<meta', '<img',
                           '<iframe', '<input', '<body', '<video', '<button', '<math', '<picture', '<map', '<svg', '<div',
                           '<a', '<details', '<frameset', '<table', '<comment', '<base', '<image']
        feature1 = 0
        for tag in malicious_tags:
            feature1+= int(lowerStr.count(tag))

        # Features related to malicious methods/events
        malicious_methods = ['exec', 'fromcharcode', 'eval', 'alert', 'getelementsbytagname', 'write', 'unescape',
                             'escape', 'prompt', 'onload', 'onclick', 'onerror', 'onpage', 'confirm', 'marquee']
        feature2 = 0
        for method in malicious_methods:
            feature2+= int(lowerStr.count(method))
            
        # Features related to file extensions and keywords
        feature3 = int(lowerStr.count('.js'))
        feature4 = int(lowerStr.count('javascript'))

        # Other features
        feature5 = int(len(lowerStr))
        feature6 = int(lowerStr.count('<script')) + int(lowerStr.count('&lt;script')) + int(lowerStr.count('%3cscript')) + int(lowerStr.count('%3c%73%63%72%69%70%74'))
       
        feature7 = 0
        for char in '&<>"\'/%*;+=%3C':
            feature7 +=int(lowerStr.count(char))
        feature8 = int(lowerStr.count('http'))

          # append the features
        featureVec = np.append(featureVec,feature1)
        #featureVec = np.append(featureVec,feature2)
        featureVec = np.append(featureVec,feature3)
        featureVec = np.append(featureVec,feature4)
        featureVec = np.append(featureVec,feature5)
        featureVec = np.append(featureVec,feature6)
        featureVec = np.append(featureVec,feature7)
        
#         feature_vec = [model.dv[text.index(line)], feature1, feature2, feature3, feature4, feature5, feature6, feature7, feature8]
        
        features.append(featureVec)

    return features

In [87]:
features = getVec(X)
features_dict = {'data':X,'features':features,'label':y}

Building the sample vector model...
*************************
Model Saved


### Features Data

In [93]:
print("Test Sample: "+ X[0])
print("Features: " + str(features[0]))
print("\nLabel:\033[1;31;1m XSS(1)/\033[1;32;1m NOT XSS(0)\033[0;0m: " + str(y[0]))

Test Sample: http://search.rin.ru/cgi-bin/find.cgi?text=%3Cscript%3Ealert(%27HZ+iz+1337%27)%3B%3C%2Fscript%3E

Features: [ 4.40686792e-01 -2.94887638e+00 -1.04484606e+00  1.64985323e+00
 -6.61160231e-01 -1.50925446e+00 -1.78789949e+00 -1.46867549e+00
  1.67239487e+00 -3.88574392e-01  1.80456448e+00  3.24892545e+00
  6.38268948e-01 -4.36293446e-02 -1.00890172e+00  2.35312438e+00
 -1.18228400e+00  3.47410703e+00  3.25613528e-01 -7.81922162e-01
  0.00000000e+00  0.00000000e+00  0.00000000e+00  8.10000000e+01
  1.00000000e+00  1.70000000e+01]

Label:[1;31;1m XSS(1)/[1;32;1m NOT XSS(0)[0;0m: 1


## 3. Train the Model

Split the data into training and testing

In [94]:
np.random.seed(42)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, y, test_size = .3, random_state=42)


In [95]:
# Use RandomState for reproducibility.
from sklearn import tree
my_classifier1 = tree.DecisionTreeClassifier(random_state=42)
print(my_classifier1)
print()

DecisionTreeClassifier(random_state=42)



#### Additonal tunning of hyperparameters can be done.

I have only set the KNN n_neighbors weights

In [97]:
print("Training Classifier #1 DecisionTreeClassifier")
my_classifier1.fit(X_train, y_train)
predictions1 = my_classifier1.predict(X_test)

Training Classifier #1 DecisionTreeClassifier


# 4. Evaluation Metrics


### Predictions Results

Test the classifier and obtain its accuracy.

### Training Accuracy Score

In [98]:
from sklearn.metrics import accuracy_score
print('Accuracy Score #1: {:.1%}'.format(accuracy_score(y_test, predictions1)))

Accuracy Score #1: 98.4%


##### How did the test compare with the research paper

Comparing the data with the results seen in the research paper we are within the expected accuracy range.
```
Every classifier had accuracy of more than 0.947.

These results validate the effectiveness of our proposed approach. The overall result of all classifiers shows that 
RandomForest and MLP are the best classifiers in terms of acuuracy 0.995

```