# Creating Real Time Defenses

## Using supervised learning example
Supervised learning techniques are those that rely on labeled data to provide the model with better information as to what is and what isn't a hacker attack, malware, or some other kind of unexpected activity.

### Building a data generator

In [1]:
from datetime import time, date, datetime, timedelta
import csv
import random
from collections import Counter
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

#### Creating the CreateAPITraffic() function

In [2]:
def CreateAPITraffic(
    values = 5000,
    benignIP = ['172:144:0:22', '172:144:0:23', 
                '172:144:0:24', '172:144:0:25',
                '172:144:0:26', '172:144:0:27'],
    hackerIP = ['175:144:22:2', '175:144:22:3',
                '175:144:22:4', '175:144:22:5',
                '175:144:22:6', '175:144:22:7'],
    apiEntries = ['Rarely', 'Sometimes', 'Regularly'],
    bias = .8, 
    outlier = 50):
    
    # Define the variables needed to perform tasks within
    # the function. You use data to hold the actual log entries
    # for return to the caller. The currTime and updateTime 
    # variables help create the log’s time entries. The selectedIP
    # variable holds one of the IP addresses provided as part of
    # benignIP or hackerIP arguments and is the IP address added to
    # the current log entry. The threshold determines the split
    # between benign and hacker log entries. The hackerCount and 
    # benignCount variables specify how many of each entry type
    # appears in the log.
    data = []
    currTime = time(0, 0, 0)
    updateTime = timedelta(seconds = 1)
    selectedIP = ""
    threshold = (len(apiEntries) * 2) - \
        (len(apiEntries) * 2 * bias)
    hackerCount = 0
    benignCount = 0

    # A loop for generating entries comes next. This code begins
    # by defining the time element of an individual log entry.
    for x in range(values):
        currTime = (datetime.combine(date.today(), 
                                     currTime)
                    + updateTime).time()
        
        # Selecting an API entry comes next.
        apiChoice = random.choice(apiEntries)
        
        # Determine which IP address to use for the data entry.
        # The CreateAPITraffic() function uses a combination of
        # approaches to make the determination based on the assumption
        # that the hacker will select less commonly used API calls to 
        # attack because these calls are more likely to contain bugs,
        # which is where threshold comes into play. However, it’s also
        # important to include a certain amount of noise in the form of
        # outliers as part of the dataset. This example uses hackerCount
        # as a means of determining when to create an outlier.
        choiceIndex = apiEntries.index(apiChoice) + 1
        randSelect = choiceIndex * \
            random.randint(1, len(apiEntries)) * bias
        if hackerCount % outlier == 0:
            selectedIP = random.choice(hackerIP)
        else:
            if randSelect >= threshold:
                selectedIP = random.choice(benignIP)
            else:
                selectedIP = random.choice(hackerIP)
        
        # Each entry is appended to data in turn. In addition, the code
        # also tracks whether the entry is a hacker or a benign entry.
        data.append([currTime.strftime("%H:%M:%S"), 
                     selectedIP, apiChoice])
        if selectedIP in hackerIP:
            hackerCount += 1
        else:
            benignCount += 1
    
    return (threshold, benignCount, hackerCount, data)

#### Creating the SaveDataToCSV() function

In [3]:
def SaveDataToCSV(data = [], fields = [], 
                  filename = "test.csv"):
    with open(filename, 'w', newline='') as file:
        write = csv.writer(file, delimiter=',')
        write.writerow(fields)
        write.writerows(data)

#### Defining the particulars of the training dataset

In [4]:
callNames = ['Rarely', 
             'Sometimes1', 'Sometimes2',
             'Regularly1', 'Regularly2', 'Regularly3',
             'Often1', 'Often2', 'Often3', 'Often4', 
             'Often5', 'Often6', 'Often7', 'Often8']
benignIPs = ['172:144:0:22', '172:144:0:23', 
             '172:144:0:24', '172:144:0:25', 
             '172:144:0:26', '172:144:0:27',
             '172:144:0:28', '172:144:0:29', 
             '172:144:0:30', '172:144:0:31',
             '172:144:0:32', '172:144:0:33',
             '172:144:0:34', '172:144:0:35',
             '172:144:0:36', '172:144:0:37']

#### Generating the CallData.csv file

In [5]:
random.seed(52)
threshold, benignCount, hackerCount, data = \
    CreateAPITraffic(values=10000, 
                     benignIP=benignIPs, 
                     apiEntries=callNames)
print(f"There are {benignCount} benign entries " \
      f"and {hackerCount} hacker entries " \
      f"with a threshold of {threshold}.")
fields = ['Time', 'IP_Address', 'API_Call']
SaveDataToCSV(data, fields, "CallData.csv")

There are 9320 benign entries and 680 hacker entries with a threshold of 5.599999999999998.


### Converting the log into a frequency data table
This section creates an aggregation of the log entries so that the model can use the calling pattern as a means for detecting whether a caller is a hacker or a regular user.

#### Creating the ReadDataFromCSV() function

In [6]:
def ReadDataFromCSV(filename="test.csv"):
    logData = pd.read_csv(filename)
    
    # Obtain a listing of the unique API calls found in the file.
    calls = np.unique(np.array(logData['API_Call']))
    
    # Aggregate the data using the IP_Address as the means
    # for determining how to group the entries and API_Call
    # as the means to determine which column to use for aggregation.
    aggData = logData.groupby(
        'IP_Address')['API_Call'].agg(list)
    
    # Create a DataFrame to hold the data to analyze later.
    # Begin labelling the data based on its IP address.
    analysisEntries = {}
    analysisData = pd.DataFrame(columns=calls)
    for ipIndex, ipEntry in zip(aggData.index, aggData):
        ipEntry.sort()
        if ipIndex[0:3] == '172':
            values = [0]
        else:
            values = [1]
        
        # Create columns for the DataFrame based on the API calls.
        keys = ['Benign']
        for callType in calls:
            keys.append(callType)
            values.append(ipEntry.count(callType))
        
        # Define each row of the DataFrame using the number of calls
        # from the IP address in question as the values for each column.
        analysisEntries[ipIndex] = pd.Series(values,
                                             index=keys)
    
    # Create the DataFrame and return it to the caller.
    analysisData = pd.DataFrame(analysisEntries)
    return (analysisData, calls)

#### Reading the data from disk

In [7]:
analysisData, calls = ReadDataFromCSV("CallData.csv")
print(analysisData)

            172:144:0:22  172:144:0:23  172:144:0:24  172:144:0:25  \
Benign                 0             0             0             0   
Often1                48            49            38            50   
Often2                23            31            60            48   
Often3                38            41            47            50   
Often4                43            43            38            48   
Often5                43            40            55            43   
Often6                47            41            54            31   
Often7                55            55            44            49   
Often8                57            48            55            57   
Rarely                33            22            28            24   
Regularly1            40            51            33            40   
Regularly2            46            47            35            43   
Regularly3            51            38            51            39   
Sometimes1          

In [8]:
X = np.array(analysisData[1:len(calls)+1]).T
print(X)
y = analysisData[0:1]
print(y)
y = y.values.ravel()
print(y)

[[48 23 38 43 43 47 55 57 33 40 46 51 29 42]
 [49 31 41 43 40 41 55 48 22 51 47 38 32 54]
 [38 60 47 38 55 54 44 55 28 33 35 51 30 39]
 [50 48 50 48 43 31 49 57 24 40 43 39 42 40]
 [59 39 48 40 38 40 34 43 24 46 42 45 44 33]
 [52 45 55 41 38 54 50 39 30 39 35 33 35 48]
 [45 45 47 42 47 48 49 39 31 41 38 44 41 49]
 [41 41 38 45 52 60 29 44 28 45 44 43 31 29]
 [40 36 47 41 40 48 41 52 31 28 32 55 29 37]
 [40 57 58 39 39 42 48 42 29 44 45 47 28 38]
 [36 37 49 37 56 34 52 45 25 55 50 39 44 31]
 [55 39 43 50 37 47 39 43 26 38 39 38 32 42]
 [52 36 43 38 46 57 35 37 29 27 38 27 38 40]
 [50 43 47 43 42 47 41 54 22 35 39 44 33 43]
 [40 48 47 37 49 43 37 47 25 37 40 38 30 36]
 [47 44 52 40 56 51 41 45 24 37 46 39 26 38]
 [ 0  0  0  0  0  0  0  1 54  4  6 10 23 11]
 [ 0  1  0  0  0  0  0  0 46 12  7  9 23 15]
 [ 0  1  0  0  0  0  0  0 41  9  6  5 22 18]
 [ 0  0  0  0  0  0  0  0 52 12 10  4 34 15]
 [ 0  0  0  0  0  0  0  0 54  9  4  9 25 14]
 [ 0  1  1  1  0  0  0  0 43  5  7 12 28 16]]
        1

### Creating the detection model
All of the data generation and preparation took a long time in this example, but it’s an even longer process in the real world. This example hasn’t considered issues like cleaning the data, dealing with missing data, or verifying that data is in the correct range and of the correct type. However, it's finally time to see the results of the data preparation.

#### Performing the classification

In [9]:
clf=RandomForestClassifier()
clf.fit(X,y)

RandomForestClassifier()

In [10]:
random.seed(19)
threshold, benignCount, hackerCount, data = \
    CreateAPITraffic(benignIP=benignIPs, 
                     apiEntries=callNames, 
                     bias=.95, outlier=15)
print(f"There are {benignCount} benign entries " \
      f"and {hackerCount} hacker entries " \
      f"with a threshold of {threshold}.")
fields = ['Time', 'IP_Address', 'API_Call']
SaveDataToCSV(data, fields, "TestData.csv")

There are 4975 benign entries and 25 hacker entries with a threshold of 1.4000000000000021.


In [11]:
testData, testCalls = ReadDataFromCSV("TestData.csv")
X_test = np.array(testData[1:len(calls)+1]).T
y_test = testData[0:1].values.ravel()
y_pred = clf.predict(X_test)
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))

Accuracy: 1.000


## Using a subprocess in Python example
A subprocess is one that is called from the current process to perform a specific task. You use subproceses for all sorts of things, like getting the contents of the current directory or opening a `.zip` file. The first cell below shows the unsafe way of doing things, while the second cell is safer. The third cell shows an easier to use and safer, but less flexible method.

In [12]:
from subprocess import check_output

MyDir = check_output("dir", shell=True)
print(MyDir.decode('ascii'))

 Volume in drive C is Data
 Volume Serial Number is 9CA2-C0E6

 Directory of C:\Users\John\Anaconda Projects\MLSec\Chapter05

08/21/2022  08:24 AM    <DIR>          .
08/21/2022  08:24 AM    <DIR>          ..
08/12/2022  10:55 AM    <DIR>          .ipynb_checkpoints
08/21/2022  08:34 AM           314,158 CallData.csv
06/08/2021  12:54 PM             8,703 MLSec; 05; Check for GPU Support.ipynb
05/17/2021  08:40 PM         3,304,477 MLSec; 05; Pix2Pix.ipynb
08/21/2022  08:24 AM            27,992 MLSec; 05; Real Time Defenses.ipynb
08/21/2022  08:35 AM           157,278 TestData.csv
               5 File(s)      3,812,608 bytes
               3 Dir(s)  1,728,845,541,376 bytes free



In [13]:
from subprocess import check_output

MyDir = check_output(['cmd','/c','dir'])
print(MyDir.decode('ascii'))

 Volume in drive C is Data
 Volume Serial Number is 9CA2-C0E6

 Directory of C:\Users\John\Anaconda Projects\MLSec\Chapter05

08/21/2022  08:24 AM    <DIR>          .
08/21/2022  08:24 AM    <DIR>          ..
08/12/2022  10:55 AM    <DIR>          .ipynb_checkpoints
08/21/2022  08:34 AM           314,158 CallData.csv
06/08/2021  12:54 PM             8,703 MLSec; 05; Check for GPU Support.ipynb
05/17/2021  08:40 PM         3,304,477 MLSec; 05; Pix2Pix.ipynb
08/21/2022  08:24 AM            27,992 MLSec; 05; Real Time Defenses.ipynb
08/21/2022  08:35 AM           157,278 TestData.csv
               5 File(s)      3,812,608 bytes
               3 Dir(s)  1,728,845,541,376 bytes free



In [14]:
from os import listdir
from os import getcwd

MyDir = listdir(getcwd())
print(MyDir)

['.ipynb_checkpoints', 'CallData.csv', 'MLSec; 05; Check for GPU Support.ipynb', 'MLSec; 05; Pix2Pix.ipynb', 'MLSec; 05; Real Time Defenses.ipynb', 'TestData.csv']


## 5.4.2	Working with Flask
Flask is a Python framework used for web applications. You could make your machine learning application available through a web API using it. However, whenever you work with the web, you could expose your network to problems such as Cross-Site Scripting (XSS). The following examples show how to avoid this problem.

In [15]:
from flask import Flask, request

app = Flask(__name__)

@app.route("/")
def say_hello():
    your_name = request.args.get('name')
    return "Hello %s" % your_name

**Click the stop button to stop the server from running.** Otherwise, the server will continue to run in the background and you won't be able to run the rest of the example. To test this server out with a script, type `http://127.0.0.1:5000/?name=<script>alert(1)</script>` in a new browser tab and press Enter.

In [16]:
app.run()

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [21/Aug/2022 08:36:13] "[37mGET /?name=%3Cscript%3Ealert(1)%3C/script%3E HTTP/1.1[0m" 200 -
127.0.0.1 - - [21/Aug/2022 08:36:16] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
127.0.0.1 - - [21/Aug/2022 08:36:24] "[37mGET /?name=John HTTP/1.1[0m" 200 -


In [17]:
from flask import Flask, request, escape

app = Flask(__name__)

@app.route("/")
def say_hello():
    your_name = request.args.get('name')
    return "Hello %s" % escape(your_name)

In [18]:
app.run()

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
