# RIoT: Reinforced IoT
A Security solution for IoT devices.

IoT devices because of their lack of processing power and storage capabilities, running an anti-virus or any other forms of protection for such devices is very difficult. RIoT aims to solve this problem by having a system which monitors all the packets in the IoT network and by using a machine learning model on the observed packets, determine whether something is off.

The [dataset](https://www.stratosphereips.org/datasets-iot23) used to train the model is created by capuring all the packets involved in IoT system which is deliberately attacked, labelling them and running these packet capture files(.pcap files) through a program called [zeek](https://zeek.org) to finally get a transactional log file to which additional two coloumns are added which denote whether these were part of an attack or not.

The dataset contains following coloumns:<br>
1. ts: Timestamp when this took place(It is in Unix time).<br>
1. uid: An Id to identify a device/user in the communication.<br>
1. id.orig_h: The IP address of the originator of the packet.<br>
1. id.orig_p: The port used in the originator of the packet.<br>
1. id.resp_h: The IP address of the rescipient of the packet.<br>
1. id.resp_p: The port used in the rescipient of the packet.<br>
1. proto: Underlying transport layer protocol used (usually its tcp or udp).<br>
1. service: Identified application protocol(http,dns etc).<br>
1. duration: The duration of connection.<br>
1. orig_bytes: The number of payload bytes the originator sent.<br>
1. resp_bytes: The number of payload bytes the rescipient sent.<br>
1. conn_state: Encoded values which denote the state of connection.<br>
1. local_orig: Boolean value that denotes whether connection originated locally.<br>
1. local_resp: Boolean value that denotes whether connection is responded to locally.<br>
1. missed_bytes: Indicates the number of bytes missed in content gaps, which is representative of packet loss.<br>
1. history: Encodes the state history of connection as a string of letters.<br>
1. orig_pkts: Number of packets that the originator sent.<br>
1. orig_ip_bytes: Number of IP level bytes that the originator sent.<br>
1. resp_pkts: Number of packets that the responder sent.<br>
1. resp_ip_bytes: Number of IP level bytes that the responder sent.<br>

1. label: Labelled as *Malicious* or *benign*<br>
1. Detailed_label: Labels the type of attack.

## Importing required libraries

The following libraries were used.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn  as sns
import sklearn as skl

## Loading the data

In [None]:
train_data = pd.read_csv("EditedCopyLog.labeled.copy.csv")

*Process of analysing and cleaning the data:*

In [None]:
train_data.info()

In [None]:
train_data.head()

In [None]:
train_data.describe()

## Data Analysis
Following is the data visualisations done as part of the data cleaning process.

In [None]:
train_data[train_data['label']==1]['id.orig_p'].hist(alpha=0.5,color='blue',bins=30,label='label=1')

train_data[train_data['label']==0]['id.orig_p'].hist(alpha=0.5,color='red',bins=30,label='label=0')
plt.legend()
plt.xlabel('port of Originator')

In [None]:
train_data[train_data['label']==1]['id.resp_p'].hist(alpha=0.5,color='blue',bins=30,label='label=1')

train_data[train_data['label']==0]['id.resp_p'].hist(alpha=0.5,color='red',bins=30,label='label=0')
plt.legend()
plt.xlabel('port of Responder')

*Encoding the values of Categorical coloumn for ML algorithm*


In [None]:
# Encoding the 'proto' coloumn of dataset
# Program comes here


# Encoding the 'service' coloumn of dataset
# Program comes here
# Not encoding service coloumn as there is a lot of blank spaces in that coloumn in dataset/

# Encoding the 'conn_state' coloumn of dataset.
# Program comes here
def conn_state_encoder(conn_state_value):
    if type(conn_state_value)!=str:
        return None
    
    if conn_state_value == "S0": # Connection attempt seen,no reply.
        return 0
    elif conn_state_value == "S1": # Connection established, not terminated.
        return 1
    elif conn_state_value == "SF": # Normal establishment and termination.
        return 2
    elif conn_state_value == "REJ": # Connection attempt rejected.
        return 3
    elif conn_state_value == "S2": # Connection established and close attempt by originator seen(but no reply from responder).
        return 4
    elif conn_state_value == "S3": # Connection established and close attempt by responder seen(but no reply from originator).
        return 5
    elif conn_state_value == "RSTO": # Connection established, originator aborted(sent a RST).
        return 6
    elif conn_state_value == "RSTR": # Responder sent a RST.
        return 7
    elif conn_state_value == "RSTOS0": # Originator sent a SYN followed by a RST, never saw a SYN-ACK from responder.
        return 8
    elif conn_state_value == "RSTRH": # Responder sent a SYN-ACK followed by a RST, never saw a SYN from (supposed) originator.
        return 9
    elif conn_state_value == "SH": # Originator sent a SYN followed by a FIN, never saw a SYN-ACK from the responder(half open connection).
        return 10
    elif conn_state_value == "SHR": # Responder sent a SYN-ACK followed by a FIN, never saw a SYN from originator.
        return 11
    elif conn_state_value == "OTH": # No SYN seen, just midstream traffic(eg: partial traffic that was not later closed).
        return 12

# Encoding the 'history' coloumn of dataset
# Program comes here... Need: An intelligent way to map each(permutably possible) values of coloumn 'history' into numerical values for the ML algorithm.
def DecimalToBase25(decimalNum):
    base25Str=""
    decodings = {1:'1',2:'2',3:'3',4:'4',5:'5',6:'6',7:'7',8:'8',9:'9',10:'a',11:'b',12:'c',13:'d',14:'e',15:'f',16:'g',17:'h',18:'i',19:'j',20:'k',21:'l',22:'m',23:'n',24:'o',0:'p'}

    while decimalNum != 0 :
        base25Str = base25Str + decodings[decimalNum%25]
        if decimalNum%25 == 0:  # For a weird edge case when '^' comes in history string...
            decimalNum=decimalNum-1
        decimalNum = decimalNum//25

    base25Str = base25Str[::-1]
    #print("decoded Base 25 Value:",base25Str)
    return base25Str

def decodeHist(encodedDecimalVal):
    if type(encodedDecimalVal) != int:
        return None
    
    decodings = {'1':'s','2':'S','3':'d','4':'D','5':'f','6':'F','7':'h','8':'H','9':'r','a':'R','b':'c','c':'C','d':'a','e':'A','f':'g','g':'G','h':'t','i':'T','j':'w','k':'W','l':'i','m':'I','n':'q','o':'Q','p':'^'}
    base25DecodedStr = DecimalToBase25(encodedDecimalVal)
    decodedHistStr=''

    for i in base25DecodedStr:
        decodedHistStr = decodedHistStr + decodings[i]

    #print("History value decoded by function: ",decodedHistStr)

# Encoding functions:
def base25ToDecimal(base25Encoded):
    if type(base25Encoded) != str:
        return None
    decimalVal = 0

    encodings = {'1':1,'2':2,'3':3,'4':4,'5':5,'6':6,'7':7,'8':8,'9':9,'a':10,'b':11,'c':12,'d':13,'e':14,'f':15,'g':16,'h':17,'i':18,'j':19,'k':20,'l':21,'m':22,'n':23,'o':24,'p':25}

    for i in base25Encoded:
       decimalVal =  decimalVal * 25 + encodings[i]

    #print("encoded value in decimal: ",decimalVal)
    return decimalVal

def encodeHist(history):
    #print(history)
    if type(history) != str :
        return None

    encodedStr = ""

    encoding = {'s':'1','S':'2','d':'3','D':'4','f':'5','F':'6','h':'7','H':'8','r':'9','R':'a','c':'b','C':'c','a':'d','A':'e','g':'f','G':'g','t':'h','T':'i','w':'j','W':'k','i':'l','I':'m','q':'n','Q':'o','^':'p'}

    for i in history :
        encodedStr = encodedStr + encoding[i]

    encodedDecimalVal = base25ToDecimal(encodedStr)

    #print("encoded value in base 25: ",encodedStr)
    decodeHist(encodedDecimalVal)
    #print('\n')
    return encodedStr

## Training the decision tree model

In [None]:
dTree=skl.DecisionTreeClassifier()

In [None]:
dTree.fit(VARIABLE_NAME_FOR_X_OF_DATASET,VARIABLE_NAME_FOR_Y_OF_DATASET_LABEL)

## Prediction and Evaluation of Decision Tree
*Measuring how good the model is by using a test dataset*

In [None]:
test_data = pd.load_csv("TotallyRealTestDatasetWhichIsNotPartOfTrainDataset.csv")
# more lines of code to prepare the test_data for Evaluation of the model.

In [None]:
prediction=dTree.predict(VARIABLE_NAME_FOR_X_OF_TESTSET)

from sklearn.metrics import classification_report,confusion_matrix

print(classification_report(VARIABLE_NAME_FOR_Y_OF_TESTSET, prediction))

In [None]:
print(confusion_matrix(VARIABLE_NAME_FOR_Y_OF_TESTSET,prediction))

## Advanced Decision Forest model for further categorising the type of attack

In [None]:
# RIoT: Reinforced IoT Machine learning program

#import required libraries
import tensorflow_decision_forests as tfdf
import pandas as pd

# Load the dataset in a Pandas dataframe.
train_df = pd.read_csv("TrainDataset.csv")
test_df = pd.read_csv("TestDataset.csv")

# Convert the dataset into a TensorFlow dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="label")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label="label")

# Train the model
model = tfdf.keras.RandomForestModel()
model.fit(train_ds)

# Look at the model.
model.summary()

# Evaluate the model.
model.evaluate(test_ds)

# Export to a TensorFlow SavedModel.
model.save("project/model")