In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/My Drive/Colab Notebooks

/content/drive/My Drive/Colab Notebooks


In [3]:
import pandas as pd

features = [
    "Fwd Seg Size Min",
    "Init Bwd Win Byts",
    "Init Fwd Win Byts",
    "Fwd Seg Size Min",
    "Fwd Pkt Len Mean",
    "Fwd Seg Size Avg",
    "Label",
    "Timestamp",
]
dtypes = {
    "Fwd Pkt Len Mean": "float",
    "Fwd Seg Size Avg": "float",
    "Init Fwd Win Byts": "int",
    "Init Bwd Win Byts": "int",
    "Fwd Seg Size Min": "int",
    "Label": "str",
}
date_columns = ["Timestamp"]

features` is a list of string items. It contains the names of columns that we're interested in within a dataset.

3) `dtypes` is a dictionary mapping column names (as string keys) to their supposed data types (as string values) in the dataset that will be used. For example, the `"Fwd Pkt Len Mean"` column data type is a floating point number as indicated by the value `"float"`).

4) `date_columns` is a list containing the names of the columns where the data stored is of 'date' type. In our case, we only have one column with the name "Timestamp" which is of type 'date

In [4]:
df = pd.read_csv("ddos_dataset.csv", usecols=features, dtype=dtypes,parse_dates=date_columns,index_col=None)

  df = pd.read_csv("ddos_dataset.csv", usecols=features, dtype=dtypes,parse_dates=date_columns,index_col=None)


The above code is reading a csv file named "ddos_dataset.csv" and assigning it to the variable `df`. The `read_csv` function in pandas is able to read in csv files and convert them into DataFrames.- `usecols=features`: This argument is used to select a subset of columns to read from the csv file. The `features` variable must be a list that contains the names of the columns we want to use.- `dtype=dtypes`: This specifies the data types of the columns. The `dtypes` variable should be a dictionary where the keys are the column names and the values are the corresponding datatypes we want to enforce for each column.- `parse_dates=date_columns`: This tells pandas to interpret certain columns as dates. The `date_columns` variable should be a list of column names that contain date information.- `index_col=None`: This specifies which column to use as the row labels in the DataFrame.

In [5]:
df2 = df.sort_values("Timestamp")

In [6]:
df3 = df2.drop(columns=["Timestamp"])

In [7]:
l = len(df3.index)
train_df = df3.head(int(l * 0.8))
test_df = df3.tail(int(l * 0.2))

`l = len(df3.index)`: This line is getting the total number of rows in the DataFrame `df3`.

2. `train_df = df3.head(int(l * 0.8))`: This line is creating a new DataFrame `train_df`, which consists of the first 80% of the rows from `df3`. The `head` function in pandas returns the first `n` rows for a DataFrame or series. Here `int(l * 0.8)` is calculating 80% of the total number of rows.

3. `test_df = df3.tail(int(l * 0.2))`: This line is creating another DataFrame `test_df`, which consists of the last 20% of the rows from `df3`. The `tail` function in pandas returns last `n` rows. Here `int(l * 0.2)` is calculating 20% of the total number of rows.

In [8]:
y_train = train_df.pop("Label").values
y_test = test_df.pop("Label").values

 `y_train = train_df.pop("Label").values`. This code creates the training label set which contains the true results for the training dataset. It does so by extracting the "Label" column from the `train_df` DataFrame, and converting it into a Numpy array.

 The same with the y_test

In [9]:
X_train = train_df.values
X_test = test_df.values

In [10]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50)

In [11]:
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.83262

Since the dataset is large, even importing all of it is computationally intensive. For this reason, the first step we begin  by specifying a subset of features from our dataset, the ones we consider most promising, as well as recording their data type so that we don't have to convert them later. We then proceed to read the data into a data frame in the next step. In the next 2 steps, we sort the data by date, since the problem requires being able to predict events in the future, and then drop the date column since we will not be employing it further. In the next two steps, we perform a train-test split, keeping in mind temporal progression. We then instantiate, fit, and test a random forest classifier in the last 2 steps. Depending on the application, the accuracy achieved is a good starting point. A promising direction to improve performance is to account for the source and destination IPs. The reasoning is that, intuitively, where a connection is coming from should have a significant bearing on whether it is part of a DDoS.

