# Cybersecurity Project - Group 10 - Main Notebook

*The EDA won't be displayed here but can be found in the notebook named : `cybersercurity_EDA.ipynb`*

This notebook covers the rest of the projects, including 
- Feature engineering
- Data Cleaning
- Preprocessing
- Training and testing of the models
- Comparision
- Conclusion

More details will be discussed on the project report.

In [1]:
import pandas as pd
import numpy as np

import geoip2
import geoip2.database

import joblib

import matplotlib

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

We start by defining a function that will give us information about the dataset we are working with. We see it as a more detailled version of `.info()`, by adding the missing value percentage and the number of unique values, in a more readable table.

In [2]:
df = pd.read_csv("data/cybersecurity_attacks.csv")

def big_info(df):

    info_df = pd.DataFrame({
        "Column": df.columns,
        "Non-Null Count": df.notnull().sum().values,
        "Null Count": df.isnull().sum().values,
        "Missing (%)": (df.isnull().sum() / len(df) * 100).round(2).values,
        "Dtype": df.dtypes.astype(str).values,
        "Unique Values": df.nunique().values
    })

    return info_df

big_info(df)

Unnamed: 0,Column,Non-Null Count,Null Count,Missing (%),Dtype,Unique Values
0,Timestamp,40000,0,0.0,object,39997
1,Source IP Address,40000,0,0.0,object,40000
2,Destination IP Address,40000,0,0.0,object,40000
3,Source Port,40000,0,0.0,int64,29761
4,Destination Port,40000,0,0.0,int64,29895
5,Protocol,40000,0,0.0,object,3
6,Packet Length,40000,0,0.0,int64,1437
7,Packet Type,40000,0,0.0,object,2
8,Traffic Type,40000,0,0.0,object,3
9,Payload Data,40000,0,0.0,object,40000


From there, we chose to get rid of high cardinality features. We could simply remove it, but that would be missing a lot of efficiency for the problem. So instead, we are processing to feature extraction. We do that column by column, starting the timestamp. In the next part, we will list the transformation/extraction we do to each column.

## Feature engineering

### Timestamp 

- the year
- the month
- the day
- the hour
- the minute
- the second
- the day number of the week
- is it on week-end (bool)

In [3]:
df["Timestamp"] = pd.to_datetime(df["Timestamp"])
df["year"] = df["Timestamp"].dt.year
df["month"] = df["Timestamp"].dt.month
df["day"] = df["Timestamp"].dt.day
df["hour"] = df["Timestamp"].dt.hour
df["minute"] = df["Timestamp"].dt.minute
df["second"] = df["Timestamp"].dt.second
df["ts_dayofweek"] = df["Timestamp"].dt.dayofweek
df["ts_is_weekend"] = df["Timestamp"].dt.dayofweek.isin([5, 6]).astype(int)

### Source IP Address & Destination IP Address

- the country of the source IP
- the country of the destination IP
- is the source and destination from the same country (bool)

*We could have also used the city and the ASN (and others) but we did and we had a high cardinality result again. So we decided to only use the countries*

In [4]:
CITY_DB_PATH = "GeoLite2-City.mmdb"
reader = geoip2.database.Reader(CITY_DB_PATH)

src_country = []
dst_country = []

for ip in df["Source IP Address"]:
    try:
        r = reader.city(ip)
        src_country.append(r.country.iso_code)
    except Exception:
        src_country.append(None)

for ip in df["Destination IP Address"]:
    try:
        r = reader.city(ip)
        dst_country.append(r.country.iso_code)
    except Exception:
        dst_country.append(None)

reader.close()

df["src_country"] = src_country
df["dst_country"] = dst_country
df["same_country"] = (df["src_country"] == df["dst_country"]).astype(int)

### Source Port & Destination Port

- the class of the source port (registered or dynamic)
- the class of the destination port (registered or dynamic)

*We also thought about a boolean is_same_port but we had 100% of no, so we decided to not use it*

In [5]:
def port_class(p):
    if p <= 49151:
        return "registered"
    else:
        return "dynamic"

df["src_port_class"] = df["Source Port"].apply(port_class)
df["dst_port_class"] = df["Destination Port"].apply(port_class)

### Payload Data

This column is a bit special since it's latin random generated words, but since the goal is to find a logical rule that led to the random generation of Attack Type, we can't afford to ignore a potential pattern in the generated payload data. That's why we extract : 

- the number of characters
- the number of words
- the number of punctuations
- the ration between punctuation and characters

In [6]:
df["payload_char_count"] = df["Payload Data"].astype(str).str.len()
df["payload_word_count"] = df["Payload Data"].astype(str).str.split().str.len()
df["payload_punct_count"] = (df["Payload Data"].astype(str).str.count(r"[^\w\s]"))
df["payload_punct_ratio"] = (df["payload_punct_count"] / df["payload_char_count"])
df.loc[df["payload_char_count"] == 0, "payload_punct_ratio"] = 0


### User information

- the first name
- the last name
- the number of characters

In [7]:
df["user_first_name"] = df["User Information"].str.split().str[0]
df["user_last_name"] = df["User Information"].str.split().str[-1]
df["user_char_count"] = df["User Information"].str.len()

### Device Information

- the OS used
- the device used
- the browser used

In [8]:
device = df["Device Information"].astype(str).str.lower()


df["device_os"] = "other"
df.loc[device.str.contains("windows"), "device_os"] = "windows"
df.loc[device.str.contains("mac"), "device_os"] = "macos"
df.loc[device.str.contains("linux"), "device_os"] = "linux"
df.loc[device.str.contains("android"), "device_os"] = "android"
df.loc[device.str.contains("ios|iphone|ipad"), "device_os"] = "ios"


df["device_type"] = "other"
df.loc[device.str.contains("mobile|android|iphone"), "device_type"] = "mobile"
df.loc[device.str.contains("tablet|ipad"), "device_type"] = "tablet"
df.loc[device.str.contains("desktop|windows|mac|linux"), "device_type"] = "desktop"


df["device_browser"] = "other"
df.loc[device.str.contains("chrome"), "device_browser"] = "chrome"
df.loc[device.str.contains("firefox"), "device_browser"] = "firefox"
df.loc[device.str.contains("safari"), "device_browser"] = "safari"
df.loc[device.str.contains("edge"), "device_browser"] = "edge"
df.loc[device.str.contains("opera"), "device_browser"] = "opera"

### Geolocation Data

- the city 
- the state

*It's important to note 2 things : 1. The geolocation has nothing to with the source or destination IP address geolocation and 2. The city and state are completely random since most of the observations of cities don't belong to the associated state. We conclude that the column was generated by a rule like : `Geo-location Data = rand(city) + " , " + rand(state)`*

In [9]:
geo = df["Geo-location Data"].astype(str)
geo_split = geo.str.split(",", expand=True)
df["geo_city"] = geo_split[0].str.strip()
df["geo_state"] = geo_split[1].str.strip() if geo_split.shape[1] > 1 else None

### Proxy Information

- is there a proxy ? (bool)

*After trying few things, we couldn't extract anything relevant else than the presence of a proxy or not*

In [10]:
# PROXY INFORMATION

df["is_proxy"] = df["Proxy Information"].notna().astype(int)

## Data Cleaning

Now we can drop the high cardinality columns and transform the 1-unique value (with 50% NaNs) columns in what they really are : boolean columns

In [11]:
columns_to_drop = ["Timestamp",
                   "Source IP Address",
                   "Destination IP Address",
                   "Source Port",
                   "Destination Port",
                   "Payload Data",
                   "User Information",
                   "Device Information",
                   "Geo-location Data",
                   "Proxy Information"]

df = df.drop(columns=columns_to_drop)

In [12]:
bool_cols = [
    "Malware Indicators",
    "Alerts/Warnings",
    "Firewall Logs",
    "IDS/IPS Alerts"
]

for col in bool_cols:
    df[col] = df[col].notna().astype(int)

So in the end we have a big_info() that looks like this :

In [13]:
big_info(df)

Unnamed: 0,Column,Non-Null Count,Null Count,Missing (%),Dtype,Unique Values
0,Protocol,40000,0,0.0,object,3
1,Packet Length,40000,0,0.0,int64,1437
2,Packet Type,40000,0,0.0,object,2
3,Traffic Type,40000,0,0.0,object,3
4,Malware Indicators,40000,0,0.0,int64,2
5,Anomaly Scores,40000,0,0.0,float64,9826
6,Alerts/Warnings,40000,0,0.0,int64,2
7,Attack Type,40000,0,0.0,object,3
8,Attack Signature,40000,0,0.0,object,2
9,Action Taken,40000,0,0.0,object,3


The last data cleaning part is to remove the lines with missing values, since it only concerns few lines (less than 1% of the dataset)

In [14]:
df = df.dropna(subset=["src_country", "dst_country"])

In [15]:
big_info(df)

Unnamed: 0,Column,Non-Null Count,Null Count,Missing (%),Dtype,Unique Values
0,Protocol,39306,0,0.0,object,3
1,Packet Length,39306,0,0.0,int64,1437
2,Packet Type,39306,0,0.0,object,2
3,Traffic Type,39306,0,0.0,object,3
4,Malware Indicators,39306,0,0.0,int64,2
5,Anomaly Scores,39306,0,0.0,float64,9815
6,Alerts/Warnings,39306,0,0.0,int64,2
7,Attack Type,39306,0,0.0,object,3
8,Attack Signature,39306,0,0.0,object,2
9,Action Taken,39306,0,0.0,object,3


## Preprocessing

1. We split the dataset in a train and a test set
2. We scale with a standard scaler the numerical columns and we encode with a one hot encoder the categorical columns
3. We train 3 models not randomly chosen :
    - a logistic regression model : for basic multi-class classification problem
    - a random forest model : for more complex multi-class classification problem
    - a decision tree model : for non-linear multi-class classification problem
4. We chose to compare few metrics : 
    - accuracy being the main metric
    - the F1-score to have more details the results
    - the training and testing score to see if we are in a case of overfitting (or underfitting)

*To help with reproduction, we use a random seed of 1337*

In [16]:
# X / y + split
X = df.drop(columns=["Attack Type"])
y = df["Attack Type"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337, stratify=y)

num_cols = X.select_dtypes(include=["int64", "int32", "float64"]).columns.tolist()
cat_cols = X.select_dtypes(include=["object"]).columns.tolist()

# preprocess
preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([("scaler", StandardScaler()),]), num_cols),
        ("cat", Pipeline([("ohe", OneHotEncoder(handle_unknown="ignore")),]), cat_cols),
    ], remainder="drop"
)

# models creation
models = {
    "LogReg": LogisticRegression(max_iter=2000, solver="lbfgs", random_state=1337),
    "DecisionTree": DecisionTreeClassifier(random_state=1337),
    "RandomForest": RandomForestClassifier(n_estimators=200, random_state=1337)
}

In [None]:
results = []

for name, model in models.items():
    pipe = Pipeline([("preprocess", preprocess),("model", model)])

    pipe.fit(X_train, y_train)
    #joblib.dump(pipe, f"models\model_{name}.joblib") to save the models
    y_pred = pipe.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    f1m = f1_score(y_test, y_pred, average="macro")
    train_score = pipe.score(X_train, y_train)
    test_score = pipe.score(X_test,y_test)
    
    results.append((name, acc, f1m, train_score, test_score))

results_df = pd.DataFrame(results, columns=["Model", "Accuracy", "F1_macro", "Train score", "Test score"])
results_df.sort_values("Accuracy", ascending=False)

  joblib.dump(pipe, f"models\model_{name}.joblib")


## Conclusion 

- We can clearly observe an overfitting from the complex models (Random Forest and Decision Tree).
- The fact that the model can learn so much from the train but remains as efficient as a randomizer with the test score takes us to the conclusion that there is no dependance between all the features and the target.
- The fact that even a non-linear model can't find such a dependance is even more showing us that the dataset is randomly generated without any logical rule.

We can try to test with different tree depth for the Decision Tree model to see how the overfitting occurs :

In [None]:
results = []

for d in [1,2, 4, 6, 8, 10, 20, 50, 100]:
    model = DecisionTreeClassifier(max_depth=d, random_state=1337)
    pipe = Pipeline([("preprocess", preprocess), ("model", model)])
    pipe.fit(X_train, y_train)

    train_score = pipe.score(X_train, y_train)
    test_score = pipe.score(X_test, y_test)

    results.append((d,train_score,test_score))

results_df = pd.DataFrame(results, columns=["max_depth","Train score","Test score"])
results_df.sort_values("max_depth",ascending=True)

Unnamed: 0,max_depth,Train score,Test score
0,1,0.336058,0.335538
1,2,0.337457,0.328415
2,4,0.339556,0.327906
3,6,0.341401,0.332358
4,8,0.344358,0.331849
5,10,0.347825,0.327525
6,20,0.383062,0.328797
7,50,0.667631,0.332358
8,80,0.958975,0.321165


We can see that the predictions are completely random (P = 1/3 = 33%) on a low max_depth, but going deeper just create an overfitting case.