PROJECT : ML-IDS

Intelligent IoT Intrusion Detection System This project builds a Machine Learning-based IDS to protect smart homes. With IoT growth, vulnerabilities have increased. Using network traffic data, we apply EDA, feature engineering, and models like Decision Trees to detect threats in real time.

Import all the libraries we will need for this project

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
from sklearn.preprocessing import LabelEncoder

Load the dataset

In [9]:
# Load dataset
df = pd.read_csv("D:\Documents\ESILV\Matières A4\Core Computer Science\Machine Learning\Projet_ML-IDS\Dataset\dataset_invade.csv")

Analyse the data to understand it and clean it

In [3]:
print(df.dtypes)

duration                    int64
protocol_type              object
service                    object
flag                       object
src_bytes                   int64
dst_bytes                   int64
land                        int64
wrong_fragment              int64
urgent                      int64
hot                         int64
logged_in                   int64
num_compromised             int64
count                       int64
srv_count                   int64
serror_rate               float64
rerror_rate               float64
same_srv_rate             float64
diff_srv_rate             float64
srv_diff_host_rate        float64
dst_host_count              int64
dst_host_srv_count          int64
dst_host_same_srv_rate    float64
dst_host_diff_srv_rate    float64
attack                     object
dtype: object


In [12]:
# Check for missing values
print("Missing values:\n", df.isnull().sum())

# Basic statistics
print("Data statistics:\n", df.describe())

# types
print(df.dtypes)

Missing values:
 duration                  0
protocol_type             0
service                   0
flag                      0
src_bytes                 0
dst_bytes                 0
land                      0
wrong_fragment            0
urgent                    0
hot                       0
logged_in                 0
num_compromised           0
count                     0
srv_count                 0
serror_rate               0
rerror_rate               0
same_srv_rate             0
diff_srv_rate             0
srv_diff_host_rate        0
dst_host_count            0
dst_host_srv_count        0
dst_host_same_srv_rate    0
dst_host_diff_srv_rate    0
attack                    0
protocol_type_encoded     0
service_encoded           0
flag_encoded              0
attack_encoded            0
dtype: int64
Data statistics:
             duration     src_bytes     dst_bytes           land  \
count  148517.000000  1.485170e+05  1.485170e+05  148517.000000   
mean      276.779305  4.022795e+04

No Missing Values: There are no missing values in the dataset, meaning no data imputation is required, allowing for a straightforward training process.

Duration, src_bytes, and dst_bytes: These continuous features show high variability, especially in src_bytes and dst_bytes, with values ranging from zero to over a billion. This high variance might require scaling or normalization to improve model performance.

Binary and Discrete Features: Some features, like land, urgent, and logged_in, have low values and can act as flags for certain types of behavior. Additionally, features like wrong_fragment and hot vary minimally but may be critical indicators for certain attacks.

Network Traffic Attributes: Features such as count, srv_count, serror_rate, and rerror_rate are statistical measures of network interactions, which help identify unusual patterns or outliers in traffic. High standard deviations in count and srv_count suggest occasional spikes, potentially indicating bursts of traffic or repeated connection attempts typical in certain attack types.

Service-Related Ratios: Metrics like same_srv_rate, diff_srv_rate, and srv_diff_host_rate provide insight into how connections are distributed across services and hosts. For example, a high same_srv_rate might indicate a consistent connection to one service, while a high srv_diff_host_rate could suggest an attack attempting connections across multiple hosts.

Host-Specific Features: Attributes like dst_host_count and dst_host_srv_count provide details on the concentration of connections to specific hosts. The high variance in these features, especially dst_host_same_srv_rate, may signal unusual access patterns or targeted attacks.

Attack Labels: The attack label is critical for supervised learning approaches, as it allows for training a model to differentiate between normal and potentially malicious activities.

Data imbalance

In [5]:
# Data imbalance
print("Class distribution:\n", df['attack'].value_counts())

Class distribution:
 attack
No     77054
Yes    71463
Name: count, dtype: int64


Exploration visuelle