# 🛡️ Predicting CAN Bus Intrusion Using Machine Learning

This notebook explores the use of machine learning and Python-based data science libraries to build a model capable of detecting intrusions in in-vehicle CAN (Controller Area Network) communication based on message patterns and timing.

We will take the following approach:

1. **Problem Definition**  
2. **Data**  
3. **Evaluation**  
4. **Features**  
5. **Modelling**  
6. **Experimentation**

---

## 🔍 1. Problem Definition

In a single sentence:

> Given CAN bus communication parameters such as timestamp, arbitration ID, and data values, can we predict whether or not a message is part of a cyberattack on the vehicle?


## 📊 2. Data

The dataset was collected from a **KIA SOUL** vehicle via the **OBD-II port** while performing both normal and malicious activities on the CAN bus. It includes the following four states:

1. **DoS Attack** – Repeatedly injecting messages with CAN ID `0x000`
2. **Fuzzy Attack** – Injecting messages with random CAN IDs and data
3. **Impersonation Attack** – Injecting spoofed messages using CAN ID `0x164`
4. **Attack-Free State** – Normal CAN communication

Each record in the dataset contains:

- `Timestamp`: Time in seconds  
- `CAN ID`: Identifier of the message (hexadecimal)  
- `DLC`: Number of data bytes (0 to 8)  
- `DATA[0-7]`: Payload bytes of the CAN message  

The dataset is available for research purposes and was originally used in the **2017 Information Security R&D dataset challenge** in South Korea.

🔗 [IEEE Paper Link](https://ieeexplore.ieee.org/document/8476919)

---

## 📈 3. Evaluation

Our goal is to build a machine learning model that can detect CAN bus intrusions in real-time.  

If the model achieves a detection **accuracy of 90% or higher**, especially in distinguishing between attack and normal states, we will consider the solution effective for practical use.


## 🧩 4. Features

This section describes the key features (columns) in the CAN bus dataset. These attributes are derived from in-vehicle communication logs recorded during normal and attack scenarios.

### 📚 Data Dictionary

| Feature       | Description |
|---------------|-------------|
| `Timestamp`   | Time when the CAN message was recorded (in seconds). Useful for computing message frequency, time intervals, or offset ratio. |
| `CAN ID`      | Arbitration ID of the CAN message, represented in hexadecimal (e.g., `0x164`). Can be used to distinguish between different Electronic Control Units (ECUs). |
| `DLC`         | Data Length Code – number of bytes in the message payload (0 to 8). |
| `DATA[0-7]`   | The payload bytes (up to 8) of the CAN message. These bytes may contain status, sensor readings, or control commands. |
| `Label`       | Indicates whether the message is part of an attack or not (only labeled in DoS). In some attack types (like fuzzy or impersonation), the attack window is known (e.g., after 250s), but individual message labels are not provided. |

---

### ⚠️ Note:
- In DoS attacks, all messages with `CAN ID = 0x000` are considered **abnormal**.
- In fuzzy and impersonation attacks, there are no explicit per-row labels; however, we can infer attacks based on **time windows**.
- Timing-related features like **offset ratio** and **response delay** can be engineered for advanced detection.

---

To improve detection performance, we'll also create derived features such as:
- Message frequency per ID
- Time delta between identical IDs
- Bit-level entropy or data variation
- Offset timing between request and response frames


In [2]:
# Import all the tools we need

# Regular EDA (exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# we want our plots to appear inside the notebook
%matplotlib inline 

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import RocCurveDisplay

In [29]:
import re
import pandas as pd

# Prepare to store a sample of lines
sample_size = 100000
data = []

# Regex pattern for the lines
pattern = re.compile(
    r"Timestamp:\s+([0-9.]+)\s+ID:\s+([0-9A-Fa-f]+)\s+\d+\s+DLC:\s+(\d+)\s+((?:[0-9A-Fa-f]{2}\s+){1,8})"
)

with open("data/Attack_free_dataset.txt") as f:
    for i, line in enumerate(f):
        if i >= sample_size:
            break

        match = pattern.search(line)
        if match:
            timestamp = float(match.group(1))
            can_id = int(match.group(2), 16)
            dlc = int(match.group(3))
            data_bytes = [int(b, 16) for b in match.group(4).strip().split()]
            # Pad to 8 bytes if less (e.g., DLC = 4)
            while len(data_bytes) < 8:
                data_bytes.append(0)
            row = [timestamp, can_id, dlc] + data_bytes[:8] + [0]
            data.append(row)


In [30]:
columns = ['Timestamp', 'CAN ID', 'DLC'] + [f'DATA[{i}]' for i in range(8)] + ['Label']
df = pd.DataFrame(data, columns=columns)
df.to_csv("data/attack_free_sample.csv", index=False)


In [32]:
df = pd.read_csv("data/attack_free_sample.csv")
df.head()

Unnamed: 0,Timestamp,CAN ID,DLC,DATA[0],DATA[1],DATA[2],DATA[3],DATA[4],DATA[5],DATA[6],DATA[7],Label
0,0.0,790,8,5,32,234,10,32,26,0,127,0
1,0.000224,809,8,215,167,127,140,17,47,0,16,0
2,0.000462,128,8,0,23,234,10,32,26,32,67,0
3,0.000704,129,8,127,132,96,0,0,0,0,83,0
4,0.000878,288,4,0,0,0,0,0,0,0,0,0


In [34]:
df.isna().sum()

Timestamp    0
CAN ID       0
DLC          0
DATA[0]      0
DATA[1]      0
DATA[2]      0
DATA[3]      0
DATA[4]      0
DATA[5]      0
DATA[6]      0
DATA[7]      0
Label        0
dtype: int64

In [37]:
df["Label"].value_counts()

Label
0    95746
Name: count, dtype: int64

In [44]:
import re
import pandas as pd

# Example: use this if you've copied the raw text to a file
with open("data/Impersonation_attack_dataset.txt", "r") as f:
    lines = f.readlines()

rows = []

for line in lines:
    match = re.match(
        r"Timestamp:\s+([\d.]+)\s+ID:\s+([0-9a-fA-F]+)\s+\d+\s+DLC:\s+(\d+)\s+((?:[0-9a-fA-F]{2}[\s]+){1,8})",
        line
    )
    if match:
        timestamp = float(match.group(1))
        can_id = match.group(2)
        dlc = int(match.group(3))
        data_bytes = match.group(4).strip().split()

        # Pad if < 8 bytes
        while len(data_bytes) < 8:
            data_bytes.append("00")

        rows.append([timestamp, can_id, dlc] + data_bytes[:8] + [1])  # Label = 1

# Define column names
cols = ["Timestamp", "ID", "DLC"] + [f"Byte{i}" for i in range(1, 9)] + ["Label"]

# Create DataFrame
df = pd.DataFrame(rows, columns=cols)


In [45]:
df.head()

Unnamed: 0,Timestamp,ID,DLC,Byte1,Byte2,Byte3,Byte4,Byte5,Byte6,Byte7,Byte8,Label
0,1481193000.0,0587,8,0,00,00,00,00,0,00,01,1
1,1481193000.0,0316,8,5,1c,6a,0a,1c,13,00,7f,1
2,1481193000.0,018f,8,0,21,1c,00,00,43,00,00,1
3,1481193000.0,0260,8,5,1c,00,30,ff,93,63,35,1
4,1481193000.0,0080,8,0,17,6a,0a,1c,13,1c,1f,1


In [46]:
df.to_csv("data/Impersonation_attack_dataset_sample.csv")


In [47]:
df = pd.read_csv("data/Impersonation_attack_dataset_sample.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Timestamp,ID,DLC,Byte1,Byte2,Byte3,Byte4,Byte5,Byte6,Byte7,Byte8,Label
0,0,1481193000.0,0587,8,0,00,00,00,00,0,00,01,1
1,1,1481193000.0,0316,8,5,1c,6a,0a,1c,13,00,7f,1
2,2,1481193000.0,018f,8,0,21,1c,00,00,43,00,00,1
3,3,1481193000.0,0260,8,5,1c,00,30,ff,93,63,35,1
4,4,1481193000.0,0080,8,0,17,6a,0a,1c,13,1c,1f,1


In [49]:
df.isna().sum()

Unnamed: 0    0
Timestamp     0
ID            0
DLC           0
Byte1         0
Byte2         0
Byte3         0
Byte4         0
Byte5         0
Byte6         0
Byte7         0
Byte8         0
Label         0
dtype: int64

In [54]:
import pandas as pd

# Load both datasets
df_normal = pd.read_csv("data/attack_free_sample.csv")
df_attack = pd.read_csv("data/Impersonation_attack_dataset_sample.csv")

# Standardize column names
df_normal.columns = df_normal.columns.str.strip()
df_attack.columns = df_attack.columns.str.strip()

# Add Label: 0 for normal, 1 for attack
df_normal['Label'] = 0
df_attack['Label'] = 1

# Define required columns (as seen in your screenshots)
required_cols = ['Timestamp', 'CAN ID', 'DLC'] + [f'DATA[{i}]' for i in range(8)] + ['Label']

# Retain only required columns that exist
df_normal = df_normal[[col for col in required_cols if col in df_normal.columns]]
df_attack = df_attack[[col for col in required_cols if col in df_attack.columns]]

# Align columns before concatenation
df_normal = df_normal.reindex(columns=required_cols)
df_attack = df_attack.reindex(columns=required_cols)

# Combine and shuffle
df_combined = pd.concat([df_normal, df_attack], ignore_index=True)
df_combined = df_combined.sample(frac=1, random_state=42).reset_index(drop=True)

# Save to CSV
df_combined.to_csv("data/combined_can_dataset.csv", index=False)


In [55]:
df=pd.read_csv("data/combined_can_dataset.csv")

In [56]:
df.isna().sum()

Timestamp         0
CAN ID       995472
DLC               0
DATA[0]      995472
DATA[1]      995472
DATA[2]      995472
DATA[3]      995472
DATA[4]      995472
DATA[5]      995472
DATA[6]      995472
DATA[7]      995472
Label             0
dtype: int64

In [58]:
# Drop rows where CAN ID is missing
df_cleaned = df_combined.dropna(subset=['CAN ID']).copy()  # <- copy avoids the warning

# Convert CAN ID and DATA columns to numeric safely
data_cols = ['CAN ID'] + [f'DATA[{i}]' for i in range(8)]
for col in data_cols:
    df_cleaned.loc[:, col] = pd.to_numeric(df_cleaned[col], errors='coerce')

# Drop remaining NaNs
df_cleaned.dropna(inplace=True)

# Convert to integer type
df_cleaned[data_cols] = df_cleaned[data_cols].astype(int)

# Save final CSV
df_cleaned.to_csv("final_can_dataset.csv", index=False)


In [59]:
df=pd.read_csv("final_can_dataset.csv")

In [62]:
df.head(50)

Unnamed: 0,Timestamp,CAN ID,DLC,DATA[0],DATA[1],DATA[2],DATA[3],DATA[4],DATA[5],DATA[6],DATA[7],Label
0,23.361013,672,8,98,0,96,157,219,12,186,2,0
1,43.761616,672,8,98,0,96,157,219,12,186,2,0
2,6.125777,357,8,17,232,127,0,0,0,4,130,0
3,5.441761,898,8,64,254,15,0,0,0,0,8,0
4,0.71709,848,8,5,32,84,121,123,0,0,115,0
5,43.203403,1088,8,255,240,0,0,255,88,9,0,0
6,19.911001,129,8,127,132,96,0,0,0,0,234,0
7,12.418424,339,8,0,128,16,255,0,255,176,62,0
8,1.350819,129,8,127,132,96,0,0,0,0,234,0
9,38.400809,128,8,0,23,130,10,33,26,33,3,0


In [64]:
df["Label"].value_counts()

Label
0    95746
Name: count, dtype: int64

In [65]:
df_normal['Label'] = 0
df_attack['Label'] = 1

df_combined = pd.concat([df_normal, df_attack], ignore_index=True)


In [66]:
df["Label"].value_counts()

Label
0    95746
Name: count, dtype: int64

In [69]:
df = pd.read_csv("smaller_attack_data.csv")

In [70]:
df.head()

Unnamed: 0,Timestamp,CAN ID,DLC,DATA[0],DATA[1],DATA[2],DATA[3],DATA[4],DATA[5],DATA[6],DATA[7],Label
0,1481193000.0,0316,8,05,1b,28,0a,1b,13,00,7f,1
1,1481193000.0,018f,8,00,23,18,00,00,3f,00,00,1
2,1481193000.0,0329,8,0f,ac,80,8c,11,2c,00,10,1
3,1481193000.0,0260,8,05,1b,0,30,ff,92,5f,2d,1
4,1481193000.0,0329,8,d7,a9,80,8c,11,2d,00,10,1


In [71]:
df['Label'].value_counts()

Label
1    9137
0     863
Name: count, dtype: int64