# 🔐 AI-Powered Data Breach Detection
This notebook is a personal/public log book which helps me/you to understand the entire working of this project step by step — from data preprocessing, model training, to Streamlit deployment with Docker, etc.

WHY?
So that I personally can better understand and keep a check of the things that I did on this project.


## 1️⃣ Why this Project?
While scrolling through LinkedIn one day, I came across a news article reporting a major security breach involving the leakage of sensitive information of millions of users. This incident sparked a question in my mind:
"What if I could help detect such malicious activities even before the damage is done using machine learning?"

At the time, I was going through the Machine Learning Specialization by Stanford & DeepLearning.AI, where I learned several techniques for building ML models based on different types of data and use cases. That’s when I decided to work on a project that not only addresses a real-world problem but also help me land an internship.

After brainstorming ideas—both on my own and with the help of ChatGPT, I landed on the concept of an AI/ML-Powered Data Breach Detection System:
A system that analyzes network traffic to detect suspicious patterns and flag them as potential security threats.

This project is my first major step in combining Machine Learning with Cybersecurity, and it has become one of my most impactful beginner-friendly projects so far for me.

## 2️⃣  Project Overview
**Dataset:** UNSW-NB15
* I have used a real world publicly available dataset created by the IXIA PerfectStorm tool in the Cyber Range Lab of UNSW Canberra for generating a hybrid of real modern normal activities and synthetic contemporary attack behaviours. This dataset constist of nine types of attacks such as Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms.

* This dataset simulates how network traffic appears in real life while containing features like Source IP, Destination IP, Ports, Protocol, etc.

* WHY?
It is realistic and mordern, while having rich features and plenty types of attacks.

**Goal:** Predict whether a network traffic instance is malicious or normal
* This project is the first deployed project I ever created hence I wanted it to be effective yet simple thus classifying the network traffic into malicious or normal type.

* The basic goal was to analyze the incomming traffic features for any suspected activities or irregular network data using machine learning models. 

**Models Used:** Random Forest, XGBoost, MLPClassifier

* Now you might ask that why I used three different models?
Why not?
When I explore Linkedin, I often see very basic projects made using Machine Learning and I did not preferred to do that. 
As I have learned about RF, XGB, MLP in my courses why not use all these to train my models to compare who fast and efficient are they in my use case.

* I wanted to create and know how these models differ in working and training and results.

**Deployment:** Streamlit

* Currently only deployed on streamlit but will soon also deploy uding docker.

WHY?
Streamlit is very easy.
Used AI to implement the frontend as I know very little about the Frontend Development [Soon I will master it too :)]


--------------------------------------------------------------------------------------------------

## Now step by step we will discuss my entire project in details!

## 3️⃣ Data Preprocessing
The first step of building this project was to deeply explore and understand the data set I had.
This was done using a Jupyter notebook called `01_data_eda.ipynb` in Notebooks folder.

➡️ Goals using EDA:
1. Understanding the rows, coloums and data types.
2. Identifying any **missing values**, **duplicate rows**, and **non-numeric** features.
3. Preparing the dataset for preprocessing and modeling.

➡️What did I do in the EDA?
1. **Loaded the Dataset**
   - Combined `UNSW-NB15_1.csv` and `UNSW-NB15_2.csv` using `pd.concat()`.

2. **Initial Exploration**
   - Used `df.info()` and `df.describe()` to understand structure and summary stats.
   - Checked number of `NaNs`, `nulls`, and `duplicates`.

3. **Label Analysis**
   - Counted the number of Normal (0) vs Malicious (1) rows.
   - Found the dataset to be **imbalanced**.
   - Here there were more **Normal** rows than **Malicious** rows.

4. **Attack Category Breakdown**
   - Explored the `attack_cat` column to see which attacks are most common.
   - Example: Reconnaissance, Exploits, Fuzzers were the top categories.

5. **Feature Types**
   - Identified **categorical** vs **numerical** features.
   - Noted key categorical features: `proto`, `service`, `state`.

6. **Decision**
   - Removed columns like `srcip`, `sport`, `dstip`, `dsport` (non-useful for our prediction).
   - Marked `proto`, `service`, and `state` for encoding.
   - Planned feature selection based on correlation and missing values.


➡️Why was EDA important?
- To find Dirty/messy columns
- To find Redundant or identifier columns
- To figure out Imbalanced label distribution
- To figure out the exact steps needed in the preprocessing pipeline

In the same notebook I have also trained my model using RF, which I will explain in next cell.

After completing EDA, it was time to create our 'data_processing.py' file where I performed loading, cleaning, encoding and splitting of the data set.

It ensures the same work as out notebook file just in more of a modular way, hence we will directly discuss our model file




## 4️⃣ Model Training and Evaluation
Now that I have completed the data cleaning, encoding, and class balancing, the next step is to train a machine learning model.

I used a **Random Forest Classifier** – an ensemble learning method that combines multiple decision trees and is known for its performance and robustness in handling structured data like this network traffic dataset.

### Goals:
1. Split the dataset into training and testing sets.
2. Balance the training data using SMOTE.
3. Train a Random Forest Classifier on the balanced data.
4. Evaluate the model using accuracy, precision, recall, and F1-score.
5. Save the trained model, encoders, and feature columns for deployment.

The file 'src/Model.py' has the entire logic and model training for my project using RF.

Importing major libraries such as:- 
1. Pandas / NumPy: Data manipulation.
2. RandomForestClassifier: For machine learning algorithm.
3. classification_report / confusion_matrix: To evaluate the model's performance.
4. joblib: For saving models and encoders.
5. SMOTE: To handle class imbalance.

* Then I defined the coloum names that are needed to train the model where I later loaded the datasets UNSW-NB15_1 and UNSW-NB15_2 and processed the data using imported custom modules like load_data, preprocess_dataframe, and split_and_clean from our data_processing.py file to modularize the pipeline.

* Later I filled all the missing values with coloum's median and empty spaces as NaN.

* Then the data was split for training and testing the model into two sets **X_train**, **Y_train**, **X_test**, **Y_test** also seprating data labels X & Y.

* Next I applied SMOTE (Synthetic Minority Oversampling) that generates synthetic samples to balance the datasets which was required in my case for good model training.

* Then it was time to train the model using RandomForestClassifier with 100 decision trees (estimators).

* Then I saved the feature coloums order and names that were used during the training so that we can process the new data in the same manner.

* The last part of this model file was to evaluate our model, to make prediction on the test set and showing how good our model performs using accuracy, precision , recall, and F1 score.
These evaluation was included in our Notebook where Confusion Matrix, Heatmap, and Classification Report was presented well.

## 5️⃣ Model Inference
It's time to create a new file called 'predict.py' now, which will be used to run our model on newer sets of data given by us to predict where the traffic is malicious or normal.

* Objective
- Load the saved model, encoders, and training feature columns.
- Preprocess a new CSV file or a dummy sample input.
- Generate predictions for whether the traffic is **Normal** or **Malicious**.
- Save the results with prediction labels and confidence scores.

The input can be a CSV file with data or else if there is not CSV file uploaded then I have set few default values which will be used to run the model just for the sake of demonstration.

* Understanding the Prediction Pipeline
The prediction process includes the following steps:
1. **Loading model assets**: RandomForest model, label encoders, and feature columns used in training.
2. **Reading new input data**: Either from a user-uploaded CSV or a default dummy traffic sample.
3. **Preprocessing data**: Ensuring input data is formatted exactly as expected (just like our training data).
4. **Making predictions**: Using the model to classify each input row as 'Normal' or 'Malicious'.
5. **Saving predictions**: Results are saved to `prediction_output.csv`.

We maintain the same pipeline used in training to ensure same model consistency and to avoid any input mismatch errors.

* Conclusion

We successfully:
- Loaded our trained ML model and encoders onto the file
- Preprocessed our new network traffic data
- Predicted whether the traffic is Normal or Malicious
- Saved the results for further analysis

This forms our final **prediction pipeline** which is ready for production or integration into dashboards like steamlit.


## 6️⃣ Multiple Model Training
We will now train two other models with different methods of MLPClassifier and XGBoost.
We will compute and compare:
- 🎯 Random Forest
- 🚀 XGBoost
- 🧠 MLPClassifier (Neural Network)

We aim to:
- Train each model on the UNSW-NB15 dataset.
- Evaluate and compare their performance using metrics like Accuracy, F1 Score, and Confusion Matrix.
- Visualize results and perform SHAP-based model explainability.
- Save all models and metrics for use in prediction/deployment.


## ✅ Conclusion

We successfully trained and evaluated three ML models:
- 🌲 Random Forest
- 🔥 XGBoost
- 🧠 MLP Neural Net

**XGBoost** and **Random Forest** performed with very high accuracy and F1-score but XGBoost had the least amount of time required to train thus being the best for our required case.

We also:
- Visualized metrics
- Interpreted predictions using SHAP
- Saved all trained models and metrics for deployment




## 7️⃣ SHAP Explainability
* What is SHAP Explainability?
SHAP (SHapley Additive exPlanations) is a technique to explain how your machine learning model makes decisions.
Basically,
SHAP explains like which features (like Sload, dbytes, state, etc.) influenced our model to predict a sample as malicious or normal — and by how much?

* WHY SHAP?
- Helps to understand that which features are most useful
- Increases the ability to debug our model
- Provides better understanding of the model

It is available on our Streamlit Dashboard as well as our second notebook where we performed advance EDA and SHAP Explainability.

## 8️⃣ Streamlit & App Logic
What is Streamlit: 
Streamlit is an open-source Python library that helps ud create interactive web apps for any projects — using just Python!

🔹 With Streamlit, we don't need to learn HTML/CSS/JavaScript.
🔹 We just write Python scripts, and Streamlit handles the UI.

Streamlit was an easy option for me as I had no idea how would I make a dashboard.
Hence CHATGPT helped me a lot creating the entire frontend for me, making this project beautifully possible.

🔹 How to use my APP?
- Upload CSV file
- Choose any one of the model from the sidebar
- Visualize the predictions (bar, pie, histogram)
- Shows SHAP plot and performance metrics
- Download the predictions


## Deployment on Streamlit Cloud
- GitHub Repo: https://github.com/IbhavMalviya/AI-Data-Breach-System
- Streamlit detects `App/dashboard.py`
- App runs on: https://ai-data-breach-system.streamlit.app


## 9️⃣ Final Thoughts
It was surely a exciting project to make. From knowing nothing about machine learning to actually building a project from the scratch.
The internet helped me a lot to know things which I was afraid to learn before.
It surely took time for me to make this project but it was all worth it.
Well, Onto the next adventure of the new project I guess.
See you soon :)
---
### Project by: **Ibhav Malviya**
[![LinkedIn](https://img.shields.io/badge/LinkedIn-blue?logo=linkedin)](https://linkedin.com/in/ibhavmalviya)
[![GitHub](https://img.shields.io/badge/GitHub-black?logo=github)](https://github.com/IbhavMalviya)