This code loads network data, preprocesses it, reduces dimensions with an autoencoder, and trains multiple classifiers (KNN, RF, LR, SVM) for anomaly detection.
Sure, here's a sample README file for the provided Python code:
This project implements a machine learning pipeline to detect network anomalies using various classification algorithms. The dataset used for this project contains network traffic data with labeled anomalies.
- Installation
- Usage
- Project Structure
- Data
- Exploratory Data Analysis
- Data Preprocessing
- Dimensionality Reduction
- Model Training and Evaluation
- Results
- Contributing
- License
-
Clone the repository:
git clone https://github.com/HayatiYrtgl/Data_analysis_ml_methods_autoencoder.git
-
Create and activate a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Ensure your dataset is placed in the appropriate directory, as specified in the code (
../dataset/network_anomaly_detection/all_data (3).csv
). -
Run the Jupyter notebook:
jupyter notebook
Open the notebook and execute the cells to run the entire pipeline.
network-anomaly-detection/
├── dataset/
│ └── network_anomaly_detection/
│ └── all_data (3).csv
├── README.md
├── requirements.txt
└── anomaly_detection.ipynb
dataset/
: Directory containing the dataset.README.md
: This file.requirements.txt
: List of Python packages required for the project.anomaly_detection.ipynb
: Jupyter notebook containing the code.
The dataset used for this project contains network traffic data with labeled anomalies. It is loaded from a CSV file located in the dataset/network_anomaly_detection/
directory.
Initial data exploration includes:
- Viewing the first and last few rows of the dataset.
- Checking data types and missing values.
- Plotting the distribution of the target variable.
- Visualizing feature correlations.
Steps include:
- Handling missing values and duplicates.
- Encoding categorical variables.
- Scaling numerical features using MinMaxScaler.
An autoencoder is used for dimensionality reduction to select the most important features.
Four different classifiers are trained and evaluated:
- K-Nearest Neighbors (KNN)
- Random Forest
- Logistic Regression
- Support Vector Machine (SVM)
The performance of each model is assessed using a classification report.
The results of the models, including precision, recall, and F1-score, are printed for each classifier.
Contributions are welcome! Please fork the repository and submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.