RbSQLi-Dataset is a comprehensive machine learning project focused on SQL Injection (SQLi) detection and classification. This repository contains algorithms and datasets for identifying multiple types of SQL injection attacks using various machine learning models.
The project implements both binary classification (malicious vs. non-malicious) and multiclass classification (identifying specific types of SQL injection attacks) using state-of-the-art machine learning techniques.
SQL Injection is a cyber attack where malicious SQL code is inserted into application queries, potentially allowing attackers to:
- Access unauthorized data
- Modify or delete database information
- Execute administrative operations on the database
- In some cases, access the operating system
This project helps identify and classify these attacks automatically using machine learning.
- Multiple SQL Injection Types: Detects 6 different types of SQL injection attacks
- Dual Classification Modes: Both binary and multiclass classification
- Multiple ML Algorithms: Implements 5 different machine learning models
- Cross-Validation: Robust model evaluation using cross-validation
- Comprehensive Analysis: Detailed exploratory data analysis (EDA)
- Professional Implementation: Ready-to-use Jupyter notebooks for research and production
RbSQLi-Dataset/
β
βββ π EDA/
β βββ Total_Appearance_of_SQL_Keywords.ipynb # Exploratory Data Analysis
β
βββ π§ Model Train/
β βββ Stratified_Model_Train.ipynb # Multiclass model training
β βββ model_comparison_nb_rf_knn_lr.ipynb # Model performance comparison
β
βββ β
cross_validation_binary_sql_injection/
βββ cross_validation_nv_rf_svc_dataset_01.ipynb # Binary classification CV
βββ cross_validation_nv_rf_svc_dataset_02.ipynb # Binary classification CV
βββ cross_validation_nv_rf_svc_dataset_03.ipynb # Binary classification CV
The system can identify 7 different types of SQL injection attacks:
- Exploits database error messages to extract information
- Forces the database to output error messages containing sensitive data
- Uses database delay functions (like
SLEEP()
,WAITFOR
) - Infers data based on response time delays
- Uses UNION SQL operator to combine results from multiple SELECT statements
- Extracts data from different database tables
- Uses TRUE/FALSE conditions to infer database information
- Analyzes application responses to determine data validity
- Executes multiple SQL statements in a single query
- Uses semicolons to separate multiple SQL commands
- Exploits database metadata and system information
- Targets database schema and configuration data
- Legitimate, non-malicious SQL queries
- Used as the "safe" class in classification
The project implements and compares 5 different ML algorithms:
Model | Type | Best Accuracy | Key Features |
---|---|---|---|
SVM (Support Vector Classifier) | Linear/Non-linear | 86.80% | Best overall performance, robust to overfitting |
Logistic Regression | Linear | 85.00% | Fast training, interpretable results |
K-Nearest Neighbors (KNN) | Instance-based | 84.43% | Simple, effective for similar patterns |
Naive Bayes | Probabilistic | 79.33% | Fast, works well with text features |
Random Forest | Ensemble | 77.37% | Good for feature importance analysis |
What it does: Analyzes the dataset to understand SQL injection patterns and keyword frequencies.
Key outputs:
- SQL keyword frequency analysis
- Dataset distribution statistics
- Pattern recognition in malicious queries
What it does: Trains multiple machine learning models for SQL injection detection.
Files:
Stratified_Model_Train.ipynb
: Implements stratified sampling for balanced multiclass training (70%-10%-20% split)model_comparison_nb_rf_knn_lr.ipynb
: Compares 5 different ML models with detailed performance metrics
Key features:
- Balanced dataset sampling
- TF-IDF vectorization with 50,000 features
- Comprehensive model evaluation
What it does: Performs cross-validation for binary classification (malicious vs. non-malicious).
Files:
- Dataset 01: Training on RbSQLi dataset, testing on external SQL injection dataset
- Dataset 02: Cross-dataset validation for generalization testing
- Dataset 03: Additional validation on different dataset splits
Key features:
- 5-fold cross-validation
- External dataset validation
- Balanced sampling (9,000 malicious + 9,000 normal for training)
- Data Loading: Handles CSV datasets with SQL queries and labels
- Text Preprocessing: Cleans and normalizes SQL query strings
- Feature Extraction: Uses TF-IDF vectorization (1-2 n-grams, 50k features)
- Stratified Sampling: Maintains class distribution across train/validation/test splits
- Model Training: Implements multiple ML algorithms with hyperparameter tuning
- Evaluation: Comprehensive metrics including accuracy, precision, recall, F1-score, AUC-ROC
- Training Dataset: 18,000 samples (balanced)
- Validation Dataset: ~10% of total data
- Test Dataset: ~20% of total data
- Feature Engineering: TF-IDF with 50,000 max features, 1-2 n-grams
- Cross-Validation: 5-fold stratified cross-validation
Model Performance Summary:
βββββββββββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ
β Model β Accuracy β AUC-ROC β mAP Score β
βββββββββββββββββββββββΌβββββββββββββββΌββββββββββββββΌβββββββββββββββ€
β SVM β 86.80% β 0.9843 β 0.9113 β
β Logistic Regression β 85.00% β 0.9756 β 0.8956 β
β KNN β 84.43% β 0.9689 β 0.8789 β
β Naive Bayes β 79.33% β 0.9234 β 0.8234 β
β Random Forest β 77.37% β 0.9123 β 0.7998 β
βββββββββββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ
- Training Dataset: 18,000 samples (9k malicious + 9k normal)
- Test Dataset: 2,000 samples (1k malicious + 1k normal)
- Best Model: Random Forest with optimized hyperparameters
- Cross-Validation: 5-fold CV for robust evaluation
# Required libraries
pandas
scikit-learn
numpy
matplotlib
seaborn
jupyter
-
Clone the repository:
git clone https://github.com/RbSQLi-Dataset/RbSQLi-Dataset.git cd RbSQLi-Dataset
-
Install dependencies:
pip install pandas scikit-learn numpy matplotlib seaborn jupyter
-
Run the notebooks:
- Start with
/EDA/Total_Appearance_of_SQL_Keywords.ipynb
for data exploration - Use
/Model Train/Stratified_Model_Train.ipynb
for multiclass classification - Try
/cross_validation_binary_sql_injection/
for binary classification
- Start with
The notebooks are configured to work with:
- Primary Dataset:
sql_injectiondataset_final_updated.csv
(RbSQLi dataset) - External Dataset: Various SQL injection datasets from Kaggle
- Format: CSV files with columns for SQL queries and labels
- Academic Research: Study SQL injection patterns and detection methods
- Comparative Analysis: Benchmark different ML models on SQL injection detection
- Feature Engineering: Analyze which SQL keywords are most indicative of attacks
- Web Application Security: Integrate models into web application firewalls
- Database Security: Monitor database queries for suspicious patterns
- Incident Response: Classify and prioritize SQL injection attempts
- Code Security: Validate SQL queries in applications
- Security Testing: Test applications against various SQL injection types
- Educational: Learn about SQL injection patterns and detection
This project is perfect for:
- Students learning about cybersecurity and machine learning
- Researchers studying SQL injection detection techniques
- Developers wanting to understand SQL injection vulnerabilities
- Security professionals looking for automated detection methods
- TF-IDF Vectorization: Converts SQL queries to numerical features
- N-grams: Uses 1-2 word combinations for better context understanding
- Max Features: 50,000 most important features selected
- Normalization: L2 normalization for consistent scaling
- Stratified Sampling: Maintains original class distribution
- Train/Validation/Test Split: 70%/10%/20% ratio
- Cross-Validation: 5-fold stratified CV for model selection
- Hyperparameter Tuning: Grid search for optimal parameters
This dataset and methodology can be used for:
- Cybersecurity Research: Understanding SQL injection attack patterns
- Machine Learning Studies: Comparing text classification algorithms
- Security Tool Development: Building automated detection systems
- Academic Projects: Teaching cybersecurity and ML concepts
Contributions are welcome! Areas for improvement:
- Additional SQL injection types
- More sophisticated feature engineering
- Deep learning implementations
- Real-time detection capabilities
- Web interface for easy usage
For questions, suggestions, or collaborations regarding this SQL injection detection project, please:
- Open an issue on GitHub
- Fork the repository for contributions
- Share your research findings using this dataset
π‘οΈ Protecting databases through intelligent SQL injection detection using machine learning.
This project demonstrates the power of combining cybersecurity knowledge with machine learning techniques to create effective defense mechanisms against SQL injection attacks.