This is a Python-based web application built using Streamlit for performing common data quality tasks such as handling missing values, duplicates, and outliers in datasets. The app also integrates with Ollama for a chatbot interface to interact with the dataset and answer questions using a Retrieval-Augmented Generation (RAG) model.
For those who wish to try the app, you can access it here.
Watch the demo
- Dataset Upload: Upload CSV or Excel files.
- Dataset Info: View detailed dataset information including memory usage and data types.
- Describe Dataset: Get descriptive statistics of the dataset.
- Handle Missing Values: Fill or drop missing values with multiple options.
- Handle Duplicates: Identify and remove duplicate rows.
- Outlier Detection: Identify and handle outliers using various techniques.
- Data Type Conversion: Convert data types, normalize, and transform columns.
- Interactive Plots: Bar plots, pie charts, histograms, box plots, scatter plots, line charts, area charts, and pair plots.
- Correlation Matrices: View correlation between features with heatmaps.
- Distribution Analysis: Analyze data distributions using density and box plots.
- Custom Color Palettes: Choose from various color palettes for visualizations.
- Model Comparison: Compare multiple models (Random Forest, SVM, Logistic Regression).
- Feature Importance: Analyze feature importance using RandomForestClassifier.
- Cross-Validation: Perform cross-validation to evaluate model performance.
- Model Performance Metrics: View accuracy, F1 score, precision, and recall.
- Interactive Prediction Interface: Make predictions on new data.
- Dataset Querying: Query the dataset using natural language.
- Context-Aware Responses: Get context-aware responses from the dataset.
- Code Snippet Generation: Generate code snippets for data analysis.
- Interactive Chat Interface: Chat with the dataset using Ollama's RAG model.
Before running the project, make sure you have Python 3.12 installed on your system and Ollama (for RAG features).
-
Clone the repository (optional)
git clone https://github.com/3bdalrhmanS3d/DataQualityProject.git cd DataQualityProject -
Create a virtual environment
python -m venv venv
-
Activate the virtual environment
On Windows:
venv\Scripts\activate
On macOS/Linux:
source venv/bin/activate -
Install the required dependencies
pip install -r requirements.txt
Alternatively, install the required libraries manually:
pip install streamlit pandas ollama scikit-learn matplotlib seaborn missingno imbalanced-learn
-
Verify the installed libraries
pip list
-
Run the Streamlit app
streamlit run RAG.py
The app will open in your default web browser.
DataQualityProject/
├── RAG.py # Main application
├── HandlingSection.py # Data handling components
├── PredictionManager.py # ML model management
├── requirements.txt # Dependencies
└── README.md # Documentation- Upload your dataset (CSV or Excel) via the sidebar.
- Select the task you want to perform from the navigation menu in the sidebar:
- Dataset Info: View detailed information about your dataset (columns, types, non-null counts).
- Describe Dataset: View the descriptive statistics of the dataset.
- Handle Missing Values: Choose to fill or drop missing values from columns.
- Handle Duplicates: Identify and remove duplicate rows.
- Handle Outliers: Remove outliers using the IQR method.
- Chat using RAG: Interact with your dataset via a chatbot powered by Ollama.
After performing any changes, you can download the modified dataset by clicking the download button on the sidebar.
- Python 3.12
- Streamlit: For creating the web interface.
- Pandas: For data manipulation and analysis.
- Ollama: For chatbot integration using the RAG model.
- Missing Values: Multiple imputation methods and visualizations.
- Outliers: IQR-based detection and handling with visual analysis.
- Transformations: Scaling, encoding, and normalization.
- Feature Engineering: Automated and manual feature engineering options.
- Models:
- Random Forest
- Support Vector Machines
- Logistic Regression
- Metrics:
- Accuracy
- F1 Score
- Precision
- Recall
- Visualization:
- Confusion Matrix
- ROC Curves
- Feature Importance
streamlit
pandas
numpy
scikit-learn
matplotlib
seaborn
ollama
missingno
imbalanced-learn