Data Quality App

This is a Python-based web application built using Streamlit for performing common data quality tasks such as handling missing values, duplicates, and outliers in datasets. The app also integrates with Ollama for a chatbot interface to interact with the dataset and answer questions using a Retrieval-Augmented Generation (RAG) model.

For those who wish to try the app, you can access it here.

Demo Video

Watch the demo

Features

1. Data Quality Analysis

Dataset Upload: Upload CSV or Excel files.
Dataset Info: View detailed dataset information including memory usage and data types.
Describe Dataset: Get descriptive statistics of the dataset.
Handle Missing Values: Fill or drop missing values with multiple options.
Handle Duplicates: Identify and remove duplicate rows.
Outlier Detection: Identify and handle outliers using various techniques.
Data Type Conversion: Convert data types, normalize, and transform columns.

2. Data Visualization

Interactive Plots: Bar plots, pie charts, histograms, box plots, scatter plots, line charts, area charts, and pair plots.
Correlation Matrices: View correlation between features with heatmaps.
Distribution Analysis: Analyze data distributions using density and box plots.
Custom Color Palettes: Choose from various color palettes for visualizations.

3. Machine Learning

Model Comparison: Compare multiple models (Random Forest, SVM, Logistic Regression).
Feature Importance: Analyze feature importance using RandomForestClassifier.
Cross-Validation: Perform cross-validation to evaluate model performance.
Model Performance Metrics: View accuracy, F1 score, precision, and recall.
Interactive Prediction Interface: Make predictions on new data.

4. RAG-powered Chat

Dataset Querying: Query the dataset using natural language.
Context-Aware Responses: Get context-aware responses from the dataset.
Code Snippet Generation: Generate code snippets for data analysis.
Interactive Chat Interface: Chat with the dataset using Ollama's RAG model.

Prerequisites

Before running the project, make sure you have Python 3.12 installed on your system and Ollama (for RAG features).

Installation

Clone the repository (optional)

git clone https://github.com/3bdalrhmanS3d/DataQualityProject.git
cd DataQualityProject

Create a virtual environment
```
python -m venv venv
```
Activate the virtual environment

On Windows:
```
venv\Scripts\activate
```
On macOS/Linux:
```
source venv/bin/activate
```

Install the required dependencies

pip install -r requirements.txt

Alternatively, install the required libraries manually:

pip install streamlit pandas ollama scikit-learn matplotlib seaborn missingno imbalanced-learn

Verify the installed libraries
```
pip list
```
Run the Streamlit app
```
streamlit run RAG.py
```
The app will open in your default web browser.

Project Structure

  DataQualityProject/
  ├── RAG.py                 # Main application
  ├── HandlingSection.py     # Data handling components
  ├── PredictionManager.py   # ML model management
  ├── requirements.txt       # Dependencies
  └── README.md              # Documentation

Usage

Upload your dataset (CSV or Excel) via the sidebar.
Select the task you want to perform from the navigation menu in the sidebar:
- Dataset Info: View detailed information about your dataset (columns, types, non-null counts).
- Describe Dataset: View the descriptive statistics of the dataset.
- Handle Missing Values: Choose to fill or drop missing values from columns.
- Handle Duplicates: Identify and remove duplicate rows.
- Handle Outliers: Remove outliers using the IQR method.
- Chat using RAG: Interact with your dataset via a chatbot powered by Ollama.

Download Modified Dataset

After performing any changes, you can download the modified dataset by clicking the download button on the sidebar.

Requirements

Python 3.12
Streamlit: For creating the web interface.
Pandas: For data manipulation and analysis.
Ollama: For chatbot integration using the RAG model.

Data Processing Features

Missing Values: Multiple imputation methods and visualizations.
Outliers: IQR-based detection and handling with visual analysis.
Transformations: Scaling, encoding, and normalization.
Feature Engineering: Automated and manual feature engineering options.

Machine Learning Capabilities

Models:
- Random Forest
- Support Vector Machines
- Logistic Regression
Metrics:
- Accuracy
- F1 Score
- Precision
- Recall
Visualization:
- Confusion Matrix
- ROC Curves
- Feature Importance

requirements.txt

streamlit
pandas
numpy
scikit-learn
matplotlib
seaborn
ollama
missingno
imbalanced-learn

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Src		Src
__pycache__		__pycache__
HandlingSection.py		HandlingSection.py
PredictionManager.py		PredictionManager.py
R1.png		R1.png
R2.png		R2.png
RAG.py		RAG.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Quality App

Demo Video

Features

1. Data Quality Analysis

2. Data Visualization

3. Machine Learning

4. RAG-powered Chat

Prerequisites

Installation

Project Structure

Usage

Download Modified Dataset

Requirements

Data Processing Features

Machine Learning Capabilities

requirements.txt

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

3bdalrhmanS3d/DataQualityProject

Folders and files

Latest commit

History

Repository files navigation

Data Quality App

Demo Video

Features

1. Data Quality Analysis

2. Data Visualization

3. Machine Learning

4. RAG-powered Chat

Prerequisites

Installation

Project Structure

Usage

Download Modified Dataset

Requirements

Data Processing Features

Machine Learning Capabilities

requirements.txt

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages