An advanced data analysis system with powerful data cleaning, visualization, and real-time monitoring capabilities for large datasets.
This application provides a comprehensive solution for data professionals to:
- Analyze datasets with up to millions of records
- Identify data quality issues automatically
- Clean and transform data with customizable methods
- Visualize patterns and insights with interactive charts
- Monitor data processing in real-time
- Generate detailed reports on data quality
- Intuitive UI: User-friendly Streamlit interface for easy interaction
- Large Data Support: Uses Dask for distributed data processing when files exceed 100MB
- Advanced Data Cleaning: Handles missing values, outliers, duplicates, and data type inconsistencies
- Intelligent Recommendations: Suggests optimal cleaning strategies based on data characteristics
- Interactive Visualizations: Comprehensive charts for data distributions, correlations, and outliers
- Real-time Monitoring: Tracks processing time, memory usage, and operation logs
- Exportable Results: Download cleaned data and reports
The application comes with a data generator that can create realistic sample datasets with:
- User data: 100,000 records
- Transaction data: 200,000 records
- Product data: 10,000 records
- Customer support data: 50,000 records
- Website analytics data: 200,000 records
Each dataset includes intentional data quality issues (missing values, outliers, format errors) to demonstrate the application's capabilities.
- Python 3.11 or higher
- Required Python packages (see requirements in pyproject.toml)
# Clone the repository
git clone https://github.com/yourusername/data-analysis-pipeline.git
cd data-analysis-pipeline
# Install dependencies
pip install -e .You can use the included shell script to run the application:
# Make the script executable
chmod +x run.sh
# Run the application
./run.shOr run directly with Python:
# Run the main application
python main.py
# Generate demo datasets
python main.py --generate-dataUsage: ./run.sh [options]
Options:
--generate-data Generate demo datasets
--output-dir DIR Directory to save generated datasets (default: demo_datasets)
--help Display help message
- Start the application using the commands above
- Upload data via the sidebar file uploader
- Analyze data quality in the "Data Quality Analysis" tab
- Clean data using recommended strategies in the "Data Cleaning" tab
- Visualize insights in the "Visualization" tab
- Monitor processing through the "Processing Logs" tab
- Export results using the download buttons
- app.py: Main Streamlit application interface
- data_cleaning.py: Data quality assessment and cleaning functions
- data_visualization.py: Visualization components and chart generation
- demo_data_generator.py: Sample dataset generation
- logger.py: Logging and activity tracking system
- utils.py: Helper functions and utilities
- main.py: Application entry point
You can customize the application by:
- Modifying data cleaning strategies in
data_cleaning.py - Adding new visualization types in
data_visualization.py - Configuring application settings in the
.streamlit/config.tomlfile
- For very large files (>500MB), consider increasing system memory
- The application uses Dask for files >100MB to optimize performance
- Memory usage is monitored in real-time to prevent crashes
This project is licensed under the MIT License - see the LICENSE file for details.
- Streamlit for the interactive web framework
- Pandas and Dask for powerful data processing
- Plotly and Matplotlib for data visualization