A comprehensive, object-oriented data analysis pipeline for professional data scientists and analysts. This all-in-one Python script provides end-to-end capabilities for data extraction, cleaning, analysis, visualization, and reporting.
- Multi-source Data Extraction: Import data from CSV, JSON, APIs, and databases
- Robust Data Cleaning: Comprehensive preprocessing with configurable options
- Exploratory Data Analysis: Automated statistical summaries and data profiling
- Advanced Statistical Analysis: T-tests, ANOVA, chi-square, correlation, and regression
- Visualization Generation: Multiple chart types with customization options
- Automated Reporting: Generate comprehensive HTML reports with findings
- Multiple Export Formats: Share results as CSV, Excel, JSON, or HTML
- Python 3.7+
- pandas
- numpy
- matplotlib
- seaborn
- scipy
- statsmodels
- scikit-learn
- requests (for API data extraction)
# Create a requirements.txt file with the following content:
pandas>=1.3.0
numpy>=1.20.0
matplotlib>=3.4.0
seaborn>=0.11.0
scipy>=1.7.0
statsmodels>=0.12.0
scikit-learn>=0.24.0
requests>=2.25.0
# Install dependencies
pip install -r requirements.txt
# Download the DataAnalysisPipeline.py file
# No installation needed - just import and use!
from DataAnalysisPipeline import DataAnalysisPipeline
# Initialize the pipeline
pipeline = DataAnalysisPipeline()
# Extract data from multiple sources
data_sources = {
'sales_data': {
'type': 'csv',
'path': 'data/sales_data.csv'
},
'customer_info': {
'type': 'json',
'path': 'data/customers.json'
}
}
pipeline.extract_data(data_sources)
# Define cleaning steps
cleaning_steps = {
'sales_data': [
{'type': 'drop_duplicates'},
{'type': 'drop_na', 'subset': ['date', 'product', 'revenue']},
{'type': 'fill_na', 'columns': {'quantity': 0}},
{'type': 'convert_types', 'conversions': {'revenue': 'float', 'quantity': 'int'}}
]
}
# Apply cleaning steps
pipeline.clean_data(cleaning_steps)
# Perform exploratory data analysis
pipeline.perform_eda('sales_data')
# Run statistical tests
analysis_config = {
'revenue_by_region': {
'type': 'anova',
'dataset': 'sales_data',
'group_column': 'region',
'value_column': 'revenue'
},
'sales_correlation': {
'type': 'correlation',
'dataset': 'sales_data',
'columns': ['revenue', 'quantity', 'customer_satisfaction']
}
}
pipeline.perform_statistical_analysis(analysis_config)
# Create visualizations
visualization_config = {
'monthly_sales': {
'type': 'line',
'dataset': 'sales_data',
'x_column': 'month',
'y_column': 'revenue',
'title': 'Monthly Sales Performance',
'output_file': 'visualizations/monthly_sales.png'
},
'region_comparison': {
'type': 'bar',
'dataset': 'sales_data',
'x_column': 'region',
'y_column': 'revenue',
'title': 'Revenue by Region',
'output_file': 'visualizations/region_comparison.png'
}
}
pipeline.create_visualizations(visualization_config)
# Generate a comprehensive report
report_config = {
'title': 'Sales Analysis Report - Q1 2024',
'introduction': 'This report analyzes sales performance across regions and products.',
'sections': [
{
'type': 'datasets',
'title': 'Data Overview',
'datasets': ['sales_data']
},
{
'type': 'eda',
'title': 'Exploratory Analysis',
'dataset': 'sales_data'
},
{
'type': 'statistical_analysis',
'title': 'Statistical Findings',
'analyses': ['revenue_by_region', 'sales_correlation']
},
{
'type': 'visualizations',
'title': 'Visual Insights',
'visualizations': ['monthly_sales', 'region_comparison']
},
{
'type': 'conclusion',
'title': 'Conclusion',
'content': 'Based on our analysis, the Western region shows the highest revenue growth...'
}
]
}
pipeline.generate_report(report_config, 'reports/q1_sales_analysis.html')
The pipeline uses configuration dictionaries to control its behavior. Here are some key configuration options:
csv
: CSV files with customizable parametersjson
: JSON files or arraysapi
: RESTful API endpoints with authenticationdatabase
: SQL database connections
drop_duplicates
: Remove duplicate rowsdrop_na
: Remove rows with missing valuesfill_na
: Fill missing values (mean, median, mode, or custom)rename_columns
: Rename dataframe columnsconvert_types
: Convert column data typesfilter_rows
: Filter based on conditionstransform_column
: Apply transformations (log, sqrt, standardize, etc.)
ttest
: Independent t-testsanova
: One-way ANOVA with post-hoc testschi_square
: Chi-square test of independencecorrelation
: Pearson, Spearman, or Kendall correlationsregression
: OLS regression analysis
histogram
: Distribution visualizationscatter
: Relationship between variablesbar
: Category comparisonsline
: Time series or trend analysisheatmap
: Correlation visualizationbox
: Distribution and outlier analysispie
: Composition visualization
import json
# Load configuration from JSON file
with open('config/analysis_config.json', 'r') as f:
config = json.load(f)
# Initialize pipeline
pipeline = DataAnalysisPipeline()
# Run the complete pipeline using configuration
pipeline.extract_data(config['data_sources'])
pipeline.clean_data(config['cleaning_steps'])
pipeline.perform_statistical_analysis(config['analyses'])
pipeline.create_visualizations(config['visualizations'])
pipeline.generate_report(config['report'], config['output_file'])
The pipeline provides comprehensive logging to help debug issues:
# Logs are automatically saved to data_pipeline.log
# You can check specific log levels:
import logging
logging.getLogger('data_pipeline').setLevel(logging.DEBUG)
A professional implementation would split this single file into a proper Python package structure:
data_analysis_pipeline/
βββ __init__.py
βββ pipeline.py # Main pipeline class
βββ extractors/ # Data extraction modules
βββ transformers/ # Data cleaning operations
βββ analyzers/ # Statistical analysis modules
βββ visualizers/ # Visualization generators
βββ reporters/ # Report generation utilities
This modular architecture would improve maintainability and extensibility, making it easier to:
- Add new data source types
- Implement additional statistical tests
- Create new visualization types
- Extend reporting capabilities
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
This data analysis pipeline was created as part of a professional portfolio project demonstrating advanced Python and data analysis skills.