Skip to content

GeeTee8/Data-Analysis-Assignment-Week-7

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 

Repository files navigation

Data Analysis with Pandas and Visualization with Matplotlib

A comprehensive Python project demonstrating data analysis techniques using Pandas and data visualization with Matplotlib. This assignment analyzes the classic Iris dataset to showcase statistical analysis, data manipulation, and various visualization techniques.

πŸ“‹ Table of Contents

🎯 Overview

This project fulfills an academic assignment focused on:

  • Data Loading & Exploration: Reading and understanding dataset structure
  • Statistical Analysis: Computing descriptive statistics and group comparisons
  • Data Visualization: Creating multiple chart types for data insights
  • Pattern Recognition: Identifying correlations and trends in the data

✨ Features

  • Complete Data Pipeline: From raw data loading to final insights
  • Multiple Visualization Types: 6 different chart types for comprehensive analysis
  • Statistical Analysis: Descriptive statistics, correlation analysis, and hypothesis testing
  • Clean Code Structure: Well-documented, professional Python code
  • Error Handling: Robust data validation and missing value detection
  • Publication-Ready Plots: High-quality visualizations with proper labels and styling

πŸ›  Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager

Quick Install

# Clone or download the project files
# Install required packages
pip install pandas matplotlib seaborn numpy scikit-learn scipy

Using Requirements File

pip install -r requirements.txt

Verify Installation

python -c "import pandas, matplotlib, seaborn, numpy, sklearn, scipy; print('βœ… All packages installed!')"

πŸš€ Usage

Method 1: Command Line

# Navigate to project directory
cd path/to/project

# Run the analysis
python data_analysis_assignment.py

Method 2: VS Code

  1. Open data_analysis_assignment.py in VS Code
  2. Press Ctrl + F5 or click the ▢️ Run button
  3. View output in terminal and plot windows

Method 3: Jupyter Notebook

# Start Jupyter
jupyter notebook

# Open notebook and run cells

πŸ“Š Dataset

Dataset: Iris Flower Dataset

  • Source: UCI Machine Learning Repository (via scikit-learn)
  • Samples: 150 flowers (50 per species)
  • Features: 4 numerical measurements
    • Sepal Length (cm)
    • Sepal Width (cm)
    • Petal Length (cm)
    • Petal Width (cm)
  • Target: 3 flower species (Setosa, Versicolor, Virginica)
  • Quality: No missing values, well-balanced dataset

πŸ” Analysis Components

Task 1: Data Loading & Exploration

  • βœ… Dataset loading using scikit-learn
  • βœ… Data structure analysis (shape, info, head)
  • βœ… Missing value detection
  • βœ… Data type validation

Task 2: Statistical Analysis

  • βœ… Descriptive statistics (describe())
  • βœ… Group analysis by species
  • βœ… Mean comparisons across categories
  • βœ… Range and distribution analysis
  • βœ… Correlation matrix computation

Task 3: Data Visualization

  • βœ… Line Chart: Trend analysis over sample indices
  • βœ… Bar Chart: Species comparison of average measurements
  • βœ… Histogram: Distribution analysis of sepal width
  • βœ… Scatter Plot: Relationship between sepal and petal length
  • βœ… Box Plot: Petal width distribution by species
  • βœ… Heatmap: Correlation matrix visualization

πŸ“ˆ Visualizations

The project generates 6 comprehensive visualizations:

  1. Trends Over Time: Line plot showing sepal vs petal length patterns
  2. Species Comparison: Bar chart of average measurements by species
  3. Distribution Analysis: Histogram of sepal width distribution
  4. Correlation Analysis: Scatter plot with species-specific coloring
  5. Statistical Comparison: Box plots showing distribution differences
  6. Feature Relationships: Correlation heatmap with coefficient values

πŸ”‘ Key Findings

  • Strongest Correlation: Petal length and petal width (r = 0.963)
  • Species Distinction: Virginica has the largest average petal area
  • Data Quality: Complete dataset with no missing values
  • Balance: Perfectly balanced with 50 samples per species
  • Variability: Petal measurements show highest variability across species

πŸ“ File Structure

Data_Analysis_Assignment/
β”‚
β”œβ”€β”€ data_analysis_assignment.py    # Main analysis script
β”œβ”€β”€ requirements.txt               # Package dependencies
β”œβ”€β”€ README.md                     # Project documentation
└── outputs/                      # Generated plots and results
    β”œβ”€β”€ analysis_plots.png        # Combined visualization
    └── summary_statistics.txt    # Analysis summary

πŸ“‹ Requirements

pandas>=1.3.0
matplotlib>=3.5.0
seaborn>=0.11.0
numpy>=1.21.0
scikit-learn>=1.0.0
scipy>=1.7.0

πŸ› Troubleshooting

Common Issues & Solutions

Issue: "pip is not recognized"

# Solution: Use python -m pip
python -m pip install pandas matplotlib seaborn numpy scikit-learn scipy

Issue: "Permission denied"

# Solution: Use --user flag
pip install --user pandas matplotlib seaborn numpy scikit-learn scipy

Issue: "Python not found"

# Solution: Try python3
python3 -m pip install pandas matplotlib seaborn numpy scikit-learn scipy

Issue: Plots not displaying

# Solution: Install GUI backend
pip install PyQt5
# Or run with:
matplotlib.use('TkAgg')

Getting Help

  1. Check if all packages are installed correctly
  2. Verify Python version (3.7+)
  3. Ensure you're in the correct directory
  4. Check for any error messages in the output

πŸŽ“ Educational Value

This project demonstrates:

  • Data Science Workflow: Complete pipeline from data to insights
  • Python Libraries: Practical use of pandas, matplotlib, seaborn
  • Statistical Concepts: Correlation, distribution, hypothesis testing
  • Best Practices: Code organization, documentation, error handling
  • Visualization Design: Effective chart selection and styling

πŸš€ Running the Project

  1. Download all project files
  2. Install required packages: pip install pandas matplotlib seaborn numpy scikit-learn scipy
  3. Run the script: python data_analysis_assignment.py
  4. View the generated plots and terminal output
  5. Analyze the insights and statistical findings

πŸ“Š Expected Output

When you run the script, you'll see:

  • Detailed data exploration results
  • Statistical summary tables
  • 6 professional visualizations
  • Key insights and correlations
  • Performance metrics and analysis

Execution Time: ~30-60 seconds Output: Console logs + 6 plot windows

πŸ‘¨β€πŸ’» Author

Created as part of a Python data analysis course assignment focusing on pandas and matplotlib proficiency.

πŸ“ Assignment Compliance

This project fulfills all assignment requirements:

  • βœ… Load and analyze dataset using pandas
  • βœ… Create simple plots and charts with matplotlib
  • βœ… Include data loading and exploration steps
  • βœ… Perform basic data analysis with results
  • βœ… Generate visualizations with proper labels
  • βœ… Document findings and observations

Ready to explore the data? Run the script and discover the insights hidden in the Iris dataset! πŸŒΈπŸ“ˆ

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages