# 📊 Data Analysis with Pandas and Visualization with Matplotlib

This notebook is a beginner-friendly assignment that demonstrates how to:
- Load and explore a dataset using **pandas**
- Perform some **basic data analysis**
- Create simple but useful **visualizations** using **matplotlib** and **seaborn**

We will be using the **Iris dataset**, a classic dataset in machine learning and statistics.


In [None]:
# Importing the libraries we need
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.datasets import load_iris


## Task 1: Load and Explore the Dataset

Here we will load the **Iris dataset** using sklearn.  
Then we will look at the first few rows, check the structure of the dataset, and see if there are any missing values.


In [None]:
# Load the Iris dataset
iris = load_iris(as_frame=True)
df = iris.frame

# Look at the first few rows
print("First 5 rows of the dataset:")
print(df.head())

# Info about dataset
print("\nDataset Info:")
print(df.info())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# The Iris dataset does not have missing values, so no cleaning is needed.


## Task 2: Basic Data Analysis

We will compute some basic statistics, then group the data by **species** and calculate the average values.


In [None]:
# Basic statistics
print("\nBasic Statistics:")
print(df.describe())

# Group by species and compute mean
grouped = df.groupby("target").mean()
print("\nGroup Means by Species (0=setosa, 1=versicolor, 2=virginica):")
print(grouped)


## Task 3: Data Visualization

Now we will make **4 types of plots**:
1. Line chart (simulated trend of sepal length)  
2. Bar chart (average petal length per species)  
3. Histogram (distribution of sepal width)  
4. Scatter plot (sepal length vs petal length)


In [None]:
sns.set(style="whitegrid")

# 1. Line chart (cumulative sepal length per species)
df_sorted = df.sort_values(by="sepal length (cm)")
plt.figure(figsize=(8, 5))
for species, data in df_sorted.groupby("target"):
    plt.plot(data.index, data["sepal length (cm)"].cumsum(), label=iris.target_names[species])
plt.title("Cumulative Sepal Length by Species")
plt.xlabel("Index")
plt.ylabel("Cumulative Sepal Length (cm)")
plt.legend()
plt.show()

# 2. Bar chart: average petal length per species
plt.figure(figsize=(6, 4))
sns.barplot(x="target", y="petal length (cm)", data=df, estimator=np.mean, ci=None)
plt.title("Average Petal Length per Species")
plt.xlabel("Species (0=setosa, 1=versicolor, 2=virginica)")
plt.ylabel("Average Petal Length (cm)")
plt.show()

# 3. Histogram: distribution of sepal width
plt.figure(figsize=(6, 4))
plt.hist(df["sepal width (cm)"], bins=15, color="skyblue", edgecolor="black")
plt.title("Distribution of Sepal Width")
plt.xlabel("Sepal Width (cm)")
plt.ylabel("Frequency")
plt.show()

# 4. Scatter plot: sepal length vs petal length
plt.figure(figsize=(6, 4))
sns.scatterplot(x="sepal length (cm)", y="petal length (cm)", hue="target", palette="Set1", data=df)
plt.title("Sepal Length vs Petal Length by Species")
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Petal Length (cm)")
plt.legend(title="Species", labels=iris.target_names)
plt.show()
