# Assignment Week 7: Data Analysis and Visualization with the Iris Dataset
This notebook demonstrates how to load, analyze, and visualize a dataset using pandas and matplotlib in Python.

## 1. Import Required Libraries
We will use pandas for data analysis and matplotlib/seaborn for visualization.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

## 2. Load Dataset
We will use the Iris dataset from sklearn, which is a classic dataset for classification and data analysis.

In [None]:
# Load the Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()

## 3. Explore Dataset
Let's inspect the first few rows, check data types, and look for missing values.

In [None]:
# Display first five rows
df.head()
# Check data types
df.dtypes
# Check for missing values
df.isnull().sum()

In [None]:
# Clean dataset if missing values exist
if df.isnull().values.any():
    df = df.fillna(df.mean(numeric_only=True))
    df = df.dropna()
df.isnull().sum()

## 4. Basic Data Analysis
Let's compute summary statistics and analyze the data by species.

In [None]:
# Summary statistics
df.describe()
# Group by species and compute mean
grouped = df.groupby('species').mean(numeric_only=True)
grouped
# Identify patterns
for col in iris.feature_names:
    max_species = grouped[col].idxmax()
    print(f"Species with highest average {col}: {max_species}")

## 5. Create Visualizations
We will create four types of plots: line chart, bar chart, histogram, and scatter plot.

In [None]:
sns.set(style="whitegrid")
# 1. Line chart: Sepal Length Across Samples
plt.figure(figsize=(8, 4))
plt.plot(df.index, df['sepal length (cm)'], label='Sepal Length')
plt.title('Sepal Length Across Samples')
plt.xlabel('Sample Index')
plt.ylabel('Sepal Length (cm)')
plt.legend()
plt.tight_layout()
plt.show()
# 2. Bar chart: Average Petal Length per Species
plt.figure(figsize=(6, 4))
sns.barplot(x=grouped.index, y=grouped['petal length (cm)'], palette='viridis')
plt.title('Average Petal Length per Species')
plt.xlabel('Species')
plt.ylabel('Average Petal Length (cm)')
plt.tight_layout()
plt.show()
# 3. Histogram: Distribution of Sepal Width
plt.figure(figsize=(6, 4))
sns.histplot(df['sepal width (cm)'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of Sepal Width')
plt.xlabel('Sepal Width (cm)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
# 4. Scatter plot: Sepal Length vs Petal Length by Species
plt.figure(figsize=(7, 5))
sns.scatterplot(data=df, x='sepal length (cm)', y='petal length (cm)', hue='species', palette='deep')
plt.title('Sepal Length vs Petal Length by Species')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Petal Length (cm)')
plt.legend(title='Species')
plt.tight_layout()
plt.show()

## 6. Findings and Observations
- The Iris dataset contains three species with distinct average measurements for each feature.
- Visualizations show clear separation between species for petal length and sepal length.
- No missing values were found in the dataset.
- The scatter plot highlights the relationship between sepal length and petal length, with species clustering.
- These insights can be useful for classification and further analysis.