# Week 6: Exploratory Data Analysis with Visualisation

In this notebook, we'll learn how to create informative and beautiful plots using **seaborn** and the famous **Palmer Penguins** dataset.

Data visualisation is crucial for:
- **Understanding your data** - Spotting patterns, outliers, and relationships
- **Communicating findings** - Making your analysis clear to others
- **Guiding analysis** - Knowing what questions to ask next

We'll cover these essential plot types:
1. **Histograms** - Distribution of numerical variables
2. **Bar plots** - Frequency counts and averages by category
3. **Scatter plots** - Relationships between numerical variables
4. **Box plots** - Distribution summaries and outlier detection
5. **Violin plots** - Detailed distribution shapes
6. **Pair plots** - Multiple relationships at once
7. **Correlation heatmaps** - Strength of relationships

Most importantly, we'll learn how to **customise** these plots to make them publication-ready!

## Load and Explore the Palmer Penguins Dataset

The Palmer Penguins dataset contains measurements of penguins from three species on three islands in Antarctica. It's perfect for learning data visualisation because it has:
- **Categorical variables**: species, island, sex
- **Numerical variables**: bill length, bill depth, flipper length, body mass
- **Real-world context**: Easy to understand and interpret

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('penguins_size.csv')

print(f'Dataset shape: {df.shape}')
print(f'\nFirst few rows:')
df.head()

## Essential Data Quality Checks

Before creating any visualisations, we must understand our data quality. Let's check for missing values, data types, and basic statistics.

### Check for Missing Values

In [None]:
print('Missing values in each column:')
missing_values = df.isnull().sum()
print(missing_values)
print(f'\nTotal missing values: {missing_values.sum()}')
print(f'Percentage of complete rows: {(len(df) - df.isnull().any(axis=1).sum()) / len(df) * 100:.1f}%')

### Examine Data Types

In [None]:
print('Data types:')
print(df.dtypes)
print('\nUnique values in categorical columns:')
categorical_cols = ['species', 'island', 'sex']
for col in categorical_cols:
    if col in df.columns:
        print(f'{col}: {df[col].unique()}')

### Summary Statistics

The `.describe()` method gives us key statistics for numerical variables. Let's interpret what these numbers tell us.

In [None]:
print('Summary statistics for numerical variables:')
df.describe()

**How to interpret these statistics:**
- **count**: Number of non-missing values
- **mean**: Average value - gives us the center of the data
- **std**: Standard deviation - tells us how spread out the data is
- **min/max**: Range of the data - helps spot potential outliers
- **25%, 50%, 75%**: Quartiles - show us the distribution shape

**Quick insights from our data:**
- Body mass ranges from 2700g to 6300g (quite a range!)
- Bill length averages around 44mm
- Some variables have missing values (count < 344 total rows)

## 1. Histograms - Understanding Distributions

Histograms show us the **distribution** of a numerical variable. They answer questions like:
- What's the typical value?
- How spread out is the data?
- Are there any unusual patterns or outliers?
- Is the distribution symmetric or skewed?

### Basic Histogram

Let's start with the simplest possible histogram of penguin body mass.

In [None]:
# Create a basic histogram
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='body_mass_g')
plt.show()

**How to read this histogram:**
- **X-axis**: Body mass in grams
- **Y-axis**: Count (frequency) - how many penguins have that mass
- **Bars**: Each bar represents a range of masses
- **Pattern**: We can see the distribution has multiple peaks - this suggests different groups!

### Adding a Title

Let's make our plot more professional by adding a clear title.

In [None]:
# Add a title to make the plot clearer
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='body_mass_g')
plt.title('Distribution of Penguin Body Mass')
plt.show()

### Improving Axis Labels

The default axis labels aren't very readable. Let's make them more descriptive.

In [None]:
# Improve the axis labels
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='body_mass_g')
plt.title('Distribution of Penguin Body Mass')
plt.xlabel('Body Mass (grams)')
plt.ylabel('Number of Penguins')
plt.show()

### Adding Color by Species

Now let's see if the multiple peaks are explained by different penguin species. We'll use the `hue` parameter to color by species.

In [None]:
# Color by species to see if that explains the pattern
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='body_mass_g', hue='species')
plt.title('Distribution of Penguin Body Mass by Species')
plt.xlabel('Body Mass (grams)')
plt.ylabel('Number of Penguins')
plt.show()

**Interpretation:**
- **Adelie penguins** (blue): Smallest, mostly 3000-4500g
- **Chinstrap penguins** (orange): Medium size, mostly 3000-4000g
- **Gentoo penguins** (green): Largest, mostly 4500-6000g

This explains the multiple peaks we saw earlier!

### Exploring Different Color Palettes

Seaborn offers many color palettes. Let's try a few to see how they affect readability.

In [None]:
# Try a different color palette
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='body_mass_g', hue='species', palette='Set2')
plt.title('Distribution of Penguin Body Mass by Species')
plt.xlabel('Body Mass (grams)')
plt.ylabel('Number of Penguins')
plt.show()

### Another Palette Option

In [None]:
# Try a colorblind-friendly palette
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='body_mass_g', hue='species', palette='viridis')
plt.title('Distribution of Penguin Body Mass by Species')
plt.xlabel('Body Mass (grams)')
plt.ylabel('Number of Penguins')
plt.show()

**Popular seaborn palettes:**
- `'Set1'`, `'Set2'`, `'Set3'`: Distinct colors for categories
- `'viridis'`, `'plasma'`, `'inferno'`: Colorblind-friendly gradients
- `'husl'`, `'bright'`: Vibrant colors
- `'pastel'`, `'muted'`: Softer colors for presentations

## 2. Bar Plots - Counting Categories

Bar plots are perfect for showing **counts** or **averages** of categorical data. They answer questions like:
- How many of each category do we have?
- Which category is most/least common?
- How do averages compare across groups?

### Basic Bar Plot - Counting Species

Let's start with the most common use: counting how many penguins of each species we have.

In [None]:
# Count how many penguins of each species
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='species')
plt.show()

**How to read this bar plot:**
- **X-axis**: The three penguin species
- **Y-axis**: Count (how many penguins)
- **Bars**: Height shows the frequency
- **Pattern**: We have roughly equal numbers of each species

### Adding a Title and Better Labels

In [None]:
# Make it more professional
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='species')
plt.title('Number of Penguins by Species')
plt.xlabel('Species')
plt.ylabel('Count')
plt.show()

### Adding Color

In [None]:
# Add some color to make it more appealing
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='species', palette='Set2', hue='species', legend=False)
plt.title('Number of Penguins by Species')
plt.xlabel('Species')
plt.ylabel('Count')
plt.show()

### Bar Plot by Island

Let's see how penguins are distributed across the three islands.

In [None]:
# Count penguins by island
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='island', palette='viridis', hue='species', legend=False)
plt.title('Number of Penguins by Island')
plt.xlabel('Island')
plt.ylabel('Count')
plt.show()

### Grouped Bar Plot - Species by Island

Now let's see which species live on which islands using the `hue` parameter.

In [None]:
# Show species distribution across islands
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='island', hue='species', palette='Set1')
plt.title('Penguin Species Distribution Across Islands')
plt.xlabel('Island')
plt.ylabel('Count')
plt.show()

**Interpretation:**
- **Biscoe Island**: Only Adelie and Gentoo penguins
- **Dream Island**: Only Adelie and Chinstrap penguins
- **Torgersen Island**: Only Adelie penguins

This tells us about penguin habitat preferences!

### Bar Plot for Averages

Bar plots can also show averages instead of counts. Let's see the average body mass by species.

In [None]:
# Show average body mass by species
plt.figure(figsize=(8, 6))
sns.barplot(data=df, x='species', y='body_mass_g', palette='muted', hue='species', legend=False)
plt.title('Average Body Mass by Species')
plt.xlabel('Species')
plt.ylabel('Average Body Mass (grams)')
plt.show()

**How to read this average bar plot:**
- **Bars**: Height shows the average (mean) body mass
- **Error bars**: Show the confidence interval (uncertainty in the mean)
- **Pattern**: Gentoo penguins are clearly the heaviest on average

### Adding Data Labels

Let's add the exact values on top of each bar to make it even clearer.

In [None]:
# Add data labels on the bars
plt.figure(figsize=(8, 6))
ax = sns.barplot(data=df, x='species', y='body_mass_g', palette='muted', hue='species', legend=False)
plt.title('Average Body Mass by Species')
plt.xlabel('Species')
plt.ylabel('Average Body Mass (grams)')

# Add value labels on top of bars
for i, bar in enumerate(ax.patches):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 50,
            f'{height:.0f}g', ha='center', va='bottom')

plt.show()

## 3. Scatter Plots - Exploring Relationships

Scatter plots show the **relationship** between two numerical variables. They help us answer:
- Are two variables related?
- Is the relationship positive or negative?
- Are there any outliers?
- Do different groups show different patterns?

### Basic Scatter Plot

Let's explore the relationship between bill length and bill depth.

In [None]:
# Basic scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='bill_length_mm', y='bill_depth_mm')
plt.show()

**How to read this scatter plot:**
- **Each dot**: Represents one penguin
- **X-axis**: Bill length in millimeters
- **Y-axis**: Bill depth in millimeters
- **Pattern**: There seems to be a negative relationship - longer bills tend to be less deep

### Adding Titles and Labels

In [None]:
# Add proper labels
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='bill_length_mm', y='bill_depth_mm')
plt.title('Relationship Between Bill Length and Bill Depth')
plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.show()

### Coloring by Species

Let's see if the relationship differs by species.

In [None]:
# Color by species
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='bill_length_mm', y='bill_depth_mm', hue='species')
plt.title('Bill Length vs Bill Depth by Species')
plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.show()

**Interpretation:**
- **Each species clusters together** - they have distinct bill shapes
- **Adelie**: Short, deep bills
- **Chinstrap**: Medium length, medium depth
- **Gentoo**: Long, shallow bills

This is a great example of how grouping reveals hidden patterns!

### Making Points Larger and More Transparent

When points overlap, we can make them larger and semi-transparent to see the density better.

In [None]:
# Adjust point size and transparency
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='bill_length_mm', y='bill_depth_mm', 
                hue='species', s=100, alpha=0.7)
plt.title('Bill Length vs Bill Depth by Species')
plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.show()

### Different Shapes by Sex

We can use both color and shape to show two categorical variables at once.

In [None]:
# Use both color (species) and shape (sex)
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='bill_length_mm', y='bill_depth_mm', 
                hue='species', style='sex', s=100, alpha=0.8)
plt.title('Bill Length vs Bill Depth by Species and Sex')
plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.show()

## 4. Box Plots - Distribution Summaries

Box plots show the **distribution** of a numerical variable, especially useful for:
- Comparing distributions across groups
- Identifying outliers
- Seeing the median, quartiles, and range at a glance
- Understanding data spread and skewness

### Understanding Box Plot Components

Before we create one, let's understand what each part means:
- **Box**: Contains the middle 50% of the data (25th to 75th percentile)
- **Line in box**: The median (50th percentile)
- **Whiskers**: Extend to the furthest points within 1.5 × IQR
- **Dots**: Outliers beyond the whiskers
- **IQR**: Interquartile Range (75th - 25th percentile)

### Basic Box Plot

In [None]:
# Basic box plot of body mass by species
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, x='species', y='body_mass_g')
plt.show()

**How to read this box plot:**
- **Adelie**: Median around 3700g, fairly compact distribution
- **Chinstrap**: Median around 3700g, similar to Adelie
- **Gentoo**: Median around 5000g, much heavier with wider spread
- **Outliers**: A few unusually light or heavy penguins (shown as dots)

### Adding Titles and Better Formatting

In [None]:
# Add proper formatting
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, x='species', y='body_mass_g', palette='Set2', hue='species', legend=False)
plt.title('Distribution of Body Mass by Species')
plt.xlabel('Species')
plt.ylabel('Body Mass (grams)')
plt.show()

### Box Plot for Bill Length

In [None]:
# Compare bill lengths across species
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, x='species', y='bill_length_mm', palette='viridis', hue='species', legend=False)
plt.title('Distribution of Bill Length by Species')
plt.xlabel('Species')
plt.ylabel('Bill Length (mm)')
plt.show()

### Grouped Box Plot by Sex

Let's see if there are differences between male and female penguins within each species.

In [None]:
# Show differences by sex within each species
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='species', y='body_mass_g', hue='sex', palette='Set1')
plt.title('Body Mass Distribution by Species and Sex')
plt.xlabel('Species')
plt.ylabel('Body Mass (grams)')
plt.show()

**Interpretation:**
- **Males are consistently heavier** than females across all species
- **Gentoo penguins** show the largest size difference between sexes
- **Sexual dimorphism** (size differences) is clear in this species

## 5. Violin Plots - Detailed Distribution Shapes

Violin plots combine the information of box plots with the shape detail of histograms. They show:
- **Distribution shape**: Is it symmetric, skewed, or multi-modal?
- **Density**: Where most of the data points are concentrated
- **Quartiles**: Like a box plot, but with more detail
- **Outliers**: Unusual values in the data

### Basic Violin Plot

In [None]:
# Basic violin plot
plt.figure(figsize=(8, 6))
sns.violinplot(data=df, x='species', y='body_mass_g')
plt.show()

**How to read violin plots:**
- **Width**: Shows density - wider areas have more penguins
- **Shape**: Shows the distribution shape
- **White dot**: The median
- **Thick black bar**: Interquartile range (25th-75th percentile)
- **Thin black line**: Range of the data

### Adding Color and Formatting

In [None]:
# Add color and proper labels
plt.figure(figsize=(8, 6))
sns.violinplot(data=df, x='species', y='body_mass_g', palette='muted', hue='species', legend=False)
plt.title('Body Mass Distribution Shape by Species')
plt.xlabel('Species')
plt.ylabel('Body Mass (grams)')
plt.show()

### Violin Plot with Split by Sex

We can split each violin to compare male and female distributions side by side.

In [None]:
# Split violins by sex
plt.figure(figsize=(10, 6))
sns.violinplot(data=df, x='species', y='body_mass_g', hue='sex', 
               split=True, palette='Set2')
plt.title('Body Mass Distribution by Species and Sex')
plt.xlabel('Species')
plt.ylabel('Body Mass (grams)')
plt.show()

**Interpretation:**
- **Left side of each violin**: Female penguins
- **Right side of each violin**: Male penguins
- **Clear separation**: Males consistently heavier across all species
- **Distribution shapes**: Most are roughly normal, some slightly skewed

## 6. Pair Plots - Multiple Relationships at Once

Pair plots show **all possible relationships** between numerical variables in one visualisation. They're perfect for:
- **Exploratory analysis**: Quickly seeing all relationships
- **Finding patterns**: Spotting interesting correlations
- **Identifying outliers**: Unusual points across multiple dimensions
- **Understanding data structure**: How variables relate to each other

### Basic Pair Plot

In [None]:
# Create a pair plot of all numerical variables
sns.pairplot(df)
plt.show()

**How to read pair plots:**
- **Diagonal**: Histograms of each variable
- **Off-diagonal**: Scatter plots between pairs of variables
- **Upper triangle**: Mirror of lower triangle
- **Each subplot**: Shows one relationship

### Pair Plot Colored by Species

Let's add species information to see how the relationships differ by group.

In [None]:
# Color by species to see group patterns
sns.pairplot(df, hue='species', palette='Set1')
plt.show()

**Key insights from the pair plot:**
- **Species cluster clearly** in most variable combinations
- **Bill dimensions** show strong species separation
- **Body mass and flipper length** are highly correlated
- **Each species** occupies a distinct region in the measurement space

### Focused Pair Plot

We can focus on just the most interesting variables to make the plot clearer.

In [None]:
# Focus on just bill measurements and body mass
selected_vars = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
sns.pairplot(df[selected_vars + ['species']], hue='species', palette='viridis')
plt.show()

## 7. Correlation Heatmap - Strength of Relationships

Correlation heatmaps show the **strength of linear relationships** between all numerical variables. They help us:
- **Quantify relationships**: Exact correlation coefficients
- **Spot multicollinearity**: Variables that measure similar things
- **Guide analysis**: Which variables to investigate further
- **Understand data structure**: Overall pattern of relationships

### Understanding Correlation Coefficients

Correlation coefficients range from -1 to +1:
- **+1**: Perfect positive correlation (as one increases, the other increases)
- **0**: No linear relationship
- **-1**: Perfect negative correlation (as one increases, the other decreases)
- **±0.7 to ±1**: Strong correlation
- **±0.3 to ±0.7**: Moderate correlation
- **±0 to ±0.3**: Weak correlation

### Calculate Correlations

In [None]:
# Calculate correlation matrix for numerical variables
numerical_vars = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
correlation_matrix = df[numerical_vars].corr()
print('Correlation matrix:')
print(correlation_matrix.round(3))

### Basic Heatmap

In [None]:
# Create a basic correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix)
plt.show()

### Adding Numbers and Better Colors

Let's make the heatmap more informative by showing the actual correlation values.

In [None]:
# Add correlation values and better formatting
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, fmt='.2f')
plt.title('Correlation Matrix of Penguin Measurements')
plt.show()

**How to read this heatmap:**
- **Colors**: Red = positive correlation, Blue = negative correlation
- **Intensity**: Darker colors = stronger correlations
- **Numbers**: Exact correlation coefficients
- **Diagonal**: Always 1.0 (each variable correlates perfectly with itself)

### Interpreting the Correlations

**Key findings from our correlation matrix:**
- **Flipper length & body mass** (0.87): Very strong positive correlation - bigger penguins have longer flippers
- **Bill length & flipper length** (0.66): Moderate positive correlation
- **Bill length & body mass** (0.60): Moderate positive correlation
- **Bill length & bill depth** (-0.24): Weak negative correlation - longer bills tend to be shallower

These relationships make biological sense - larger penguins tend to have larger features overall!

### Custom Color Palette

Let's try a different color scheme that might be easier to interpret.

In [None]:
# Try a different color palette
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='RdBu_r', center=0,
            square=True, fmt='.2f', cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Correlation Matrix of Penguin Measurements')
plt.show()

## Summary and Best Practices

You've learned how to create and customise the essential plots for exploratory data analysis. Here's a summary of what we covered:

### Plot Types and When to Use Them:
- **Histograms**: Distribution of single numerical variables
- **Bar plots**: Counts of categories or averages by group
- **Scatter plots**: Relationships between two numerical variables
- **Box plots**: Distribution summaries and outlier detection
- **Violin plots**: Detailed distribution shapes
- **Pair plots**: Multiple relationships at once
- **Correlation heatmaps**: Strength of all relationships

### Customisation Elements:
- **Titles**: Always include descriptive titles
- **Axis labels**: Make them clear and include units
- **Colors**: Use appropriate palettes for your audience
- **Legends**: Ensure they're clear and positioned well
- **Size**: Make plots large enough to read easily

### Best Practices:
1. **Start simple** - Basic plot first, then add complexity
2. **Tell a story** - Each plot should answer a specific question
3. **Consider your audience** - Choose colors and complexity appropriately
4. **Check for outliers** - Always investigate unusual points
5. **Group by categories** - Use `hue` to reveal hidden patterns
6. **Iterate** - Try different plot types to find the best representation

### Next Steps:
- Practice with your own datasets
- Explore more seaborn plot types (strip plots, swarm plots, etc.)
- Learn about statistical plotting (regression lines, confidence intervals)
- Study color theory for better visualisations