# Advanced Data Visualization in Python

Welcome to the **Advanced Data Visualization** class! In this notebook, we will build upon the basic plots we've already learnedâ€”line plots, bar plots, scatter plots, histograms, and introduce more complex modifications and new types of charts. Effective data visualization is a crucial skill in data analytics, as it allows us to communicate insights and patterns in data clearly and efficiently. By mastering advanced visualization techniques, you'll be able to create more informative and compelling graphs that can help in data exploration and presentation.

We will also utilize data manipulation techniques like `groupby` and `merge` to prepare data for visualization. These techniques are essential for summarizing and combining datasets in ways that make them suitable for plotting.

## Learning Objectives

By the end of this notebook, you should be able to:

- Create and customize advanced plots using Seaborn and Matplotlib.
- Use data manipulation techniques to prepare data for visualization.
- Interpret advanced plots to extract meaningful insights.
- Utilize subplots and layouts to compare multiple visualizations.

## Table of Contents
1. Importing Libraries and Datasets
2. Advanced Customizations
3. Box Plots and Violin Plots
4. Heatmaps
5. Pair Plots and Joint Plots
6. Time Series Visualization
7. Grouping and Merging for Visualization
8. Subplots and Layouts
9. Conclusion

Throughout the notebook, **Practice Exercises** are included to reinforce your learning.

### 1. Importing Libraries and Datasets

Before we start plotting, we need to import the necessary libraries and load some sample datasets. The libraries we'll use are:

- **Pandas**: Used for data manipulation and analysis.
- **NumPy**: Used for numerical computations.
- **Matplotlib**: The foundational plotting library in Python.
- **Seaborn**: Built on top of Matplotlib, Seaborn provides a high-level interface for creating attractive and informative statistical graphics.

We'll also load some built-in datasets from Seaborn for practice:

- **Tips**: Data about tips received by waiters in a restaurant.
- **Iris**: Measurements of different species of iris flowers.
- **Flights**: Number of passengers flying each month over several years.
- **Diamonds**: Characteristics of diamonds, including price, carat, and quality.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization styles
sns.set_style('whitegrid')
%matplotlib inline

# Load sample datasets
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
flights = sns.load_dataset('flights')
diamonds = sns.load_dataset('diamonds')  # Additional dataset for practice

Now that we've loaded the datasets, let's take a quick look at them to understand what kind of data we're working with. We'll display the first five rows of each dataset using the `head()` method.

In [None]:
# Display first five rows of the tips dataset
tips.head()

**The Tips Dataset**

The `tips` dataset contains information about the tips received by waiters in a restaurant. It includes variables like:

- `total_bill`: The total bill amount (including tip).
- `tip`: The tip amount.
- `sex`: Gender of the person paying.
- `smoker`: Whether the person was a smoker or not.
- `day`: Day of the week.
- `time`: Time of the day (Lunch or Dinner).
- `size`: Number of people at the table.

Understanding the variables will help us decide which plots to create.

In [None]:
# Display first five rows of the iris dataset
iris.head()

**The Iris Dataset**

The `iris` dataset is a classic dataset in machine learning and statistics. It contains measurements of three species of iris flowers:

- `sepal_length`: Length of the sepal.
- `sepal_width`: Width of the sepal.
- `petal_length`: Length of the petal.
- `petal_width`: Width of the petal.
- `species`: The species of the iris flower (`setosa`, `versicolor`, or `virginica`).

This dataset is often used for classification tasks.

In [None]:
# Display first five rows of the flights dataset
flights.head()

**The Flights Dataset**

The `flights` dataset contains the number of passengers flying each month over a period of years. It includes:

- `year`: The year of the observation.
- `month`: The month of the observation.
- `passengers`: The number of passengers that flew in that month.

This dataset is useful for time series analysis.

In [None]:
# Display first five rows of the diamonds dataset
diamonds.head()

**The Diamonds Dataset**

The `diamonds` dataset contains information about diamond characteristics and their prices. Variables include:

- `carat`: The weight of the diamond.
- `cut`: The quality of the cut (Fair, Good, Very Good, Premium, Ideal).
- `color`: Diamond color, from J (worst) to D (best).
- `clarity`: A measurement of how clear the diamond is.
- `depth`, `table`: Physical measurements of the diamond.
- `price`: Price of the diamond in US dollars.
- `x`, `y`, `z`: Length, width, and depth measurements.

**Practice Exercise 1:**

- **Task:** Load the `titanic` dataset using `sns.load_dataset('titanic')` and display the first five rows.
- **Goal:** Familiarize yourself with another dataset that we'll use in practice exercises.

In [None]:
# Your code here
titanic = sns.load_dataset('titanic')
titanic.head()

**The Titanic Dataset**

The `titanic` dataset contains information about the passengers on the Titanic, including whether they survived or not. Variables include:

- `survived`: Survival status (0 = No, 1 = Yes).
- `pclass`: Passenger class (1st, 2nd, or 3rd).
- `sex`: Gender of the passenger.
- `age`: Age of the passenger.
- `sibsp`: Number of siblings/spouses aboard.
- `parch`: Number of parents/children aboard.
- `fare`: Passenger fare.
- `embarked`: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

We will use this dataset in our practice exercises.

### 2. Advanced Customizations

In this section, we'll explore how to enhance basic plots by adding more customizations to make them more informative and visually appealing.

#### Enhanced Scatter Plot

Scatter plots are useful for visualizing the relationship between two continuous variables. We'll create an enhanced scatter plot using the `iris` dataset. We'll incorporate additional features like color (`hue`), marker style (`style`), and size (`s`) to convey more information in the plot.

**Explanation:**

- **Hue (`hue`):** Adds color encoding based on a categorical variable (e.g., `species`).
- **Style (`style`):** Changes the marker style based on a categorical variable.
- **Size (`s`):** Controls the size of the markers.
- **Palette (`palette`):** Determines the color palette used for the plot.

In [None]:
# Enhanced scatter plot
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=iris,
    x='sepal_length',
    y='sepal_width',
    hue='species',
    style='species',
    s=100,
    palette='bright'
)
plt.title('Sepal Length vs Sepal Width', fontsize=16)
plt.xlabel('Sepal Length (cm)', fontsize=12)
plt.ylabel('Sepal Width (cm)', fontsize=12)
plt.legend(title='Species')
plt.show()

**Interpretation:**

In the plot above, each species of iris is represented by a different color and marker style. This allows us to see how the sepal measurements differ among the species. For example, *Iris-setosa* tends to have shorter sepal lengths and wider sepal widths compared to the other species.

**Practice Exercise 2:**

- **Task:** Create an enhanced scatter plot using the `diamonds` dataset.
- **Instructions:**
  - Plot `carat` on the x-axis and `price` on the y-axis.
  - Use `color` as the `hue` parameter.
  - Customize the plot by adjusting the marker size, adding a title, and labeling the axes.
- **Goal:** Apply advanced customizations to a scatter plot with a different dataset.

In [None]:
# Your code here


**Practice Exercise 3:**

- **Task:** Create an enhanced scatter plot using the `tips` dataset.
- **Instructions:**
  - Plot `total_bill` on the x-axis and `tip` on the y-axis.
  - Use `day` as the `hue` parameter.
  - Use `smoker` as the `style` parameter.
  - Customize the plot by adjusting the marker size, adding a title, and labeling the axes.
- **Goal:** Practice adding multiple customizations to a scatter plot.

In [None]:
# Your code here


### 3. Box Plots and Violin Plots

Box plots and violin plots are useful for visualizing the distribution of a dataset and identifying outliers.

#### Box Plot

A box plot displays the median, quartiles, and outliers of the data. It provides a summary of the distribution of a dataset.

In [None]:
# Box plot
plt.figure(figsize=(10, 6))
sns.boxplot(
    x='day',
    y='total_bill',
    data=tips,
    hue='smoker',
    palette='Set2'
)
plt.title('Total Bill Distribution by Day and Smoking Status', fontsize=16)
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Total Bill ($)', fontsize=12)
plt.legend(title='Smoker')
plt.show()

**Interpretation:**

The box plot shows how the total bill varies by day and smoking status. Each box represents the distribution of total bills for smokers and non-smokers on a given day. The line in the middle of each box represents the median, the box edges represent the first and third quartiles, and the whiskers extend to show the rest of the distribution except for points that are determined to be outliers.

#### Violin Plot

A violin plot combines a box plot with a kernel density plot. It shows the distribution's probability density at different values.

In [None]:
# Violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(
    x='day',
    y='total_bill',
    data=tips,
    hue='smoker',
    split=True,
    palette='Set2'
)
plt.title('Total Bill Distribution by Day and Smoking Status', fontsize=16)
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Total Bill ($)', fontsize=12)
plt.legend(title='Smoker')
plt.show()

**Interpretation:**

The violin plot provides a deeper understanding of the data distribution. The width of the violin represents the frequency of data points at that value. By splitting the violins based on smoking status, we can compare the distributions side by side.

**Practice Exercise 4:**

- **Task:** Create a violin plot using the `titanic` dataset.
- **Instructions:**
  - Plot `class` on the x-axis and `age` on the y-axis.
  - Use `sex` as the `hue` parameter and set `split=True`.
  - Customize the plot by adding a title and adjusting the palette.
- **Goal:** Understand how to create and customize violin plots with different data.

In [None]:
# Your code here
# Remove NaN values in 'age'
titanic_clean = titanic.dropna(subset=['age'])



**Practice Exercise 5:**

- **Task:** Create a box plot using the `diamonds` dataset.
- **Instructions:**
  - Plot `cut` on the x-axis and `price` on the y-axis.
  - Use `color` as the `hue` parameter.
  - Customize the plot by adding a title and adjusting the palette.
- **Goal:** Practice creating box plots with multiple categorical variables.

In [None]:
# Your code here
# Due to the large size of the dataset, we'll sample it
diamonds_sample = diamonds.sample(1000, random_state=42)



### 4. Heatmaps

Heatmaps are a great way to visualize matrix-like data, such as correlation matrices or frequency tables.

#### Correlation Heatmap

A correlation heatmap displays the correlation coefficients between variables in a dataset.

In [36]:
# Compute correlation matrix
corr = iris.drop('species', axis=1).corr()


In [None]:
corr

In [None]:

# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    corr,
    annot=True,
    cmap='coolwarm',
    linewidths=0.5,
    fmt='.2f'
)
plt.title('Correlation Matrix of Iris Dataset', fontsize=16)
plt.show()

**Interpretation:**

The heatmap above shows the correlation between different measurements in the iris dataset. Positive correlations are shown in red, negative correlations in blue. For instance, `petal_length` and `petal_width` are highly positively correlated.

**Practice Exercise 6:**

- **Task:** Create a heatmap to visualize the correlation matrix of the `diamonds` dataset.
- **Instructions:**
  - Use only numerical columns for the correlation matrix.
  - Customize the heatmap with annotations and an appropriate color map.
- **Goal:** Practice creating heatmaps for larger datasets.

In [None]:
# Your code here
# Select numerical columns
diamonds_numeric = diamonds.select_dtypes(include=[np.number])
corr_diamonds = diamonds_numeric.corr()

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
    corr_diamonds,
    annot=True,
    cmap='viridis',
    fmt='.2f'
)
plt.title('Correlation Matrix of Diamonds Dataset', fontsize=16)
plt.show()

**Practice Exercise 7:**

- **Task:** Create a heatmap showing the number of passengers for each month and year in the `flights` dataset.
- **Instructions:**
  - Use the `pivot` method to reshape the data.
  - Customize the heatmap with annotations and a suitable color map.
- **Goal:** Practice creating heatmaps for time series data.

In [None]:
flights_pivot = flights.pivot(index='month',columns= 'year', values='passengers')


In [None]:
flights_pivot

In [None]:
# Your code here
# Pivot the data
flights_pivot = flights.pivot(index='month',columns= 'year', values='passengers')

# Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(
    flights_pivot,
    annot=True,
    fmt='d',
    cmap='YlGnBu'
)
plt.title('Number of Passengers (1949-1960)', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Month', fontsize=12)
plt.show()

### 5. Pair Plots and Joint Plots

#### Pair Plot

A pair plot creates a matrix of scatter plots for each pair of variables. It's useful for exploring relationships between multiple variables.

In [None]:
sns.pairplot(iris, hue='species', height=2.5)
plt.show()

**Interpretation:**

The pair plot shows pairwise relationships between features in the iris dataset, colored by species. Diagonal plots show the distribution of each variable.

#### Practice Exercise 8:

- **Task:** Create a pair plot for the `titanic` dataset.
- **Instructions:**
  - Use the columns `age`, `fare`, and `pclass`.
  - Use `survived` as the `hue` parameter.
  - Customize the plot with an appropriate palette.
- **Goal:** Practice creating pair plots with custom variables.

In [None]:
# Your code here
# Remove NaN values
titanic_clean = titanic[['age', 'fare', 'pclass', 'survived']].dropna()

sns.pairplot(
    data=titanic_clean,
    vars=['age', 'fare', 'pclass'],
    hue='survived',
    palette='coolwarm',
    height=3
)
plt.show()

#### Joint Plot

A joint plot shows the relationship between two variables along with their marginal distributions.

In [None]:
sns.jointplot(
    x='total_bill',
    y='tip',
    data=tips,
    kind='reg',
    height=8
)
plt.show()

**Interpretation:**

The joint plot above shows the relationship between total bill and tip amounts, including a regression line and marginal histograms. It suggests a positive correlation between total bill and tip.

**Practice Exercise 9:**

- **Task:** Create a joint plot using the `diamonds` dataset.
- **Instructions:**
  - Plot `carat` on the x-axis and `price` on the y-axis.
  - Use `kind='reg'` to create a regression plot.
- **Goal:** Explore different kinds of joint plots.

In [None]:
# Your code here


**Practice Exercise 10:**

- **Task:** Visualize the average fare over time using the `titanic` dataset.
- **Instructions:**
  - Create a line plot of average `fare` over `pclass`.
  - Use `sex` as the `hue` parameter.
  - Customize the plot by adding markers and adjusting the palette.
- **Goal:** Practice time series visualization with categorical data.

In [None]:
# Your code here
# Calculate average fare by pclass and sex
avg_fare = titanic.groupby(['pclass', 'sex'])['fare'].mean().reset_index()

# Line plot


### 7. Grouping and Merging for Visualization

Data often needs to be grouped or merged to prepare it for visualization.

#### Grouping Data

We'll group the `tips` dataset by day and smoker status to calculate the average total bill.

In [None]:
# Group by day and smoker status
grouped_tips = tips.groupby(['day', 'smoker'])['total_bill'].mean().reset_index()

# Pivot the data for better plotting
pivot_tips = grouped_tips.pivot(index='day', columns='smoker', values='total_bill')

# Bar plot
pivot_tips.plot(kind='bar', figsize=(10, 6))
plt.title('Average Total Bill by Day and Smoking Status', fontsize=16)
plt.xlabel('Day of Week', fontsize=12)
plt.ylabel('Average Total Bill ($)', fontsize=12)
plt.legend(title='Smoker')
plt.xticks(rotation=0)
plt.show()

**Interpretation:**

The bar plot compares the average total bill between smokers and non-smokers on different days. This can reveal patterns such as whether smokers tend to spend more on certain days.

**Practice Exercise 11:**

- **Task:** Use the `titanic` dataset to visualize survival rates.
- **Instructions:**
  - Group the data by `sex` and `pclass` and calculate the survival rate.
  - Create a grouped bar chart to display the survival rates.
- **Goal:** Practice grouping and visualizing grouped data.

In [None]:
# Your code here
# Group by sex and pclass
survival_rates = titanic.groupby(['sex', 'pclass'])['survived'].mean().reset_index()

# Pivot the data
pivot_survival = survival_rates.pivot(index='pclass', columns='sex', values='survived')

# Bar plot


**Practice Exercise 12:**

- **Task:** Calculate the average `tip` for each day and time (Lunch or Dinner) in the `tips` dataset.
- **Instructions:**
  - Use `groupby` to calculate the averages.
  - Create a heatmap to visualize the results.
- **Goal:** Practice grouping and visualizing data using a heatmap.

In [None]:
# Your code here
# Group by day and time
avg_tip = tips.groupby(['day', 'time'])['tip'].mean().reset_index()

# Pivot the data
pivot_tip = avg_tip.pivot(index='day',columns= 'time',values= 'tip')

# Heatmap

### 9. Subplots and Layouts

Creating subplots allows you to compare multiple plots side by side.

In [None]:
# Subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Histogram
sns.histplot(tips['total_bill'], ax=axes[0, 0], color='skyblue')
axes[0, 0].set_title('Histogram of Total Bill')

# Scatter Plot
sns.scatterplot(x='total_bill', y='tip', data=tips, ax=axes[0, 1], hue='smoker', palette='Set1')
axes[0, 1].set_title('Total Bill vs Tip')

# Box Plot
sns.boxplot(x='day', y='total_bill', data=tips, ax=axes[1, 0], palette='Set3')
axes[1, 0].set_title('Total Bill by Day')

# Violin Plot
sns.violinplot(x='day', y='total_bill', data=tips, ax=axes[1, 1], palette='Set2')
axes[1, 1].set_title('Total Bill Distribution by Day')

plt.tight_layout()
plt.show()

**Interpretation:**

The subplots provide a comprehensive overview of different aspects of the `tips` dataset. By arranging multiple plots in a grid, we can compare various distributions and relationships simultaneously.

**Practice Exercise 14:**

- **Task:** Create a figure with multiple subplots using the `iris` dataset.
- **Instructions:**
  - Create a 2x2 grid of plots.
  - Include a scatter plot, histogram, box plot, and violin plot of different variables.
- **Goal:** Practice arranging multiple plots in a single figure.

In [None]:
# Your code here

# Scatter Plot

# Histogram

# Box Plot

# Violin Plot


### 10. Conclusion

In this notebook, we've explored advanced data visualization techniques and practiced them through examples and exercises.

**Key Takeaways:**

- Enhanced customizations of basic plots to convey more information.
- Introduction to box plots and violin plots for distribution analysis.
- Utilization of heatmaps to visualize correlations and frequencies.
- Exploration of pair plots and joint plots for multivariate data analysis.
- Visualization of time series data to understand trends and patterns.
- Preparation of data using `groupby` and `merge` for plotting.
- Creation of subplots and layouts for comparative analysis.

### Further Reading

- Wes McKinney, *Python for Data Analysis*.
- Seaborn documentation: https://seaborn.pydata.org/
- Matplotlib documentation: https://matplotlib.org/
- Pandas documentation: https://pandas.pydata.org/docs/

### Next Steps

Continue practicing these visualization techniques with different datasets. Experiment with customizing plots and combining different types of plots to uncover deeper insights into your data.