ARTI308 - Machine Learning
# Seaborn Overview

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.




## Distribution Plots

Let's discuss some plots that allow us to visualize the distribution of a data set. These plots are:

* distplot
* jointplot
* pairplot
* rugplot
* kdeplot

## Imports

In [1]:
import seaborn as sns      # Import the seaborn library for advanced statistical data visualization
%matplotlib inline         # Display matplotlib plots inline within the Jupyter Notebook

## Data
Seaborn comes with built-in data sets!

In [2]:
tips = sns.load_dataset('tips')   # Load the built-in 'tips' dataset from seaborn into a DataFrame

In [3]:
tips.head()   # Display the first five rows of the dataset

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## distplot

The distplot shows the distribution of a univariate set of observations.

In [4]:
sns.distplot(tips['total_bill'])
# Safe to ignore warnings


`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(tips['total_bill'])


<Axes: xlabel='total_bill', ylabel='Density'>

To remove the kde layer and just have the histogram use:

In [5]:
sns.distplot(tips['total_bill'], kde=False, bins=30)  
# Plot a histogram of the 'total_bill' column with 30 bins and no KDE curve


`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(tips['total_bill'],kde=False,bins=30)


<Axes: xlabel='total_bill', ylabel='Density'>

## jointplot

jointplot() allows you to basically match up two distplots for bivariate data. With your choice of what **kind** parameter to compare with: 
* “scatter” 
* “reg” 
* “resid” 
* “kde” 
* “hex”

In [6]:
sns.jointplot(x='total_bill', y='tip', data=tips, kind='scatter')  
# Create a joint plot showing the relationship between total_bill and tip using a scatter plot

<seaborn.axisgrid.JointGrid at 0x211f256a3c0>

In [7]:
sns.jointplot(x='total_bill', y='tip', data=tips, kind='hex')  
# Create a joint plot with hexagonal binning to visualize the density between total_bill and tip

<seaborn.axisgrid.JointGrid at 0x211f287e490>

In [8]:
sns.jointplot(x='total_bill', y='tip', data=tips, kind='reg')  
# Create a joint plot with a scatter plot and a regression line between total_bill and tip

<seaborn.axisgrid.JointGrid at 0x211f2cdb390>

## pairplot

pairplot will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns). 

In [9]:
sns.pairplot(tips)   # Create pairwise relationships (scatterplots and histograms) for all numerical columns in the dataset

<seaborn.axisgrid.PairGrid at 0x211f256a510>

In [10]:
sns.pairplot(tips, hue='sex', palette='coolwarm')  
# Create pairwise plots colored by the 'sex' column using the 'coolwarm' color palette

<seaborn.axisgrid.PairGrid at 0x211f367f110>

## rugplot

rugplots are actually a very simple concept, they just draw a dash mark for every point on a univariate distribution. They are the building block of a KDE plot:

In [11]:
sns.rugplot(tips['total_bill'])   # Draw a rug plot to show the distribution of total_bill values as tick marks along the axis

<Axes: xlabel='size', ylabel='Density'>

## kdeplot

kdeplots are [Kernel Density Estimation plots](http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth). These KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value. For example:

In [12]:
# Don't worry about understanding this code!
# It's just for the diagram below
import numpy as np                     # Import NumPy for numerical operations
import matplotlib.pyplot as plt        # Import Matplotlib for plotting
from scipy import stats                # Import statistical functions from SciPy

# Create dataset
dataset = np.random.randn(25)          # Generate 25 random values from a standard normal distribution

# Create another rugplot
sns.rugplot(dataset);                  # Draw a rug plot for the generated dataset

# Set up the x-axis for the plot
x_min = dataset.min() - 2              # Define minimum x value (with extra margin)
x_max = dataset.max() + 2              # Define maximum x value (with extra margin)

# 100 equally spaced points from x_min to x_max
x_axis = np.linspace(x_min, x_max, 100)   # Create 100 evenly spaced values between x_min and x_max

# Set up the bandwidth, for info on this:
url = 'http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth'  
# Reference URL explaining bandwidth estimation in KDE

bandwidth = ((4 * dataset.std()**5) / (3 * len(dataset)))**.2  
# Calculate bandwidth using Silverman's rule of thumb

# Create an empty kernel list
kernel_list = []                      # Initialize list to store kernel values

# Plot each basis function
for data_point in dataset:            # Loop through each data point in the dataset
    
    # Create a kernel for each point and append to list
    kernel = stats.norm(data_point, bandwidth).pdf(x_axis)   # Compute normal PDF centered at data_point
    kernel_list.append(kernel)                                # Store the kernel in the list
    
    # Scale for plotting
    kernel = kernel / kernel.max()                            # Normalize kernel to max value of 1
    kernel = kernel * .4                                      # Scale down for visual clarity
    plt.plot(x_axis, kernel, color='grey', alpha=0.5)         # Plot each individual kernel curve

plt.ylim(0, 1)                         # Set y-axis limits for better visualization

ModuleNotFoundError: No module named 'scipy'

In [None]:
# To get the kde plot we can sum these basis functions.

# Plot the sum of the basis function
sum_of_kde = np.sum(kernel_list, axis=0)   # Sum all kernel curves point-wise to get the KDE curve

# Plot figure
fig = plt.plot(x_axis, sum_of_kde, color='indianred')   # Plot the resulting KDE curve

# Add the initial rugplot
sns.rugplot(dataset, c='indianred')   # Overlay a rug plot of the dataset

# Get rid of y-tick marks
plt.yticks([])   # Remove y-axis tick marks for cleaner visualization

# Set title
plt.suptitle("Sum of the Basis Functions")   # Add a title to the entire figure

So with our tips dataset:

In [None]:
sns.kdeplot(tips['total_bill'])        # Plot the Kernel Density Estimation (KDE) for total_bill
sns.rugplot(tips['total_bill'])        # Add a rug plot to show individual total_bill data points

In [None]:
sns.kdeplot(tips['tip'])        # Plot the Kernel Density Estimation (KDE) for tip amounts
sns.rugplot(tips['tip'])        # Add a rug plot to show individual tip data points

# Categorical Data Plots

Now let's discuss using seaborn to plot categorical data! There are a few main plot types for this:

* factorplot
* boxplot
* violinplot
* stripplot
* swarmplot
* barplot
* countplot

Let's go through examples of each!

In [None]:
import seaborn as sns      # Import seaborn library for statistical data visualization
%matplotlib inline         # Display matplotlib plots inline in the Jupyter Notebook

In [None]:
tips = sns.load_dataset('tips')   # Load the built-in 'tips' dataset into a DataFrame
tips.head()                       # Display the first five rows of the dataset

## barplot and countplot

These very similar plots allow you to get aggregate data off a categorical feature in your data. **barplot** is a general plot that allows you to aggregate the categorical data based off some function, by default the mean:

In [None]:
sns.barplot(x='sex', y='total_bill', data=tips)  
# Create a bar plot showing the average total_bill for each gender (sex)

In [None]:
import numpy as np   # Import NumPy library for numerical and array operations

You can change the estimator object to your own function, that converts a vector to a scalar:

In [None]:
sns.barplot(x='sex', y='total_bill', data=tips, estimator=np.std)  
# Create a bar plot showing the standard deviation of total_bill for each gender

### countplot

This is essentially the same as barplot except the estimator is explicitly counting the number of occurrences. Which is why we only pass the x value:

In [None]:
sns.countplot(x='sex', data=tips)   # Create a count plot showing the number of observations for each gender

## boxplot and violinplot

boxplots and violinplots are used to shown the distribution of categorical data. A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

In [None]:
sns.boxplot(x="day", y="total_bill", data=tips, palette='rainbow')  
# Create a box plot showing the distribution of total_bill for each day using the rainbow color palette

In [None]:
# Can do entire dataframe with orient='h'
sns.boxplot(data=tips,palette='rainbow',orient='h')

In [None]:
sns.boxplot(x="day", y="total_bill", hue="smoker", data=tips, palette="coolwarm")  
# Create a grouped box plot showing total_bill per day, separated by smoker status using the coolwarm palette

### violinplot
A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

In [None]:
sns.violinplot(x="day", y="total_bill", data=tips, palette='rainbow')  
# Create a violin plot to visualize the distribution of total_bill for each day with density estimation

In [None]:
sns.violinplot(x="day", y="total_bill", data=tips, hue='sex', palette='Set1')  
# Create a violin plot showing total_bill distribution per day, split by gender using the Set1 palette

In [None]:
sns.violinplot(x="day", y="total_bill", data=tips, hue='sex', split=True, palette='Set1')  
# Create a split violin plot to compare total_bill distribution between genders within each day

## stripplot and swarmplot
The stripplot will draw a scatterplot where one variable is categorical. A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

The swarmplot is similar to stripplot(), but the points are adjusted (only along the categorical axis) so that they don’t overlap. This gives a better representation of the distribution of values, although it does not scale as well to large numbers of observations (both in terms of the ability to show all the points and in terms of the computation needed to arrange them).

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips, palette='rainbow')  
# Create a strip plot to display individual total_bill data points for each day

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips, jitter=True, palette='rainbow')  
# Create a strip plot with jitter to spread out overlapping points for better visibility

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips, jitter=True, hue='sex', palette='Set1')  
# Create a strip plot with jitter to reduce overlap and color points based on gender

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips, jitter=True, hue='sex', palette='Set1', dodge=True)  
# Create a jittered strip plot and separate (dodge) points by gender within each day category

In [None]:
sns.swarmplot(x="day", y="total_bill", data=tips)  
# Create a swarm plot to show individual total_bill points arranged to avoid overlap

### Combining Categorical Plots

In [None]:
sns.violinplot(x="tip", y="day", data=tips,palette='rainbow')
sns.swarmplot(x="tip", y="day", data=tips,color='black',size=3)

## catplot

factorplot is the most general form of a categorical plot. It can take in a kind parameter to adjust the plot type:

In [None]:
sns.catplot(x='sex', y='total_bill', data=tips, kind='bar')  
# Create a categorical bar plot showing the average total_bill for each gender

# Matrix Plots

Matrix plots allow you to plot data as color-encoded matrices and can also be used to indicate clusters within the data (later in the machine learning section we will learn how to formally cluster data).

Let's begin by exploring seaborn's heatmap and clutermap:

In [None]:
import seaborn as sns      # Import seaborn library for statistical data visualization
%matplotlib inline         # Enable inline plotting for matplotlib in Jupyter Notebook

In [None]:
flights = sns.load_dataset('flights')   # Load the built-in 'flights' dataset into a DataFrame

In [None]:
tips = sns.load_dataset('tips')   # Load the built-in 'tips' dataset into a DataFrame

In [None]:
tips.head()   # Display the first five rows of the tips dataset

In [None]:
flights.head()   # Display the first five rows of the flights dataset

## Heatmap

In order for a heatmap to work properly, your data should already be in a matrix form, the sns.heatmap function basically just colors it in for you. For example:

In [None]:
tips.head()   # Display the first five rows of the tips dataset

In [None]:
# Matrix form for correlation data
tips.corr()

In [None]:
sns.heatmap(tips.corr())   # Create a heatmap to visualize the correlation matrix of numerical columns in the tips dataset

In [None]:
sns.heatmap(tips.corr(), cmap='coolwarm', annot=True)  
# Create a heatmap of the correlation matrix with the coolwarm color map and display correlation values

Or for the flights data:

In [None]:
flights.pivot_table(values='passengers', index='month', columns='year')  
# Create a pivot table showing passenger counts indexed by month and separated by year

In [None]:
pvflights = flights.pivot_table(values='passengers', index='month', columns='year')  
# Create a pivot table of passengers with months as rows and years as columns

sns.heatmap(pvflights)  
# Generate a heatmap to visualize passenger counts across months and years

In [None]:
sns.heatmap(pvflights, cmap='magma', linecolor='white', linewidths=1)  
# Create a heatmap using the magma color map with white grid lines between cells

## clustermap

The clustermap uses hierarchal clustering to produce a clustered version of the heatmap. For example:

In [None]:
sns.clustermap(pvflights)  
# Create a clustered heatmap to group similar months and years based on passenger patterns

Notice now how the years and months are no longer in order, instead they are grouped by similarity in value (passenger count). That means we can begin to infer things from this plot, such as August and July being similar (makes sense, since they are both summer travel months)

In [None]:
# More options to get the information a little clearer like normalization
sns.clustermap(pvflights,standard_scale=1)

# Regression Plots

Seaborn has many built-in capabilities for regression plots, however we won't really discuss regression until the machine learning section of the course, so we will only cover the **lmplot()** function for now.

**lmplot** allows you to display linear models, but it also conveniently allows you to split up those plots based off of features, as well as coloring the hue based off of features.

Let's explore how this works:

In [None]:
import seaborn as sns
%matplotlib inline

In [None]:
tips = sns.load_dataset('tips')   # Load the built-in 'tips' dataset into a DataFrame

In [None]:
tips.head()

## lmplot()

In [None]:
sns.lmplot(x='total_bill', y='tip', data=tips)  
# Create a scatter plot with a linear regression line between total_bill and tip

In [None]:
sns.lmplot(x='total_bill', y='tip', data=tips, hue='sex')  
# Create a regression plot separated by gender, with different colors for each group

In [None]:
sns.lmplot(x='total_bill', y='tip', data=tips, hue='sex', palette='coolwarm')  
# Create a regression plot with separate lines for each gender using the coolwarm color palette

## Using a Grid

We can add more variable separation through columns and rows with the use of a grid. Just indicate this with the col or row arguments:

In [None]:
sns.lmplot(x='total_bill', y='tip', data=tips, col='sex')  
# Create separate regression plots for each gender in different columns

In [None]:
sns.lmplot(x="total_bill", y="tip", row="sex", col="time", data=tips)  
# Create a grid of regression plots separated by gender (rows) and time of day (columns)

In [None]:
sns.lmplot(x='total_bill', y='tip', data=tips, col='day', hue='sex', palette='coolwarm')  
# Create separate regression plots for each day, colored by gender using the coolwarm palette

## Aspect and Size

Seaborn figures can have their size and aspect ratio adjusted with the **height** and **aspect** parameters:

In [13]:
sns.lmplot(
    x='total_bill',
    y='tip',
    data=tips,
    col='day',
    hue='sex',
    palette='coolwarm',
    aspect=0.6,
    height=8
)  
# Create regression plots for each day, colored by gender, 
# with custom aspect ratio and figure height

<seaborn.axisgrid.FacetGrid at 0x211f55882f0>

### Reference:

* https://seaborn.pydata.org/ - Seaborn: statistical data visualization


* https://seaborn.pydata.org/tutorial/color_palettes.html - Color palettes