<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/DVRG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Data Visualization Reference Guide - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Books/Data-Visualization-Reference-Guide).

<br>
<!--NAVIGATION-->

<[ [Data Visualization with Matplotlib](02.00-dvrg-Data-Visualization-with-Matplotlib.ipynb) | [Contents and Acronyms](00.00-dvrg-Contents-and-Acronyms.ipynb) | [References](04.00-dvrg-References.ipynb) ]>

# 3. Data Visualization with Seaborn

## 3.1. Introduction
* Seaborn is built on top of Matplotlib
* Seaborn also **supports NumPy** and **Pandas** data structures and statistical units from **SciPy**
* Seaborn provides a high-level interface to produce high-quality statistical graphics
* The customizability and backend options for Matplotlib make it easy to generate high-quality figures
* **Seaborn is not an alternative to Matplotlib**, but we can think of it as a complement to Matplotlib
* In summary, **“If Matplotlib tries to make easy things easy and hard things possible, Seaborn tries to make a well-defined set of hard things easy too in a well-defined way.”**
* Seaborn helps to resolve two major problems faced by Matplotlib. The problems are:
  - Working with Matplotlib’s default parameters
  - Working with data frames

## 3.2. Comparison of Seaborn with Matplotlib
Seaborn uses Matplotlib to draw plots. Many actions can be accomplished with only Seaborn functions. Further customization options might require using Matplotlib directly. Seaborn is not an alternative to Matplotlib. We can think of it as a complement to Matplotlib. As it is built on top of Matplotlib, we can call Matplotlib functions directly for creating simple plots. In summary, “If Matplotlib tries to make easy things easy and hard things possible, Seaborn tries to make a well-defined set of hard things easy too in a well-defined way.”
Seaborn helps to resolve two major problems faced by Matplotlib. The problems are:
* Working with Matplotlib’s default parameters
* Working with data frames

## 3.3. Seaborn APIs
Seaborn has various types of plots that can be categorized into six categories:
* Relational plots
* Categorical plots
* Distribution plots
* Regression plots
* Matrix plots
* Multi-plot grids

### 3.3.1. Relational Plots 
Used to visualize statistical relationships such as trends and patterns that indicate a relationship. There are three types of plotting functions in relational plots:
```
* relplot()
* scatterplot()
* lineplot()
```

### 3.3.2. Categorical Plots 
Used to visualize the relationship involving categorical data. Commonly used functions are:
```
* stripplot()      - categorical scatterplots
* swarmplot()      - categorical scatterplots
* boxplot()        - categorical distribution plots 
* violinplot()     - categorical distribution plots 
* boxenplot()      - categorical distribution plots
* pointplot()      - categorical estimate plots 
* barplot()        - categorical estimate plots 
* countplot()      - categorical estimate plots
```

### 3.3.3. Distribution Plots 
Used to visualize the distribution of a dataset with the distribution plotting functions, and they are:
```
* distplot()      - plot univariate distribution (histogram and fit a KDE)
* kdeplot()       - plot shape of a distribution 
* rugplot()       - plot data points in an array as sticks on an axis
* jointplot()     - plot of two variables with bivariate and univariate graphs
* pairplot()      - visualize pairwise relationships in a dataset
```

### 3.3.4. Regression Plots
Used to visualize linear relationship between variables. These plotting functions are:
```
* regplot()       - visualize linear relationship between variables
* lmplot()        - visualize linear relationship between variables
* residplot()     - plot the residuals of a linear regression model
```

### 3.3.5. Matrix Plots
Used to visualize plots of matrix type and they are:
```
* heatmap()       - plot rectangular data as a colour-encoded matrix
* clustermap()    - plot a matrix dataset as a hierarchically clustered heatmap
```

### 3.3.6. Multi-Plot Grids
* While exploring medium-dimensional data, a useful technique is to draw multiple instances of the same plot on different subsets of the dataset which allows us to extract a large amount of information about the complex data
* Seaborn provides plotting functions to link the structure of the plot to the structure of the dataset and draw multi-plot grids
* To use these features, the dataset has to be in a Pandas dataframe and it must be in a **tidy data format** which means that the dataframe should be structured such that each column is a feature/variable and each row is an observation/sample
* Multi-plot grids are further divided into three categories:
  - Facet grids
  - Pair grids
  - Joint grids

**1) Facet grids:**
* Facet grids provide three functions. These functions are used to visualize the distribution of a variable or the relationship between multiple variables separately within subsets of the dataset

```
* FacetGrid() 
* FacetGrid.map() 
* FacetGrid.map_dataframe()
```

**2) Pair grids:**
* Pair grids provide plotting functions to visualize pairwise data relationships. The basic plotting function is `PairGrid()`. It allows us to draw a grid of small subplots using the same plot type to visualize data. In a `PairGrid()`, each row and column is assigned to a different variable. So, the resulting plot shows each pairwise relationship in the dataset

**3) Joint grids:**
* Joint grids provide plotting functions to draw Joint plots. There are 4 types of Joint grid functions:

```
* JointGrid()                  - draw a bivariate plot with marginal univariate plots
* JointGrid.plot()             - shortcut to draw the full plot
* JointGrid.plot_joint()       - draw a bivariate plot of x and y
* JointGrid.plot_marginals()   - draw univariate plots for x and y separately
```

## 3.4. 	Seaborn Functionality
Seaborn offers a lot of functionality data visualization tasks. They are:
* Seaborn provides a dataset-oriented API to examine relationships between variables
* It provides functions to fit and visualize linear regression models for different types of independent and dependent variables
* Seaborn provides functions for visualizing univariate and bivariate distributions and for comparing them between subsets of data
* It provides plotting functions for using categorical variables to show observations or aggregate statistics
* It helps to visualize matrices of data and use clustering algorithms to discover structure in those matrices
* Seaborn provides a plotting function to plot statistical time series data. The function provides flexible estimation and representation of uncertainty around the estimate
* It provides tools for choosing styles, color palettes, palette widgets, and utility functions. These tools help us to make beautiful plots that reveal patterns in the data
* It provides several built-in themes for producing stylish-looking Matplotlib graphics

## 3.5. Set aesthetic parameters with the set() method
We can set the aesthetic parameters of Seaborn plots with the `set()` method. The aesthetic parameters are `context, style, palette, font, font_scale, color_codes`, dictionary of `rc` parameters. Example:
```
sns.set()
```

## 3.6. Seaborn Color Palette
Color plays a very important role in data visualization. Colors add various dimensions to a plot when used effectively. A palette means a flat surface on which a painter mixes paint. Seaborn provides a function called `color_palette()`. It can be used to give colors to plots and adding aesthetic value to them. It returns a list of colors defining a color palette. There are several readily available Seaborn palettes. These are:
* Deep
* Muted
* Bright
* Pastel
* Dark
* Colorblind

Besides these, we can also create new palettes. There is another function `seaborn.palplot()` which deals with color palettes. This function plots the color palette as a horizontal array.

**Qualitative or Categorical colour palettes**: are best suitable to plot the categorical data as follows.
```
current_palette1 = sns.color_palette()
sns.palplot(current_palette1)
```
**Sequential Color Palettes**: Sequential plots are suitable to express the distribution of data ranging from relatively lower values to higher values within a range. Appending an additional character “s” to the color passed to the color parameter will plot the Sequential plot. We need to append ‘s’ to the parameter like ‘Greens’ as follows:
```
current_palette2 = sns.color_palette()
sns.palplot(sns.color_palette(“Greens”))
```

**Diverging Color Palette**: Diverging palettes use two different colors. Each color represents variation in the value ranging from a common point in either direction. We assume plotting the data ranging from -1 to 1. The values from -1 to 0 take one color and 0 to +1 take another color. By default, the values are centered from zero. We can control it with parameter center by passing a value as follows:
```
current_palette3 = sns.color_palette()
sns.palplot(sns.color_palette(“BrBG”, 7))
```
**Default Color Palette**: We can set the default color palette of a Seaborn plot using `set_palette()` function. The arguments are same for both `set_palette()` and `color_palette()` functions, but the default Matplotlib parameters are changed so that the palette is used for all plots.
```
def sinplot(flip=1):
   x = np.linspace(0, 15, 100)
   for i in range(1, 5):
      plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)
sns.set_style("white")
sns.set_palette("husl")
sinplot()
```

## 3.7. Plotting Univariate Distribution with `distplot()`
The most important thing to do while analyzing the data is to understand its distribution. Seaborn helps us to understand the univariate distribution of data. The `distplot()` function provides a quick look at univariate distribution. This function will plot a histogram that fits the kernel density estimate of the data. Example:
```
tips = sns.load_dataset('tips')     # load dataset
sns.distplot(tips['total_bill'])
```

### 3.7.1. Seaborn – Histogram
Histograms represent the data distribution by forming bins along with a range of the data and then drawing bars to show the number of observations that fall in each bin. We can use the same `distplot()` function to plot a histogram:
```
sns.distplot(tips['total_bill'], kde=False)
```
The kde parameter is set to false. As a result, the representation of the kernel estimation plot will be removed and only the histogram is plotted.

### 3.7.2. Seaborn – Kernel Density Estimation (KDE)
Kernel Density Estimation (KDE) is a way to estimate the probability density function of a continuous random variable. It is used for non-parametric analysis. Setting the hist parameter to false in the `distplot()` function will yield the kernel density estimation plot.
```
sns.distplot(tips['total_bill'], hist=False)
```

## 3.8. Plotting Bivariate distribution with `jontplot()`
Bivariate Distribution is used to determine the relation between two variables. This mainly deals with the relationship between two variables and how one variable is behaving with respect to the other. The best way to analyze Bivariate Distribution in seaborn is by using the `jointplot()` function. `Jointplot()` creates a multi-panel figure that projects the bivariate relationship between two variables and also the univariate distribution of each variable on separate axes.
```
sns.jointplot(x="total_bill", y="tip", data=tips)
```

## 3.9. Seaborn – `scatterplot()`
A scatter plot can be used to demonstrate the relationship between two variables x and y. A simple scatter plot can be drawn as follows:
```
sns.scatterplot(x="total_bill", y="tip", data=tips)
```
The relationship between the variables can be shown for different subsets of the data using the hue, size, and style parameters.

## 3.10. Visualizing Pairwise relationship with `pairplot()`
Some datasets contain many variables. In such cases, the relationship between variables should be analyzed. So, we need to plot pairwise relationships in a dataset. To plot multiple pairwise bivariate distributions, we can use the `pairplot()` function. This shows the relationship for (n,2) combination of the variable in a dataframe as a matrix of plots, and the diagonal plots are the univariate plots. We can plot a pairplot as:
```
sns.set_style("ticks")
sns.pairplot(tips)
```

## 3.11. Plotting Categorical data
The histograms, scatter plots, and kde plots are used to analyze the continuous variables under study. These plots are not suitable when the variable under study is categorical. When one or both the variables under study are categorical, we can use plots like
```
* striplot()       - to plot categorical data
* swarmplot()      - to plot categorical data
```

## 3.11.1. Seaborn – `striplot()`
A `stripplot()` is used to draw a scatterplot where one variable is categorical. It represents the data in sorted order along with any one of the axis. We can plot a stripplot as follows:
```
sns.stripplot(x="day", y="total_bill", data=tips)
```
We can add jitter to bring out the distribution of values as follows:
```
sns.stripplot(x="day", y="total_bill", data=tips, jitter=True)
```

## 3.11.2. Seaborn – `swarmplot()`
Another option that can be used as an alternate to ‘jitter’ is to use function `swarmplot()`. This function positions each point of the scatter plot on the categorical axis and thereby avoids overlapping points. So `swarmplot()` can be used to draw categorical scatterplots with non-overlapping points. We can plot a swarmplot as follows:
```
sns.swarmplot(x="day", y="total_bill", data=tips)
```

## 3.12. Distribution of Observations
When dealing with a dataset, the first thing that we want to do is get a sense of how the variables are distributed. We can use the following plots to plot the data distribution.
```
* boxplot()
* violinplot()
```

### 3.12.1. Seaborn – `boxplot()`
We can draw a box plot to show distributions for categories. Boxplot is a convenient way to visualize the distribution of data through their quartiles. Box plots usually have vertical lines extending from the boxes which are termed as whiskers. These whiskers indicate variability outside the upper and lower quartiles; hence Box Plots are also termed a box-and-whisker plot. Any Outliers in the data are plotted as individual points.
We can plot a box-plot as follows:
```
sns.boxplot(x=tips["total_bill"])
```
We can draw a vertical boxplot grouped by a categorical variable as follows:
```
sns.boxplot(x="day", y="total_bill", data=tips)
```

### 3.12.2. Seaborn – `violinplot()`
Violin Plots are combinations of the box plot with the kernel density estimates. So, these plots are easier to analyze and understand the distribution of the data. We can draw a single horizontal violin plot as follows:
```
sns.violinplot(x=tips["total_bill"])
```
We can draw a vertical violinplot grouped by a categorical variable as follows:
```
sns.violinplot(x="day", y="total_bill", data=tips)
```

## 3.13. Statistical estimation with Seaborn
Sometimes, we need to estimate the central tendency of a distribution. Mean and median are the very often used techniques for this. The following plots are used to measure the central tendency of a distribution.
```
* barplot()
* pointplot()
```

### 3.13.1. Seaborn – `barplot()`
A **barplot** shows the relation between a categorical variable and a continuous variable. The data is represented in rectangular bars where the length of the bar represents the proportion of the data in that category. A barplot show point estimates and confidence intervals as rectangular bars. So, it represents the estimate of central tendency. We can draw a set of vertical bar plots grouped by a categorical variable as follows:
```
sns.barplot(x="day", y="total_bill", data=tips)
```

### 3.13.2. Seaborn – `pointplot()`
Point plots serve the same as bar plots but in a different style. Rather than the full bar, the value of the estimate is represented by the point at a certain height on the other axis. The Point plot shows point estimates and confidence intervals using scatter plot glyphs. We can draw a set of vertical point plots grouped by a categorical variable as follows:
```
sns.pointplot(x="time", y="total_bill", data=tips)
sns.pointplot(x="time", y="total_bill", hue="smoker", data=tips)
```

## 3.14. Linear relationships with Seaborn
Many datasets contain multiple quantitative variables, and the goal of an analysis is often to relate those variables to each other. We can use statistical models to estimate a simple relationship between two sets of observations. These are termed, **regression models**. While building the regression models, we often check for multicollinearity, where we had to see the correlation between all the combinations of continuous variables and will take necessary action to remove multicollinearity if exists. There are two main functions in Seaborn to visualize a linear relationship determined through regression. These functions are:
```
* regplot() 
* lmplot()
* residplot()
```

### 3.14.1. Seaborn – `regplot()`
The function `regplot()` plots data and fits the linear regression model. It draws a scatterplot of two variables x and y, and then fit the regression model y ~ x and plot the resulting regression line and a 95% confidence interval for that regression. We can draw the scatterplot and regression line using regplot() as follows:
```
sns.regplot(x="total_bill", y="tip", data=tips)
```

### 3.14.2. Seaborn – `lmplot()`
The function `lmplot()` plots data and fits the regression model across a `FacetGrid`. This function combines `regplot()` and `FacetGrid`. It is intended as a convenient interface to fit regression models across conditional subsets of a dataset.
We can plot a simple linear relationship between two variables using `lmplot()` as follows:
```
sns.lmplot(x="total_bill", y="tip", data=tips)
```
The `regplot()` and `lmplot()` functions are closely related, but the former is an axes-level function while the latter is a figure-level function that combines `regplot()` and `FacetGrid`

### 3.14.3. Seaborn – `residplot()`
The function `residplot()` plot the residuals of a linear regression. We can plot the residuals as follows:
```
sns.residplot(x="total_bill", y="tip", data=tips)
```

## 3.15. Matrix plots with Seaborn
Two functions enable us to plot data in the form of a matrix. These are:
```
* heatmap()
* clustermap()
```

### 3.15.1. Seaborn – `heatmap()`
Seaborn `heatmap()` function plots rectangular data as a color-encoded matrix. We can plot a heatmap for a numpy array as follows:
```
uniform_data = np.random.rand(10, 12)
sns.heatmap(uniform_data)
```

### 3.15.2. Seaborn – `clustermap()`
Seaborn `clustermap()` function plot a matrix dataset as a hierarchically-clustered heatmap. We can plot a clustered heatmap of tips dataset as follows:
```
df1 = tips[['total_bill', 'tip', 'size']]
sns.clustermap(df1)
```

## 3.16. Multi-plot grids with Seaborn
When exploring medium-dimensional data, a useful approach is to draw multiple instances of the same plot on different subsets of the dataset. This technique is sometimes called either `“lattice”`, or `“trellis”` plotting, and it is related to the idea of drawing multiple instances of the same plot. It allows us to quickly extract a large amount of information about complex data. Seaborn provides three types of grids:
```
* FacetGrid
* PairGrid 
* JointGrid
```

### 3.16.1. Seaborn – `FacetGrid()`
Seaborn `Facetgrid()` function enables us to draw multi-plot grid for plotting conditional relationships. Let’s initialize a 2x2 grid of facets using the tips dataset as follows:
```
sns.FacetGrid(tips, col="time", row="smoker")
```
Initializing the grid-like this sets up the matplotlib figure and axes but doesn’t draw anything on them.
The main approach for visualizing data on this grid is with the `FacetGrid.map()` method. We need to provide it with a plotting function and the name(s) of variable(s) in the dataframe to plot. We can draw a univariate plot on each facet as follows:
```
g = sns.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
```

### 3.16.2. Seaborn – `PairGrid()`
Seaborn `PairGrid()` function draws subplot grid for plotting pairwise relationships in a dataset. We can draw a scatterplot for each pairwise relationship using `PairGrid()` function as follows:
```
g = sns.PairGrid(tips)
g = g.map(plt.scatter)
```
We can show a univariate distribution on the diagonal as follows:
```
g = sns.PairGrid(tips)
g = g.map_diag(plt.hist)
g = g.map_offdiag(plt.scatter)
```

### 3.16.3. Seaborn – `JointGrid()`
Seaborn `JointGrid()` function help us to set up the grid of subplots. We can initialize the figure but don’t draw any plots onto it as follows:
```
g = sns.JointGrid(x="total_bill", y="tip", data=tips)
```
We can add plots using default parameters as follows:
```
g = sns.JointGrid(x="total_bill", y="tip", data=tips)
g = g.plot(sns.regplot, sns.distplot)
```

<!--NAVIGATION-->
<br>

<[ [Data Visualization with Matplotlib](02.00-dvrg-Data-Visualization-with-Matplotlib.ipynb) | [Contents and Acronyms](00.00-dvrg-Contents-and-Acronyms.ipynb) | [References](04.00-dvrg-References.ipynb) ]>