# Data Visualization with Seaborn

Seaborn plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots. 

Seaborn uses matplotlib to draw its plots. For interactive work, it’s recommended to use a Jupyter/IPython interface in matplotlib mode, or else you’ll have to call matplotlib.pyplot.show() when you want to see the plot. It automates the creation of multiple figures but sometimes leads to OOM (out of memory) issues.

## Different categories of plot in Seaborn

Plots are basically used for visualizing the relationship between variables. Those variables can be either be completely numerical or a category like a group, class or division. Seaborn divides plot into the below categories –

- __Relational plots__: This plot is used to understand the relation between two variables.
- __Categorical plots__: This plot deals with categorical variables and how they can be visualized.
- __Distribution plots__: This plot is used for examining univariate and bivariate distributions
- __Regression plots__: The regression plots in seaborn are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during exploratory data analyses.
- __Matrix plots__: A matrix plot is an array of scatterplots.
- __Multi-plot grids__: It is an useful approach is to draw multiple instances of the same plot on different subsets of the dataset.

__Reference Links__: 
- https://seaborn.pydata.org/tutorial/introduction.html
- https://medium.com/analytics-vidhya/introduction-to-matplotlib-and-seaborn-e2dd04bfc821#:~:text=Matplotlib%3A%20Matplotlib%20is%20mainly%20deployed,has%20easily%20interesting%20default%20themes.

In [None]:
import seaborn as sb
from matplotlib import pyplot as plt
df = sb.load_dataset('tips')
df.head(10)

## Relational Plots

Relational plots are used for visualizing the statistical relationship between the data points. They plot two-dimensional graphics that can be enhanced by mapping up to three additional variables using the semantics of hue, size, and style.

### Rel Plot
This function provides us the access to some other different axes-level functions which shows the relationships between two variables with semantic mappings of subsets. The plot below shows the relationship between five variables in the tips dataset using a single call to the seaborn function relplot(). Hue, Style and size are used for grouping variable that will produce elements with different colors.

In [None]:
sb.relplot(data=df, x="total_bill", y="tip", col="time", hue="smoker", style="smoker", size="size")
plt.show()

Both the plots discussed below can also be drawn with the help of kind parameter in relplot(). If we pass the parameter kind = “line”, it gives us lineplot(). Though both these plots can be drawn using relplot(), seaborn also have separate functions for visualizing these kind of plots. These functions do provides some other functionalities too, compared to relplot().

### Scatter Plot

The scatter plot is a mainstay of statistical visualization. It depicts the joint distribution of two variables using a cloud of points, where each point represents an observation in the dataset. This depiction allows the eye to infer a substantial amount of information about whether there is any meaningful relationship between them.

In [None]:
sb.scatterplot(x ="total_bill",y ="tip",style ="time", data = df)
plt.show()

### Line Plot

Scatter plots are highly effective, but there is no universally optimal type of visualization. For certain datasets, you may want to consider changes as a function of time in one variable, or as a similarly continuous variable. In this case, drawing a line-plot is a better option.

In [None]:
fmri = sb.load_dataset("fmri")
sb.lineplot(x ="timepoint", y ="signal", hue ="region", style ="event", data = fmri)
plt.show()

## Categorical Plots

If one of the main variables is “categorical” (divided into discrete groups) it may be helpful to use a more specialized approach to visualization. There are a number of axes-level functions for plotting categorical data in different ways and a figure-level interface, catplot(), that gives unified higher-level access to them. Some examples of categorical plots are discussed below.

### Bar Plot

A barplot is basically used to aggregate the categorical data according to some methods and by default its the mean. It creates a plot taking a mean per categorical column but this statistical function for estimation can be changed by modifying estimator within each categorical bin. The palette is used to set the color of the plot

 The below plot shows that the average total_bill for the male is more as compared to the female

In [None]:
# change the estimator from mean to standard deviation
import numpy as np
sb.barplot(x ='sex', y ='total_bill', data = df, palette ='plasma', estimator = np.std)
plt.show()

### Count Plot

A countplot basically counts the categories and returns a count of their occurrences. As it only returns the count based off a categorical column, we need to specify only the x parameter. It is a special case of bar plot when you want to show the number of observations in each category rather than computing a statistic for a second variable.

In [None]:
titanic = sb.load_dataset("titanic")
sb.countplot(data=titanic, x="deck", palette="ch:.25")
plt.show()

### Box Plot

It is kind of plot shows the three quartile values of the distribution along with extreme values. The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile, and then observations that fall outside this range are displayed independently. This means that each value in the boxplot corresponds to an actual observation in the data.

By looking at the plot  below we can say that the people who do not smoke had a higher bill on Friday as compared to the people who smoked.

In [None]:
sb.boxplot(x ='day', y ='total_bill', data = df, hue ='smoker')
plt.show()

### Boxen Plot

It draws a plot that is similar to a box plot but optimized for showing more information about the shape of the distribution. It is best suited for larger datasets

In [None]:
diamonds = sb.load_dataset("diamonds")
sb.boxenplot(data=diamonds.sort_values("color"),x="color", y="price")
plt.show()

### Violin Plot

It is similar to the boxplot except that it provides a higher, more advanced visualization and uses the kernel density estimation to give a better description about the data distribution. Additionally, the quartile and whisker values from the boxplot are shown inside the violin. The downside is that, because the violinplot uses a KDE, there are some other parameters that may need tweaking, adding some complexity relative to the straightforward boxplot.

In [None]:
sb.violinplot(x ='day', y ='total_bill', data = df, hue ='sex')
plt.show()

It’s also possible to “split” the violins when the hue parameter has only two levels, which can allow for a more efficient use of space.

In [None]:
sb.violinplot(x ='day', y ='total_bill', data = df, hue ='sex', split = True)
plt.show()

### Strip Plot

It basically creates a scatter plot based on the category. One problem with strip plot is that you can’t really tell which points are stacked on top of each other and hence we use the jitter parameter to add some random noise. The jitter parameter is used to add an amount of jitter (only along the categorical axis) which can be useful when you have many points and they overlap, so that it is easier to see the distribution. hue is used to provide an addition categorical separation

In [None]:
sb.stripplot(x ='day', y ='total_bill', data = df,jitter = True, hue ='smoker', dodge = True)
plt.show()

### Swarm Plot

It is very similar to the stripplot except the fact that the points are adjusted so that they do not overlap.Some people also like combining the idea of a violin plot and a stripplot to form this plot. One drawback to using swarmplot is that sometimes they dont scale well to really large numbers and takes a lot of computation to arrange them. So in case we want to visualize a swarmplot properly we can plot it on top of a violinplot.

In [None]:
sb.swarmplot(x ='day', y ='total_bill', data = df)
plt.show()

In [None]:
sb.violinplot(x ='day', y ='total_bill', data = df)
sb.swarmplot(x ='day', y ='total_bill', data = df, color ='black')
plt.show()

### Point Plot

This function also encodes the value of the estimate with height on the other axis, but rather than showing a full bar, it plots the point estimate and confidence interval. Additionally, it connects points from the same hue category. This makes it easy to see how the main relationship is changing as a function of the hue semantic, because your eyes are quite good at picking up on differences of slopes.

In [None]:
titanic = sb.load_dataset("titanic")
titanic.head()

In [None]:
sb.pointplot(data=titanic, x="sex", y="survived", hue="class")
plt.show()

## Distribution Plots

Distribution plots in seaborn are used for examining univariate and bivariate distributions.There are several different approaches to visualizing a distribution, and each has its relative advantages and drawbacks. It is important to understand these factors so that you can choose the best approach for your particular aim.

### Joint Plot

It plots bivariate and univariate graphs of two variables. It represents the bivariate distribution using scatterplot() and the marginal distributions using histplot(). kind is a variable that helps us play around with the fact as to how do you want to visualise the data.It helps to see whats going inside the joinplot. The default is scatter and can be hex, reg(regression) or kde.

In [None]:
sb.jointplot(x="total_bill", y="tip", data=df, kind="reg")
plt.show()

### Pair Plot

The function offers a similar blend of joint and marginal distributions. Rather than focusing on a single relationship, however, pairplot() uses a “small-multiple” approach to visualize the univariate distribution of all variables in a dataset along with all of their pairwise relationships.With kind="reg" combines regplot() and PairGrid to show the linear relationship between variables in a dataset. PairGrid is used to show multiple relationships between different pairings of the variables in a dataset.

In [None]:
sb.pairplot(df, x_vars=["total_bill", "size"], y_vars=["tip"], hue="smoker", height=5, aspect=.8, kind="reg");
plt.show()

### Dis Plot

It is used basically for univariant set of observations and visualizes it through a histogram i.e. only one observation and hence we choose one particular column of the dataset. KDE stands for Kernel Density Estimation and that is another kind of the plot in seaborn. bins is used to set the number of bins you want in your plot and it actually depends on your dataset.

By looking at the below plot we can say that most of the total bill given lies between 10 and 20.

In [None]:
sb.displot(df['total_bill'], kde = False, color ='red', bins = 30)
plt.show()

## Regression Plots

The regression plots in seaborn are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during exploratory data analyses. Regression plots as the name suggests creates a regression line between 2 parameters and helps to visualize their linear relationships.

The functions in Seaborn to find the linear regression relationship are __regplot()__ and __lmplot()__. These functions draw similar plots, but regplot() is an axes-level function, and lmplot() is a figure-level function. Only lmplot() has hue as a parameter. The below example shows its use.

### Reg Plot

In [None]:
sb.regplot(x = "total_bill", y = "tip", data = df)
plt.show()

### Lm Plot

lmplot() can be understood as a function that basically creates a linear model plot. lmplot() makes a very simple linear regression plot.It creates a scatter plot with a linear fit on top of it.

In [None]:
sb.lmplot(x="total_bill", y="tip", hue="smoker",col="time", row="sex", data=df, height=3);
plt.show()

## Matrix Plots

There are two types of Matrix Plots Heat Maps and Cluster Maps. To plot matrix, the index name and the column name must match in some way so that the data that we fill inside the cells are relevant. 

### Heat Maps

The correlated matrix can be shown using heat maps.

In [None]:
# correlation between the different parameters
tc = df.corr()
 
# plot a heatmap of the correlated data
sb.heatmap(tc)
plt.show()

### Cluster Maps

Cluster maps use hierarchical clustering. It performs the clustering based on the similarity of the rows and columns. 

In [None]:
import pandas as pd

# load the flights dataset
fd = sb.load_dataset('flights')
 
# make a dataframe of the data
df = pd.pivot_table(values ='passengers', index ='month',columns ='year', data = fd)
 
# make a clustermap from the dataset
sb.clustermap(df, cmap ='plasma')
plt.show()

## Multi plot grids

When exploring multi-dimensional data, a useful approach is to draw multiple instances of the same plot on different subsets of your dataset. It allows a viewer to quickly extract a large amount of information about a complex dataset. Matplotlib offers good support for making figures with multiple axes; seaborn builds on top of this to directly link the structure of the plot to the structure of your dataset.

### FacetGrid

The FacetGrid class is useful when you want to visualize the distribution of a variable or the relationship between multiple variables separately within subsets of your dataset. Each of relplot(), displot(), catplot(), and lmplot() use this object internally, and they return the object when they are finished so that it can be used for further tweaking.

In [None]:
exercise = sb.load_dataset("exercise")

# Form a facetgrid using columns with a hue
sea = sb.FacetGrid(exercise, col = "time", hue = "kind")
   
# map the above form facetgrid with some attributes  
sea.map(sb.scatterplot, "pulse", "time", alpha = .8)
 
# adding legend
sea.add_legend()
plt.show()

### PairGrid

PairGrid also allows you to quickly draw a grid of small subplots using the same plot type to visualize data in each. In a PairGrid, each row and column is assigned to a different variable, so the resulting plot shows each pairwise relationship in the dataset. This style of plot is sometimes called a “scatterplot matrix”, as this is the most common way to show each relationship, but PairGrid is not limited to scatterplots.

In [None]:
iris = sb.load_dataset("iris")
g = sb.PairGrid(iris, vars=["sepal_length", "sepal_width"], hue="species")
g.map(sb.scatterplot)
plt.show()

The differences between a FacetGrid and a PairGrid. In the former, each facet shows the same relationship conditioned on different levels of other variables. In the latter, each plot shows a different relationship. Using PairGrid can give you a very quick, very high-level summary of interesting relationships in your dataset.