## Data Visualization with Seaborn


In this notebook we will try out another library for data visualization called [Seaborn](https://seaborn.pydata.org/index.html)

Let's import all the libraries first. 


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline

Read in the table with returned orders



In [None]:
orders = pd.read_csv('datasets/returned_orders.csv')
orders.head()

### Histogram with Seaborn

Seaborn is built on top of Matplotlib, and is integrated with Pandas. 

There are many high level functions defined in Seaborn that can help us to create complex plots quickly. 

The basic histogram is created by specifying the data and the column to be used for the x-axis.

In [None]:
# Create histogram with seaborn, return the axes object
fig = plt.figure(figsize=(8,5))
ax= sns.histplot(data = orders, x = 'sales', bins=20)

# use the axes object to customize
ax.set_title('Distribution of Sales Value for Orders')
plt.show()

## Comparing Categories

Seaborn makes it easy to add categorical comparisons by color. 

We can compare the distribution for different product categories by adding the keyword argument `hue`.
Specifying the `category` as the `hue` will separate each category by color.


In [None]:
# Create histogram with seaborn
ax = sns.histplot(data = orders, x = 'sales', bins = 20, hue='category')
plt.show()

## Seaborn boxplot

Similarly, we can create a comparative boxplot that might make it easier for us to view the distribution by sub category



In [None]:
# Create boxplot by Cateogry and Sales
ax = sns.boxplot(data =orders, 
                 x='margin_category', 
                 y='quantity',
                hue='ship_mode')
ax.legend(loc='upper right')
plt.show()

In [None]:
orders.dtypes

### Scatterplot with Seaborn

Recall that we previously created a scatterplot using the `scatter` method in Matplotlib.

In [None]:
# Scatterplot with matplotlib
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(data = orders, x='discount', y='profit_margin',alpha=0.4)
plt.show()

Let's say we wanted to create the same scatter plot but we want to compare the `ship_mode` category.

All we have to do is to add the `ship_mode` for the `hue` argument, and this will *color* the points according to whether it was returned, and automatically create the legend.

In [None]:
# Create the scatterplot using seaborn, still using figure and ax as before 

fig, ax = plt.subplots(figsize=(10,5))

## but now we can add hue for a categorical variable and size for a numeric variable
ax = sns.scatterplot(data =orders, 
                     x='discount', y='profit_margin', 
                     hue='ship_mode', size='sales', alpha = 0.7)
plt.show()

## Jointplot

#The seaborn jointplot shows the scatterplot relationship (bivariate) and the distribution of the individual variables (univariate)

Check the different kinds of plots for both the bivariate and univariate plots by setting the `kind=` parameter
https://seaborn.pydata.org/generated/seaborn.jointplot.html

In [None]:
# Create the jointplot using seaborn, still using figure and ax as before 
sns.jointplot(data =orders, x='discount', y='profit_margin', alpha = 0.7)

# add the hue by ship mode and see what happens!
# try kind='kde', kind='reg', kind='hex'

## Pairplot

The Seaborn pairplot will plot all the parameters, for our dataset let's limit to a few to compare:


In [None]:
plot_data = orders[['discount', 'profit_margin', 'quantity','sales','order_day']]

In [None]:
sns.pairplot(plot_data)

## Displaying Categorical Data

Scatterplots are useful for visualizing the relationship between numerical data. However when plotting categorical values the points may overlap too much. 

### Stripplot
A stripplot adds 'jitter' to the points.

In [None]:
sns.stripplot(x='order_day', y='sales', data=orders, hue='category', alpha=0.5)

### Violinplot

A violinplot shows the frequency distribution as well as the boxplot of the numeric variables

In [None]:
sns.violinplot(x='category', y='sales', data=orders, alpha=0.5, hue='returned', split=True)

## Faceting with Seaborn

We can create subplots, called **Facets**, with Seaborn to compare categories.

Each facet is used to represent one category, and can be defined by row, column, or both.


In [None]:
# For example, creating a FacetGrid object where each column represents one ship mode
ax = sns.FacetGrid(orders, col='ship_mode')

# Now map the plotting function onto the facets
ax.map(sns.scatterplot, 'discount', 'profit_margin')
ax.add_legend()
plt.show()

In [None]:
# For example, Add another facet by row to show return status
ax = sns.FacetGrid(orders, col='ship_mode', row='returned', hue='category')
ax.map(sns.scatterplot, 'discount', 'profit_margin')
ax.set_titles(col_template="{col_name}", row_template="Returned: {row_name}")
ax.add_legend()
plt.show()

### Bar Chart 
A bar chart is another type of plot that is usually used to compare categorical data. Seaborn can automatically create a barplot with an estimated summary value by category.

For example, we can create a barplot comparing the mean `sales` based on product `category`.

In [None]:
# bar plot of mean sales by order return status
ax = sns.barplot(data = orders, x = 'category', y='sales')
ax.set_title("Mean Sales Value by Product Category")
ax.set_xlabel('')
plt.show()

The little black lines shown at the top of each bar is an error bar to show the confidence interval estimate for the mean. We can turn it off by using the argument `ci=None`.

In [None]:
# bar plot of mean sales by order return status
fig, ax = plt.subplots(figsize=(10,5))
# use hue for the category and palette for the required colors
ax = sns.barplot(
    data = orders, 
    x = 'category', 
    y='sales', 
    hue='returned', 
    ci=None, 
    dodge=True)
ax.set_title("Mean Sales Value by Product Category")
ax.set_xlabel('')
ax.set_ylabel('')

plt.show()

### Specifying Colors
We can specify basic colors in matplotlib by name (or letter). ![image.png](attachment:image.png)


In Seaborn, the colors can be specified using the `palette=` kwarg. You can also use colormaps to represent quantitative data.
You can check the colormap names [here](https://matplotlib.org/stable/tutorials/colors/colormaps.html) 

**Estimator Values**

When we specify the categories that we want to plot, Seaborn will calculate an single value from each group of categories depending on the `estimator=` argument. 

The default estimator is the mean, but we can specify a different estimator such as
- `estimator=len` for the number of observations in the group
- `estimator=np.median` for the median value (using numpy)
- `estimator=sum` for the sum of all the values
- `estimator=max` for the highest value
- `estimator=min` for the lowest value



In [None]:
# bar plot of Median sales value by return status
import numpy as np
fig, ax = plt.subplots(figsize=(20,8))
ax = sns.barplot(data = orders, 
                 x = 'sub_category', 
                 y='sales', 
                 hue='returned', 
                 ci=None, 
                 estimator=np.median)
ax.set_title("Median Sales Value")
plt.show()

### Heatmaps with Seaborn

When a category has many levels, like the `sub_category` above, a heatmap is useful for comparing the values by color. 

A heatmap is a way of representing the data in a 2-dimensional form. The data values are represented as colors in the graph. 
The goal of the heatmap is to provide a colored visual summary of information. 

For example, we can calculate the mean sales for each order by return status. First we create a two-way table for the two dimensions, `sub_category` and `returned`.

In [None]:
# Calculate mean sales for each group of sub category and day name
sales_by_subcategory= orders.groupby(['sub_category','order_day_name'])['sales'].mean()
data = sales_by_subcategory.unstack()
data

#### Reordering the Axis Labels

We can put the weekday names in order so that it's easier to identify if there are any trends. To do this, we can specify the `xticklabels` in the required order.

In [None]:
# Create heatmap
fig, ax = plt.subplots(figsize=(12,9))
ax=sns.heatmap(data, 
               annot=True, 
               fmt="0.01f", 
               cmap="Blues",
               xticklabels=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'])
ax.set_title('Mean Sales by Sub Category')
plt.show()

### Showing Correlation

A correlation matrix can be used to investigate the correlation between numeric values in a DataFrame.

The Pandas `corr` method can generate the correlation matrix on the numeric variables.

Two variables can have
- Strong positive correlation: pearson correlation statistic close to 1 
- Strong negative correlation: pearson correlation statistic close to -1 
- Weak correlation: pearson correlation statistic close to 0

A value of 1 shows perfect correlation and this can be seen when a variable is compared against itself. 

In [None]:
# Correlation matrix

corr_mat = orders.corr()
corr_mat

### Heatmap for Correlation

A heatmap can be used to show the correlation visually where darker colors can be used to indicate stronger correlations.


In [None]:

# Create heatmap
fig, ax = plt.subplots(figsize=(12,9))
ax=sns.heatmap(corr_mat, 
               annot=True, 
               fmt="0.01f", 
               cmap="BrBG")
ax.set_title('Mean Sales by Sub Category')
plt.show()


### Summary

Phew! That is quite a lot of plotting, but in summary:

[Seaborn](https://seaborn.pydata.org/tutorial.html) is integrated with matplotlib but provides an easier interface for grouping categorical variables.

There are so many other types of plots that you can create in both Seaborn, so do check out the documentation and explore the other types of visualizations.


### Review and Homework

For this homework, we will use the diamonds dataset to create the plots using Seaborn.


In [None]:
# Make sure you have the necessary imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Read in the diamonds data set
diamonds = pd.read_csv('datasets/diamonds.csv')
diamonds.head()

Q1. Create a scatter plot using
- data: diamonds
- x-axis: depth
- y-axis: table
- color of markers: cut
- size of markers: price
- transparency: 0.5



In [None]:
#Q1 Answer Using Seaborn


Q2. Use facets to separate the plots by cut.


In [None]:
#Q2 Using Seaborn


Q3. Draw Bar Charts to compare the mean prices by Cut and Clarity


In [None]:
#Q3 Using Seaborn


Q4. Generate the correlation matrix to check the correlation between the numeric variables in the diamonds dataset.

In [None]:
# Q4 answer 


Q5. Use the correlation matrix in a heatmap to determine the variables which are most highly correlated

In [None]:
#Q5 Answer using seaborn


Go ahead and try the other plots available in Seaborn to explore the data!