# Plotting Distributions with Seaborn

Bar charts/plots in Seaborn are not able to visualize distributions, we use KDE, Box and Violin plots for that.

## KDE Plots

A barchart can plot the mean of our dataset, but don't give any indication of the data's distribution, e.g. is the data clustered around the mean or evenly distributed across the entire range.

In order to examine a datasets distribution you could use a histogram. A better option is a `KDE plot`. With histograms, depending on how you group the data into bins and the width of the bins, you can draw wildly different conclusions about the shape of the data. Using a KDE plot can mitigate these issues, because they smooth the dataset they allow us to generalize over the shape of our data.

A KDE or Kernel Density Estimator plot gives us the sense of a univariate as a curve. A univariate dataset only has one variable(also referred to as being one-dimensional). As opposed to bivariate or two-dimensional datasets which have two variables.

To plot a KDE, use `sns.kdeplot()`. It takes two args:
 - `data` the univariante dataset being visualized(panda dataframe, python list or numpy array)
 - `shade` a boolean that determines whether or not the space underneath the curve is shaded.

Plot each dataset separetly, rather than using a combined dataframe.

 ```py
sns.kdeplot(dataset1, shade=True)
sns.kdeplot(dataset2, shade=True)
sns.kdeplot(dataset3, shade=True)
plt.legend()
plt.show()
 ```
![KDE plot](img/kde-plot.png)

The plot demonstrates:
 - Dataset 1 is skewed left
 - Dataset 2 is normally distributed
 - Dataset 3 is bimodal (it has two peaks)

#### Example

```py
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

# Take in the data from the CSVs as NumPy arrays:
set_one = np.genfromtxt("dataset1.csv", delimiter=",")
set_two = np.genfromtxt("dataset2.csv", delimiter=",")
set_three = np.genfromtxt("dataset3.csv", delimiter=",")
set_four = np.genfromtxt("dataset4.csv", delimiter=",")

# Creating a Pandas DataFrame:
n=500
df = pd.DataFrame({
    "label": ["set_one"] * n + ["set_two"] * n + ["set_three"] * n + ["set_four"] * n,
    "value": np.concatenate([set_one, set_two, set_three, set_four])
})

# Setting styles:
sns.set_style("darkgrid")
sns.set_palette("pastel")

# Add your code below:
sns.kdeplot(set_one, shade=True)
sns.kdeplot(set_two, shade=True)
sns.kdeplot(set_three, shade=True)
sns.kdeplot(set_four, shade=True)

plt.show()
```

## Box Plots

A `Box plot` can tell us how out dataset is distributed(like a KDE plot), but unlike a KDE plot, it shows us the range of our dataset, gives us an idea about where a significant portion of our data lies, and whether or not any outliers are present.

To create a box plot, use `sns.boxplot()`, which takes 3 srgs:
 - `data` the dataset, a DataFrame, python list, or numpy array
 - `x`  a one-dimensional set of values, like  Series, list, or array
 - `y ` a second set of one-dimensional data

If you use a Pandas Series for the x and y values, the Series will also generate the axis labels - uses the name as the label.

#### Interpreting a Box Plot

- The **box** represents the interquartile range
- The **line in the middle** of the box is the median
- The **end lines** are the first and third quartiles
- The **diamonds** show outliers

![Box Plot](img/box-plot-2.png)

To create a box plot of the 3 datasets from above:

```py
sns.boxplot(data=df, x='label', y='value')
plt.show()
```
A box plot likes like so:

![Box plot](img/box-plot.png)

It shows that dataset one is skewed to the left and dataset two is uniform, as does the KDE plot. But it does not show that dataset three is bimodal.

#### Example

```py
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

# Take in the data from the CSVs as NumPy arrays:
set_one = np.genfromtxt("dataset1.csv", delimiter=",")
set_two = np.genfromtxt("dataset2.csv", delimiter=",")
set_three = np.genfromtxt("dataset3.csv", delimiter=",")
set_four = np.genfromtxt("dataset4.csv", delimiter=",")

# Creating a Pandas DataFrame:
n=500
df = pd.DataFrame({
    "label": ["set_one"] * n + ["set_two"] * n + ["set_three"] * n + ["set_four"] * n,
    "value": np.concatenate([set_one, set_two, set_three, set_four])
})

# Setting styles:
sns.set_style("darkgrid")
sns.set_palette("pastel")

# Add your code below:
sns.boxplot(data=df, x='label', y='value')
plt.show()
```

## Violin Plots

They show a dataset's distributions, like the KDE plot, as well as information about the median and interquartile range, like the box plot - they're a combination of KDE and Box plots. You can compare multiple distributions at once.

To create a violin plot, use the `sns.violinplot()` method, it takes a min of 3 args:
 - `data` the dataset - a dataframe, python list, or numpy array
 - `x`  a one-dimensional set of values, like  Series, list, or array
 - `y ` a second set of one-dimensional data values (Series, list, or array)

It can also take:
 - `hue` a one-dimensional set of values, like  Series, list, or array
 - any of the parameters accepted by `sns.boxplt()`
 
 ```py
sns.violinplot(data=df, x="label", y="value")
plt.show()
 ```
 
A violin plot of the 3 datasets from above demonstrates that dataset 1  is skewed to the left, dataset 2 is uniform and dataset 3 is bimodal
![Violin plot](img/violin-plot.png)


### Interpreting a Violin Plot

 - they are symmetrical along the center line.
 - the white dot represents the median.
 - the thick black line in the center of each violin represents the interquartile range(visualize confidence levels).
 - the lines that extend from the center are the confidence intervals - a violin plot displays the 95% confidence interval.
 
 ![Violin plot](img/violin-plot-2.png)

#### Example

```py
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

# Take in the data from the CSVs as NumPy arrays:
set_one = np.genfromtxt("dataset1.csv", delimiter=",")
set_two = np.genfromtxt("dataset2.csv", delimiter=",")
set_three = np.genfromtxt("dataset3.csv", delimiter=",")
set_four = np.genfromtxt("dataset4.csv", delimiter=",")

# Creating a Pandas DataFrame:
n=500
df = pd.DataFrame({
    "label": ["set_one"] * n + ["set_two"] * n + ["set_three"] * n + ["set_four"] * n,
    "value": np.concatenate([set_one, set_two, set_three, set_four])
})

# Setting styles:
sns.set_style("darkgrid")
sns.set_palette("pastel")

# Add your code below:
sns.violinplot(data=df, x="label", y="value")
plt.show()
```

## Summary

Seaborn's strength is in visualizing statistical calculations. Seaborn includes several plots that allow you to graph univariate distributions, including KDE plots, box plots, and violin plots.

Read the datasets, concatenate them using numpy and convert them to a dataframe.

```py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dataset1 = np.genfromtxt("dataset1.csv", delimiter=",")
dataset2 = np.genfromtxt("dataset2.csv", delimiter=",")
dataset3 = np.genfromtxt("dataset3.csv", delimiter=",")

n=500
df = pd.DataFrame({
    "label": ["set_one"] * n + ["set_two"] * n + ["set_three"] * n,
    "value": np.concatenate([dataset1, dataset2, dataset3])
})

sns.set()
```

Plot the dataset

```py
sns.barplot(data=df, x='label', y='value')
plt.show()
```
![bar chart](img/bar-chart.png)


To discover the distribution(how spread out) the data is, we use a `KDE` plot.

```py
sns.kdeplot(dataset1, shade=True, label="dataset1")
sns.kdeplot(dataset2, shade=True, label="dataset2")
sns.kdeplot(dataset3, shade=True, label="dataset3")

plt.legend()
plt.show()
```
![KDE plot](img/kde-plot.png)

This particular plot is difficult to interpret. in which case we can generate a `Box Plot`, which makes it easier to compare distributions, identify outliers and the interquartile range.

```py
sns.boxplot(data=df, x='label', y='value')
plt.show()
```
![Box plot](img/box-plot.png)


In order to see data's 'shape', create a `violin plot`.

```py
sns.violinplot(data=df, x="label", y="value")
plt.show()
```
![Violin plot](img/violin-plot.png)