# Seaborn basics

Seaborn is a Python library based on `Matplotlib` that allows you to create charts for statistical analysis.

It improves upon `Matplotlib`, e.g. it natively supports `Pandas`, allows you to summarize large `Panda DataFrames` into aggregated charts. One of the most important tasks that Seaborn automates for us is calculating aggregate statistics over large datasets

To import `Seaborn`, add the following line at the top of your script `import seaborn as sns`. You'll need to import `Pandas` aswell `import pandas as pd`.

If you were to plot a satisfaction survey, 'satisfaction vs gender' using `Matplotlib`, you would do the following:

```py
df = pd.read_csv("results.csv")
ax = plt.subplot()
plt.bar(range(len(df)),
        df["Mean Satisfaction"])
ax.set_xticks(range(len(df)))
ax.set_xticklabels(df.Gender)
plt.xlabel("Gender")
plt.ylabel("Mean Satisfaction")
```

This task is much simpler using the `Seaborn` method `barplot()`. It takes a min of 3 args:
 - `data` - a dataframe
 - `x` - a string indicating the column in the dataframe that contains the labels identifying the bars
 - `y` - a string indicating the column in the dataframe that has the heights of each bar
 
 
```py
sns.barplot(data=<name of DataFrame>,
  x=<column for x-data>,
  y=<column for y-data>)
```
 
By default, Seaborn will aggregate and plot the mean of each category.

#### Example

```py
# results.csv
	Gender	Mean Satisfaction
0	Male	7.2
1	Female	8.1
2	Non-binary	6.8
```

```py
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

# Load results.csv here:
df = pd.read_csv('results.csv')
print(df)

sns.barplot(
	 data=df,
	 x='Gender' ,
	 y='Mean Satisfaction'
)
plt.show()
```

Seaborn can also calculate `aggregate statistics`(mean, mode, median, std deviaton, etc) on datasets. By default it will plot the `mean` of our data.

Using `Numpy` you could calculate the mean grade for Assignment 1 like so:

```py
import pandas as pd
import numpy as np

gradebook = pd.read_csv("gradebook.csv")
assignment1 = gradebook[gradebook.assignment_name == 'Assignment 1'] # grab all assignment 1 rows
asn1_mean = np.mean(assignment1.grade) # calculate the mean grade
print(asn1_mean) # 88.0
```

 Using `Seaborn` you could do the following:
 
 ```py
 # first 5 rows of gradebook dataframe
 	student	assignment_name	grade
0	Amy	Assignment 1	75
1	Amy	Assignment 2	82
2	Bob	Assignment 1	99
3	Bob	Assignment 2	90
4	Chris	Assignment 1	72
 ```
 
```py
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

gradebook = pd.read_csv("gradebook.csv")
sns.barplot(data=gradebook, x="assignment_name", y="grade")
plt.show()
```

By default, Seaborn will place error bars on each bar when you use the `barplot()` function. These are a visual indication of the range of values that might be expected.

By default, Seaborn uses something called a `bootstrapped confidence interval` or `95% confidence interval`. This interval means that "based on the data, you can be confident that the mean value represented by the bar would be within the range indicated by the error bar 95% of the time".

This confidence interval is calulated for `median` and `mode` as well as `mean`.

If you're calculating a mean and would prefer to use standard deviation for your error bars, you can pass in the keyword argument ci="sd" to sns.barplot() which will represent one standard deviation.

```py
sns.barplot(data=gradebook, x="name", y="grade", ci="sd")
```

We can also plot the `median`, when the dataset possess many outliers. If our data is categorical, we might want to count how many times each category appears, e.g.the case of a survey response.

To calculate other aggregates, use the `estimator` keyword, which accepts any function that works on a list.

```py
sns.barplot(data=df,
  x="x-values",
  y="y-values",
  estimator=np.median)
```

To calculate the number of times a particular value appears in the Response column , we pass in `len`:

```py
sns.barplot(data=df,
  x="Patient ID",
  y="Response",
  estimator=len)
```

### Using hue keyword

Seaborn allows you to aggregate multiple columns to visualize nested categorical variables, e.g. a customer satisfaction survey might depend on 'Gender' as well as 'age'. We can compare both `Gender` and `Age Range` columns using the keyword 'hue`.

```py
sns.barplot(data=df,
            x="Gender",
            y="Response",
            hue="Age Range")
```

Example:

Given a dataset that includes information on the number of licks it takes people to get to the center of a tootsie pop, and includes columns with a person's age range, and gender; which is the correct use of hue to create the following chart?

```py
sns.barplot(x="Age Range",
  y="Number of Licks",
  hue="Gender",
  data=df)
```

![bar-plot](img/bar-plot.png)

### Summary

1. Import the following libraries

```py
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np # optional, req'd to calculate some aggregates
```

2. Ingest data from a CSV file to Pandas DataFrame.

```py
df = pd.read_csv('file_name.csv')
```

3. Set sns.barplot() with desired values for x, y, and set data equal to your DataFrame.

```py
sns.barplot(data=df, x='X-Values', y='Y-Values')
```

4. Set desired values for estimator and hue parameters.

```py
sns.barplot(data=df, x='X-Values', y='Y-Values', estimator=len, hue='Value')
```

5. Render plot

```py
plt.show()
```