<a href="https://colab.research.google.com/github/NIP-Data-Computation/show-and-tell/blob/master/piercel_week3_notes1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Author**: Pierce Lopez <br>
**Date Created**: August 18, 2020 <br>
**Last Updated**: August 18, 2020 <br> 
**Description**: Contains my notes on the Data Analyst lesson: _Introduction to Data Visualization with Seaborn_.

# Introduction to Data Visualization with Seaborn
For this chapter, we will make use of the `pandas` and `matplotlib.pyplot` functions, so do not forget to import the necessary modules!

```
# import modules
import pandas as pd
import matplotlib.pyplot as plt
```

## Chapter 1: Introduction to Seaborn

### Section 1: Introduction to Seaborn

1. Advantages of Seaborn
* Easy to use!
* Works well with `pandas`.
* Built on top of `matplotlib`.

2. Getting Started

```
import seaborn as sns
```

3. Example: Scatter plot

```
# sample data
data1 = [1, 2, 3]
data2 = [10, 20, 30]

# sample scatter plot
sns.scatterplot(x = data1,
                y = data2)

# display plot
plt.show()
```

4. Example: Count plot

```
# sample data
data = ["a", "b", "c", "a", "a", "c"]

# sample count plot
sns.countplot(x = data)

# display plot
plt.show()
```

<br>

### Section 2: Using `pandas` with Seaborn

1. Working with DataFrames

```
# reading as csv file
df = pd.read_csv(filename)
```

2. Using DataFrames with `countplot()`

```
# reading as csv file
df = pd.read_csv(filename)

# sample count plot
sns.countplot(x = "df_colname",
              data = df)

# display plot
plt.show()
```

**Note:** Seaborn works well with `pandas` DataFrames, but only if the DataFrames are _tidy_. _Tidy_ DataFrames imply that <ins>each observation has its own row and each variable has its own column</ins>.

<br>

### Section 3: Adding a third varible with hue

1. Loading Datasets

```
# load dataset
ds = sns.load_dataset("dataset_name")
```

2. Setting Hue Order in a Scatter Plot

**Assume:** ds_colname3's column values are "a" and "b"
```
# sample scatter plot
sns.scatterplot(x = "ds_colname1", 
                y = "ds_colname2", 
                data = ds,
                hue = "ds_colname3"
                hue_order = ["a", "b"])

# display plot
plt.show()
```

3. Specifying Hue Colors

```
# setting hue colors
hue_color = {"a":"red", "b":"blue"}

# sample scatter plot
sns.scatterplot(x = "ds_colname1", 
                y = "ds_colname2", 
                data = ds,
                hue = "ds_colname3"
                hue_order = ["a", "b"]
                palette = hue_colors)

# display plot
plt.show()
```

4. Using Hue with Count Plots

```
# sample count plot
sns.countplot(x = "ds.colname1",
              data = ds
              hue = "ds_colname2")

# display plot
plt.show()
```

<br>

## Chapter 2: Visualizing Two Quantitative Variables

### Section 1: Introduction to relational plots and subplots
1. Questions about quantitative variables
  * Relational plots can be made from:
    * Height vs. weight
    * Number of absences vs. final grade
    * GDP vs. percent literate

**Recall:** In the previous chapter, we created subgroups using the `hue` argument.

2. Introducing `relplot()`
  * `relplot()` creates relational plots for different subgroups.
    * can be scatter or line plots
    * can create subplots for subgroups in a single figure

3. Using `relplot()`

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "scatter")

# display plot
plt.show()
```

4. Subplots in Columns

The `col` argument dictates the basis dataset column that defines how many subplots arranged horizontally will be created.

The `row` argument dictates the basis dataset column that defines how many subplots arranged vertically will be created.

**Note:** We can combine the `col` and `row` arguments!

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "scatter",
            col = "ds_colname3")

# display plot
plt.show()
```

5. Wrapping Columns

`col_wrap` dictates how many subplot columns should be made.

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "scatter",
            col = "ds_colname3",
            col_wrap = 3)

# display plot
plt.show()
```

6. Ordering Columns

`col_order` dictates the order of the subgroups.

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "scatter",
            col = "ds_colname3",
            col_wrap = 3,
            col_order = [list of subgroup order])

# display plot
plt.show()
```

<br>


### Section 2: Customizing scatter plots
1. Subgroups with Point Size

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "scatter",
            size = "ds_colname3")

# display plot
plt.show()
```

2. Point Size and Hue

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "scatter",
            size = "ds_colname3",
            hue = "ds_colname3")

# display plot
plt.show()
```

3. Subgroups with Point Style

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "scatter",
            style = "ds_colname3",
            hue = "ds_colname3")

# display plot
plt.show()
```

4. Changing Point Transparency

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "scatter",
            alpha = 0.4)

# display plot
plt.show()
```

<br>


### Section 3: Introduction to line plots

0. Prelude

* A scatterplot point is an independent observation from the same two variables while a line plot point represents the same _thing_, typically tracked through time (a development of the point).

1. Line Plot

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "line")

# display plot
plt.show()
```

2. Subgroups by a Column Variable

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "line",
            style = "ds_colname3",
            hue = "ds_colname3")

# display plot
plt.show()
```

3. Adding Markers and Removing Line Styles

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "line",
            style = "ds_colname3",
            hue = "ds_colname3",
            markers = True
            dashes = False)

# display plot
plt.show()
```

4. Multiple Observations per x-Value

If a line plot has multiple observations at a certain x-value, those y-values will be aggregated into a single summary measure (mean by default). Seaborn also automatically calculates the _confidence interval_ (shaded region) for the mean.

**Note:** Confidence intervals dictate the how certain we are that the true mean is within the range.

5. Replacing Confidence Interval with Standard Deviation

```
sns.relplot(x = "ds_colname1",
            y = "ds_colname2",
            data = ds,
            kind = "line",
            ci = "sd")

# display plot
plt.show()
```

<br>

## Chapter 3: Visualizing a Categorical and a Quantitative Variable

### Section 1: Count plots and bar plots
1. Categorical Plots
* Count plots, bar plots, and box plots are considered such.
* These plots are used when:
   * we want to make comparisons between different groups, and
   * we have categorical variables (variables that consist of a fixed, typically small, number of possible values).

2. `catplot()`
* Used to make categorical plots.
* Can use `relplot()` to add subplot functionality.

```
# sample countplot using catplot()
sns.catplot(x = "df_colname",
            data = df,
            kind = "count")

# display plot
plt.show()
```

3. Changing the Order

```
# introduce preferred order
category_order = ["value1", 
                  "value2", 
                  "value3"]

# sample countplot using catplot()
sns.catplot(x = "df_colname",
            data = df,
            kind = "count",
            order = category_order)

# display plot
plt.show()
```

4. Bar Plots
* Displays mean of quantitative variable per category.

```
# sample countplot using catplot()
sns.catplot(x = "df_colname1",
            y = "df_colname2",
            data = df,
            kind = "bar")

# display plot
plt.show()
```

**Note:** The barplots will also show 95% confidence intervals in each mean.

5. Turning Off Confidence Intervals

```
# sample countplot using catplot()
sns.catplot(x = "df_colname1",
            y = "df_colname2"
            data = df,
            kind = "bar",
            ci = None)

# display plot
plt.show()
```

6. Changing the Orientation (**common practice**)

```
# sample countplot using catplot()
sns.catplot(x = "df_colname1",
            y = "df_colname2",
            data = df,
            kind = "bar",
            ci = None)

# display plot
plt.show()
```

<br>

### Section 2: Box plots
1. What is a Box Plot?
* This plot shows the distribution of quantitative data.
  * Box region: 25th to 75th percentile
  * Line inside the box region: median
  * Whiskers: spread of the data distribution
  * Points: outliers

2. Creating a Box Plot

```
# sample boxplot using catplot()
sns.catplot(x = "df_colname1",
            y = "df_colname2",
            data = df,
            kind = "box")

# display plot
plt.show()
```

2. Change the Order of Categories

```
# sample boxplot using catplot()
sns.catplot(x = "df_colname1",
            y = "df_colname2",
            data = df,
            kind = "box",
            order = ["Dinner", 
                      "Lunch"])

# display plot
plt.show()
```

3. Omit Outliers Using `sym`

```
# sample boxplot using catplot()
sns.catplot(x = "df_colname1",
            y = "df_colname2",
            data = df,
            kind = "box",
            sym = "")

# display plot
plt.show()
```

4. Change Whiskers Using `whis`
* By default, whiskers extend to 1.5 times the interquartile range (IQR: length of the box region). The `whis` parameter's default value is 1.5.

* We can change the `whis` value in the following ways:
  * `whis = numeric factor`: whiskers will extend to the numeric factor times the IQR.
  * `whis = [5, 95]`: whiskers will extend from the 5th percentile to the 95th percentile.
  * `whis = [0, 100]`: whiskers will extend from the minimum and maximum values.

```
# sample boxplot using catplot()
sns.catplot(x = "df_colname1",
            y = "df_colname2",
            data = df,
            kind = "box",
            whis = 2.0)

# display plot
plt.show()
```
<br>

### Section 3: Point plots
1. What are Point Plots?
* Plot points in point plots show the mean with the 95% confidence interval (by default).

2. Difference to Line Plots
* Line plots have quantitative variables on both axes.
* Point plots have a categorical variable on an axis.

3. Point Plots vs. Bar Plots
* It is easier to compare heights of subgroup points when they are stacked above each other (point plots)
* It is also easy to see the slope from one categorical value to another (point plots)

4. Creating a Point Plot

```
# sample point plot using catplot()
sns.catplot(x = "df_colname1",
            y = "df_colname2",
            data = df,
            kind = "point")

# display plot
plt.show()
```

5. Disconnecting Points

```
# sample point plot using catplot()
sns.catplot(x = "df_colname1",
            y = "df_colname2",
            data = df,
            kind = "point",
            join = False)

# display plot
plt.show()
```

5. Displaying Median Instead of Mean

```
# import necessary module
from numpy import median

# sample point plot using catplot()
sns.catplot(x = "df_colname1",
            y = "df_colname2",
            data = df,
            kind = "point",
            estimator = median)

# display plot
plt.show()
```

6. Adding _caps_ to the Confidence Intervals

```
# sample point plot using catplot()
sns.catplot(x = "df_colname1",
            y = "df_colname2",
            data = df,
            kind = "point",
            capsize = 0.2)

# display plot
plt.show()
```

<br>


## Chapter 4: Customizing Seaborn Plots

### Section 1: Changing plot style and color
1. Figure Styles
* `sns.set_style()` adds different details to our plot.
  * whitegrid: gray grid on background
  * ticks: ticks on axis values
  * darkgrid: white grid on gray background

2. Changing Palettes
* `sns.set_palette()` sets preferred color modifications to the main elements of the plot.
  * Diverging palettes:
    * "RdBu": Red-white-blue gradient
    * "PRGn": Purple-white-green gradient
    * "RdBu_r": Red-white-blue gradient (reversed)
  * Sequential  palettes
    * "Greys"
    * "Blues"
    * "GnBu": Green-blue gradient
  * Custom palettes
    * `custom = ["red", "blue", "black"]`

3. Changing the Scale
* `sns.set_context()` changes the scale of the plot elements and labels.
  * "paper" (default): smallest
  * "notebook"
  * "talk"
  * "poster": largest

<br>

### Section 2: Adding titles and labels: Part 1

1. `FacetGrid` vs. `AxesSubplot` Objects
* Seaborn plots create two different types of objects: `FacetGrid` and `AxesSubplot`.
  * FacetGrids has subplot functionality (i.e. `relplot()` and `catplot()`).
  * AxesSubplots only creates one plot (i.e. `scatterplot`, `countplot()`, etc.).
* To identify which of the two are we dealing with, we can store it in a variable and know its type.

2. Adding a Title to a `FacetGrid`

```
# store plot in a variable
g = sns.catplot(x = "df_colname1",
                y = "df_colname2",
                data = df,
                kind = "point")

# add title and adjust height
g.fig.suptitle("Title",
               y = 1.03)

# display plot
plt.show()
```

<br>

### Section 3: Adding titles and labels: Part 2

1. Adding a Title to an `AxesSubplot`

```
# store plot in a variable
g = sns.scatterplot(x = "df_colname1",
                y = "df_colname2",
                data = df)

# add title and adjust height
g.set_title("Title",
            y = 1.03)

# display plot
plt.show()
```

2. Titles for Subplots

```
# store plot in a variable
g = sns.catplot(x = "df_colname1",
                y = "df_colname2",
                data = df,
                kind = "point")

# add main title and adjust height
g.fig.suptitle("Title",
               y = 1.03)

# add subplot titles
g.set_titles("This is {df_colname}")

# display plot
plt.show()
```

3. Adding Axis Labels

```
# store plot in a variable
g = sns.catplot(x = "df_colname1",
                y = "df_colname2",
                data = df,
                kind = "point")

# add axis labels
g.set(xlabel = "X Label"
      ylabel = "Y Label")

# display plot
plt.show()
```

4. Rotating x-Axis Tick Labels

**Note:** Storing the plot objects into a variable is not required for this one.

```
# store plot in a variable
g = sns.catplot(x = "df_colname1",
                y = "df_colname2",
                data = df,
                kind = "point")

# rotate x-axis tick labels
plt.xticks(rotation = 90)

# display plot
plt.show()
```

<br>

### Section 4: Putting it all together

1. Import modules.
2. Choose what type of plot you want to create.
  * Relational plots - relationship between two quantitative variables
    * Scatter plots
    * Line plots
  * Categorical plots - distribution of a quantitative variable over categories under a categorical variable.
    * Bar plots
    * Count plots
    * Box plots
    * Point plots
3. Add a third variable (optional).
  * Setting `hue` to distinguish, by color, different values (subgroups) of the third variable.
  * Setting `row`/`col` to create subplots for the different subgroups of the third variable.
4. Customize plots (optional but good to practice).
  * `sns.set_style()` to change the background
  * `sns.set_palette()` to change main element colors
  * `sns.set_context()` to change the plot scale
5. Add plot titles and axes labels.
  * Titles:
    * `FacetGrid` - `g.fig.suptitle("Title")`
    * `AxesSubplot` - `g.set_title("Title")`
  * Axes Labels - `g.set(xlabel = "X Label", ylabel = "Y Label")`
6. Add final touches (optional, but can be needed).
  * Rotating x-tick labels - `plt.xticks(rotation = 90)`

