# Chapter 7. Introduction to Data Visualization with Seaborn



Seaborn is a powerful Python library for creating data visualizations. It was developed in order to make it easy to create the most common types of plots. The plot can be created with just a few lines of Seaborn code.

- Data visualization is often a huge component of both the data exploration phase and the communication of results
- Advantages:
   1. Seaborn's main purpose is to make data visualization easy.
   2. Seaborn works extremely well with pandas data structures.
   3. Finally, it's built on top of Matplotlib, which is another Python visualization library.

## 7.1 Introduction to Seaborn



## Getting started

```
import matplotlib.pyplot as plt
import seaborn as sns
```

## Scatter plot

```
import matplotlib.pyplot as plt
import seaborn as sns

height = [62, 64, 69, 75, 66, 68, 65, 71, 76, 33]
weight = [120, 136, 148, 175, 137, 165, 154, 172, 200, 187]

sns.scatterplot(x=height, y=weight)
plt.show()
```

## Count plot

```
import matplotlib.pyplot as plt
import seaborn as sns

gender = ["Female", "Female", "Female", "Female", 
          "Male", "Male", "Male", "Male", "Male", "Male"]

sns.countplot(x=gender)
plt.show()
```

## Using pandas with Seaborn

### Pandas

- A python library for data analysis.
- It can easily read datasets from many types of files including csv and txt files.
- pandas supports several types of data structures, but the most common one is the DataFrame object.

```
import pandas as pd
df = pd.read_csv("file.csv")
df.head()
```

### Using dataframes with countplot()

How to make a count plot with a DataFrame instead of a list of data.

- Important observation: Seaborn looks OK with **TIDY** dataframes.
   - "Tidy data" means that each observation has its own row and each variable has its own column.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.countplot(x="how_masculine", data=df)
plt.show()
```

### Adding a third variable with hue

Another big advantage that Seaborn offers is the ability to quickly add a third variable to your plots by adding color.

#### 1. Basic scatter plot

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.countplot(x="total_bill", y="tip", data=df)
plt.show()
```

#### 2. Scatter plot with ``hue``

    - You can set the "hue" parameter equal to the DataFrame column name and then Seaborn will automatically color each point
    - It will add a legend to the plot automatically
    - Hue is available in most of Seaborn's plot types.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.scatter(x="total_bill", y="tip", data=df, 
              hue="smoker")
plt.show()
```

#### 3. Setting hue order with ``hue_order``
    
- Hue also allows you to assert more control over the ordering and coloring of each value.
    - The "hue order" parameter takes in a list of values and will set the order of the values in the plot accordingly.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.scatter(x="total_bill", y="tip", data=df, 
              hue="smoker", hue_order=["Yes","No"])
plt.show()
```

#### 4. Specifying hue colors with ``pallete``

- You can also control the colors assigned to each value using the ``palette`` parameter.`
    - This parameter takes in a dictionary, which is a data structure that has key-value pairs.
    - This dictionary should map the variable values to the colors you want to represent the value.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
hue_colors = {"Yes":"black",
              "No":"red"}
sns.scatter(x="total_bill", y="tip", data=df, 
              hue="smoker", pallete=hue_colors)
plt.show()
```

#### 5. Using hue with count plots

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.countplot(x="smoker", data=df, hue="sex")
plt.show()
```

## 7.2 Visualizing two quantitative variables

### Introduction to relational plots and subplots

Many questions in data science are centered around describing the relationship between two quantitative variables. Seaborn calls plots that visualize this relationship "relational plots".

#### 1. Visualizing subgroups

We will create a separate plot per subgroup. To do this, we're going to introduce a new Seaborn function: ``relplot()``

- Stands for 'relational plot' and enables you to visualize the relationship between two quantitative variables using either scatter plots or line plots.
- Using ``relplot()`` gives us a big advantage: the ability to create subplots in a single figure.

```
# Using scatterplot()
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.scatter(x="total_bill", y="tip", data=df)
plt.show()
```

```
# Using relplot()
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="total_bill", y="tip", data=df, kind="scatter")
plt.show()
```

#### 2. Subplots in columns

- By setting ``col`` equal to "smoker", we get a separate scatter plot for a variable with 2 possible results

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="total_bill", y="tip", data=df, kind="scatter", 
            col="smoker")
plt.show()
```

#### 3. Subplots in rows

- If you want to arrange these vertically in rows instead, you can use the ``row`` parameter instead of ``col``.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="total_bill", y="tip", data=df, kind="scatter", 
            row="smoker")
plt.show()
```

#### 4. Subplots in rows and columns

- It is possible to use both "col" and "row" at the same time.
- We will have a subplot for each combination of two categorical variables.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="total_bill", y="tip", data=df, kind="scatter", 
            col="smoker", row="time")
plt.show()
```

#### 5. Subplots for days of the week

- To address this, you can use the ``col_wrap`` parameter to specify how many subplots you want per row.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="total_bill", y="tip", data=df, kind="scatter", 
            col="smoker", col_wrap=2)
plt.show()
```

#### 6. Ordering columns

- To address this, you can use the ``col_wrap`` parameter to specify how many subplots you want per row.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="total_bill", y="tip", data=df, kind="scatter", 
            col="smoker", col_wrap=2,
            col_order=["Thur", "Fri", "Sat", "Sun"])
plt.show()
```

### Customizing scatter plots

Scatter plots are a great tool for visualizing the relationship between two quantitative variables.

Seaborn allows you to add more information to scatter plots by varying the size, the style, and the transparency of the points.

#### 1. Subgroups with point size and hue

- We'll set the ``size`` parameter equal to the variable name "size" from our dataset.
- Larger groups have both larger and darker points, which provides better contrast and makes the plot easier to read.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="total_bill", y="tip", data=df, kind="scatter",
            size="size", hue="size")
plt.show()
```

#### 2. Subgroups with point style

- Setting the ``style`` parameter to a variable name will use different point styles for each value of the variable.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="total_bill", y="tip", data=df, kind="scatter",
            hue="smoker", style="smoker")
plt.show()
```

#### 3. Changing point transparency

- Setting the ``alpha`` parameter to a value between 0 and 1 will vary the transparency of the points in the plot
    - 0 being completely transparent
    - 1 being completely non-transparent.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="total_bill", y="tip", data=df, kind="scatter",
            alpha=0.4)
plt.show()
```

### Intro to line plots

In Seaborn, we have two types of relational plots: scatter plots and line plots. While each point in a scatter plot is assumed to be an independent observation, line plots are the visualization of choice when we need to track the same thing over time.

- By specifying "kind" equals "line", we can create a line plot

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="hour", y="NO_2_mean", data=df, kind="line")
plt.show()
```

#### 1. Subgroups by variable

- Setting the ``style`` and ``hue`` parameters equal to the variable name "location" creates different lines for each region that vary in both line style and color. 

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="hour", y="NO_2_mean", data=df, kind="line",
            style="location", hue="location")
plt.show()
```

#### 2. Adding markers

- Setting the "markers" parameter equal to "True" will display a marker for each data point.
- The marker will vary based on the subgroup you've set using the "style" parameter.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="hour", y="NO_2_mean", data=df, kind="line",
            style="location", hue="location", markers=True)
plt.show()
```

#### 3. Turning off line style

- If you don't want the line styles to vary by subgroup, set the ``dashes`` parameter equal to "False".

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="hour", y="NO_2_mean", data=df, kind="line",
            style="location", hue="location", markers=True, dashes=False)
plt.show()
```

#### 4. Multiple observations per x-value

- If a line plot is given multiple observations per x-value, it will aggregate them into a single summary measure. By default, it will display the mean.
- Seaborn will automatically calculate a confidence interval for the mean, displayed by the shaded region.
- Confidence intervals indicate the uncertainty we have about what the true mean is for the whole city.
   - This tells us that based on our sample, we can be 95% confident that the average measure is within this range.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="hour", y="NO_2_mean", data=df, kind="line")
plt.show()
```

#### 5. Replacing confidence interval with standard deviation

- Instead of visualizing a confidence interval, we may want to see how varied the measurements are across the different collection stations at a given point in time. 
- To visualize this, set the ``ci`` parameter equal to the string ``sd`` to make the shaded area represent the standard deviation, which shows the spread of the distribution of observations at each x value.
- To turn it off, we set ``ci`` as ``None``

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("file.csv")
sns.relplot(x="hour", y="NO_2_mean", data=df, kind="line", ci="sd")
plt.show()
```

## 7.3 Visualizing a categorical and a quantitative variable

Count plots and bar plots are two types of visualizations that Seaborn calls "categorical plots".

Categorical plots involve a categorical variable, which is a variable that consists of a fixed, typically small number of possible values, or categories.

These types of plots are commonly used when we want to make comparisons between different groups.

``catplot()`` is used to create different types of categorical plots.
   - Use the ``kind`` parameter to specify what kind of categorical plot to use.
   - To change the order of the categories, create a list of category values in the order that you want them to appear, and then use the ``order`` parameter. 

### Count plots vs bar plots

- A count plot displays the number of observations in each category.
- Bar plots look similar to count plots, but instead of the count of observations in each category, they show the mean of a quantitative variable among observations in each category.
    - Seaborn automatically shows 95% confidence intervals for these means.
    - You can also change the orientation of the bars in bar plots and count plots by switching the x and y parameters.
- Bar plot using ``catplot()``


```
# COUNT PLOT
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("file.csv")
sns.catplot(x="how_masculine", data=df, kind="count")
plt.show()
```

```
# BAR PLOT
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("file.csv")
sns.catplot(x="day", y="total_bill", data=df, kind="bar")
plt.show()
```

#### 1. Changing the order and turning off confidence interval

1. To change the order of the categories, create a list of category values in the order that you want them to appear, and then use the "order" parameter.
    - Applies to all types of categorical plots
2. To turn off the condfidence interval in bar plots, set ``ci=None``

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("file.csv")
category_order =  ["No answer", "Not at all", "Not very", "Somewhat", "Very"]
sns.catplot(x="how_masculine", data=df, kind="count", order=category_order)
plt.show()
```

### Box plots

- A box plot shows the distribution of quantitative data. It is commonly used as a way to compare the distribution of a quantitative variable across different groups of a categorical variable.
- The colored box represents the 25th to 75th percentile, and the line in the middle of the box represents the median.
- The whiskers give a sense of the spread of the distribution, and the floating points represent outliers.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("file.csv")
sns.catplot(x="time", y="total_bill", data=df, kind="box")
plt.show()
```

#### 1. Change the order of categories

- `catplot()` allows you to change the order of the categories using the `order` parameter.
```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("file.csv")
sns.catplot(x="time", y="total_bill", data=df, kind="box")
plt.show()
```

#### 2. Omitting the outliers using `sym`

- If you pass an empty string into ``sym``, it will omit the outliers from your plot altogether.
- ``sym`` can also be used to change the appearance of the outliers instead of omitting them.
```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("file.csv")
sns.catplot(x="time", y="total_bill", data=df, kind="box", sym="")
plt.show()
```

#### 4. Changing the whiskers using `whis`

- By default, the whiskers extend to 1 point 5 times the interquartile range, or "IQR".
    - The IQR is the 25th to the 75th percentile of a distribution of data.
- You can change the range of the whiskers from 1 point 5 times the IQR (which is the default) to 2 times the IQR by setting "whis" equal to 2 point 0.
- you can have the whiskers define specific lower and upper percentiles by passing in a list of the lower and upper values.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("file.csv")
sns.catplot(x="time", y="total_bill", data=df, kind="box", whis=2.0) # whis=[5,95]
plt.show()
```

### Point plots

- Point plots show the mean of a quantitative variable for the observations in each category, plotted as a single point.
- The vertical bars extending above and below the mean represent the 95% confidence intervals for that mean.
- Look similar to line plots. 
    - Line plots are relational plots, so both the x- and y-axis are quantitative variables.
    - In a point plot, one axis - usually the x-axis - is a categorical variable, making it a categorical plot.
- In the point plot, it's easier to compare the heights of the subgroup points when they're stacked above each other.
- It's also easier to look at the differences in slope between the categories than it is to compare the heights of the bars between them.

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("file.csv")
sns.catplot(x="time", y="total_bill", data=df, hue="feel_masculine", kind="point")
plt.show()
```

#### 1. Disconnecting the points

- We may want to remove the lines connecting each point, perhaps because we only wish to compare within a category group and not between them.

```
sns.catplot(x="time", y="total_bill", data=df, hue="feel_masculine", kind="point", join=False)
```

#### 2. Displaying the median

- To have the points and confidence intervals be calculated for the median instead of the mean, import the median function from the numpy library and set "estimator" equal to the numpy median function.
- The median is more robust to outliers, so if your dataset has a lot of outliers, the median may be a better statistic to use.

```
from numpy import mnedian
sns.catplot(x="time", y="total_bill", data=df, hue="feel_masculine", kind="point", estimator=median)
```

#### 3. Customizing the confidence intervals

- You can also customize the way that the confidence intervals are displayed.
- To add “caps” to the end of the confidence intervals, set the ``capsize`` parameter equal to the desired width of the caps.

```
sns.catplot(x="time", y="total_bill", data=df, hue="feel_masculine", kind="point", capsize=0.2)
```

#### 4. Turning off confidence intervals

- Like we saw with line plots and bar plots, you can turn the confidence intervals off by setting the parameter ``ci=None``.

```
sns.catplot(x="time", y="total_bill", data=df, hue="feel_masculine", kind="point", ci=None)
```

## 7.4 Customizing Seaborn Plots

## Changing plot style and color

Changing the style of a plot can be motivated by personal preference, but it can also help improve its readability or help orient an audience more quickly to the key takeaway.

1. Changing the figure style: Seaborn has five preset figure styles which change the background and axes of the plot by using``sns.set_style()`
    - "white", "dark", "whitegrid", "darkgrid", and "ticks"

2. Changing the palette: You can change the color of the main elements of the plot with Seaborn's ``sns.set_pallete()`` function.
    - Seaborn has a group of preset palettes called diverging palettes that are great to use if your visualization deals with a scale where the two ends of the scale are opposites and there is a neutral midpoint.
    - Note that if you append the palette name with "_r", you can reverse the palette.
    - Sequential palettes are great for emphasizing a variable on a continuous scale.

3. Changing the scale: You can change the scale of your plot by using the ``sns.set_context()`` function.
    - The scale options from smallest to largest are "paper", "notebook", "talk", and "poster".
    -  Default context: "paper".
    - You'll want to choose a larger scale like "talk" for posters or presentations where the audience is further away from the plot. 

## Adding titles and labels

We create data visualizations to communicate information, and we can't do that effectively without a clear title and informative axis labels.

#### 1. FacetGrid vs. AxesSubplot objects

Seaborn's plot functions create two different types of objects: FacetGrids and AxesSubplots.
- To figure out which type of object you're working with, first assign the plot output to a variable.
- In the documentation, the variable is often named "g", so we'll do that here as well.
- Write "type" "g" to return the object type.

Recall that "relplot()" and "catplot()" both support making subplots. This means that they are creating FacetGrid objects.

In contrast, single-type plot functions like "scatterplot()" and "countplot()" return a single AxesSubplot object.

#### A1. Adding a title to FacetGrid

- To add a title to a FacetGrid object, first assign the plot to the variable "g".
- After you assign the plot to "g", you can set the title using "g dot fig dot suptitle".
- This tells Seaborn you want to set a title for the figure as a whole.

```
g = sns.catplot(x="Region", y="Birthrate", data=df, kind="box")
g.fig.suptitle("New Title")
plt.show()
```

#### A2. Adjusting height of title in FacetGrid

- Note that by default, the figure title might be a little low.
- To adjust the height of the title, you can use the "y" parameter.
- The default value is 1, so setting it to 1 point 03 will make it a little higher than the default.

```
g = sns.catplot(x="Region", y="Birthrate", data=df, kind="box")
g.fig.suptitle("New Title", y=1.03)
plt.show()
```

#### A3. Titles for subplots

- To alter the subplot titles, use ``g.set_titles`` to set the titles for each AxesSubplot.
- If you want to use the variable name in the title, you can use "col name" in braces to reference the column value.

```
g = sns.catplot(x="Region", y="Birthrate", data=df, kind="box", col="Group")
g.fig.suptitle("New Title", y=1.03)
g.set_titles("This is {col_name})
plt.show()
```

#### B1. Adding a title to AxesSubplot

- To add a title to an AxesSubplot object, assign the plot to a variable and use “g dot set_title”.
- You can also use the “y” parameter here to adjust the height of the title.

```
g = sns.boxplot(x="Region", y="Birthrate", data=df)
g.set_title("New Title", y=1.03)
plt.show()
```

#### 2. Adding axis labels

- To add axis labels, assign the plot to a variable and then call the ``set()`` function.
- Set the parameters ``x_label`` and ``y_label`` to set the desired x-axis and y-axis labels, respectively.
- This works with both **FacetGrid** and **AxesSubplot objects**.

```
g = sns.catplot(x="Region", y="Birthrate", data=df, kind="box", col="Group")
g.set(xlabel="New X Label", ylabel= "New Y Label")
plt.show()
```

#### 3. Rotating x-axis tick labels

- Your tick labels may overlap, making it hard to interpret the plot.
- One way to address this is by rotating the tick labels. To do this, we don't call a function on the plot object itself.
- Instead, after we create the plot, we call the matplotlib function "plt. dot xticks" and set "rotation" equal to 90 degrees.
- This works with both **FacetGrid** and **AxesSubplot objects**.

```
g = sns.catplot(x="Region", y="Birthrate", data=df, kind="box")
plt.xticks(rotation=90)
plt.show()
```