# Introduction to Seaborn/Matplotlib

## Skills

* Load data into Python using the pandas module.
* Select columns using `[]` and rows using `DataFrame.loc[]`.
* Summarize columns with basic descriptive statistics.
* Summarize by category using `DataFrame.groupby()`
* Create new columns.
* Use built-in Pandas string manipulation functions.
* **Visualize data using Seaborn and Matplotlib.**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
netflix = pd.read_csv("netflix.csv")

netflix.head(3)

Now that we can load, manipulate, and summarize data, the Seaborn and Matplotlib packages (traditionally loaded with the names *sns* and *plt*, respectively) will let us make graphs to visualize that data.

We'll start with four basic kinds of plots:

1. Histogram
2. Bar Plot
3. Scatterplot
4. Boxplot

Seaborn does a good job of making the plots look *pretty good* by default, but we'll want to tweak those plots as well, to clean up axes labels, tweak colors and sizes of plots, and so on. The defaults are very easy to learn, as are a few tweaks, but there are *a lot* of them, and will require frequent reading of the documentation and Google to go from merely proficient to skilled. 

## Basic Plots

Before making a graph, we need to decide what *kind* of graph we want.

##### 1. Is the data we're plotting numeric or categorical?

Recall that "categorical" isn't Python or Pandas datatype, but a variable that can take a small number of values. 

This distinction is important for making a plot because numeric variables have an inherent numbering and spacing to them: 5 is larger than 2, and the distance from 3 to 4 is the same as the distance from 100 to 101. Categorical data don't have this: is "romance" larger than "comedy"? How far should "horror" and "romance" be from each other?

##### 2. Are we examining the counts of a single variable, or comparing two variables?

I hope this one is self-explanatory. Are you looking at the counts of how many movies are rated "R" (just examining one variable)? Or are you looking at the typical IMDB rating for each rating (looking at two variables)?

Here's a handy summary of the basic graph types based on these questions:

| Data Type        | One Variable | Two Variables|
| ---------------- | ------------------- | ----------------------- |
| **Numeric**     | Histogram           | Scatterplot             |
| **Categorical** | Bar Plot            | Boxplot |


#### Histogram

A *histogram* is a visualization showing how often a variable takes a particular value.

Let's produce one to see the IMDB user scores for each of the movies in the dataset:

Here, the function we're using is `sns.histplot()`, and it takes two keyword arguments: `data=`, which should be set to the DataFrame we want to pull data from, and `x=`, which is a string telling Seaborn which column of the DataFrame to plot. You can pass the entire column itself, i.e. `x=netflix["imdb_score"]`, but that is not necessary.

After all plotting commands, we end with `plt.show()` which tells Matplotlib to display the plot. If you don't include this line, you should still see the plot (in JupyterLab, anyway), but you may also get some additional output that you're not interested in.

In [None]:
movies = netflix.loc[ netflix["type"] == "MOVIE" ]

sns.histplot(data=movies, x="imdb_score")

plt.show()

The most common score for a movie on Netflix (based on IMDB users) is about 6.5, but is typically between 5 and 8. A very small number of movies have scores below 5.

#### Bar Plot

A *bar plot* is exactly the same as a histogram, but for categorical data.

Let's see how many movies have different ratings. This uses the function `sns.countplot()`, but otherwise takes the same keyword arguments as `sns.histplot()`:

In [None]:
sns.countplot(data=movies, x="age_certification")

plt.show()

Most of the movies on netflix with an MPAA rating are R, followed by PG-13. Very few movies on Netflix have a rating of NC-17.

Once again, we'll fix the axis labels and put the x axis in a more reasonable order later.

#### Scatterplot

A *scatterplot* plots two separate numeric variables on its x and y axes. This is what I tend to think of when someone says "make a graph of...", probably influenced by my highschool math and science classes.

Let's see whether the ratings from IMDB and TMDB users are similar or not.

Just like above, we use a Seaborn function: `sns.scatterplot()`, but now need both `x=` and `y=` keyword arguments.

In [None]:
sns.scatterplot(data=netflix, x="imdb_score", y="tmdb_score")

plt.show()

Typically, TMBD and IMDB scores are close (within about 1-2 points of each other), but there is quite a bit of scatter. Some TMDB scores are exactly 1 or 10 (presumably the maximum and minimum possible), even when the IMDB score is more neutral.

#### Boxplot

A boxplot is like a scatterplot but one of your variables is categorical. You may not be familiar with boxplots, but they are fairly straightforward to interpret, it just takes some getting used to.
* In the center of the box is a line, which represents the median value.
* The box itself tells you where 50% of all of the data lie.
* The long lines (called "whiskers" or "outer fences") tell you what "normal" data are. This is a rather arbitrary cut-off.
* The individual points are "outliers", and are expected to be unusually high or low value observations.

Pay close attention to the center line and the box for trends, and see if there are many outliers or not. If a box looks particularly strange (e.g. it's just a line or has no whiskers), then there probably isn't enough data to make a good boxplot.

**Example.** Are longer-running series generally higher-rated? Let's make a boxplot using `sns.boxplot()`, which takes the same keyword arguments as `sns.scatterplot()`:

In [None]:
eightorless = netflix.loc[netflix["seasons"] <= 8]

sns.boxplot(data=eightorless, x="seasons", y="imdb_score")

plt.show()

(I removed shows with more than 8 seasons because there are very few of them. Try putting them back and you'll see how much of a mess the visualization is.)

From looking at the center-lines, it looks like longer-running shows do have higher overall ratings, but it's a small effect (only one point higher). The effect may stop after 6 seasons, but this may be due to small numbers of long-running shows.

#### Other kinds of plots

We are just scratching the surface of Seaborn in this course. Check out the [example gallery](https://seaborn.pydata.org/examples/index.html) on the main site to see more.

## Improving Plots

Now we're going to revisit the previous plots and dig into the additional options Seaborn provides to clean them up.

#### The Histogram, Revisited

Let's take a look at the API reference for [Seaborn's histogram plot](https://seaborn.pydata.org/generated/seaborn.histplot.html). We have used only the `data=` and `x=` keyword arguments, but you can see there are *many, many more*.

* Let's start with `bindwidth=`, which allows us to adjust how wide the bins are. Try making it very small (0.01) or large (2) to see how it affects the plot. I think that a value of 0.5 looks good, so we'll use that.
* Second, `stat=` lets us change which statistic is being plotted on the y axis. The default is "count", that is, how many observations are in each bin. With `stat="probability"`, we can instead plot the probability that, if you picked an observation at random, it would fit in that bin. This option for `stat=` is better if you're trying to generalize from a sample ("there are 10 people in this group") to the whole population ("about 20% of people are in this group")
* In addition, let's use a few functions from Matplotlib. `plt.xlabel()` lets us change what the label on the x axis is (there's a corresponding `plt.ylabel()` as well), and `plt.xlim()` lets us change the axis range on the x axis. These Matplotlib functions will be used over and over again

In [None]:
sns.histplot(data=movies, x="imdb_score",
             binwidth=0.5, stat="probability")

plt.xlabel("IMDB Score")
plt.xlim(1,10)
plt.ylim(0,0.2)
plt.show()

We made the above plot for just movies, but what if we wanted to compare movies and TV shows?

* For that, there's the `hue=` keyword. If you pass a categorical column to `hue=`, the histogram will be split and each bar colored by category. The gray area is where the bars overlap.
* Try adding a `multiple=` keyword (it can take the values "layer", "dodge", "stack", or "fill") to see what it does. Each one shows different information about the data. When would you want to use each?

In [None]:
sns.histplot(data=netflix, x="imdb_score",
             binwidth=0.5, hue="type")

plt.xlabel("IMDB Score")
plt.show()

#### The Bar Plot, Revisited

Once again, here's a link to the [Seaborn Countplot API](https://seaborn.pydata.org/generated/seaborn.countplot.html). Let's make two changes to our bar plot.

* Change the labels on the x and y axes, again using `plt.xlabel()` and `plt.ylabel()`.
* Change the order that the bars appear in using `order=` to make the bars go from tallest to shortest. What if we wanted to make the bars go from lowest age rating to highest?

In [None]:
sns.countplot(data=movies, x="age_certification", order=["R", "PG-13", "PG", "G", "NC-17"])

plt.ylabel("Count")
plt.xlabel("MPAA Rating")
plt.show()

You can also use the `hue=` keyword for `sns.countplot()`. Try creating an "is_comedy" column like we did last time to see if comedies and non-comedies typically have different ratings.

#### The Scatterplot, Revisited

Let's take a look at the [Seaborn scatterplot API](https://seaborn.pydata.org/generated/seaborn.scatterplot.html). There are a *lot* of things we can fix about this plot, so we'll take it in a few pieces, the first of which don't even require us to use Seaborn, but are from Matplotlib (and I figured out how to do them by knowing what I wanted to achieve and Googling them).

* First, we'll adjust the x and y axes to both be from 0.5 to 10.5, and change the labels. That's something we've done before.
* Second, we'll set the aspect ratio of the axes to be "equal". What this does is force the scales to be equal on the x and y axes. So if both axes have a range of 0-10, then the plot will be square. This is something you should always do when the x and y axes should have the same value (here, the IMDB and TMDB score should be roughly the same, right?). Otherwise, it's quite pointless.
* Third, we'll add a dashed black line showing where the two scores are equal. It's hard to tell if one is slightly larger without such a line. The arguments here are the x and y endpoints of the line, followed by a description of what the line should look like. Here, I wanted dashed, `--`, and blac`k` in color.

With these changes, we can see that TMDB scores are actually slightly larger than for the IMDB! Their users must be kinder.

In [None]:
sns.scatterplot(data=netflix, x="imdb_score", y="tmdb_score")

plt.ylim(0.5,10.5)
plt.xlim(0.5,10.5)
plt.xlabel("IMDB Score")
plt.ylabel("TMDB Score")

plt.gca().set_aspect('equal')
plt.plot([0,11],[0,11], '--k')

plt.show()

From here, the `hue=`, `size=`, and `alpha=` keywords bear mentioning.
* Like with the previous plots, we can use `hue=` to color the data by category.
* In addition, `size=` will make the points different sizes based on a numeric variable.
* `alpha=` will make the points partially transparent. `alpha=0` makes the points invisible, while `alpha=1` means completely opaque. Fractional values are in-between. This is useful when you have many overlapping points, so you can get a better feel of how dense the points are.

#### The Boxplot, Revisited

This one's easy. Let's relabel the axes. We could add a category for `hue=`, as well.

In [None]:
sns.boxplot(data=eightorless, x="seasons", y="imdb_score")

plt.ylim(1,10)
plt.xlabel("Number of Seasons")
plt.ylabel("IMDB Score")
plt.show()