<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Visualization" data-toc-modified-id="Data-Visualization-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Visualization</a></span><ul class="toc-item"><li><span><a href="#PLOTS" data-toc-modified-id="PLOTS-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>PLOTS</a></span><ul class="toc-item"><li><span><a href="#Line-Charts" data-toc-modified-id="Line-Charts-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Line Charts</a></span></li></ul></li></ul></li><li><span><a href="#MatPlotLib" data-toc-modified-id="MatPlotLib-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>MatPlotLib</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Importing-pyplot-with-matplotlib" data-toc-modified-id="Importing-pyplot-with-matplotlib-2.0.1"><span class="toc-item-num">2.0.1&nbsp;&nbsp;</span>Importing pyplot with matplotlib</a></span></li><li><span><a href="#More-with-matplotlib" data-toc-modified-id="More-with-matplotlib-2.0.2"><span class="toc-item-num">2.0.2&nbsp;&nbsp;</span>More with matplotlib</a></span></li><li><span><a href="#multiple-plots" data-toc-modified-id="multiple-plots-2.0.3"><span class="toc-item-num">2.0.3&nbsp;&nbsp;</span>multiple plots</a></span></li></ul></li><li><span><a href="#Bar-Plots-&amp;-Scatter-Plots" data-toc-modified-id="Bar-Plots-&amp;-Scatter-Plots-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Bar Plots &amp; Scatter Plots</a></span><ul class="toc-item"><li><span><a href="#Bar-Plots" data-toc-modified-id="Bar-Plots-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Bar Plots</a></span></li><li><span><a href="#Scatter-Plots" data-toc-modified-id="Scatter-Plots-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Scatter Plots</a></span></li></ul></li></ul></li></ul></div>

# Data Visualization

A discipline that focuses on the visual representation of data. As humans, our brains have evolved to develop powerful visual processing capabilities. We can quickly find patterns in the visual information we encounter, which was incredibly important from a survivability standpoint. Unfortunately, when data is represented as tables of values, we can't really take advantage of our visual pattern matching capabilities. This is because our ability to quickly process symbolic values (like numbers and words) is very poor. Data visualization focuses on transforming data from table representations visual ones.

**Exploratory Data Visualization**, we'll focus on data visualization techniques to explore datasets and help us uncover patterns. In this mission, we'll use a specific type of data visualization to understand U.S. unemployment data.

When we read the dataset into a DataFrame, pandas will set the data type of the `DATE` column as a text column. Because of how pandas reads in strings internally, this column is given a data type of object. We need to convert this column to the datetime type using the `pandas.to_datetime()` function, which returns a Series object with the datetime data type that we can assign back to the DataFrame:

**Seasonality** is when a pattern is observed on a regular, predictable basis for a specific reason. A simple example of seasonality would be a large increase textbook purchases every August every year. Many schools start their terms in August and this spike in textbook sales is directly linked.

## PLOTS

Instead of representing data using text like tables do, visual representations use visual objects like dots, shapes, and lines on a grid. **Plots** are a category of visual representations that allow us to easily understand the relationships between variables. There are many types of plots and selecting the right one is an important skill that you'll hone as you create data visualizations. 

### Line Charts
Because we want to **compare** the unemployment **trends across time**, we should **use line charts**. Here's an overview of line charts using 4 sample data points:

# MatPlotLib

To create the line chart, we'll use the matplotlib library, which allows us to:

- quickly create common plots using high-level functions
- extensively tweak plots
- create new kinds of plots from the ground up

**To help you become familiar with matplotlib, we'll focus on the first 2 use cases. When working with commonly used plots in matplotlib, the general workflow is**:

- create a plot using data
- customize the appearance of the plot
- display the plot
- edit and repeat until satisfied

**This interactive style aligns well with the exploratory workflow of data visualization because we're asking questions and creating data visualizations to help us get answers.** The `pyplot` module provides a high-level interface for `matplotlib` that allows us to quickly create common data plots and perform common tweaks to them.

The pyplot module is commonly imported as plt from matplotlib:

### Importing pyplot with matplotlib
`import matplotlib.pyplot as plt`

Using the different pyplot functions, we can create, customize, and display a plot. For example, we can use 2 functions to :

- `plt.plot()`
- `plt.show()`

### More with matplotlib

Because we didn't pass in any arguments, the `plot()` function would generate an empty plot with just the axes and ticks and the `show()` function would display that plot. You'll notice that we didn't assign the plot to a variable and then call a method on the variable to display it. We instead called 2 functions on the pyplot module directly.

This is because every time we call a pyplot function, the module maintains and updates the plot internally (also known as state). When we call `show()`, the plot is displayed and the internal state is destroyed. While this workflow isn't ideal when we're writing functions that create plots on a repeated basis as part of a larger application, it's useful when exploring data.

Let's run this code to see the default properties matplotlib uses. If you'd like to follow along on your own computer, we recommend installing matplotlib using Anaconda: `conda install matplotlib`. We recommend working with matplotlib using Jupyter Notebook because it can render the plots in the notebook itself. You will need to run the following Jupyter magic in a code cell each time you open your notebook: `%matplotlib inline`. Whenever you call `show()`, the plots will be displayed in the output cell.

By default, Matplotlib displayed a coordinate grid with:

- the x-axis and y-axis values ranging from -0.06 to 0.06
- no grid lines
- no data

Even though no data was plotted, the `x-axis` and `y-axis` ticks correspond to the `-0.06` to `0.06` value range. The axis ticks consist of tick marks and tick labels. Here's a focused view of the x-axis tick marks and x-axis tick labels:

Instead of manually updating the ticks, drawing each marker, and connecting the markers with lines, we can just specify the data we want plotted and let matplotlib handle the rest. To generate the line chart we're interested in, we pass in the list of x-values as the first parameter and the list of y-values as the second parameter to `plot()`:

`plt.plot(x_values, y_values)`

Matplotlib will accept any iterable object, like NumPy arrays and pandas.Series instances.

While the y-axis looks fine, the **x-axis tick** labels are too close together and are unreadable. The line charts from earlier in the mission suggest a better way to display the x-axis tick labels:

We can rotate the x-axis tick labels by 90 degrees so they don't overlap. The `xticks()` function within pyplot lets you customize the behavior of the x-axis ticks. If you head over to the documentation for that function, it's not immediately obvious the arguments it takes:

`matplotlib.pyplot.xticks(*args, **kwargs)`

In the documentation for the function, you'll see a link to the matplotlib Text class, which is what pyplot uses to represent the x-axis tick labels. You'll notice that there's a `rotation` parameter that accepts degrees of rotation as a parameter. We can specify degrees of rotation using a float or integer value.

Let's now finish tweaking this plot by adding axis labels and a title. Always adding axis labels and a title to your plot is a good habit to have, and is especially useful when we're trying to keep track of multiple plots down the road.

Here's an overview of the pyplot functions we need to tweak the axis labels and the plot title:

- `xlabel()`: accepts a string value, which gets set as the x-axis label.
- `ylabel()`: accepts a string value, which is set as the y-axis label.
- `title()`: accepts a string value, which is set as the plot title.

### multiple plots

When we were working with a single plot, pyplot was storing and updating the state of that single plot. We could tweak the plot just using the functions in the pyplot module. When we want to work with multiple plots, however, we need to be more explicit about which plot we're making changes to. This means we need to understand the matplotlib classes that pyplot uses internally to maintain state so we can interact with them directly. Let's first start by understanding what pyplot was automatically storing under the hood when we create a single plot:

- a container for all plots was created (returned as a Figure object)
- a container for the plot was positioned on a grid (the plot returned as an Axes object)
- visual symbols were added to the plot (using the Axes methods)

A figure acts as a container for all of our plots and has methods for customizing the appearance and behavior for the plots within that container. Some examples include changing the overall width and height of the plotting area and the spacing between plots.

We can manually create a figure by calling `pyplot.figure()`:

`fig = plt.figure()`

Instead of only calling the pyplot function, we assigned its return value to a variable (fig). After a figure is created, an axes for a single plot containing no data is created within the context of the figure. When rendered without data, the plot will resemble the empty plot from the previous mission. The Axes object acts as its own container for the various components of the plot, such as:

- values on the x-axis and y-axis
- ticks on the x-axis and y-axis
- all visual symbols, such as:
    - markers
    - lines
    - gridlines
    
While plots are represented using instances of the Axes class, they're also often referred to as subplots in matplotlib. To add a new subplot to an existing figure, use `Figure.add_subplot`. This will return a new Axes object, which needs to be assigned to a variable:

`axes_obj = fig.add_subplot(nrows, ncols, plot_number)`

If we want the figure to contain 2 plots, one above the other, we need to write:

`ax1 = fig.add_subplot(2,1,1)`

`ax2 = fig.add_subplot(2,1,2)`

This will create a grid, 2 rows by 1 column, of plots. Once we're done adding subplots to the figure, we display everything using `plt.show()`:

`import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
plt.show()`

Let's create a figure, add subplots to it, and display it.

For each subplot, matplotlib generated a coordinate grid that was similar to the one we generated in the last mission using the `plot()` function:

- the x-axis and y-axis values ranging from 0.0 to 1.0
- no gridlines
- no data
The main difference is that this plot ranged from 0.0 to 1.0 instead of from -0.06 to 0.06, which is a quirk suggested by a difference in default properties.

Now that we have a basic understanding of the important matplotlib classes, we can create multiple plots to compare monthly unemployment trends. If you recall, we need to specify the position of each subplot on a grid. Here's a diagram that demonstrates how a 2 by 2 subplot layout would look like:

To generate a line chart within an Axes object, we need to call `Axes.plot()` and pass in the data you want plotted:

`x_values = [0.0, 0.5, 1.0]
y_values = [10, 20, 40]
ax1.plot(x_values, y_values)`

Like `pyplot.plot()`, the `Axes.plot()` will accept any iterable object for these parameters, including NumPy arrays and pandas Series objects. It will also generate a line chart by default from the values passed in. Each time we want to generate a line chart, we need to call `Axes.plot()` and pass in the data we want to use in that plot.

One issue with the 2 plots is that the x-axis ticks labels are unreadable. The other issue is that the plots are squeezed together vertically and hard to interpret. Even though now we generated 2 line charts, the total plotting area for the figure remained the same:

This is because matplotlib used the default dimensions for the total plotting area instead of resizing it to accommodate the plots. If we want to expand the plotting area, we have to specify this ourselves when we create the figure. To tweak the dimensions of the plotting area, we need to use the `figsize` parameter when we call `plt.figure()`:

This parameter takes in a tuple of floats:

`fig = plt.figure(figsize=(width, height))`

The unit for both width and height values is inches. The `dpi` parameter, or dots per inch, and the `figsize` parameter determine how much space on your display a plot takes up. By increasing the width and the height of the plotting area, we can address both issues.

By adding more line charts, we can look across more years for seasonal trends. This comes at a cost, unfortunately. We now have to visually scan over more space, which is a limitation that we experienced when scanning the table representation of the same data. If you recall, one of the limitations of the table representation we discussed in the previous mission was the amount of time we'd have to spend scanning the table as the number of rows increased significantly.


To extract the month values from the `DATE` column and assign them to a new column, we can use the pandas.Series.dt accessor:

`unrate['MONTH'] = unrate['DATE'].dt.month`

Calling `pandas.Series.dt.month` returns a Series containing the integer values for each month (e.g. 1 for January, 2 for February, etc.). Under the hood, pandas applies the `datetime.date.month` attribute from the datetime.date class over each datetime value in the `DATE` column, which returns the integer month value. Let's now move onto generating multiple line charts in the same subplot.

In the last mission, we called `pyplot.plot()` to generate a single line chart. Under the hood, matplotlib created a figure and a single subplot for this line chart. If we call `pyplot.plot()` multiple times, matplotlib will generate the line charts on the single subplot.

`plt.plot(unrate[0:12]['MONTH'], unrate[0:12]['VALUE'])
plt.plot(unrate[12:24]['MONTH'], unrate[12:24]['VALUE'])`

If we want to set the dimensions for the plotting area, we can create the figure ourselves first then plot the data. This is because matplotlib first checks if a figure already exists before plotting data. It will only create one if we didn't create a figure.

`fig = plt.figure(figsize=(6,6))
plt.plot(unrate[0:12]['MONTH'], unrate[0:12]['VALUE'])
plt.plot(unrate[12:24]['MONTH'], unrate[12:24]['VALUE'])`

By default, matplotlib will select a different color for each line. To specify the color ourselves, use the `c` parameter when calling `plot():`

`plt.plot(unrate[0:12]['MONTH'], unrate[0:12]['VALUE'], c='red')`

Example:

`fig = plt.figure(figsize=(6,3))
plt.plot(unrate.MONTH.head(12), unrate.VALUE.head(12), c='red')
plt.plot(unrate.MONTH[12:24], unrate.VALUE[12:24], c='blue')
plt.show()`

How colorful! By plotting all of the lines in one coordinate grid, we got a different perspective on the data. The main thing that sticks out is how the blue and green lines span a larger range of y values (4% to 8% for blue and 4% to 7% for green) while the 3 plots below them mostly range only between 3% and 4%. You can tell from the last sentence that we don't know which line corresponds to which year, because the x-axis now only reflects the month values.

To help remind us which year each line corresponds to, we can add a legend that links each color to the year the line is representing. Here's what a `legend` for the lines in the last screen could look like:

When we generate each line chart, we need to specify the text label we want each color linked to. The `pyplot.plot()` function contains a `label` parameter, which we use to set the year value:

`plt.plot(unrate[0:12]['MONTH'], unrate[0:12]['VALUE'], c='red', label='1948')
plt.plot(unrate[12:24]['MONTH'], unrate[12:24]['VALUE'], c='blue', label='1949')`

We can create the legend using pyplot.legend and specify its location using the loc parameter:

`plt.legend(loc='upper left')`

If we're instead working with multiple subplots, we can create a legend for each subplot by mirroring the steps for each subplot. When we use `plt.plot()` and `plt.legend()`, the `Axes.plot()` and `Axes.legend()` methods are called under the hood and parameters passed to the calls. When we need to create a legend for each subplot, we can use `Axes.legend()` instead.

Let's now add a legend for the plot we generated in the last screen.

## Bar Plots & Scatter Plots
### Bar Plots

To create a useful bar plot, however, we need to specify the positions of the bars, the widths of the bars, and the positions of the axis labels. Here's a diagram that shows the various values we need to specify:

- Positions of the bars
- Positions of the axis labels
- Width of the Bars

We'll focus on positioning the bars on the x-axis in this step and on positioning the x-axis labels in the next step. We can generate a vertical bar plot using either `pyplot.bar()` or `Axes.bar()`. We'll use `Axes.bar()` so we can extensively customize the bar plot more easily. We can use `pyplot.subplots()` to first generate a single subplot and return both the Figure and Axes object. This is a shortcut from the technique we used in the previous mission:

`fig, ax = plt.subplots()` is the same as

`fig = plt.figure()
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)`

The `Axes.bar()` method has 2 required parameters, **left and height**. We use the `left` parameter to specify the x coordinates of the left sides of the bar (marked in blue on the above image). We use the height parameter to specify the `height` of each bar. Both of these parameters accept a list-like object.

The `np.arange()` function returns evenly spaced values. We use `arange()` to generate the positions of the left side of our bars. This function requires a parameter that specifies the number of values we want to generate. We'll also want to add space between our bars for better readability:

`#Positions of the left sides of the 5 bars. [0.75, 1.75, 2.75, 3.75, 4.75]
from numpy import arange
bar_positions = arange(5) + 0.75 
num_cols = ['RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']
bar_heights = norm_reviews[num_cols].iloc[0].values 
ax.bar(bar_positions, bar_heights)`

We can also use the `width` parameter to specify the width of each bar. This is an optional parameter and the width of each bar is set to `0.8` by default. The following code sets the `width` parameter to `1.5`

`ax.bar(bar_positions, bar_heights, 1.5)`

By default, matplotlib sets the x-axis tick labels to the integer values the bars spanned on the x-axis (from `0` to `6`). We only need tick labels on the x-axis where the bars are positioned. We can use `Axes.set_xticks()` to change the positions of the ticks to `[1, 2, 3, 4, 5]`:

`tick_positions = range(1,6)
ax.set_xticks(tick_positions)`

Then, we can use `Axes.set_xticklabels()` to specify the tick labels:

`num_cols = ['RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']
ax.set_xticklabels(num_cols)`

If you look at the documentation for the method, you'll notice that we can specify the orientation for the labels using the rotation parameter:

`ax.set_xticklabels(num_cols, rotation=90)`

Rotating the labels by 90 degrees keeps them readable. In addition to modifying the x-axis tick positions and labels, let's also set the x-axis label, y-axis label, and the plot title.

We can create a horizontal bar plot in matplotlib in a similar fashion. Instead of using `Axes.bar()`, we use `Axes.barh()`. This method has 2 required parameters, `bottom` and `width`. We use the `bottom` parameter to specify the y coordinate for the bottom sides for the bars and the `width` parameter to specify the lengths of the bars:

`bar_widths = norm_reviews[num_cols].iloc[0].values
bar_positions = arange(5) + 0.75
ax.barh(bar_positions, bar_widths, 0.5)`

To recreate the bar plot from the last step as horizontal bar plot, we essentially need to map the properties we set for the y-axis instead of the x-axis. We use `Axes.set_yticks(`) to set the y-axis tick positions to `[1, 2, 3, 4, 5]` and `Axes.set_yticklabels()` to set the tick labels to the column names:

`tick_positions = range(1,6)
num_cols = ['RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']
ax.set_yticks(tick_positions)
ax.set_yticklabels(num_cols)`

### Scatter Plots
From the horizontal bar plot, we can more easily determine that the 2 average scores from Fandango users are higher than those from the other sites. While bar plots help us visualize a few data points to quickly compare them, they aren't good at helping us visualize many data points. Let's look at a plot that can help us visualize many points.

In the previous mission, the line charts we generated always connected points from left to right. This helped us show the trend, up or down, between each point as we scanned visually from left to right. Instead, we can avoid using lines to connect markers and just use the underlying markers. A plot containing just the markers is known as a scatter plot.

**A scatter plot visualizes data using markers. If the markers are close together, it implies that there could be a correlation**

A scatter plot helps us determine if 2 columns are weakly or strongly correlated. While calculating the `correlation coefficient` will give us a precise number, a scatter plot helps us find outliers, gain a more intuitive sense of how spread out the data is, and compare more easily.

To generate a scatter plot, we use `Axes.scatter()`. The `scatter()` method has 2 required parameters, `x` and `y`, which matches the parameters of the `plot()` method. The values for these parameters need to be iterable objects of matching lengths (lists, NumPy arrays, or pandas series).

Let's start by creating a scatter plot that visualizes the relationship between the Fandango_Ratingvalue and RT_user_norm columns. We're looking for at least a weak correlation between the columns.

`import matplotlib.pyplot as plt`

`fig, ax = plt.subplots()`

`ax.scatter(norm_reviews.Fandango_Ratingvalue, norm_reviews.RT_user_norm)
ax.set_xlabel('Fandango')
ax.set_ylabel('Rotten Tomatoes')`

`plt.show()`

**Multiple scatter plots**

`fig = plt.figure(figsize=(5,10))
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
reviews = pd.read_csv('fandango_scores.csv')
cols = ['FILM', 'RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars']
norm_reviews = reviews[cols]
ax1.scatter(norm_reviews.Fandango_Ratingvalue, norm_reviews.RT_user_norm)
ax1.set_xlabel('Fandango')
ax1.set_ylabel('Rotten Tomatoes')
ax2.scatter(norm_reviews.RT_user_norm, norm_reviews.Fandango_Ratingvalue,)
ax2.set_xlabel('Rotten Tomatoes')
ax2.set_ylabel('Fandango')
plt.show()`

The second scatter plot is a mirror reflection of the first scatter plot. The nature of the correlation is still reflected, however, which is the important thing. Let's now generate scatter plots to see how Fandango ratings correlate with all 3 of the other review sites.

When generating multiple scatter plots for the purpose of comparison, it's important that all plots share the same ranges in the x-axis and y-axis. In the 2 plots we generated in the last step, the ranges for both axes didn't match. We can use `Axes.set_xlim()` and `Axes.set_ylim()` to set the data limits for both axes:

`ax.set_xlim(0, 5)
ax.set_ylim(0, 5)`

By default, matplotlib uses the minimal ranges for the data limits necessary to display all of the data we specify. By manually setting the data limits ranges to specific ranges for all plots, we're ensuring that we can accurately compare. We can even use the methods we just mentioned to zoom in on a part of the plots. For example, the following code will limit the axes to the 4 to 5 range:

`ax.set_xlim(4, 5)
ax.set_ylim(4, 5)`

This makes small changes in the actual values in the data appear larger in the plot. A difference of 0.1 in a plot that ranges from `0` to `5` is hard to visually observe. A difference of `0.1` in a plot that only ranges from `4` to `5` is easily visible since the difference is 1/10th of the range.