<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Storytelling-w/-Data-Visualization" data-toc-modified-id="Storytelling-w/-Data-Visualization-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Storytelling w/ Data Visualization</a></span><ul class="toc-item"><li><span><a href="#Line-Plot" data-toc-modified-id="Line-Plot-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Line Plot</a></span></li><li><span><a href="#Clean-Visuals" data-toc-modified-id="Clean-Visuals-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Clean Visuals</a></span><ul class="toc-item"><li><span><a href="#Cleaning-TIck-Marks" data-toc-modified-id="Cleaning-TIck-Marks-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Cleaning TIck Marks</a></span></li><li><span><a href="#Cleaning-Spines" data-toc-modified-id="Cleaning-Spines-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Cleaning Spines</a></span></li><li><span><a href="#Comparing-multiple-Line-Charts,-making-them-consistent" data-toc-modified-id="Comparing-multiple-Line-Charts,-making-them-consistent-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Comparing multiple Line Charts, making them consistent</a></span></li><li><span><a href="#Color-Pallettes" data-toc-modified-id="Color-Pallettes-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Color Pallettes</a></span></li><li><span><a href="#Increase-line-widths" data-toc-modified-id="Increase-line-widths-1.2.5"><span class="toc-item-num">1.2.5&nbsp;&nbsp;</span>Increase line widths</a></span></li><li><span><a href="#Plot-Spacing" data-toc-modified-id="Plot-Spacing-1.2.6"><span class="toc-item-num">1.2.6&nbsp;&nbsp;</span>Plot Spacing</a></span></li><li><span><a href="#Legends" data-toc-modified-id="Legends-1.2.7"><span class="toc-item-num">1.2.7&nbsp;&nbsp;</span>Legends</a></span></li></ul></li></ul></li><li><span><a href="#SEABORN" data-toc-modified-id="SEABORN-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>SEABORN</a></span><ul class="toc-item"><li><span><a href="#How-it-works" data-toc-modified-id="How-it-works-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>How it works</a></span></li><li><span><a href="#Multiple-Plots-(small-multiple)" data-toc-modified-id="Multiple-Plots-(small-multiple)-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Multiple Plots (small multiple)</a></span></li><li><span><a href="#GEO-Data" data-toc-modified-id="GEO-Data-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>GEO Data</a></span><ul class="toc-item"><li><span><a href="#Basemap" data-toc-modified-id="Basemap-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Basemap</a></span></li></ul></li></ul></li></ul></div>

# Storytelling w/ Data Visualization

In the **Exploratory Data Visualization** course, we learned how to use visualizations to explore and understand data. Because we were focused on exploring trends and getting familiar with the data, we didn't focus much on tweaking the appearance of the plots to make them more presentable to others. We instead focused on the workflow of quickly creating, tweaking, displaying, and iterating on plots.

In this course, we'll focus on how to use data visualization to communicate insights and tell stories. In this mission, we'll start with a standard matplotlib plot and improve its appearance to better communicate the patterns we want a viewer to understand. Along the way, we'll introduce the principles that informed those changes and provide a framework for you to apply them in the future. Here's a preview that demonstrates some of the improvements we make in this course:

## Line Plot
`import pandas as pd
import matplotlib.pyplot as plt
plt.plot(df.col_x, df.col_y)
plt.show()`

`plt.plot(df.Year, df.Biology, c='blue', label='Women')
plt.plot(df.Year, 100 - df.Biology, c='green', label='Men')
plt.title('Percentage of Biology Degrees Awarded By Gender')
plt.legend(loc='upper right')
plt.show()`

## Clean Visuals
Although our plot is better, it still contains some extra visual elements that aren't necessary to understand the data. We're interested in helping people understand the gender gap in different fields across time. These excess elements, sometimes known as [chartjunk](https://en.wikipedia.org/wiki/Chartjunk), increase as we add more plots for visualizing the other degrees, making it harder for anyone trying to interpret our charts. In general, we want to maximize the [data-ink ratio](https://infovis-wiki.net/wiki/Data-Ink_Ratio), which is the fractional amount of the plotting area dedicated to displaying the data.

The following is an animated GIF by [Darkhorse Analytics](https://www.darkhorseanalytics.com/blog/data-looks-better-naked) that shows a series of tweaks for boosting the data-ink ratio:

Non-data ink includes any elements in the chart that don't directly display data points. This includes tick markers, tick labels, and legends. Data ink includes any elements that display and depend on the data points underlying the chart. In a line chart, data ink would primarily be the lines and in a scatter plot, the data ink would primarily be in the markers. As we increase the data-ink ratio, we decrease non-data ink that can help a viewer understand certain aspects of the plots. We need to be mindful of this trade-off as we work on tweaking the appearance of plots to tell a story, because plots we create could end up telling the wrong story.

This principle was originally set forth by [Edward Tufte](https://en.wikipedia.org/wiki/Edward_Tufte), a pioneer of the field of data visualization. Tufte's first book, [The Visual Display of Quantitative Information](https://www.edwardtufte.com/tufte/books_vdqi), is considered a bible among information designers. We cover some of the ideas presented in the book in this course, but we recommend going through the entire book for more depth.

To improve the data-ink ratio, let's make the following changes to the plot we created in the last step:

- Remove all of the axis tick marks.
- Hide the spines, which are the lines that connects the tick marks, on each axis.

### Cleaning TIck Marks

To customize the appearance of the ticks, we use the `Axes.tick_params()` method. Using this method, we can modify which tick marks and tick labels are displayed. By default, matplotlib displays the tick marks on all four sides of the plot. Here are the four sides for a standard line chart:

The left side is the y-axis.
The bottom side is the x-axis.
The top side is across from the x-axis.
The right side is across from the y-axis.
The parameters for enabling or disabling tick marks are conveniently named after the sides. To hide all of them, we need to pass in the following values for each parameter when we call `Axes.tick_params()`:

- `bottom: "off"`
- `top: "off"`
- `left: "off"`
- `right: "off"`

`plt.tick_params(bottom='off', top='off', left='off', right='off')`

### Cleaning Spines

With the axis tick marks gone, the data-ink ratio is improved and the chart looks much cleaner. In addition, the spines in the chart now are no longer necessary. When we're exploring data, the spines and the ticks complement each other to help us refer back to specific data points or ranges. When a viewer is viewing our chart and trying to understand the insight we're presenting, the ticks and spines can get in the way. As we mentioned earlier, chartjunk becomes much more noticeable when you have multiple plots in the same chart. By keeping the axis tick labels but not the spines or tick marks, we strike an appropriate balance between hiding chartjunk and making the data visible.

In matplotlib, the spines are represented using the `matplotlib.spines.Spine` class. When we create an Axes instance, four Spine objects are created for us. If you run `print(ax.spines)`, you'll get back a dictionary of the Spine objects:

`{'right': <matplotlib.spines.Spine object at 0x111089c18>, 'bottom': <matplotlib.spines.Spine object at 0x111060898>, 'top': <matplotlib.spines.Spine object at 0x1110606a0>, 'left': <matplotlib.spines.Spine object at 0x11107cd30>}`

To hide all of the spines, we need to:

- access each Spine object in the dictionary
- call the Spine.set_visible() method
- pass in the Boolean value False

The following line of code removes the spines for the right axis:

`ax.spines["right"].set_visible(False)`

### Comparing multiple Line Charts, making them consistent

`major_cats = ['Biology', 'Computer Science', 'Engineering', 'Math and Statistics']
for sp in range(0,4):
    ax = fig.add_subplot(2,2,sp+1)
    ax.plot(women_degrees['Year'], women_degrees[major_cats[sp]], c='blue', label='Women')
    ax.plot(women_degrees['Year'], 100-women_degrees[major_cats[sp]], c='green', label='Men')
    ax.set_xlim(1968, 2011)
    ax.set_ylim(0, 100)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.tick_params(bottom='off', top='off', left='off', right='off')
    ax.set_title(major_cats[sp])
plt.legend(loc='upper right')
plt.show()`

By spending just a few seconds reading the chart, we can conclude that the gender gap in Computer Science and Engineering have big gender gaps while the gap in Biology and Math and Statistics is quite small. In addition, the first two degree categories are dominated by men while the latter degree categories are much more balanced. This chart can still be improved, however, and we'll explore more techniques in the next mission.

In this mission, we explored how to enhance a chart's storytelling capabilities by minimizing chartjunk and encouraging comparison. In the next mission, we'll explore how to use color, spacing, and weights to further enhance the storytelling capability of the plots.

### Color Pallettes

If we wanted to publish the data visualizations we create, we need to be mindful of color blindness. Thankfully, there are color palettes we can use that are friendly for people with color blindness. One of them is called `Color Blind 10` and was released by Tableau, the company that makes the data visualization platform of the same name. Navigate to [this page](http://tableaufriction.blogspot.com/2012/11/finally-you-can-use-tableau-data-colors.html) and select just the `Color Blind 10` option from the list of palettes to see the ten colors included in the palette.

These numbers represent the **RGB values** for each color. The RGB color model describes how the three primary colors (red, green, and blue) can be combined in different proportions to form any secondary color. The RGB color model is very familiar to people who work in photography, filmography, graphic design, and any field that use colors extensively. In computers, each RGB value can range between 0 and 255. This is because 256 integer values can be represented using 8 bits. You can read more about 8-bit color here.

The first color in the palette is a color that resembles dark blue and has the following RGB values:

- Red: 0
- Green: 107
- Blue: 164

To specify a line color using RGB values, we pass in a tuple of the values to the c parameter when we generate the line chart. Matplotlib expects each value to be scaled down and to range between 0 and 1 (not 0 and 255). In the following code, we scale the first color, which resembles dark blue, in the Color Blind 10 palette and set it as the line color:

`cb_dark_blue = (0/255,107/255,164/255)
ax.plot(women_degrees['Year'], women_degrees['Biology'], label='Women', c=cb_dark_blue)`

Color Composition 
`dark_blue = (0/255,107/255,164/255)
orange = (255/255,128/255,14/255)`

`ax.plot(women_degrees['Year'], women_degrees[major_cats[sp]], c=women_dark_blue, label='Women')
ax.plot(women_degrees['Year'], 100-women_degrees[major_cats[sp]], c=men_orange, label='Men')`

### Increase line widths

By default, the actual lines reflecting the underlying data in the line charts we've been generating are quite thin. The white color in the blank area in the line charts is still a dominating color. To emphasize the lines in the plots, we can increase the width of each line. Increasing the line width also improves the data-ink ratio a little bit, because more of the chart area is used to showcase the data.

When we call the `Axes.plot()` method, we can use the `linewidth` parameter to specify the line width. Matplotlib expects a float value for this parameter:

`ax.plot(women_degrees['Year'], women_degrees['Biology'], label='Women', c=cb_dark_blue, linewidth=2)`

The higher the line width, the thicker each line will be.

### Plot Spacing

So far, we've been generating our line charts on a 2 by 2 subplot grid. If we wanted to visualize all six STEM degrees, we'd need to either add a new column or a new row. Unfortunately, neither solution orders the plots in a beneficial way to the viewer. By scanning horizontally or vertically, a viewer isn't able to learn any new information and this can cause some frustration as the viewer's gaze jumps around the image.

To make the viewing experience more coherent, we can:

- use layout of a single row with multiple columns
- order the plots in decreasing order of initial gender gap

The leftmost plot has the largest gender gap in 1968 while the rightmost plot has the smallest gender gap in 1968. If we're instead interested in the recent gender gaps in STEM degrees, we can order the plots from largest to smallest ending gender gaps. Here's what that would look like:

In this exercise, you'll order the charts by decreasing ending gender gap. We've populated the list `stem_cats` with the six STEM degree categories, ordering them by decreasing ending gender gap. In the next step, we'll explore how we can replace the legend, which is currently overlapping with the rightmost line chart.

`stem_cats = ['Engineering', 'Computer Science', 'Psychology', 'Biology', 'Physical Sciences', 'Math and Statistics']
fig = plt.figure(figsize=(18, 3))
for sp in range(0,6):
    ax = fig.add_subplot(1,6,sp+1)
    ax.plot(women_degrees['Year'], women_degrees[stem_cats[sp]], c=cb_dark_blue, label='Women', linewidth=3)
    ax.plot(women_degrees['Year'], 100-women_degrees[stem_cats[sp]], c=cb_orange, label='Men', linewidth=3)
    for key,spine in ax.spines.items():
        spine.set_visible(False)
    ax.set_xlim(1968, 2011)
    ax.set_ylim(0,100)
    ax.set_title(stem_cats[sp])
    ax.tick_params(bottom="off", top="off", left="off", right="off")
plt.legend(loc='upper right')
plt.show()`

### Legends

The purpose of a legend is to ascribe meaning to symbols or colors in a chart. We're using it to inform the viewer of what gender corresponds to each color. Tufte encourages removing legends entirely if the same information can be conveyed in a cleaner way. Legends consist of non-data ink and take up precious space that could be used for the visualizations themselves (data-ink).

Instead of trying to move the legend to a better location, we can replace it entirely by annotating the lines directly with the corresponding genders:

If you notice, even the position of the text annotations have meaning. In both plots, the annotation for `Men` is positioned above the orange line while the annotation for `Women` is positioned below the dark blue line. This positioning subtly suggests that men are a majority for the degree categories the line charts are representing (`Engineering` and `Math and Statistics`) and women are a minority for those degree categories.

Combined, these two observations suggest that we should stick with annotating just the leftmost and the rightmost line charts, prioritizing the data-ink ratio over the consistency of elements.

To add text annotations to a matplotlib plot, we use the Axes.text() method. This method has a few required parameters:

- x: x-axis coordinate (as a float)
- y: y-axis coordinate (as a float)
- s: the text we want in the annotation (as a string value)

The values in the coordinate grid match exactly with the data ranges for the x-axis and the y-axis. If we want to add text at the intersection of `1970` from the `x-axis` and `0` from the `y-axis`, we would pass in those values:

`ax.text(1970, 0, "starting point")`


`fig = plt.figure(figsize=(18, 3))
for sp in range(0,6):
    ax = fig.add_subplot(1,6,sp+1)
    ax.plot(women_degrees['Year'], women_degrees[stem_cats[sp]], c=cb_dark_blue, label='Women', linewidth=3)
    ax.plot(women_degrees['Year'], 100-women_degrees[stem_cats[sp]], c=cb_orange, label='Men', linewidth=3)
    for key,spine in ax.spines.items():
        spine.set_visible(False)
    ax.set_xlim(1968, 2011)
    ax.set_ylim(0,100)
    ax.set_title(stem_cats[sp])
    ax.tick_params(bottom="off", top="off", left="off", right="off")
    if sp == 0:
        ax.text(2005, 87, 'Men')
        ax.text(2002, 8, 'Women')
    if sp == 5:
        ax.text(2005, 62, 'Men')
        ax.text(2001, 35, 'Women')
plt.show()`

In this mission, we learned how to improve the viewing experience by making our plots more color-blind friendly and thickening the line widths. We then explored how to use the layout and ordering of the plots as well annotations directly onto the plots to enhance the story that's being told to the viewer. Next in this course is a guided project, where we'll extend the work we did in this mission to all of the degree categories.

With seventeen line charts in one diagram, the non-data elements quickly clutter the field of view. The most immediate issue that sticks out is the titles of some line charts overlapping with the x-axis labels for the line chart above it. If we remove the titles for each line chart, the viewer won't know what degree each line chart refers to. Let's instead remove the x-axis labels for every line chart in a column except for the bottom most one. We can accomplish this by modifying the call to `Axes.tick_params()` and setting `labelbottom` to `off`:

`ax.tick_params(bottom="off", top="off", left="off", right="off", labelbottom='off')`

This will disable the x-axis labels for all of the line charts. You can then enable the x-axis labels for the bottommost line charts in each column:

`ax.tick_params(labelbottom='on')`

Removing the x-axis labels for all but the bottommost plots solved the issue we noticed with the overlapping text. In addition, the plots are cleaner and more readable. The trade-off we made is that it's now more difficult for the viewer to discern approximately which years some interesting changes in trends may have happened. This is acceptable because we're primarily interested in enabling the viewer to quickly get a high level understanding of which degrees are prone to gender imbalance and how that has changed over time.

In the vein of reducing cluttering, let's also simplify the y-axis labels. Currently, all seventeen plots have six y-axis labels and even though they are consistent across the plots, they still add to the visual clutter. By keeping just the starting and ending labels (`0` and `100`), we can keep some of the benefits of having the y-axis labels to begin with.

We can use the Axes.set_yticks() method to specify which labels we want displayed. The following code enables just the `0` and `100` labels to be displayed:

`ax.set_yticks([0,100])`

While removing most of the y-axis labels definitely reduced clutter, it also made it hard to understand which degrees have close to 50-50 gender breakdown. While keeping all of the y-axis labels would have made it easier, we can actually do one better and use a horizontal line across all of the line charts where the y-axis label `50` would have been.

We can generate a horizontal line across an entire subplot using the `Axes.axhline()` method. The only required parameter is the y-axis location for the start of the line:

`ax.axhline(50)`

Let's use the next color in the **Color Blind 10** palette for this horizontal line, which has an RGB value of (171, 171, 171). Because we don't want this line to clutter the viewing experience, let's increase the transparency of the line. We can set the color using the `c` parameter and the transparency using the `alpha` parameter. The value passed in to the `alpha` parameter must range between `0` and `1`:

`ax.axhline(50, c=(171/255, 171/255, 171/255), alpha=0.3)`

If you recall, matplotlib can be used many different ways. It can be used within a Jupyter Notebook interface (like this one), from the command line, or in an integrated development environment. Many of these ways of using matplotlib vary in workflow and handle the rendering of images differently as well. To help support these different use cases, matplotlib can target different outputs or **backends**. If you import matplotlib and run `matplotlib.get_backend()`, you'll see the specific backend you're currently using.

With the current backend we're using, we can use `Figure.savefig()` or `pyplot.savefig()` to export all of the plots contained in the figure as a single image file. Note that these have to be called before we display the figure using `pyplot.show()`

`plt.plot(women_degrees['Year'], women_degrees['Biology'])
plt.savefig('biology_degrees.png')`

In the above code, we saved a line chart as a PNG file. You can read about the different popular file types for images [here](https://www.sitepoint.com/gif-png-jpg-which-one-to-use/). The image will be exported into the same folder that your Jupyter Notebook server is running. You can click on the **Jupyter** logo to navigate the file system and find this image:

Jupyter Logo

Exporting plots we create using matplotlib allows us to use them in Word documents, Powerpoint presentations, and even in emails.

# SEABORN
So far, we've mostly worked with plots that are quick to analyze and make sense of. Line charts, scatter plots, and bar plots allow us to convey a few nuggets of insights to the reader. We've also explored how we can combine those plots in interesting ways to convey deeper insights and continue to extend the storytelling power of data visualization. In this mission, we'll explore how to quickly create multiple plots that are subsetted using one or more conditions.

We'll be working with the [seaborn](http://seaborn.pydata.org/) visualization library, which is built on top of matplotlib. Seaborn has good support for more complex plots, attractive default styles, and integrates well with the pandas library. Here are some examples of some complex plots that can be created using seaborn:

## How it works
Seaborn works similarly to the `pyplot` module from matplotlib. We primarily use seaborn interactively, by calling functions in its top level namespace. Like the `pyplot` module from matplotlib, seaborn creates a matplotlib figure or adds to the current, existing figure each time we generate a plot. When we're ready to display the plots, we call `pyplot.show()`.

To get familiar with seaborn, we'll start by creating the familiar histogram. We can generate a histogram of the `Fare` column using the `seaborn.distplot()` function:

`import seaborn as sns    # seaborn is commonly imported as `sns`
import matplotlib.pyplot as plt
sns.distplot(titanic["Fare"])
plt.show()`

Under the hood, seaborn creates a histogram using matplotlib, scales the axes values, and styles it. In addition, seaborn uses a technique called kernel density estimation, or KDE for short, to create a smoothed line chart over the histogram. If you're interested in learning about how KDE works, you can read more on [Wikipedia](https://en.wikipedia.org/wiki/Kernel_density_estimation).

What you need to know for now is that the resulting line is a smoother version of the histogram, called a kernel density plot. **Kernel density plots** are especially helpful when we're comparing distributions, which we'll explore later in this mission. When viewing a histogram, our visual processing systems influence us to smooth out the bars into a continuous line.

While having both the histogram and the kernel density plot is useful when we want to explore the data, it can be overwhelming for someone who's trying to understand the distribution. To generate just the kernel density plot, we use the `seaborn.kdeplot()` function:

`sns.kdeplot(titanic["Age"])`

While the distribution of data is displayed in a smoother fashion, it's now more difficult to visually estimate the area under the curve using just the line chart. When we also had the histogram, the bars provided a way to understand and compare proportions visually.

To bring back some of the ability to easily compare proportions, we can shade the area under the line using a single color. When calling the `seaborn.kdeplot()` function, we can shade the area under the line by setting the `shade` parameter to `True`.

`sns.kdeplot(titanic.Age, shade=True)
plt.xlabel('Age')
plt.show()`

From the plots in the previous step, you'll notice that seaborn:

Sets the x-axis label based on the column name passed through `plt.xlabel()` function
Sets the background color to a light gray color
Hides the x-axis and y-axis ticks
Displays the coordinate grid
In the last few missions, we explored some general aesthetics guidelines for plots. The default seaborn style sheet gets some things right, like hiding axis ticks, and some things wrong, like displaying the coordinate grid and keeping all of the axis spines. We can use the `seaborn.set_style()` function to change the default seaborn style sheet. Seaborn comes with a few style sheets:

- `darkgrid`: Coordinate grid displayed, dark background color
- `whitegrid`: Coordinate grid displayed, white background color
- `dark`: Coordinate grid hidden, dark background color
- `white`: Coordinate grid hidden, white background color
- `ticks`: Coordinate grid hidden, white background color, ticks visible

Here's a diagram that compares the same plot across all styles:

By default, the seaborn style is set to "darkgrid":

`sns.set_style("darkgrid")`

If we change the style sheet using this method, all future plots will match that style in your current session. This means you need to set the style before generating the plot.

To remove the axis spines for the top and right axes, we use the `seaborn.despine()` function:

`sns.despine()`

By default, only the top and right axes will be **despined**, or have their spines removed. To despine the other two axes, we need to set the `left` and `bottom` parameters to `True`

`sns.set_style('white')
sns.kdeplot(titanic.Age, shade=True)
plt.xlabel('Age')
sns.despine(left=True, bottom=True)`

## Multiple Plots (small multiple)

In the last few missions, we created a [small multiple](https://en.wikipedia.org/wiki/Small_multiple), which is a group of plots that have the same axis scales so the viewer can compare plots effectively. We accomplished this by subsetting the data manually and generating a plot using matplotlib for each one.

In seaborn, we can create a small multiple by specifying the conditioning criteria and the type of data visualization we want. For example, we can visualize the differences in age distributions between passengers who survived and those who didn't by creating a pair of kernel density plots. One kernel density plot would visualize the distribution of values in the `"Age"` column where `Survived` equalled `0` and the other would visualize the distribution of values in the `"Age"` column where `Survived` equalled `1`.

Here's what those plots look like:

`# Condition on unique values of the "Survived" column.
g = sns.FacetGrid(titanic, col="Survived", size=6)
g.map(sns.kdeplot, "Age", shade=True)`

Seaborn handled:

- subsetting the data into rows where `Survived` is `0` and where `Survived` is `1`
- creating both Axes objects, ensuring the same axis scales
- plotting both kernel density plots

Instead of subsetting the data and generating each plot ourselves, seaborn allows us to express the plots we want as parameter values. The `seaborn.FacetGrid` object is used to represent the layout of the plots in the grid and the columns used for subsetting the data. The word "facet" from `FacetGrid` is another word for "subset". Setting the `col` parameter to `"Survived"` specifies a separate plot for each unique value in the `Survived` column. Setting the `size` parameter to 6 specifies a height of 6 inches for each plot.

Once we've created the grid, we use the `FacetGrid.map()` method to specify the plot we want for each unique value of `Survived`. Seaborn generated one kernel density plot for the ages of passengers that survived and one kernel density plot for the ages of passengers that didn't survive.

The function that's passed into `FacetGrid.map()` has to be a valid matplotlib or seaborn function. For example, we can map matplotlib histograms to the grid:

`g = sns.FacetGrid(titanic, col="Survived", size=6)
g.map(plt.hist, "Age")`

Let's create a grid of plots that displays the age distributions for each class.

`g = sns.FacetGrid(titanic, col='Pclass', size=6)
g.map(sns.kdeplot, 'Age', shade=True)
sns.despine(left=True, bottom=True)
plt.show()`

We can use two conditions to generate a grid of plots, each containing a subset of the data with a unique combination of each condition. When creating a `FacetGrid`, we use the `row` parameter to specify the column in the dataframe we want used to subset across the rows in the grid. The best way to understand this is to see a working example.

`g = sns.FacetGrid(titanic, col="Pclass", row="Survived")
g.map(sns.kdeplot, "Age", shade=True)
sns.despine(left=True, bottom=True)
plt.show()`

When subsetting data using two conditions, the rows in the grid represented one condition while the columns represented another. We can express a third condition by generating multiple plots on the same subplot in the grid and color them differently. Thankfully, we can add a condition just by setting the `hue` parameter to the column name from the dataframe.

Let's add a new condition to the grid of plots we generated in the last step and see what this grid of plots would look like.

`g = sns.FacetGrid(titanic, col="Survived", row="Pclass", hue='Sex', size=3)
g.map(sns.kdeplot, "Age", shade=True)
sns.despine(left=True, bottom=True)
plt.show()`

Now that we're coloring plots, we need a legend to keep track of which value each color represents. As a challenge to you, we won't specify how exactly to generate a legend in seaborn. Instead, we encourage you to use the examples from the [page](http://seaborn.pydata.org/generated/seaborn.FacetGrid.html#seaborn.FacetGrid) on plotting using the `FacetGrid` instance.

Here's what we want the final grid to look like:

`g = sns.FacetGrid(titanic, col='Survived', row='Pclass', hue='Sex', size=3)
g.map(sns.kdeplot, 'Age', shade=True).add_legend()
sns.despine(left=True, bottom=True)
plt.show()`

## GEO Data
From scientific fields like meteorology and climatology, through to the software on our smartphones like Google Maps and Facebook check-ins, geographic data is always present in our everyday lives. Raw geographic data like latitudes and longitudes are difficult to understand using the data charts and plots we've discussed so far. To explore this kind of data, you'll need to learn how to visualize the data on maps.

In this mission, we'll explore the fundamentals of geographic coordinate systems and how to work with the basemap library to plot geographic data points on maps.

We can explore a range of interesting questions and ideas using these datasets:

For each airport, which destination airport is the most common?
Which cities are the most important hubs for airports and airlines?
Before diving into coordinate systems, explore the datasets in the code cell below.

In most cases, we want to visualize latitude and longitude points on two-dimensional maps. Two-dimensional maps are faster to render, easier to view on a computer and distribute, and are more familiar to the experience of popular mapping software like Google Maps. Latitude and longitude values describe points on a sphere, which is three-dimensional. To plot the values on a two-dimensional plane, we need to convert the coordinates to the Cartesian coordinate system using a **map projection**.

A [map projection](https://en.wikipedia.org/wiki/Map_projection) transforms points on a sphere to a two-dimensional plane. When projecting down to the two-dimensional plane, some properties are distorted. Each map projection makes trade-offs in what properties to preserve and you can read about the different trade-offs [here](https://en.wikipedia.org/wiki/Map_projection#Metric_properties_of_maps). We'll use the [Mercator projection](https://en.wikipedia.org/wiki/Mercator_projection), because it is commonly used by popular mapping software.

### Basemap

Before we convert our flight data to Cartesian coordinates and plot it, let's learn more about the [basemap toolkit](https://matplotlib.org/basemap/). Basemap is an extension to Matplotlib that makes it easier to work with geographic data. The [documentation for basemap](https://matplotlib.org/basemap/users/intro.html) provides a good high-level overview of what the library does:

The matplotlib basemap toolkit is a library for plotting 2D data on maps in Python. Basemap does not do any plotting on its own, but provides the facilities to transform coordinates to one of 25 different map projections.

Basemap makes it easy to convert from the spherical coordinate system (latitudes & longitudes) to the Mercator projection. While basemap uses Matplotlib to actually draw and control the map, the library provides many methods that enable us to work with maps quickly. Before we dive into how basemap works, let's get familiar with how to install it.

The Basemap library has some external dependencies that Anaconda handles the installation for. To test the installation, run the following import code:

`from mpl_toolkits.basemap import Basemap`

If an error is returned, we recommend searching for similar errors on StackOverflow to help debug the issue. Because basemap uses matplotlib, you'll want to import `matplotlib.pyplot` into your environment when you use Basemap.

Here's what the general workflow will look like when working with two-dimensional maps:

- Create a new basemap instance with the specific map projection we want to use and how much of the map we want included.
- Convert spherical coordinates to Cartesian coordinates using the basemap instance.
- Use the matplotlib and basemap methods to customize the map.
- Display the map.

Let's focus on the first step and create a new basemap instance. To create a new instance of the basemap class, we call the basemap constructor and pass in values for the required parameters:

- `projection`: the map projection.
- `llcrnrlat`: latitude of lower left hand corner of the desired map domain
- `urcrnrlat`: latitude of upper right hand corner of the desired map domain
- `llcrnrlon`: longitude of lower left hand corner of the desired map domain
- `urcrnrlon`: longitude of upper right hand corner of the desired map domain

As we mentioned before, we need to convert latitude and longitude values to Cartesian coordinates to display them on a two-dimensional map. We can pass in a list of latitude and longitude values into the basemap instance and it will return back converted lists of longitude and latitude values using the projection we specified earlier. The constructor only accepts list values, so we'll need to use `Series.tolist()` to convert the `longitude` and `latitude` columns from the `airports` dataframe to lists. Then, we pass them to the basemap instance with the longitude values first then the latitude values:

`x, y = m(longitudes, latitudes)`

The basemap object will return 2 list objects, which we assign to `x` and `y`. Finally, we display the first 5 elements of the original longitude values, original latitude values, the converted longitude values, and the converted latitude values.

Now that the data is in the right format, we can plot the coordinates on a map. A scatter plot is the simplest way to plot points on a map, where each point is represented as an (x, y) coordinate pair. To create a scatter plot from a list of `x` and `y` coordinates, we use the `basemap.scatter()` method.

`m.scatter(x,y)`

The `basemap.scatter()` method has similar parameters to the `pyplot.scatter()`. For example, we can customize the size of each marker using the `s` parameter:

`m.scatter(x,y,s=10)
m.scatter(x,y,s=5)`

After we've created the scatter plot, use `plt.show()` to display the plot. We'll dive more into customizing the plot in the next step but now, create a simple scatter plot.

`m = Basemap(projection='merc', llcrnrlat=-80, urcrnrlat=80, llcrnrlon=-180, urcrnrlon=180)
x, y = m(longitudes, latitudes)
m.scatter(x, y, s=1)
plt.show()`

You'll notice that the outlines of the coasts for each continent are missing from the map above. We can display the coast lines using the `basemap.drawcoastlines()` method.

Because basemap uses matplotlib under the hood, we can interact with the matplotlib classes that basemap uses directly to customize the appearance of the map.

We can add code that:

- uses `pyplot.subplots()` to specify the `figsize` parameter
- returns the Figure and Axes object for a single subplot and assigns to `fig` and `ax` respectively
- use the `Axes.set_title()` method to set the map title

To better understand the flight routes, we can draw **great circles** to connect starting and ending locations on a map. A great circle is the shortest circle connecting 2 points on a sphere.

On a two-dimensional map, the great circle is demonstrated as a line because it is projected from three-dimensional down to two-dimensional using the map projection. We can use these to visualize the flight routes from the `routes` dataframe. To plot great circles, we need the source longitude, source latitude, destination longitude, and the destination latitude for each route. While the `routes` dataframe contains the source and destination airports for each route, the latitude and longitude values for each airport are in a separate dataframe (`airports`).

To make things easier, we've created a new CSV file called `geo_routes.csv` that contains the latitude and longitude values corresponding to the source and destination airports for each route. We've also removed some columns we won't be working with.

We use the `basemap.drawgreatcircle()` method to display a great circle between 2 points. The `basemap.drawgreatcircle()` method requires four parameters in the following order:

- `lon1` - longitude of the starting point.
- `lat1` - latitude of the starting point.
- `lon2` - longitude of the ending point.
- `lat2` - latitude of the ending point.

The following code generates a great circle for the first three routes in the dataframe:

`m.drawgreatcircle(39.956589, 43.449928, 49.278728, 55.606186)
m.drawgreatcircle(48.006278, 46.283333, 49.278728, 55.606186)
m.drawgreatcircle(39.956589, 43.449928, 43.081889 , 44.225072)`

Unfortunately, basemap struggles to create great circles for routes that have an absolute difference of larger than 180 degrees for either the latitude or longitude values. This is because the `basemap.drawgreatcircle()` method isn't able to create great circles properly when they go outside of the map boundaries. This is mentioned briefly in the documentation for the method:

**Note**: Cannot handle situations in which the great circle intersects the edge of the map projection domain, and then re-enters the domain.

In this mission, we learned how to visualize geographic data using basemap. This is the last mission in the Storytelling Through Data Visualization course. You should now have a solid foundation in data visualization for exploring data and communicating insights. We encourage you to keep exploring data visualization on your own. Here are some suggestions for what to do next:

- Plotting tools:
    - [Creating 3D plots using Plotly](https://plot.ly/python/3d-scatter-plots/)
    - [Creating interactive visualizations using bokeh](https://bokeh.pydata.org/en/latest/)
    - [Creating interactive map visualizations using folium](http://python-visualization.github.io/folium/)
    
- The art and science of data visualization:
    - [Visual Display of Quantitative Information](https://www.amazon.com/Visual-Display-Quantitative-Information/dp/0961392142)
    - [Visual Explanations: Images and Quantities, Evidence and Narrative]()