<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, Jonathan Morgan, and Ridhima Sodhi. "ADA-KCMO-2018." Coleridge Initiative GitHub Repositories. 2018. https://github.com/Coleridge-Initiative/ada-kcmo-2018. [![DOI](https://zenodo.org/badge/119078858.svg)](https://zenodo.org/badge/latestdoi/119078858)

# Data Visualization in Python
---

## Table of Contents
- [Introduction](#Introduction)
    - [Learning Objectives](#Learning-Objectives)
- [Python Setup](#Python-Setup)
- [Load the Data](#Load-the-Data)
- [Our First Chart in `matplotlib`](#Our-First-Chart-in-matplotlib)
    - [A Note on Data Sourcing](#A-Note-on-Data-Sourcing)
    - [Layering in `matplotlib`](#Layering-in-matplotlib)
- [Our First Chart in `seaborn`](#Our-First-Chart-in-seaborn)
- [Choosing a Data Visualization package](#Choosing-a-Data-Visualization-Package)
    - [Combining `seaborn` and `matplotlib`](#Combining-seaborn-and-matplotlib)
- [Visual Encodings](#Visual-Encodings)
    - [Using Hex Codes for Color](#Using-Hex-Codes-for-Color)
    - [Saving Charts as a Variable](#Saving-Charts-as-a-Variable)
    - [An Important Note on Graph Titles](#An-Important-Note-on-Graph-Titles)
- [Exporting Completed Graphs](#Exporting-Completed-Graphs)
- [Exercises & Practice](#Exercises-&-Practice)
- [Additional Resources](#Additional-Resources)

## Introduction
- Back to [Table of Contents](#Table-of-Contents)

In this module, you will learn to quickly and flexibly make a wide series of visualizations for exploratory data analysis and communicating to your audience. This module contains a practical introduction to data visualization in Python and covers important rules that any data visualizer should follow.

### Learning Objectives

* Learn critical rules about data visualization (using the correct graph types, correctly labeling all visual encodings, properly sourcing data).

* Become familiar with a core base of data visualization tools in Python - specifically matplotlib and seaborn.

* Start to develop the ability to conceptualize what visualizations are going to best reveal various types of patterns in your data.

* Learn more about class data with exploratory analyses.

## Python Setup
- Back to [Table of Contents](#Table-of-Contents)

In [None]:
import pandas as pd

import matplotlib as mplib
import matplotlib.pyplot as plt # visualization package
import seaborn as sns

# database connections
from sqlalchemy import create_engine # to get data from database
from sqlalchemy import __version__ as sql_version
from sqlalchemy import inspect

# so images get plotted in the notebook
%matplotlib inline

## Load the Data
- Back to [Table of Contents](#Table-of-Contents)

In [None]:
# set up sqlalchemy engine
engine = create_engine('postgresql://10.10.2.10/appliedda')

Let's focus on the employer data. The employer EIN is a unique identification number for every organization. Grouping by EIN, let's take a look at the total number of employees and the total wages by company during three distinct quarters: the first quarter of 2006, the first quarter of 2010, and the first quarter of 2016.

In [None]:
# We can look at column names within the employers table:
query = '''
SELECT * 
FROM information_schema.columns 
WHERE table_schema = 'kcmo_lehd' AND table_name = 'mo_qcew_employers'
'''
pd.read_sql(query, engine)

In [None]:
# For 3 distinct quarters, let's look at total number of employees and total wages per EIN

select_string = '''
SELECT ein
        , year
        , qtr
        , SUM(mon1_empl) AS total_empl1
        , SUM(mon2_empl) AS total_empl2
        , SUM(mon3_empl) AS total_empl3
        , SUM(total_wage) AS total_wage
FROM kcmo_lehd.mo_qcew_employers
WHERE year in (2006, 2010, 2016)
        AND ein != '' 
        AND qtr = 1
        AND total_wage > 0 
        AND mon1_empl > 0
        AND mon2_empl > 0
        AND mon3_empl > 0
GROUP BY ein, year, qtr;
'''
ein_empl = pd.read_sql(select_string, engine)

print("Number of rows returned: " + str(len(ein_empl)))

In [None]:
pd.crosstab(index = ein_empl['year'], columns = 'count')

Let's define the average employer monthly wage as the total wages over the quarter divided by the sum of the number of employees in every month of the quarter.

In [None]:
ein_empl['total_empl'] = ein_empl['total_empl1']+ein_empl['total_empl2']+ein_empl['total_empl3']
ein_empl['avg_wage'] = ein_empl['total_wage']/ein_empl['total_empl']

In [None]:
ein_empl.head()

## Our First Chart in `matplotlib`
- Back to [Table of Contents](#Table-of-Contents)

Below, we make our first chart in matplotlib. We'll come back to the choice of this particular library in a second, but for now just appreciate that the visualization is creating sensible scales, tick marks, and gridlines on its own.

In [None]:
# Make a simple histogram:
plt.hist(ein_empl['avg_wage'])
plt.show()

The chart only shows us one bar. What is the distribution of our data? 

In [None]:
ein_empl['avg_wage'].describe(percentiles = [.01, .1, .25, .5, .75, .9, .99])

Since the distribution of average wages is very skewed to the right, let's limit our data to average wages under $8,000 a month.

In [None]:
## Average wages per company often have a very strong right skew:
max_empl = ein_empl['avg_wage'].max()
print("Maximum average company wage = " + str(max_empl))

## But most companies have an average wage of under $8000 per month:
(ein_empl['avg_wage'] < 8000).value_counts()

In [None]:
## So let's just look at companies with average wages under $8000 per month
ein_empl_lim = ein_empl[(ein_empl['avg_wage'] <= 8000)]

# Make a simple histogram:
plt.hist(ein_empl_lim['avg_wage'])
plt.show()

In [None]:
## We can change options within the hist function (e.g. number of bins, color, transparency:
plt.hist(ein_empl_lim['avg_wage'], bins=20, facecolor="purple", alpha=0.5)

## And we can affect the plot options too:
plt.xlabel('Average Monthly Wage')
plt.ylabel('Number of Employers')
plt.title('Most Employers Pay Under $5,000 per Month')

## And add Data sourcing:
### xy are measured in percent of axes length, from bottom left of graph:
plt.annotate('Source: MO Department of Labor', xy=(0.7,-0.2), xycoords="axes fraction")

## We use plt.show() to display the graph once we are done setting options:
plt.show()

### A Note on Data Sourcing
- Back to [Table of Contents](#Table-of-Contents)

Data sourcing is a critical aspect of any data visualization. Although here we are simply referencing the agencies that created the data, it is ideal to provide as direct of a path as possible for the viewer to find the data the graph is based on. When this is not possible (e.g. the data is sequestered), directing the viewer to documentation or methodology for the data is a good alternative. Regardless, providing clear sourcing for the underlying data is an **absolutely requirement** of any respectable visualization, and further builds trusts and enables reproducibility.

### Layering in `matplotlib`
- Back to [Table of Contents](#Table-of-Contents)

This functionality - where we can make consecutive changes to the same plot - also allows us to layer on multiple plots. By default, the first graph you create will be at the bottom, with ensuing graphs on top.

Below, we see the 2006 histogram, in blue, is beneath the 2016 histogram, in orange. You might also notice that since 2006, the number of employers paying high average wages has increased, but the number of companies paying an average monthly wage of around $2,000 has decreased.

In [None]:
plt.hist(ein_empl_lim[ein_empl_lim['year'] == 2016].avg_wage, facecolor="blue", alpha=0.5)
plt.hist(ein_empl_lim[ein_empl_lim['year'] == 2006].avg_wage, facecolor="orange", alpha=0.5)
plt.annotate('Source: MO Department of Labor', xy=(0.7,-0.2), xycoords="axes fraction")
plt.show()

## Our First Chart in `seaborn`
- Back to [Table of Contents](#Table-of-Contents)

Below, we quickly use pandas to create an aggregation of our job data - the total number of jobs by year. Then we pass the data to the barplot function in the `seaborn` function, which recall we imported as `sns` for short.

In [None]:
## Calculate average wages by year:
overall_avg_wage = ein_empl.groupby('year')[['total_empl', 'total_wage']].sum().reset_index()
overall_avg_wage['average_wages'] = overall_avg_wage['total_wage']/overall_avg_wage['total_empl']
overall_avg_wage.columns = ['year', 'total_empl', 'total_wage', 'average_wages']

print(type(overall_avg_wage))
print("***********")
print(overall_avg_wage)

In [None]:
## Barplot function
# Note we can reference column names (in quotes) in the specified data:
sns.barplot(x='year', y='average_wages', data=overall_avg_wage)
plt.show()

You might notice that if you don't include plt.show(), Jupyter will still produce a chart. However this is not the case in other environments. So we will continue using plt.show() to more formally ask for Python to display the chart we have constructed, after adding all layers and setting all options.

In [None]:
## Seaborn has a great series of charts for showing distributions across a categorical variable:
sns.factorplot(x='year', y='avg_wage', hue='year', data=ein_empl_lim, kind='box')
plt.show()

## Other options for the 'kind' argument include 'bar' and 'violin'

Already you might notice some differences between matplotlib and seaborn - at the very least seaborn allows us to more easily reference column names within a pandas dataframe, whereas matplotlib clearly has a plethora of options.

## Choosing a Data Visualization Package

- Back to [Table of Contents](#Table-of-Contents)

There are many excellent data visualiation modules available in Python, but for the tutorial we will stick to the tried and true combination of `matplotlib` and `seaborn`. You can read more about different options for data visualization in Python in the [More Resources](#More-Resources) section at the bottom of this notebook. 

`matplotlib` is very expressive, meaning it has functionality that can easily account for fine-tuned graph creation and adjustment. However, this also means that `matplotlib` is somewhat more complex to code.

`seaborn` is a higher-level visualization module, which means it is much less expressive and flexible than matplotlib, but far more concise and easier to code.

It may seem like we need to choose between these two approaches, but this is not the case! Since `seaborn` is itself written in `matplotlib` (you will sometimes see `seaborn` be called a `matplotlib` 'wrapper'), we can use `seaborn` for making graphs quickly and then `matplotlib` for specific adjustments. When you see `plt` referenced in the code below, we are using `matplotlib`'s pyplot submodule.


`seaborn` also improves on `matplotlib` in important ways, such as the ability to more easily visualize regression model results, creating small multiples, enabling better color palettes, and improve default aesthetics. From [`seaborn`'s documentation](https://seaborn.pydata.org/introduction.html):

> If matplotlib 'tries to make easy things easy and hard things possible', seaborn tries to make a well-defined set of hard things easy too. 

In [None]:
## Seaborn offers a powerful tool called FacetGrid for making small multiples of matplotlib graphs:

### Create an empty set of grids:
facet_histograms = sns.FacetGrid(ein_empl_lim, col='year', hue='year')

## "map' a histogram to each grid:
facet_histograms = facet_histograms.map(plt.hist, 'avg_wage')

## Data Sourcing:
plt.annotate('Source: MO Department of Labor', xy=(0.6,-0.35), xycoords="axes fraction")
plt.show()

In [None]:
## Alternatively, you can create and save several charts:
for i in set(ein_empl_lim["year"]):
    tmp = ein_empl_lim[ein_empl_lim["year"] == i]
    plt.hist(tmp["avg_wage"])

    plt.xlabel('Average Monthly Wage')
    plt.ylabel('Number of Employers')
    plt.title(str(i))
    
    plt.annotate('Source: MO Department of Labor', xy=(0.7,-0.2), xycoords="axes fraction")

    filename = "output/graph_" + str(i) + ".pdf"
    plt.savefig(filename)
    plt.show()

### Combining `seaborn` and `matplotlib` 
- Back to [Table of Contents](#Table-of-Contents)

Below, we use `seaborn` for setting an overall aesthetic style and then faceting (created small multiples). We then use `matplotlib` to set very specific adjustments - things like adding the title, adjusting the locations of the plots, and sizing th graph space. This is a pretty protoyptical use of the power of these two libraries together. 

More on [`seaborn`'s set_style function](https://seaborn.pydata.org/generated/seaborn.set_style.html).
More on [`matplotlib`'s figure (fig) API](https://matplotlib.org/api/figure_api.html).

In [None]:
# Seaborn's set_style function allows us to set many aesthetic parameters.
sns.set_style("whitegrid")

facet_histograms = sns.FacetGrid(ein_empl_lim, col='year', hue='year')
facet_histograms.map(plt.hist, 'avg_wage')

## We can still change options with matplotlib, using facet_histograms.fig
facet_histograms.fig.subplots_adjust(top=0.85)
facet_histograms.fig.suptitle("Employer Average Monthly Wages Improved since 2006", fontsize=14)
facet_histograms.fig.set_size_inches(10,5)

## Add a legend for hue (color):
facet_histograms = facet_histograms.add_legend()

## Data Sourcing:
plt.annotate('Source: MO LEHD', xy=(0.6,-0.35), xycoords="axes fraction")
plt.show()

## Visual Encodings

- Back to [Table of Contents](#Table-of-Contents)

We often start with charts that use 2-dimensional position (like a scatterplot) or that use height (like histograms and bar charts). This is because these visual encodings - the visible mark that represents the data - are particularly perceptually strong. This means that when humans view these visual encodings, they are more accurate in estimating the underlying numbers than encodings like size (think circle size in a bubble chart) or angle (e.g. pie chart).

For more information on visual encodings and data visualization theory, see:

* [Designing Data Visualizations, Chapter 4](http://www.safaribooksonline.com/library/view/designing-data-visualizations/9781449314774/ch04.html) by Julie Steele and Noah Iliinsky

* Now You See It - book by Stephen Few

In [None]:
select_string = "SELECT year, SUM(mon1_empl + mon2_empl + mon3_empl) AS total_empl, SUM(total_wage) AS total_wages"
select_string += " FROM kcmo_lehd.mo_qcew_employers"
select_string += " GROUP BY year"

yearly_avg_wages = pd.read_sql(select_string, engine)

In [None]:
yearly_avg_wages['avg_wage'] = yearly_avg_wages['total_wages']/yearly_avg_wages['total_empl']
yearly_avg_wages = yearly_avg_wages.sort_values('year')
yearly_avg_wages

In [None]:
## We can pass a single value to a the tsplot function to get a simple line chart:
sns.tsplot(data=yearly_avg_wages['avg_wage'], color="#179809")

## Data Sourcing:
plt.annotate('Source: MO Department of Labor', xy=(0.8,-0.20), xycoords="axes fraction")
plt.show()

### Using Hex Codes for Color
- Back to [Table of Contents](#Table-of-Contents)

In the graph above, you can see I set the color of the graph with a pound sign `#` followed by a series of six numbers. This is a hexcode - which is short for hexadecimal code. A hexadecimal code lets you specify one of over 16 million colors using combinations of red, green, and blue. It first has two digits for red, then two digits for green, and lastly two digits for blue: `#RRGGBB`

Further, these codes allow for you to specify sixteen integers (thus hexadecimal) for each digit, in this order:

(0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F)

Over time, it gets easier to read these codes. For instance, above, I used the hex code "#179809". Understanding how hex codes work, I can see that there is a relatively low number for red (17) and fairly high number for green (98) and another low number for blue (09). Thus it shouldn't be too surprising that a green color resulted in the graph.

Tools like [Adobe Color](https://color.adobe.com) and this [Hex Calculator](https://www.w3schoosl.com/colors/colors_hexadecimal.asp) can help you get used to this system.

Most modern browsers also support eight digit hex codes, in which the first two enable transparency, which is often called 'alpha' in data visualization: `#AARRGGBB`

In [None]:
## We can add the time argument to set the x-axis correctly. And let's change the color, since we can:
sns.tsplot(data=yearly_avg_wages['avg_wage'], time=yearly_avg_wages['year'], color="#B088CD")

# Color Note: B088CD
## The highest values are red 'B0' and blue 'CD', so we can expect a mix of those
## Further this is high in all three colors, so it'll be light, not dark

## Data Sourcing:
plt.annotate('Source: MO Department of Labor', xy=(0.8,-0.20), xycoords="axes fraction")
plt.show()

### Saving Charts as a Variable
- Back to [Table of Contents](#Table-of-Contents)

Although as you can see above, we can immediately print our plots on a page, it is generally better to save them as variable. We can then alter the charts over several lines before finally displaying them with the `show()` function, which comes from the `matplotlib` `pyplot` module we loaded earlier.

In [None]:
## Save the line chart as 'graph'
graph = sns.tsplot(data=yearly_avg_wages['avg_wage'], time=yearly_avg_wages['year'])

## To add data labels, we loop over each row and use graph.text()
for i, row, in yearly_avg_wages.iterrows():
    graph.text(row["year"] + 0.05, row["avg_wage"] - 50, int(row["year"]))
    
## Now change x-axis and y-axis labels:
graph.set(xlabel="Year", ylabel="Average Annual Wage")
graph.set(title="Rising Annual Wages since the 2009 Financial Crisis")

plt.annotate('Source: MO Department of Labor', xy=(0.8,-0.20), xycoords="axes fraction")

## Then display the plot:
plt.show()

### An Important Note on Graph Titles
- Back to [Table of Contents](#Table-of-Contents)

The title of a visualization occupies the most valuable real estate on the page. If nothing else, you can be reasonably sure a viewer will at least read the title and glance at your visualization. This is why you want to put thought into making a clear and effective title that acts as a **narrative** for your chart. Many novice visualizers default to an **explanatory** title, something like: "Average Wages Over Time (2006-2016)". This title is correct - it just isn't very useful. This is particularly true since any good graph will have explained what the visualization is through the axes and legends. Instead, use the title to reinforce and explain the core point of the visualization. It should answer the question "Why is this graph important?" and focus the viewer onto the most critical take-away.

## Exporting Completed Graphs
- Back to [Table of Contents](#Table-of-Contents)

When you are satisfied with your visualization, you may want to save a a copy outside of your notebook. You can do this with `matplotlib`'s savefig function. You simply need to run:

plt.savefig("fileName.fileExtension")

The file extension is actually surprisingly important. Image formats like png and jpeg are actually **not ideal**. These file formats store your graph as a giant grid of pixels, which is space-efficient, but can't be edited later. Saving your visualizations instead as a PDF is strongly advised. PDFs are a type of vector image, which means all the component of the graph will be maintained.

With PDFs, you can later open the image in a program like Adobe Illustrator and make changes like the size or typeface of your text, move your legends, or adjust the colors of your visual encodings. All of this would be impossible with a png or jpeg.

In [None]:
## Save the line chart as 'graph'
graph = sns.tsplot(data=yearly_avg_wages['avg_wage'], time=yearly_avg_wages['year'])

## To add data labels, we loop over each row and use graph.text()
for i, row, in yearly_avg_wages.iterrows():
    graph.text(row["year"] + 0.05, row["avg_wage"] - 50, int(row["year"]))
    
## Now change x-axis and y-axis labels:
graph.set(xlabel="Year", ylabel="Average Annual Wage")
graph.set(title="Rising Annual Wages since the 2009 Financial Crisis")

plt.annotate('Source: MO Department of Labor', xy=(0.8,-0.20), xycoords="axes fraction")

plt.savefig('output/wageplot.png')
plt.savefig('output/wageplot.pdf')

## Exercises & Practice
- Back to [Table of Contents](#Table-of-Contents)

### Exercise 1: Directed Scatterplot
- Back to [Table of Contents](#Table-of-Contents)

A directed scatterplot still uses one point for each year, but then uses the x-axis and the y-axis for variabes. In order to maintain the ordinal relationship, a line is drawn between the years. To do this in seaborn, we actually use sns.FacetGrid, which allows us to overlay different plots together. Specifically, it lets us overlay a scatterplot (`plt.scatter` and a line chart `plt.plot`).

In [None]:
# We can also look at a scatterplot of the number of people and averages wages in each year:
scatter = sns.lmplot(x='total_empl', y='avg_wage', data=yearly_avg_wages, fit_reg=False)
scatter.set(xlabel="Number of Employees", ylabel="Average Annual Wages", title="Number and Wages of MO Employees")
 
## Sourcing:
plt.annotate('Source: MO Department of Labor', xy=(0.8,-0.20), xycoords="axes fraction")

plt.show()

In [None]:
cncted_scatter = sns.FacetGrid(data=yearly_avg_wages, size=7)
cncted_scatter.map(plt.scatter, 'total_empl', 'avg_wage', color="#A72313")
cncted_scatter.map(plt.plot, 'total_empl', 'avg_wage', color="#A72313")
cncted_scatter.set(title="Rising Wages of MO Employees", xlabel="Number of Employees", ylabel="Average Wages")

## Adding data labels:
for i, row, in yearly_avg_wages.iterrows():
    plt.text(row["total_empl"], row["avg_wage"], int(row["year"]))
    
## Sourcing:
plt.annotate('Source: MO Department of Labor', xy=(0.8,-0.10), xycoords="axes fraction")

plt.show()

### Exercise 2: Heatmap
- Back to [Table of Contents](#Table-of-Contents)

Below, we reconsider the count of jobs by industry that we calculated in the "Variables" notebook. We query the database and collect the the sum of the jobs in every industry from the LODES data. We then format this data into a wide DataFrame using `pandas`. This grid is format that `seaborn`'s heatmap function is expecting.

If you would like to reproduce this type of graph, query one of the tables again and create dataframe in the correct format, then pass that along to seaborn's heatmap function. Use the code you learned above to add a title, better axis labels, and data sourcing.

Note that the color map used here `viridis` is a scientifically derived color palette meant to be perceptually linear. The color maps `inferno`, `plasma` and `magama` also all meet this criteria.

__More information:__
* [seaborn heatmap documentation](http://seaborn.pydata.org/generated/seaborn.heatmap.html)

* [matplotlib color map documentation](http://matplotlib.org/users/colormap.html)

In [None]:
query = '''
SELECT *
FROM public.lodes_workplace_area_characteristics
WHERE segment = 'S000' AND jobtype = 'JT01' AND state = 'mo'
LIMIT 20;
'''
wac = pd.read_sql(query, engine)

In [None]:
filter_col = [col for col in wac if col.startswith('cn')]
query = '''
SELECT
    year'''

for col in filter_col:
    query += '''
    , sum({0:}) as {0:}'''.format(col)

query += '''
FROM public.lodes_workplace_area_characteristics
WHERE segment = 'S000' AND jobtype = 'JT01' AND state = 'mo'
GROUP BY year
ORDER BY year
'''
wac_year_stats = pd.read_sql(query, engine, index_col='year')

In [None]:
wac_year_stats['total_jobs'] = wac_year_stats.sum(axis=1)
for var in filter_col:
    wac_year_stats[var] = (wac_year_stats[var]/wac_year_stats['total_jobs'])*100
del wac_year_stats['total_jobs']

In [None]:
wac_year_stats = wac_year_stats.T

In [None]:
## Create a heatmap, with annotations:
pd.options.display.float_format = '{:.2f}%'.format
fig, ax = plt.subplots(figsize = (20,12))
sns.heatmap(wac_year_stats, annot=True, fmt='.2f', cmap="viridis")
plt.show()

### Exercise 3: Joinplot
- Back to [Table of Contents](#Table-of-Contents)

Below, we pull two continuous variables from the Missouri Department of Labor, summed over each employer. See if you can pass this data to the sns.jointplot() function. Some of the arguments have been filled out for you, while others need completion.

In [None]:
pd.read_sql("SELECT * FROM kcmo_lehd.mo_qcew_employers LIMIT 5;",engine)

In [None]:
## Querying Total Wages and Jobs by Employer
select_string = "SELECT ein, sum(total_wage) as agg_wages, sum(mon1_empl + mon2_empl + mon3_empl) as agg_jobs"
select_string += " FROM kcmo_lehd.mo_qcew_employers"
select_string += " WHERE year = 2016"
select_string += " GROUP BY ein"

print(select_string)

## Run SQL query:
employers = pd.read_sql(select_string, engine)
print(len(employers))

In [None]:
## Fill in the arguments (x, y, data) below to get the visualiztion to run.

# sns.jointplot(x=, y=, data=
#               , color="#137B80", marginal_kws={"bins":30})
# plt.show()

### Exercise 4: FacetGrid
- Back to [Table of Contents](#Table-of-Contents)

Let's see if we can use seaborn's FacetGrid to create small multiple scatterplots. First you need to query a database and get at least one categorical variable and at least two continuous variables (floats).

Then try passing this data to the FacetGrid function from `seaborn` and the scatter function from `matplotlib`. 

[FacetGrid Documentation](http://seaborn.pydata.org/examples/many_facets.html)

In [None]:
## Pseudo-code to get you started:

# grid = sns.FacetGrid(dataframe, col = "categorical_var", hue="categorical_var", col_wrap=2)
# grid.map(plt.scatter("x_var", "y_var"))



In [None]:
## Enter your code for excercise 3 here:



In [None]:
## Submit results by saving to shared folder (use code below):

# myname = !whoami
# plt.savefig('/nfshome/{0}/Projects/ada_kcmo/shared/Class_Submits/Data_Visualization/{0}_1.png'.format(myname[0]))

### Exercise 5: Geographic Visualization

Another important feature in data visualization is mapping out a metric according to geography. In the following graphic, we color blocks in Kansas City according to the number of jobs on the LEHD data. We will use the Python packages `geopandas` and `matplotlib`.

In [None]:
import geopandas as gpd

In [None]:
query = """
SELECT geoid10, geom_wgs, sum(c000) tot_jobs
FROM kcmo_blocks b
JOIN lodes_workplace_area_characteristics w
ON w.w_geocode = b.geoid10
WHERE w.segment = 'S000' AND w.jobtype = 'JT01' AND w.year = 2010
group by geoid10, geom_wgs
"""

gdf = gpd.read_postgis(query, engine, geom_col='geom_wgs', crs='+init=epsg:4326')

In [None]:
fig, ax = plt.subplots(1, figsize = (10,15))
gdf.plot('tot_jobs', cmap = 'plasma', scheme = 'quantiles', legend = True, edgecolor = 'grey', ax = ax)

### Exercise 6: One more Visualization
- Back to [Table of Contents](#Table-of-Contents)

Test your mettle. Check out the seaborn [data visualization gallery](http://seaborn.pydata.org/examples) and see if you can implement an interesting visualization. Don't forget to submit your results by saving to the shared folder.

In [None]:
## Your code here.


# myname = !whoami
# export_file.to_csv(
#     '/nfshome/{0}/Projects/ada_kcmo/shared/Class_Submits/Data_Visualization/{0}.csv'.format(myname[0])
#     , index = False)

---

## Additional Resources

* [A Thorough Comparison of Python's DataViz Modules](https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair)

* [Seaborn Documentation](http://seaborn.pydata.org)

* [Matplotlib Documentation](https://matplotlib.org)

* [Advanced Functionality in Seaborn](blog.insightdatalabs.com/advanced-functionality-in-seaborn)

* Other Python Visualization Libraries:
    * [`Bokeh`](http://bokeh.pydata.org)

    * [`Altair`](https://altair-viz.github.io)

    * [`ggplot`](http://ggplot.yhathq.com.com)

    * [`Plotly`](https://plot.ly)