<h1 align="center">Data Visualization with Python</h1>

## Instructors
- Scott Bailey 
- Claire Cahoon

## Learning objectives
By the end of our workshop today, we hope you'll be able to load in data and visualize it in different ways depending on your specific needs and the nature of the data. We will work with a couple of customized datasets from the [Stanford Open Policing Project](https://openpolicing.stanford.edu/).

## Topics
- Visualization
- Univariate plotting
 - Bar charts
 - Histograms
 - Line charts
 - Time series
 - Box plots
 - Area charts
- Bivariate plotting
 - Scatter plots
 - Stacked plots
 - Box plots (again)
- Styling
- Seaborn
 - Faceting
 
##  Setup

Clicking the "Open in Colab" button you can see after opening the Github link above will create a new temporary copy of the notebook in the Google Colaboratory environment. Clicking the "Copy to Drive" button that appears in the menu bar will create a copy of the notebook attached to your own user account, so you can edit it in any way you like -- you can even take notes directly in the notebook.

## Zoom etiquette

Please make sure that your mic is muted during the workshop.

## Questions during the workshop

During the workshop, we have a second instructor who will be monitoring chat on Zoom. Please feel free to ask questions by chat throughout the workshop. Our second instructor will answer as able, and will aggregate questions with answers that might help everyone. 

At the end of each section of the workshop, the primary instructor will answer aggregated and new questions as time permits. If we aren't able to get to your question during the workshop, please follow up with us afterward. 

## Jupyter Notebooks and Google Colaboratory

Jupyter notebooks are a way to write and run Python code in an interactive way. They're quickly becoming a standard way of putting together data, code, and written explanations or visualizations into a single document and sharing that. There are a lot of ways that you can run Jupyter notebooks, including just locally on your computer, but we've decided to use Google's Colaboratory notebook platform for this workshop.  Colaboratory is “a Google research project created to help disseminate machine learning education and research.”  If you would like to know more about Colaboratory in general, you can visit the [Welcome Notebook](https://colab.research.google.com/notebooks/welcome.ipynb).

Using the Google Colaboratory platform allows us to focus on learning and writing Python in the workshop rather than on setting up Python, which can sometimes take a bit of extra work depending on platforms, operating systems, and other installed applications. If you'd like to install a Python distribution locally, though, we're happy to help. Feel free to drop by our walk-in consulting or schedule an appointment with us.

https://go.ncsu.edu/dvs-request

## Environment
For those using Anaconda or their own local installation of Jupyter Notebooks, in this workshop we'll need an environment with the following packages:
- `pandas`
- `matplotlib`
- `seaborn`, available in the `conda-forge` channel


## Datasets and Pandas

From Jake Vanderplas' book [**Python Data Science Handbook**](http://shop.oreilly.com/product/0636920034919.do) (from which some code excerpts are used in this workshop):

> Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a `DataFrame`. `DataFrame`s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

In [None]:
import pandas as pd

# Set some options
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 10)

We will be using a couple of datasets from the Stanford Open Policing project.

In [None]:
# Data source: https://openpolicing.stanford.edu/data/
raleigh_stops_url = "https://stacks.stanford.edu/file/druid:yg821jf8611/yg821jf8611_nc_raleigh_2020_04_01.csv.zip"

Pandas has very good [documentation](http://pandas.pydata.org/pandas-docs/stable/), and using Jupyter, any method help can be shown by appending the a `?` to the end and running the cell.

One of the nice things about Pandas is that it is able to load in data from the Internet without blinking an eye. The first dataset, which we will store in the `DataFrame` `stops`, contains one entry for each traffic stop in Raleigh over a period of years. Let's take a look at it.

In [None]:
stops = pd.read_csv(
    raleigh_stops_url,
    index_col='raw_row_number',
    parse_dates=['date'])
stops

Let's take a minute to explore the data a bit to see what we might want to visualize.

In [None]:
stops.info()

In [None]:
stops["type"]

In [None]:
stops["subject_race"].describe()

In [None]:
stops["subject_age"].describe()

In [None]:
stops["search_person"].describe()

In [None]:
stops["search_person"].value_counts()

In [None]:
stops["raw_action_description"]

## Visualization

Visualization in Python is a bit... convoluted. And it changes regularly as new libraries are developed. 

<img src="https://rougier.github.io/python-visualization-landscape/landscape-colors.png" width=960>
<div align="right">Adapted by [Nicolas P. Rougier](https://github.com/rougier/python-visualization-landscape) from a [Jake VanderPlass' presentation](https://twitter.com/jakevdp/status/852308415372115968)</div>

The library matplotlib is one of the oldest and most common options for plotting in Python, and set the foundation for many other libraries. Pandas also provides some useful shortcuts on top of matplotlib. These utilities are available in all `Series` and `DataFrame` objects just by calling `plot()`. But first we need to tell Jupyter that we are going to plot some charts using the plotting library matplotlib, so it knows how to render them.

In [None]:
%%capture
# enables inline plotting in Jupyter using matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
plt.rcParams["figure.figsize"] = (10, 6)  # default figure size in inches

<img src="https://matplotlib.org/_images/anatomy.png" width=600>
<div align="right">From the <a href="https://matplotlib.org/examples/showcase/anatomy.html">matplotlib documentation</a></div>

Unlike libraries like plotnine, ggpy, or R's ggplot, matplotlib doesn't follow [Leland Wilkinson’s The Grammar of Graphics](https://www.springer.com/gp/book/9780387245447). Instead, it relies on the Object-Oriented-Programming model of Python and all aspects of a plot are represented by objects, methods, properties, and arguments. 

## Univariate plotting

<table>
<tbody><tr>
<td><img src="https://camo.githubusercontent.com/fd040001f0007b89887ba2b2565698dfd790b85c/68747470733a2f2f692e696d6775722e636f6d2f736b615a5068622e706e67" width="350px" data-canonical-src="https://i.imgur.com/skaZPhb.png"></td>
<td><img src="https://camo.githubusercontent.com/98528c164a36365fa89ef9eb89d8664919e6a8ed/68747470733a2f2f692e696d6775722e636f6d2f67614e747459642e706e67" width="350px" data-canonical-src="https://i.imgur.com/gaNttYd.png"></td>
<td><img src="https://camo.githubusercontent.com/ad41d9f3a80ae03f9f79ea700b3e01a5155742da/68747470733a2f2f692e696d6775722e636f6d2f70616d70696f682e706e67" data-canonical-src="https://i.imgur.com/pampioh.png"></td>
<td><img src="https://camo.githubusercontent.com/f38749834bc72f3526a737244af468b99c91666f/68747470733a2f2f692e696d6775722e636f6d2f4f536275737a642e706e67" data-canonical-src="https://i.imgur.com/OSbuszd.png"></td>
</tr>
<tr>
<td>Bar Chat</td>
<td>Line Chart</td>
<td>Area Chart</td>
<td>Histogram</td>
</tr>
<tr>
<td>`df.plot.bar()`</td>
<td>`df.plot.line()`</td>
<td>`df.plot.area()`</td>
<td>`df.plot.hist()`</td>
</tr>
<tr>
<td>Good for nominal and small ordinal categorical data.</td>
<td>Good for ordinal categorical and interval data.</td>
<td>Good for ordinal categorical and interval data.</td>
<td>Good for interval data.</td>
</tr>
</tbody></table>
<div align="right">From Kaggle <a href="https://www.kaggle.com/learn/data-visualisation">Data Visualization</a> circa 2017</div>

The simplest type of visualization is single-variable or "univariate" visualization. This includes basic tools like bar plots and line charts.

### Bar and line plots

Bar charts are arguably the simplest data visualization. They map categories to numbers. In our case, we could want to map the types of traffic violation (category) to the number of violations by type (number). We'll use the `stops` `DataFrame` now.

In [None]:
stops['reason_for_stop'].value_counts()

In [None]:
stops['reason_for_stop'].value_counts().plot.bar()

We could also using the `rot` parameter to rotate the labels in our X axis.

In [None]:
stops['reason_for_stop'].value_counts().plot.bar(rot=60)

Bar charts are very flexible: The height can represent anything, as long as it is a number. And each bar can represent anything, as long as it is a category.

Some categories are **nominal** categories, and don't have an inherent or determined order. Nominal categorical variables include things like countries, gender, types of cheese, and lunar landers. There are also **ordinal** categories: things that do make sense to order, like earthquake magnitudes, happiness scales, and grades.

### Styling your plot

Let's create another bar chart, and start changing the styling a bit. We'll start with the driver's race. 

In [None]:
stops["subject_race"].value_counts().plot.bar()

We can easily change the color of the bars.

In [None]:
stops["subject_race"].value_counts().plot.bar(color='tomato')

And even the opacity by using `alpha`.

In [None]:
stops["subject_race"].value_counts().plot.bar(alpha=0.3, color='tomato')

Having vertical labels isn't super readable, and rotating the labels will only help so much. Let's use a horizontal bar chart instead, `barh()`, and also pass in `figsize` to change the size of the figure.

In [None]:
stops["subject_race"].value_counts().plot.barh(alpha=0.3, color='blue', figsize=(8, 16))

Race isn't an ordered category, but let's sort alphabetically just to see how we would. At the same time, we'll switch our syntax a bit so you can see how you can call `.plot` on a DataFrame and specify which columns should be on the x and y axes.

In [None]:
# We'll store our value counts in a DataFrame, but we have to reset the index and change some column names
# so they make more sense. 
race_count = stops["subject_race"].value_counts().reset_index()
race_count.rename(columns={"index": "race", "subject_race": "count"}, inplace=True)

(race_count
     .sort_values(["race"], ascending=False)
     .plot.barh(x="race", y="count", alpha=0.3, color='blue', figsize=(8, 16)))

Each time you call `plot()` an `Axes` object is returned, and Jupyter knows how to paint those. `Axes` objects are objects of the underlying `matplotlib` library for plotting in Python, and as such, lots of different options can be given to customize the aspect.

For example, we could set axis labels or a title.

In [None]:
ax = (race_count
     .sort_values(["race"], ascending=False)
     .plot.barh(x="race", y="count", alpha=0.3, color='blue', figsize=(8, 16)))
ax.set_ylabel("Driver Race")
ax.set_xlabel("Number of drivers")
ax.set_title("Race in Raleigh Traffic Stops")

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Using `stops`, plot the total number of each type of violation in a horizontal bar chart. Adjust the labels to describe your plot accurately.
</p>
</div>

In [None]:
ax = stops['reason_for_stop'].value_counts().plot.barh(alpha=0.5)
ax.set_xlabel("# Violations")
ax.set_ylabel("Reason for Stop")
ax.set_title("Raleigh Traffic Violation Types")

### Histograms

A histogram looks, trivially, like a bar plot. And it basically is. In fact, a histogram is special kind of bar plot that splits your data into even intervals and displays how many entries are in each interval with bars. The only analytical difference is that instead of each bar representing a single value, it represents a range of values; this is call the bin.

In [None]:
stops['subject_age'].plot.hist()

The number of bins can also be setup beforehand.

In [None]:
stops['subject_age'].plot.hist(bins=20)

### Boxplots

Another way to visualize univariate distributions is the boxplot, which displays a lot of statistically useful information about it.

In [None]:
stops.boxplot('subject_age')

Boxplots can be very powerful when comparing data.

In [None]:
stops.boxplot('subject_age', by='reason_for_stop', rot=60)

### Line charts

A line chart can pass over any number of individual values, making it the tool of first choice for distributions with many unique values or categories.

However, line charts have an important weakness: unlike bar charts, they're not appropriate for nominal categorical data. While bar charts distinguish between every "type" of point line charts mush them together. So a line chart asserts an order to the values on the horizontal axis, and the order won’t make sense with some data. After all, a "descent" from California to Washington to Tuscany doesn't mean much!

Let's try using a line chart when we previously used a bar chart.

In [None]:
(stops["reason_for_stop"].value_counts()
     .plot.line(alpha=0.3, color='blue', legend=False, rot=60))

Line charts also make it harder to distinguish between individual values. In general, if your data can fit into a bar chart, just use a bar chart.

### Time series

There is, however, a type of data at which line charts excel: time series.

To demonstrate time series data we will count the number of violations for each day of the month.

In [None]:
# Create a new columns for month, day, year
stops["month"] = stops["date"].dt.month
stops["day"] = stops["date"].dt.day
stops["year"] = stops["date"].dt.year
stops

We'll create a subset dataframe with just the data from 2015, the most recent year in the dataset.

In [None]:
stops_2015 = stops[stops.year == 2015]

Depending on what we're trying to do, we could create a new dataframe with a groupby statement, but for now we'll just use `.value_counts`.

In [None]:
stops_2015["day"].value_counts().plot.line()

What do you think has happened here?

In [None]:
stops_2015["day"].value_counts().sort_index()

In [None]:
stops_2015["day"].value_counts().sort_index().plot.line()

In [None]:
stops_2015["day"].value_counts().sort_index().plot.line(ylim=(0, 3500))

Let's also pull a single month of data, and look at it over time.

In [None]:
stops_april_2015 = stops[(stops.year == 2015) & (stops.month == 4)]
stops_april_2015

In [None]:
stops_april_2015["day"].value_counts().sort_index().plot.line()

### Area charts

Area charts are just line charts, but with the bottom shaded in

In [None]:
stops_2015["day"].value_counts().sort_index().plot.area()

When plotting only one variable, the difference between an area chart and a line chart can be mostly visual. When plotting the distribution of data, though, area charts often convey information more effectively than the line by itself. 

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Using `stops`, plot a histogram that shows the distribution of crime over the days of the month. Experiment with the number of bins to get close to a representation of weeks. 
</p>
</div>

In [None]:
stops['day'].plot.hist(bins=4)

## Bivariate plotting

<table>
<tbody><tr>
<td><img src="https://i.imgur.com/bBj1G1v.png" width="350px"></td>
<td><img src="https://i.imgur.com/ChK9zR3.png" width="350px"></td>
<td><img src="https://i.imgur.com/KBloVHe.png" width="350px"></td>
<td><img src="https://i.imgur.com/C7kEWq7.png" width="350px"></td>
</tr>
<tr>
<td style="font-weight:bold; font-size:16px;">Scatter Plot</td>
<td style="font-weight:bold; font-size:16px;">Hex Plot</td>
<td style="font-weight:bold; font-size:16px;">Stacked Bar Chart</td>
<td style="font-weight:bold; font-size:16px;">Bivariate Line Chart</td>
</tr>
<tr>
<td>`df.plot.scatter()`</td>
<td>`df.plot.hex()`</td>
<td>`df.plot.bar(stacked=True)`</td>
<td>`df.plot.line()`</td>
</tr>
<tr>
<td>Good for interval and some nominal categorical data.</td>
<td>Good for interval and some nominal categorical data.</td>
<td>Good for nominal and ordinal categorical data.</td>
<td>Good for ordinal categorical and interval data.</td>
</tr>
</tbody></table>
<div align="right">From Kaggle <a href="https://www.kaggle.com/learn/data-visualisation">Data Visualization</a> circa 2017</div>

In order to understand how variables interact with one another, we might use plots that enable these possibilities. Bivariate plotting can also be easily extended to more than 2 variables (in some cases).

### Scatter plots

A scatter plot is the simplest bivariate plot: it maps each variable of interest to a point in two-dimensional space. To draw a scatter plot you need to call `scatter()` on a `DataFrame`'s `plot` attribute, and then pass in the variables names representing the axes, `x` and `y`, from the `DataFrame` columns.

In [None]:
stops.plot.scatter(x='day', y='subject_age')

We used our two numerical variables here, but a scatter plot really isn't that useful for displaying this information. Maybe it's time to switch datasets to one that covers all of NC. 

Here's a dataset I created from the NC State Patrol data from the Stanford Open Policing Project. It only contains data from 2010 on, aggregates on the county, and focuses on what percent of the total traffic arrests for each county came from each identified racial group. 

In [None]:
arrests = pd.read_csv("https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Data_Vizualization_with_Python/nc_traffic_arrests_race_since_2010.csv")
arrests = arrests[arrests.county_name.notna()]
arrests.head()

Now that we have a lot more numerical values, let's try a couple of different scatterplots.

In [None]:
arrests.plot.scatter(x="count_arrests_total", y="county_stops_total")

In [None]:
arrests.sort_values("count_arrests_total", ascending=False)

In [None]:
arrests.plot.scatter(x="white", y="black")

And the same parameters for style can also be used here.

In [None]:
fig, ax = plt.subplots(1, figsize=(10, 10))
arrests.plot.scatter(ax=ax, x='white', y='black', alpha=0.3, color='blue')
ax.set_ylabel("% Black")
ax.set_xlabel("% White")
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])

### Hexplots

When a scatter plot has too many overlapping points, it might be difficult to interpret or see what's happening. For cases like those, we can use a hexplot, which aggregates points in space into hexagons, and then colors those hexagons.

In [None]:
arrests.plot.hexbin(x='white', y='black', gridsize=25)

A hexplot resembles what a histogram does, but for bivariate distributions.

Hexplots and scatter plots can by applied to combinations of interval variables or ordinal categorical variables. To help alleviate overplotting, scatter plots (and, to a lesser extent, hexplots) benefit from variables which can take on a wide range of unique values.

### Stacked plots

Let's go back to our original stops data for a moment.

What if we want to know what types of traffic violations each biological sex tends be stopped for? Note that the NC data doesn't account for gender vs biological sex; it only has sex. 

In [None]:
# We're using `county_name` here just because we need a column other the two of interest to count observations.
stops.pivot_table(index=['reason_for_stop'], columns=['subject_sex'], values='county_name', aggfunc=len)

With that new pivot table, we are now able to use `bar()` as usual.

In [None]:
stops.pivot_table(index=['reason_for_stop'], columns=['subject_sex'], values='county_name', aggfunc=len).plot.bar()

Or stack it together per value of `violation`, also changing the colors.

In [None]:
(stops
    .pivot_table(index=['reason_for_stop'], columns=['subject_sex'], values='county_name', aggfunc=len)
    .plot.bar(stacked=True, color=['blue', 'red']))

But if we wanted to have a 100% stacked graph, things get a bit messy with the simple Pandas plotting capabilities.

In [None]:
pivot_table = stops.pivot_table(index=['reason_for_stop'], columns=['subject_sex'], values='county_name', aggfunc=len)
pivot_table.div(pivot_table.sum(1), axis=0).plot.bar(stacked=True, color=['blue', 'red'])

The same things apply for area and line charts.

In [None]:
stops.pivot_table(index=['reason_for_stop'], columns=['subject_sex'], values='county_name', aggfunc=len).plot.line(rot=60)

In [None]:
stops.pivot_table(index=['reason_for_stop'], columns=['subject_sex'], values='county_name', aggfunc=len).plot.area(rot=60)

### Boxplots (again)

Boxplots are also naturally good at representing more than 1 distribution

In [None]:
stops.boxplot('subject_age', by='subject_sex', grid=False)

In [None]:
stops.boxplot('subject_age', by='reason_for_stop', grid=False, rot=60)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Using `arrests`, plot a scatter plot for two racial groups that uses a different color for points, limits the x and y axes, and has 100 as the size of the points.
<br/>
    * **Hint**: Take a look at the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html">documentaton for `scatter()`</a>.*
</p>
</div>

In [None]:
ax = arrests.plot.scatter(x='black', y='hispanic', s=100, c='tomato')
ax.set_xlim([0, .9])
ax.set_ylim([0, .5])

In [None]:
ax = arrests.plot.scatter(x='black', y='hispanic', s=arrests["count_arrests_total"], c='tomato', alpha=0.2)

## Styling

matplotlib uses styles to guide the default visual aspects of a plot, and while those can always be changed individually, a style can be set for an entire Python session or set of plots. Beyond the default `classic` style, there are other styles available as well.

In [None]:
plt.style.available

In [None]:
with plt.style.context('Solarize_Light2'):
    stops['reason_for_stop'].value_counts().plot.bar()

Even a special one for XKCD!

In [None]:
with plt.xkcd():
    stops['reason_for_stop'].value_counts().plot.bar()

## Seaborn

`seaborn`, a library built on `matplotlib`, changes the default style after being imported, but it can be reverted back easily by setting the default style to `classic` using `plt.style.use("classic")`.

In [None]:
import seaborn as sns
plt.style.use("seaborn")
stops['reason_for_stop'].value_counts().plot.bar()

Seaborn is designed to work with `pandas` data structures and making data visualization easier with semantic mapping and other features. See their guiding ideas: https://seaborn.pydata.org/introduction.html.

For example, to fit and show a regression line in a scatter plot, we just need to pass in `fit_reg` to seaborn's `lmplot()` or `regplot()`.

In [None]:
sns.lmplot(data=arrests, x='white', y='black', fit_reg=True) 

In [None]:
sns.lmplot(data=arrests, x='county_stops_total', y='count_arrests_total', fit_reg=True) 

It can also group and split very easily.

In [None]:
# Let's create a new column set to whichever of black or white is the higher percentage of arrests
import numpy as np
arrests["most_freq_race"] = np.where(arrests["white"] > arrests["black"], "white", "black")
arrests.sample(10)

In [None]:
sns.lmplot(data=arrests, x='white', y='black', hue='most_freq_race', fit_reg=True) 

In [None]:
sns.lmplot(data=arrests, x='county_stops_total', y='count_arrests_total', col='most_freq_race', fit_reg=True) 

Plotting a hexplot with accompanying histograms gets simple.

In [None]:
sns.jointplot(data=arrests, x='black', y='white', kind='hex', 
              gridsize=20)

Or creating a more visually enticing verson of the boxplot, the violin plot.

In [None]:
chart = sns.violinplot(x='reason_for_stop', y='subject_age', data=stops)
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)

In [None]:
sns.set(palette="pastel", color_codes=True)
chart = sns.violinplot(x='reason_for_stop', y='subject_age', hue="subject_sex", split=True,
               inner="quart", palette={"male": "b", "female": "y"}, data=stops)
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
`stripplot()` and `swarmplot()` are types of categorical scatterplots that generalize a scatterplot to the case where one of the variables is categorical. The main different between them is that one avoids overlapping points while the other doesn't. `stripplot()` also allows you to add some random `jitter` to get a nicer visualization.

Using `stops`, plot a `stripplot()` with jitter and a `swarmplot()` showing the distribution of drivers age by violation and driver gender.

    * **Hint**: Now it's a good time to take a look at the seaborn <a href="https://seaborn.pydata.org/index.html">documentation</a>.*
</p>
</div>

In [None]:
chart = sns.stripplot(x="reason_for_stop", y="subject_age", hue="subject_sex", jitter=True, data=stops.sample(1000))
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)

In [None]:
chart = sns.swarmplot(x="reason_for_stop", y="subject_age", hue="subject_sex", data=stops.sample(1000))
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)

### Faceting

Faceting is the act of breaking data variables up across multiple subplots, and combining those subplots into a single figure. So instead of one bar chart, we might have, say, four, arranged together in a grid.



In [None]:
g = sns.FacetGrid(stops, col='reason_for_stop')
g.map(sns.distplot, 'subject_age')

In [None]:
g = sns.FacetGrid(stops, col='subject_sex')
g.map(sns.distplot, 'subject_age')

In [None]:
g = sns.FacetGrid(stops, row='reason_for_stop', col='subject_sex')
g.map(sns.distplot, 'subject_age')

We can also use FacetGrids with bivariate graphs. 

In [None]:
g = sns.FacetGrid(arrests, col="most_freq_race")
g.map(plt.scatter, "county_stops_total", "count_arrests_total")

There are a lot of complex things that you can do with FacetGrids, and we've only scratched the surface. I encourage you to take a look at the documentation to get a sense of other possibilities: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html

## Saving a figure

Sometimes you need to save a figure to include in a poster, paper, or presentation. Let's call back one of our previous figures, then save it with a high dpi as though we would put it in a paper. 

In [None]:
chart = sns.swarmplot(x="reason_for_stop", y="subject_age", hue="subject_sex", data=stops.sample(1000))
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)

In [None]:
fig = chart.get_figure()
fig.savefig("figure.png", dpi=300)

## A brief look at Bokeh for interactive visualization

In [None]:
!pip install bokeh
from bokeh.plotting import figure, output_notebook, show

In [None]:
from bokeh.models import ColumnDataSource

In [None]:
output_notebook()

In [None]:
source = ColumnDataSource(arrests)
tooltips = [
    ("County", "@county_name")
]
title = "State Patrol Traffic Stops & Arrests in NC Counties from 2010 on"
p = figure(x_range=[0, 6e+5], y_range=[0, 13000], plot_height=500, plot_width=500, title=title, tooltips=tooltips)
p.scatter("county_stops_total", "count_arrests_total", fill_color="most_freq_race", source=source, legend="most_freq_race")
p.legend.location = "top_left"
show(p)

## Other Python Visualization Libraries

- Altair: https://altair-viz.github.io/
- See the graph early in the workshop for a range of visualization libraries

### Interactive

- Bokeh: https://docs.bokeh.org/en/latest/index.html
- Holoviews: http://holoviews.org/index.html
- Plotly: https://plotly.com/python/

## Evaluation survey
Please, spend 1 minute answering these questions that can help us a lot on future workshops. 

https://go.ncsu.edu/dvs-eval

## Credits

The first version of this workshop was originally written and taught at Stanford University by Javier de la Rosa (Postdoctoral Researcher at UNED) and Scott Bailey.