<table>
    <tr>
        <td>
            <center>
                <front size="7">This notebook contains the cirriculum from the <a href="https://www.ire.org/events-and-training/event/3189/3499/">"How to find stories in data through visualization"</a> workshop taught at the <a href = "https://www.ire.org/events-and-training/event/3189/">2018 CAR Conference</a> by Alberto Cairo, Mark Hansen, Olga Pierce and Rachael Tatman. </font>
            </center>
        </td>
    </tr>
</table>

_____

**How to use this notebook**: This notebook contains 6 sections, one with instructions for setting up your notebook, one explaining the basicis of ggplot (the graphing library we'll be using) and four which contains code and descriptions of visualizations useful for exploring different types of data.

In each section (except the first) there will be exercises for you to complete. If you get stuck you can check out [this notebook](https://www.kaggle.com/rtatman/hints-for-exploratory-data-visualization-exercises/). It has hints that you can expand to show you what your output visualization should look like and, if you’re completely lost, the code to generate those figures. You can also use the notebook to check your work once you're finished. 


## Table of contents:
___
1.  [Setting up our notebook environment](#Setting-up-our-notebook-environment)
2. [A brief introduction to ggplot](#A-brief-introduction-to-ggplot)
3.  [Univariate (one variable) visualizations](#Univariate-Data)
  *  [Histograms](#Histograms)
  *  [Box and whisker plots](#Box-and-whisker-plot)
4.  [Time series](#Time-Series)
    * [Parsing dates](#Parsing-dates)
    * [Plotting time series](#Plotting-time-series)
5.  [Multivariate (more than one variable) plots](#Multivariate-Data)
    * [Corellelogram](#Corellelogram)
    * [Scatterplots,  (including adding a fitted like & faceting)](#Scatterplot)
6.  [Mapping spatial data](#Mapping-Spatial-Data)



# Setting up our notebook environment
___

Every time you open a notebook, it starts with a blank R session. That means that you need to load in any libraries you want to use as well as read in your data and clean it each time.

Here, I've done these steps for you. You just need to run this cell to read in all the data & libraries we'll be using in this workshop.

> **How do I run the code in this notebook?** You can the code in this notebook a couple ways. One is to click inside the cell (box with code in it) with the code you want to run and then hit CTRL + ENTER. You can also click in a cell and then click on the right "play" arrow to the left of the code. If you want to run all the code in your notebook, you can use the double, "fast forward" arrows at the bottom of the notebook editor.

If you're interested in finding out more about data cleaning in R, check out [this kernel.

In [3]:
# libraries with useful functions
library(tidyverse)
library(corrgram)
library(janitor)
library(lubridate)
library(corrplot)
library(maps)
library(ggrepel)

### Crime ###
# read & clean up in the the crime in context data
crime <- read_csv("../input/crime-rates/report.csv") %>%
    clean_names()

# seperate the "agency_jurisdiction" column into two 
# columns called "city" and "state", splitting based
# on where commas are
crime <- crime %>%
    filter(!is.na(agency_code)) %>%
    separate(agency_jurisdiction, into = c("city","state"), sep = ",")

### Mass Shootings ###
# read in the mass shootings data
mass_shootings <- read_csv("../input/us-mass-shootings-last-50-years/Mass Shootings Dataset Ver 5.csv") %>%
    clean_names()

### Drone Strikes ###
# read in & clean up the drone strikes data (this data is
# collected by individuals and is slightly less clean)
drone_strikes <- read_csv("../input/pakistandroneattacks/PakistanDroneAttacksWithTemp Ver 10 (October 19, 2017).csv")

# remove the last row (has totals in it)
drone_strikes <- drone_strikes[-nrow(drone_strikes),] 

# automatically tidy column names
drone_strikes <- clean_names(drone_strikes)

# correct rows where the latitude & longitude were switched
corrected_rows <- drone_strikes %>%
    filter(longitude < 60) %>%
    mutate(longitude1 = latitude,
           latitude = longitude, 
           longitude = longitude1) %>%
    select(-longitude1)

# remove incorrect rows and add the corrected rows
drone_strikes <- drone_strikes %>%
    filter(longitude > 60) %>%
    rbind(corrected_rows)

#  Read in A/B testing data
visits <- read_csv("../input/visits/visits.csv")

# "source" in a data set sampled from the CDC's BRFSS data set
source("../input/brfsscdc/cdc.R")

# Frequency displays

By way of a fast introduction, R's version of a spread sheet is a data frame. In the code above, we created a number of them for us to work with. The first example we'll examine is called `cdc`. It is a 10% sample of the data from the CDC's Behavioral Risk Surveillance System, a large (the largest?) telephone survey. We can see the top few rows of the data frame using the "function" `head()`.  

In [None]:
head(cdc)

The respondents of this survey were asked a series of questions about their health. In this data frame we have roughly 20% or 20,000 respondents, selected at random from the original data set. We have also included just 10 of the questions in this data frame, a small fraction of  original survey.

In [None]:
dim(cdc)

The data frame has a mix of categorical and quantitative variables. The first column, for example, records how each respondent evaluated their overall health -- from poor to excellent. The seventh and eighth columns record both their present and desired weights. For categorical variables with a relatively small number of "levels", we can "summarize" the data by simply tabulating the number of occurrences of each level.

In the code below, the dollar sign is used to "extract" a column of data from the data frame and use it as input to the function that tabulates or counts the number of times each level appears.

In [None]:
table(cdc$genhlth)

What do we see? Making a comparison between counts for variables with just a couple of levels is easy enough from a summary like this, but a graphic can help us considerably. 

In [None]:
# This controls the size of the plot
options(repr.plot.width=8,repr.plot.height=6)

# Make a simple barplot
barplot(table(cdc$genhlth))

As an aside, if we have two categorical variables and would like to explore the relationship between them, we might look at a cross-tabulation or crosstab to see which combinations are most common.

In [None]:
table(cdc$genhlth,cdc$exerany)

And the generalization of a barplot is a mosaicplot. The area of each rectangle is proportional to the count in the crosstab above.

In [None]:
mosaicplot(table(cdc$genhlth,cdc$exerany))

The shape, or concentration of data,  tells a story. When we have quantitative variables, we can also use frequency to provide us with a visual description of the shape of a data set. In this case, we tease out shape by dividing the range of the data into intervals of equal (usually) size. A "histogram" will divide the x-axis into equal intervals, each becoming one side of a rectangle, and for the y-axis we plot the number of points of the data set falling into the interval, making the other side of each rectangle. 

A histogram with equal interval sizes has one knob to turn -- the number of intervals, or, equivalently, their size. Let's look at people's heights in the BRFSS survey.

In [None]:
hist(cdc$height)

In [None]:
hist(cdc$height,breaks=50)

How would you describe the shape here? Symmetric? Bell-shaped?  If you increase the number of intervals, you see another feature of the data. What is it?

In [None]:
hist(cdc$height,breaks=500)

The situation is a little different with weights.

In [None]:
hist(cdc$weight,breaks=25)

How would you describe the distribution of weights reported among the BRFSS respondents? 

Using a number of bins here gives a different picture. What do you see?

In [None]:
hist(cdc$weight,breaks=500,xlim=c(200,300))

We consider frequency displays as a means to judge the adequacy of numerical summaries of a data set. We are often called on to reduce a data set to one or two numbers. In the case of a bell curve, we need just the mean and standard deviation (the "normal" distribution being a two-parameter location-scale family, with the mean and standard deviation being "sufficient"). When the data set exhibits considerable skew we can either appeal to other summaries that might be more "robust" representations of center and scale, or we can transform the data in some way, reducing the skew.

[Slides about skew and introducing Normal Q-Q Plot]

In [4]:
hist(visits$VisitLength,breaks=100)

These data are heavily skewed. What should we do to help "see" the features?

As mentioned before, the normal distribution is bound to the mean and standard deviation. Often we want to assess how "normal" something looks. The histogram is a tough display for refined judgements. Q-Q plots can be easier to interpret.

In [None]:
par(pty="s")
qqnorm(cdc$weight)
qqline(cdc$weight)

When comparing distributions, we can overlay histograms, appeal to "smoothed" histograms or densities that represent each distribution with a curve, or we can try simple representations that allow for fast comparisons. One such representation is the "boxplot". Below we compare the heights of men and women (not the most shocking of comparisons, I admit). What do you notice?

One point. Journalists often focus on "outliers" or points that are some distance from the center of the distribution. How does one define "at some distance" in a way that meaningfully highlights outlying points that are worthy of more investigation? Tukey came up with one soltuion, what would you do? 

In [None]:
boxplot(cdc$height~cdc$gender)

We are now going to review some of these ideas, introducing a powerful graphical engine that has been contributed to R by Hadley Wickham, ggplot.

# A brief introduction to ggplot
____

Ggplot2 works a bit different from other R packages, and that's by design. 

>The "gg" in "ggplot" stands for "the grammar of graphics". If you're curious, you can learn more about what this means in [this paper](http://vita.had.co.nz/papers/layered-grammar.html).

Plots in ggplot2 are "built up" using multiple functions connected with the plus sign (+). The first function, ggplot(), just draws the outline of the plot, including the the axes and tick marks. It takes two arguments:

1. The dataset that you want to plot. 
2. A function, aes(), short for aesthetic. This function can take multiple arguments, and each argument tells ggplot which variables in the dataset you want to be mapped to which part of the plot. By default, the first two arguments will be assigned to the x-axis and y-axis.

The ggplot() function itself doesn't actually plot anything, it just creates the outline of the plot. To actually plot something in that outline, we need to add a geom layer to our plot. There are lots of different geom layers (you can see a full list of them [here](http://ggplot2.tidyverse.org/reference/#section-layer-geoms)) and you can add multiple layers to the same plot. The general syntax of ggplot looks like this:

    # ggplot syntax looks like this
    ggplot(dataset, aes(x = xaxis, y = yaxis)) +
        geom_something() +
        geom_somethingElse()

First you call ggplot() and pass it the dataset & aesthetics you want to plot. Then you add the layer(s) you want. 

Note that you do need to have a plus sign between every layer of the same plot. If you end a line without a plus sign, ggplot2 will think that you're done adding things and ignore everything that shows up after that (you'll also get an error because R doesn't expect lines to begin with +).

    # This will just plot the empty ggplot, without the geom_something
    ggplot(dataset, aes(x = xaxis, y = yaxis))
        + geom_something()

# Univariate Data
___

You can use these visualizations when you want to look at one variable at a time. 


# Histograms
____

**Example stories:**

* [Are Pop Lyrics Getting More Repetitive?](https://pudding.cool/2017/05/song-repetition/) by Colin Morris
* [Who will win the presidency?](https://projects.fivethirtyeight.com/2016-election-forecast/#electoral-vote) by 538

Histograms let you quickly see the distrobution and range of a dataset. 



In [None]:
# plot a histogram of deaths per drone strike
ggplot(drone_strikes, aes(x = total_died_min)) +
    geom_histogram() +
    ggtitle("Number killed/drone strike (minimum)")

This graph shows us that the nuber has a strong positive skew. While most drone strike kill around five people, there is a long positive tail. This graph also shows us that the minimum numer killed ranges from 0 to 40.

**Exercise**: 

Use the `crime` dataset to creat a histogram of crimes percapita (actually per 100,000 residents). Is this measure skewed? What measurement was the most common?

In [None]:
# plot a histogram of crime per 100,000 residents

# your code here :)

# Box and whisker plot
____

**Examples:**

* [Climate change will hit the South and Midwest the hardest](https://www.theverge.com/2017/6/29/15892910/climate-change-south-midwest-economic-loss) by Alessandra Potenza (Figure 2)
* [Gender Differences in Adolescent Major Depressive Disorder](https://reliawire.com/gender-differences-adolescent-depression/) by Milla Bengtsson

Box and whisker plots will give you some of the same information as historgrams (where is the center of mass of the data? Is there a strong skew?). They also highlight outliers, which we'll learn how to add labels to in the next section.

> *What does factor(0) mean?* It creates a vector of length 1 with only the element "0" in it, and tells R that it has one factor it in. (A factor is just a level of a categorical variable. So the vector "pets" might have the factors "cat", "dog" and "hamster".) Basically, we're just making an empty object to plot on the x axis, since geom_boxplot() requires both an x and y axis. 

In [None]:
# Plot a box and whisker plots of total deaths
ggplot(drone_strikes, aes(x = factor(0), y = total_died_min)) +
    geom_boxplot()

This graph shows us that the average drone strike kills at least five people, but there are a handful of outliers that have killed fifteen or more.

**Exercise**: 

Use the `crime` dataset to create a box and whisker plot of crimes percapita (actually per 100,000 residents). Compare it to the histogram you made in the last exercise. Which is it easier to see outliers in? Which more clearly shows the overall shape of the distrobution? 

In [None]:
# plot a box and whisker plot of crime per 100,000 residents

# your code here :)

# Labelling outliers
___

**Examples:**

* [Which College Football Teams Do The Most With The Least Talent? (And Vice Versa)](https://fivethirtyeight.com/features/which-college-football-teams-do-the-most-with-the-least-talent-and-vice-versa/) by Neil Paine
* [Finding Faulty Auto Chips](https://semiengineering.com/finding-faulty-auto-chips/) by Mark Lapedus

Since boxplots show us outliers, sometimes it's nice to see more information about those outliers. In this section, we'll learn how to add text labels to outliers. 

Before we can label our outliers, we first need to identify them. For this, we're going to use a function that will identify which rows in a column contain outliers.

In [None]:
# function to identify outliers
identify_outliers <- function(input){
    input <- as.numeric(input) 
    outliers <- input > (4 * IQR(input, na.rm = T))
    return(outliers)
} 

# quick test :)
identify_outliers(c(1,2,3,4,100,NA,"Q"))

Now let's remake the box and whisker plot above, but label the outliers with which city each drone strike took place in.

In [None]:
# Plot a box and whisker plots of total deaths, with outliers labelled
ggplot(drone_strikes, aes(x = factor(0), y = total_died_min)) +
    geom_boxplot() +
    geom_text_repel(aes(label = ifelse(identify_outliers(drone_strikes$total_died_min),
                                 city, #label if condition is TRUE
                                 ""))) + # label if condition is FALSE
    ggtitle("Drone strikes by minimum total deaths")

This graph shows us that, while the average drone strike kills at least five, there were several outliers in North Waziristan and South Wasiristan, including a drone strike in North Waziristan which killed at least 40 people.

**Exercise:**

Re-create the box and whisker plot you created in the last exercise (crime per 100,000 citizens), and label the outliers with the city.

In [None]:
# Your code here :)


# Time Series
____

**Examples:**

* [How (un)popular is Donald Trump](https://projects.fivethirtyeight.com/trump-approval-ratings/?ex_cid=rrpromo) by  Aaron Bycoffe, Dhrumil Mehta and Nate Silver
* [Bitcoin Mania: Even Grandma Wants In on the Action](https://www.wsj.com/articles/bitcoin-mania-even-grandma-wants-in-on-the-action-1511996653) by Peter Rudegeair and Akane Otani

Sometimes you'll want to plot changes in a value over time. This is known as a time series. The good news is that ggplot() is very smart about plotting dates and will automatically deal with them. The bad news is that your dates have to be in the right data format first.

In this section, you'll learn how to get dates into the correct format and then plot them.

# Parsing dates 
___

Sometimes you're lucky and your dates will already be in a numeric format that you can just feed directly into ggplot(). Sometimes, however, you'll have to let R know that a certain column contains a date and what the format of that date is. This is known as *parsing* a date.

> **What does the mdy() function do?** This tells R that the date is in the format month, day, year. In order to a parse a date you'll need to specify the order of the parts of the date. Some other similar functions include:

> * **ymd()**, for dates that are year, month and then day
> *  **dmy()**, for dates that are day, month and then year
> * **ymd_hms()**, for dates that are year, month, day, hour, minute and finally second

> Once a date is parsed, you can easily extract parts of it using functions like month() and year(). You can learn more about parsing dates [here](https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html).

In this example, we've done a tiny bit of extra data processing to remove the day of the week (Sunday, Monday, etc.) that's been included in the date. If you don't need to do this step, you can skip creating the remove_dow() function and the line `mutate(date_formatted = remove_dow(date)) %>%`. 

In [None]:
# function to remove the day of the week (friday, monday, etc.) & following comma
remove_dow <- function(column){
    no_dow <- str_replace_all(column, '[A-Za-z]*day, ', '')
    return(no_dow)
}

# tidy up dates & convert them to date format
dates <- drone_strikes %>%
    select(date) %>% # get the "date" column
    mutate(date_formatted = remove_dow(date)) %>% # remove the day of the week
    mutate(date_formatted = mdy(date_formatted)) # convert to date formant

# compare before & after formatting
head(dates)

# add formatted dates to our dataframe
drone_strikes$date_formatted <- dates$date_formatted

**Exercise:**

Correctly parse the "date" column from the mass_shootings dataset. Print some of the unconverted & converted dates using the head() function to compare them.

In [None]:
# your code here :)


# Plotting time series
___

Now that you've correctly parsed your dates, it's time to plot them! By convention, time series are plotted with time on the x axis and counts or values on the y axis. If you've got very common events, or are tracking a value that changes over time, you can just plot the date as the x axis and the value of whateever your tracking on the y axis:

In [None]:
# median crimes per 100,000 residents, by year
crime_per_year <- crime %>%
    group_by(report_year) %>%
    filter(!is.na(crimes_percapita)) %>% # remove rows with NA
    summarise(crime = median(crimes_percapita))

# plot
ggplot(crime_per_year, aes(report_year, crime)) +
    geom_line() +
    ggtitle("Median Crime/100,000 Residents")

The plot above shows us that the median crime per 100,000 residents steadily increased from 1975 to 1993 and then steadily decreased until 2015.

Sometimes, however, you want to see how many events occured at each time time point, which is a tiny bit more involved.

Here, we're going to plot how many events occur each day. In order to make this a little bit easier, we're going to use a function `events_per_day()` take takes in a column of a dataframe with dates and returns a dataframe with the date and how many events occured on that day. (Including 0 if no events occured on that day:

In [None]:
# take a column with dates and calculates the number of events 
# per day, adding "0" for dates where no
events_per_day <- function(column_with_date){
    # number of strikes per day
    events_per_day <- table(column_with_date) %>%
        as.data.frame() %>%
        mutate(date = as.Date(column_with_date), events = Freq)

    # all dates in our observed date range
    all_dates <- seq(from = min(events_per_day$date),
                    to = max(events_per_day$date),
                    by = "day")

    # fill in days with 0 strikes
    missing_days <- data_frame(date = all_dates[events_per_day$date %in% all_dates],
                               events = 0)

    # add days with 0 strikes to our data
    events_per_day <- events_per_day %>%
        select(date, events) %>%
        rbind(missing_days)

    # return a dataframe with the columns "date" and "events"
    return(events_per_day)
}

Now we can use this function to plot any event occurance based on a list of dates where it occured. For example, we can look at how many drone strikes there were per day in our drone strikes dataset.

In [None]:
# get the number of strikes per day
drone_strikes_per_day <- events_per_day(drone_strikes$date_formatted)

# plot number of strike/day
ggplot(drone_strikes_per_day, aes(date, events)) +
    geom_line() +
    ggtitle("Number of Drone Strikes per Day")

This plot shows us that the numer of drone strikes was highest around 2012 and 2013, with as many as three a day. Currently, the number of drone strikes is closer to one per day.

**Exercise:** 

Use this `events_per_day()` function to plot how many mass shootings occured in America by day baesd on the mass_shootings dataset. Has the number of shootings per day been increasing or decreasing overtime? 

In [None]:
# your code goes here


# Multivariate Data
_____

So far, we've looked at one variable at a time. Often, however, you're going to be interseted in how two or more variables are related to each other.

# Corellelogram
___

**Examples:**

* [Election Update: North Carolina Is Becoming A Problem For Trump](https://fivethirtyeight.com/features/election-update-north-carolina-is-becoming-a-backstop-for-clinton/) by Nate Silver. (Figure 2)
* [Understanding The Recent Rise In Correlations And How You Can Turn It To Your Advantage](https://www.forbes.com/sites/riabiz/2011/03/09/understanding-the-recent-rise-in-correlations-and-how-you-can-turn-it-to-your-advantage/#60b6f9e0f6c8) by Elizabeth MacBride 

Corellelograms show you the strength and direction of linear relationships between pairs of variables. They are useful for very early exploration because they allow you to compare a lot of variables at once.

> **Caution:** correlation will only tell you about *linear* relationships, so you may miss things like very high and very low values behaving similarly while values in the middle of the range behave differently. Also, of course, correlation isn’t the same as causation. :)

In `corrplot()` output, strong correlation between two variables is represented by darker colored and larger circles in the intersecting column and row for those variables. Blue circles indicate a positive correlation, red ones a negative correlation. 

So in the drone_strikes dataset, there is a very strong positive correlation between the temperature in Celsius and the temperature in Fahrenheit. 

In [None]:
# plot the correlation of all numeric columns
drone_strikes %>%
    select_if(is.numeric) %>% # get numeric columns
    replace(is.na(.), 0) %>% # replace na's with 0
    cor() %>% # calculate correlations
    corrplot() # plot

The corrgram() output shows the same information, but in a slighlty different way. Darker squares indicate a strong relationship, and the direction of correlation is shown by color (blue for positive, red for negative) and the direction of the white hashlines.

In [None]:
# another way to plot a correlogram
# (less code but harder to read, IMO)
corrgram(drone_strikes,
         cex.labels = .75)

** Exercise:**

Try using both the `corrplot()` and `corrgram()` functions to plot the crime dataset. Which measures are most strongly correlated with each other? Is their correlation negative or positive? Is this surprising? 

In [None]:
# make a correllogram using corrplot() here


In [None]:
# plot a correlogram using corrgram() here


# Scatterplot
___

**Examples:**

* [What country spends the most (and least) on health care per person?](https://www.mprnews.org/story/2017/04/20/npr-what-country-spends-the-most-and-least-on-health-care-per-person) by Susan Brink
* [New Study Reveals Bartenders, Casino Workers Most Likely to Get Divorced](https://www.inverse.com/article/36156-divorce-rate-study-americans-bartenders-flight-attendants) by Emily Gaudette

If you want to compare two variables in more detail than you can using a corellelogram, a scatterplot is a good choice. It represents each observation as a single point on the plot, with its position determined by the value for that point for the variables assigned to the x and y axis. 

We can use this to see the relationship between these variables. For example, we can see the relationship between the maximum and minimum number of reported injuries for  each drone strike:

In [None]:
# plot distance between max & min number injured per attack
ggplot(drone_strikes, aes(injured_min, injured_max)) +
    geom_point()

This graph shows us that the maximum number of injured is always great than the minimum number injured (as we'd expect), but that they are generally fairly to each other.

**Exercise:**

Plot a scatter plot showing the relationship between violent_crimes and population in the crime dataset.

In [None]:
# your code goes here :)

### Scatterplots: Adding a third variable
___

We can easily add a third variable to our scatterplot by assigning it to the color (or colour) aesthetic. In this example, I’m using color to represent whether more than five Taliban members were killed by each drone strike.

In [None]:
# of those killed, how many were civilians?
ggplot(drone_strikes, aes(injured_min, injured_max, color = (taliban > 5))) +
    geom_point()

This graph shows us that some of the drone strikes that many of the drone strikes that resulted in a lot of injuries didn't kill more than five Taliban members. It also shows us that for many strikes the number of Taliban members killed was not recorded (and may have been zero).

**Exercise:**

Re-create your scatterplot from the last exercise (violent crimes vs. population in the crime dataset), but use color to show the state variable.

In [None]:
# your code here :)


### Scatterplots: Adding a fitted line
____

Sometimes it can be helpful to show trends in your data. The geom_smooth() layer adds a fitted line with its standard error. By deafult, this line is fit using LOESS, or locally weighted scatterplot smoothing.

In [None]:
ggplot(drone_strikes, aes(injured_min, injured_max)) +
    geom_point() +
    geom_smooth()

This figure shows us that there's a strong positive relationship between the maximum and minimum number of people reported injured for each drone strike, which is what we'd expect for these measurement. (We wouldn't expect to see a very low number of injured max and a very high number of injured min, for example.)

**Exercise:**

Recreate your plot showing violent crime vs. population in the crime dataset, but add a fitted line. What's the relationship between these two features? 

In [None]:
# your code goes here :)


### Scatterplots: Faceting
___

Sometimes you can learn more about the data if you can compare groups side-by-side. One way to do this is by uisng faceting, which breaks each group into a different small plot and puts them next to each other. Here's an example that puts each provine in a different smaller graph: 

In [None]:
# injured_min and injured_max, with each province in a smaller 
ggplot(drone_strikes, aes(injured_min, injured_max)) +
    geom_point() +
    facet_wrap(~province)

This figure shows us that there was a large difference in the number of drone attacks in each of the different provinces, but that in the two provinces with more than one drone strike the relationship between the minimum & maximum number of injuries was the same.

**Exercise:**

Plot a scatter plot for violent crime vs. population, but put each state in a separate facet. 

In [None]:
# your code here :)

# Mapping Spatial Data
____

**Examples:**

* [Here are the states most threatened by steel tariffs](https://www.marketwatch.com/story/here-are-the-states-most-threatened-by-steel-tariffs-2018-03-06) by Rex Nutting and Andrea Riquier
* [#1 SONGS IN 3,000 CITIES](https://pudding.cool/2018/01/music-map/) by The Pudding


Spatial data includes location as part of each data point. When these are locations in the world, maps are usually the best way to represent this data. 

Mapping is a very complex type of data visualization with a lot of specialized considerations. The simplest approach is to take an already existing map and place your data points on top of it using information like their latitude and longitude. 

For our example, we’re going to use maps from the [map_data() function](https://www.rdocumentation.org/packages/ggplot2/versions/2.2.1/topics/map_data), which we can use to create a ggplot. Then we can add other layers that reference other data sets in order to plot our data on the map we’ve drawn.

In [None]:
# create an object with a map of Pakistan
pakistan <- map_data("world", "Pakistan")

# plot drone strikes by location & number of dead
ggplot() + 
    geom_polygon(data = pakistan, # plot map of Pakistan
        aes(x=long, y = lat, group = group), 
        fill = NA, 
        color = "black") + 
    theme(panel.background = element_blank()) + # remove background
    geom_point(data= drone_strikes, # plot drone strikes
               aes(x = longitude, y = latitude, 
                   size = total_died_min, color = total_died_min))

This map shows us that drone strikes in Pakistan are tightly clustered along the northwest border, with one outlier further south.

** Exercise:**

Plot the location of mass shootings in the United State. I've started you out by creating a data object with the information you need for a map of the US in it. It has the same columns as the map of Pakistan in the example above (lat, long and group). 

In [None]:
# create an object with a map of the lower 48
USA <- map_data("usa")

In [None]:
# your code goes here :)



# Additional Resources
___

* For a more in-depth exploration of ggplot, check out [this notebook](https://www.kaggle.com/rtatman/visualizing-data-with-ggplot2).
* [A free tutorial series on maps and GIS (geographic information systems) in R](http://www.nickeubank.com/gis-in-r/)
* [Facets](https://pair-code.github.io/facets/): Facets is a code-free way to quickly look at a visual overview of your data. You can upload a file to explore on the website or add it to a Jupyter notebook.