# Geopandas and Choropleth Charts -- for Google COLAB

Before importing, this does pip install on the virtual colab machine, to make sure the machine has the modules installed

In [None]:
# matplotlib/pandas/seaborn are standard enough we shouldn't need to worry
!pip install geopandas
!pip install descartes
!pip install mapclassify

Now that we're sure stuff is installed, it should be safe to import it

In [None]:
import matplotlib.pyplot as plt  # these we've seen before
import pandas as pd

import geopandas as gpd          # this is the point: pandas with geos!

import descartes                 # these two won't be used explicitly,
import mapclassify               # but will be used in the background

## Geopandas data
Geopandas is a layer on top of pandas, that can handle geographic polygons (or points) and make maps with them. Geopandas has a couple datasets built-in:

In [None]:
gpd.datasets.available

`naturalearth_lowres` contains the outlines of 177 countries around the world, as well a few more useful columns. Once we read it in, the result is a pandas DataFrame like we've seen before, except there's a `geometry` column which adds special capabilities.

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head(20)

In [None]:
world['continent'].value_counts()

In [None]:
world['gdp_md_est'].describe()

## Drawing as a map

A geopandas dataframe has a .plot() function, which simply works:

In [None]:
world.plot()

## Filtering rows
This works the same way as we saw before with regular pandas DataFrames.

In [None]:
asia = world[ world['continent'] == 'Asia' ]
asia.plot()

In [None]:
noam = world[ world['continent'] == 'North America']
noam.plot()

In [None]:
sixc = world[ world['continent'] != 'Antarctica' ]    # != means 'not-equal'
sixc.plot()

In [None]:
# Exercise: create a filtered DataFrame the "continent" of 'Seven seas (open ocean)' 
# What is it?


In [None]:
# Exercise: create a filtered DataFrame containing any 1 country of your choosing 
# (filter using the 'name' or 'iso_a3' column instead of 'continent')


## Combining plots

This shows how geopandas can have multiple DataFrames (filters of the same DataFrame) onto the same map.

Note the big difference is we have an Axes, and we tell each plot() that's where they should plot themselves.

In [None]:
plt.figure(figsize=(18,10))
axes=plt.gca()
sixc.plot(ax=axes, color='lightgrey')

# Uncomment these one at a time
#asia.plot(ax=axes, color='green')
#noam.plot(ax=axes, color='purple')
#axes.set_xticks([])
#axes.set_yticks([])
#for s in axes.spines.values(): s.set_visible(False) # one-liner for turning off all 4 spines
# don't indent after that line!

plt.show()

## Exercise 1: Filter and Color
* Modify the code cells above to create filtered DataFrames as instructed
* Color the 1 country in 'Seven seas (open ocean)' red -- where is it?
* Color your chosen country blue

## Mapping U.S. States
There are datasets out there suitable for Geopandas for all kinds of countries, regions, and subdivisions. A very good collection [can be found here](https://github.com/deldersveld/topojson). If you need to work with any of those be sure to click to the 'Raw' view, then Save As...

This file `us-albers.json` came from that repository. It has the U.S. states, with Alaska/Hawaii scaled/shifted as customary to make a more compact map.

Note this has 51 'states' in it -- why?

In [None]:
# Note: for use on Google colab, this reads straight off the web via the url
states = gpd.read_file('https://raw.githubusercontent.com/RubeRad/camcom/master/us-albers.json')
states.info()

In [None]:
states.head()

In [None]:
states.plot()

## Choropleth Maps
The world 'choropleth' comes from Greek χῶρος (choros 'area/region') and (πλῆθος plethos 'multitude'). The main purpose (Greek τέλος) of Geopandas is choropleth maps. You just tell `plot()` what column you are interested in, and Geopandas will color each shape accordingly, using a color scheme/map based on the range of values it finds.

In [None]:
plt.figure()
axes=plt.gca()
states.plot(column='census', ax=axes) # column 'census' is the population of each state
plt.show()

# Exercise 2: Choropleth Options
One at a time, add/change options to `states.plot()` above, and see what happens:
* `cmap='Blues'` (or Reds, Greens,... or OrRd, YlGnBu, etc, see [Matplotlib colormaps](https://matplotlib.org/tutorials/colors/colormaps.html))
* `edgecolor='k'`
* `legend=True`
* `legend_kwds={'orientation':'horizontal'}`
* `scheme='quantiles'`
* `legend_kwds={'loc':'lower left'}`
* `legend_kwds={'loc':'lower left', 'bbox_to_anchor':(1,0)}` 

**Note** what happened there (if you did it right) is first the choropleth used a smooth, continuous range of colors from whatever cmap was chosen. `scheme='quantiles'` switched it to 5 discrete colors from that cmap, and it totally changed the type of legend.

The first use of `'lower left'` referred to where within the whole plot to put the legend. When `'bbox_to_anchor'` was added, the meaning of `'lower left'` changed to which corner of the legend to anchor. And (1,0) means 100% of the way to the right of the plot, and 0% of the way up the plot -- the coordinates are not related to the coordinates being plotted.

## Combining DataFrames

Sometimes data you need to relate might be in different files. This example shows how to add more data about all the states from a separate csv file.

In [None]:
# This one also reads the csv off the web
ev2016 = pd.read_csv('https://raw.githubusercontent.com/RubeRad/camcom/master/2016ev.csv')
ev2016.head()

**Note** in our `states` DataFrame, the column with state names is `name`. In this new DataFrame, the column is named `State`. It is important that the values in the Series are spelled and punctuated and capitalized *exactly* the same, or that part of the data won't merge.

After the merge, use `info()` and `head()` to verify that everything merged successfully -- same number of rows as before, and matched up properly.

In [None]:
all = pd.merge(states, ev2016, left_on='name', right_on='State')
all.info()

In [None]:
all.head() # scroll to the right to see the new columns

**As seen above,** choropleths color regions based on values in a numerical column. However, what if the data is categorical? 

This shows an example of creating a new column with colors for categorical data, and having Geopandas map with those colors.

Above we filtered out Asia and North America, and used the Geopandas `plot()` keyword `color` to draw each sub-DataFrame with a single color. We could do that here too, filter 'Winning Party'=='Republicans' or 'Democrats' and used two plot() statements, but if the number of categories gets larger, that gets awkward. So here's another way: we create a new column full of color names for Geopandas to use:

In [None]:
# 'red' and 'blue' is a little intense
all['party_color'] = all['Winning Party'].map({'Republicans':'pink', 'Democrats':'lightblue'})
all.head()
# scroll right to see new column 'party_color'

In [None]:
all['party_color'].value_counts()

In [None]:
all.plot(color=all['party_color'], edgecolor='gray')

## Exercise
Using capabilities examined in the chloropleth exercise above make an excellent chloropleth visualization of `ev_per_million`

In [None]:
# Here's the new column: electoral votes per million people: 
all['ev_per_million'] = all['Votes'] / all['census'] * 1000000
all['ev_per_million'].head()

# State Shootings Example

In [None]:
# All years
wapoALL = pd.read_csv('https://corgis-edu.github.io/corgis/datasets/csv/police_shootings/police_shootings.csv')
wapoALL.info()

In [None]:
# 2016 only
wapo = wapoALL[ wapoALL['Incident.Date.Year'] == 2016 ]
wapo.head()

In [None]:
# Group all the rows of wapo by common state, and fill the columns with...
# For example min() gives the smallest value within each group, 
# so AK (Alaska) had no shootings on the 1st or 2nd of any month, etc
# In the Age column, I don't know if 0 means babies, or unknown age
wapo.groupby('Incident.Location.State').min().head()

In [None]:
# AK, AL, and AR had no shootings on 30th or 31st of a month, AK had no shootings in Dec
# In CA an 86-year-old was shot
wapo.groupby('Incident.Location.State').max().head()

In [None]:
# mean() gives the average age of shooting victims per state
# Not too useful for date fields, average month is mid-year, average day is mid-month, all years are 2016
# Note fields that don't make sense for mean() are omitted (Person.Gender, Person.Race, ...)
wapo.groupby('Incident.Location.State').mean().head()

All of the above are just examples, maybe the mean() age is useful. For what Emily is interested in, we want count():

In [None]:
# This time we'll save the grouped DataFrame in a variable, named state_counts
# Note every column is countable, so every column gets counted, and yields the same count
state_counts = wapo.groupby('Incident.Location.State').count()
state_counts.head()

**Protip**: This took me some more googling to figure out. Note how way up there wapo.head() has a first column of row numbers not in any column, that's the *index*. But all these groupby() results have Incident.Location.State as the index. Unfortunately that seems to mean the DataSeries name can't be used. 

In [None]:
# This gives an error: "Could not interpret input 'Incident.Location.State'"
sns.catplot(x='Incident.Location.State', y='Person.Name', data=state_counts, kind='bar')

In [None]:
# This is how to fix it, by copying the index column into a column with a name
# Let's choose a name that's easier to type
state_counts['state'] = state_counts.index
state_counts.info() # See the extra DataSeries 'state' added to the end?

Since all of the columns have the same counts, any is as good as any other. `Person.Name` holds the same counts as `Incident.Location.City`. But to make things less confusings, we can add a column with a better name.

In [None]:
state_counts['n_shootings'] = state_counts['Person.Name']
state_counts['n_shootings'].head()

In [None]:
# With a column named n_shootings, this looks more sensible:
sns.catplot(x='state', y='n_shootings', data=state_counts, kind='bar')

# Merging GeoDataFrames
Remember from above how we merged `states` and `ev2016` into `all`? Well `all['iso_3166_2']` has 2-letter state abbreviations (inherited from `states`) that match `state_counts['state']` (inherited from `wapo`). 

That means we can also merge these together, so that other interesting columns (census (state population) and 2016 winning party (and `party_color`)) are available for sns/gpd plotting

In [None]:
# Merge the WaPo shooting counts with the previous based on matching 2-letter code
wapo_merge = pd.merge(all, state_counts, left_on='iso_3166_2', right_on='state')
wapo_merge.info()

In [None]:
# This is raw number of shootings, no accounting for state population
wapo_merge.plot(column='n_shootings')

# From Here:
Either the bar plot above, or this map, can be tailored in all the ways demonstrated in the above examples. Some ideas:
* Create a new column like shootings_per_million (see example ev_per_million above)
  * Use that for either the map plot column, or the bar plot y

For the bar plot:
* Control the aspect and color 
* Sort the bars
* Color bars by hue='Winning Party' 
  * (try palette='coolwarm', dodge=False)

For the map plot:
* Choose a useful palette from the link
* Control the size/aspect
* Add a legend