<a href="https://colab.research.google.com/github/RubeRad/tcscs/blob/master/notebooks/60_Geopandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Geopandas and Choropleth Charts



In [None]:
import matplotlib.pyplot as plt  # these we've seen before
import pandas as pd
import seaborn as sns

import geopandas as gpd          # this is the point: pandas with geos!

# Geopandas data
Geopandas is a layer on top of pandas, that can handle geographic polygons (or points) and make maps with them. Geopandas has a couple datasets built-in:

In [None]:
gpd.datasets.available

`naturalearth_lowres` contains the outlines of 177 countries around the world, as well a few more useful columns. Once we read it in, the result is a pandas DataFrame like we've seen before, except there's a `geometry` column which adds special capabilities.

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world

In [None]:
world.continent.value_counts()

In [None]:
world.gdp_md_est.describe()

## Drawing as a map

A geopandas dataframe has a .plot() function, which simply works:

In [None]:
world.plot()

## Filtering rows
This works the same way as we saw before with regular pandas DataFrames.

In [None]:
noam = world[ world.continent == 'North America']
noam

In [None]:
noam.plot()

In [None]:
asia = world[ world.continent == 'Asia' ]
asia

In [None]:
asia.plot()

In [None]:
sixc = world[ world.continent != 'Antarctica' ]    # != means 'not-equal'
sixc.plot()

In [None]:
# Exercise: create a filtered DataFrame the "continent" of 'Seven seas (open ocean)'
# Plot it.
# What is it?


In [None]:
# Exercise: create a filtered DataFrame containing any 1 country of your choosing, and plot it
# (filter using the 'name' or 'iso_a3' column instead of 'continent')


## Combining plots

This shows how geopandas can have multiple DataFrames (filters of the same DataFrame) onto the same map.

Note the big difference is we have an Axes, and we tell each plot() that's where they should plot themselves.

In [None]:
fig = plt.figure(figsize=(18,10))
axes = fig.add_subplot()
sixc.plot(ax=axes, color='lightgrey')

# Uncomment these one at a time
#asia.plot(ax=axes, color='green')
#noam.plot(ax=axes, color='purple')
#axes.set_xticks([])
#axes.set_yticks([])
#for s in axes.spines.values(): s.set_visible(False) # one-liner for turning off all 4 spines
# don't indent after that line!

## Exercise 1: Filter and Color
* Modify the code cells above to create filtered DataFrames as instructed
* Color the 1 country in 'Seven seas (open ocean)' red -- where is it?
* Color your chosen country blue

## Mapping U.S. States
There are datasets out there suitable for Geopandas for all kinds of countries, regions, and subdivisions. A very good collection [can be found here](https://github.com/deldersveld/topojson). If you need to work with any of those be sure to click to the 'Raw' view, then Save As...

This file `us-albers.json` came from that repository. It has the U.S. states, with Alaska/Hawaii scaled/shifted as customary to make a more compact map.

Note this has 51 'states' in it -- why?

In [None]:
states = gpd.read_file('https://raw.githubusercontent.com/RubeRad/camcom/master/us-albers.json')
states.info()

In [None]:
states.head()

In [None]:
states.plot()

## Choropleth Maps
The world 'choropleth' comes from Greek χῶρος (choros 'area/region') and (πλῆθος plethos 'multitude'). The main purpose (Greek τέλος) of Geopandas is choropleth maps. You just tell `plot()` what column you are interested in, and Geopandas will color each shape accordingly, using a color scheme/map based on the range of values it finds.

In [None]:
fig = plt.figure()
axes= fig.add_subplot()
states.plot(column='census', ax=axes) # column 'census' is the population of each state

## Exercise 2: Choropleth Options
One at a time, add/change options to `states.plot()` above, and see what happens:
* `cmap='Blues'` (or Reds, Greens,... or OrRd, YlGnBu, etc, see [Matplotlib colormaps](https://matplotlib.org/tutorials/colors/colormaps.html))
* `edgecolor='k'`
* `legend=True`
* `legend_kwds={'orientation':'horizontal'}`
* `scheme='quantiles'`
* `legend_kwds={'loc':'lower left'}`
* `legend_kwds={'loc':'lower left', 'bbox_to_anchor':(1,0)}`

**Note** what happened there (if you did it right) is first the choropleth used a smooth, continuous range of colors from whatever cmap was chosen. `scheme='quantiles'` switched it to 5 discrete colors from that cmap, and it totally changed the type of legend.

The first use of `'lower left'` referred to where within the whole plot to put the legend. When `'bbox_to_anchor'` was added, the meaning of `'lower left'` changed to which corner of the legend to anchor. And (1,0) means 100% of the way to the right of the plot, and 0% of the way up the plot -- the coordinates are not related to the coordinates being plotted.

# Combining DataFrames

Usually the data you are analyzing is from a separate source, in its own DataFrame; the Geopandas mapping dataframe doesn't have that much information in it.

In a normal case like that, you need to *merge* the data-DataFrame with the mapping-DataFrame, so that Geopandas can map the data you care about

## Merging Covid data to a world map
We'll load the same Covid data we were looking at in the other recent notebook, and give it the same treatment

In [None]:
# load it
url = 'https://covid.ourworldindata.org/data/owid-covid-data.csv'
dfall = pd.read_csv(url, parse_dates=['date']) # make sure it knows 'date' is dates
dfall # take a look

In [None]:
dfall.isnull().sum() # note some columns have some empty cells

In [None]:
# Fill in any missing values with 0
dfall.fillna(0, inplace=True)

In [None]:
dfall.isnull().sum() # now they are all filled in

In [None]:
# simplify to fewer columns
df = dfall[ ['iso_code', 'location', 'date', 'new_cases', 'new_deaths', 'total_cases', 'total_deaths', 'population'] ]

## Identify matching column
In the `world` Geopandas DataFrame, the column with the 3-letter country codes is called `iso_a3`, and in the Covid DataFrame the column `iso_code` has the same country codes.

In [None]:
world.iso_a3

In [None]:
df.iso_code

The `pd.merge()` command tells pandas to match rows up by those columns

In [None]:
testmerge = pd.merge(world,  # this first DataFrame is on the Left
                     df,     # this one is on the Right
                     left_on='iso_a3',    # matching column in the Left
                     right_on='iso_code') # matching column in the Right

In [None]:
testmerge

## Type, shape, and size
Consider these questions:
* What *is* `testmerge`?
* What *shape* is `testmerge`?
* What *size* is `testmerge`?

`testmerge` has a `geometry` column, and a numerical column `total_cases` -- can we throw that data onto a map?

In [None]:
fig = plt.figure()
axes = fig.add_subplot()
testmerge.plot(column='total_cases', ax=axes)

That didn't work (or at least not in a reasonable time). What might have gone wrong?

## Same-sizing before merging

This is the most important lesson for successful use of geopanda:

* Your pandas DataFrame with your data,
* The geopandas DataFrame with the maps, and
* The merged DataFrame that brings them together

All have to have (about) the same size. I say "about" because there may be a few mismatches that dropped in the merge. But you cannot have a brazillion more rows in the data than the maps, because then the merge will have a brazillion rows, and the plot will try to draw a brazillion countries, or states. It will take forever, and it will be wrong, because it will just overdraw the same countries/states over and over, and what shows at the end is just whatever was last.

Usually, the way to force a giant dataset to be (about) the same size as the mapping DataFrame, is to use `groupby` -- with the same column that was used for merging. Tips for `groupby()`:
* After the `groupby('colname')` use a summarizing function, like max, min, mean, count, sum -- as appropriate.
* Give the summarizing function the argument `numeric_only=True` otherwise it will complain like "I don't know how to add strings!"
* End with `.reset_index()` to ensure the `groupby` column can still be used as a column, not just the index.

In [None]:
covidmax = df.groupby('iso_code').max(numeric_only=True).reset_index()
covidmax

In [None]:
covidmrg = pd.merge(world, covidmax, left_on='iso_a3', right_on='iso_code')
covidmrg

In [None]:
fig = plt.figure()
axes = fig.add_subplot()
covidmrg.plot(column='total_cases', ax=axes)

As before, this might need some per-capita treatment. Let's try this:

In [None]:
df['cases_pct'] = df.total_cases / df.population * 100

## Exercise 3:
Go back up a bunch of cells and:

* Comment out the testmerge and testmerge.plot() that didn't work
* Add a new column to `dfall` `total_cases_pct` computing `total_cases` / `population` * 100
* Re-column-slice `dfall`, adding `total_cases_pct` to the list of selected columns
* Re-groupgy() `covidmax`
* Re-merge `covidmrg`
* Check the column `covidmrg.total_cases_pct` -- is it reasonable?
* Make a choropleth of the `total_cases_pct` column

## Mini-Exercise:
What's going on here? Look at the total_cases/deaths and new_cases/deaths columns:

In [None]:
covidmax = df.groupby('iso_code').max(numeric_only=True)
covidmax

In [None]:
covidsum = df.groupby('iso_code').sum(numeric_only=True)
covidsum

# State Shootings Example

In [None]:
# All years
wapoALL = pd.read_csv('https://corgis-edu.github.io/corgis/datasets/csv/police_shootings/police_shootings.csv')
wapoALL.info()

In [None]:
# 2016 only
wapo = wapoALL[ wapoALL['Incident.Date.Year'] == 2016 ]
wapo.head()

In [None]:
# Do these groupby().min() examples make sense?
wapo.groupby('Incident.Location.State').min().head()

In [None]:
# Do these groupby().max() examples make sense?
wapo.groupby('Incident.Location.State').max().head()

In [None]:
# Do these groupby().mean() examples make sense?
wapo.groupby('Incident.Location.State').mean(numeric_only=True).head()

For our purpose below, `count()` is most useful

In [None]:
# This time we'll save the grouped DataFrame in a variable, named state_counts
# Note every column is countable, so every column gets counted, and yields the same count
# Also, even though python can't .add() non-numeric data, it can .count() it, 
# so you don't need numeric_only=True
state_counts = wapo.groupby('Incident.Location.State').count()
state_counts.head()

In [None]:
sns.catplot(data=state_counts, x='Incident.Location.State', y='Person.Name', kind='bar')#, height=4, aspect=3)

Since all of the columns have the same counts, any is as good as any other. `Person.Name` holds the same counts as `Incident.Location.City`. But to make things less confusings, we can add a column with a better name.

In [None]:
state_counts['n_shootings'] = state_counts['Person.Name']
state_counts['n_shootings'].head()

In [None]:
# With a column named n_shootings, this looks more sensible:
sns.catplot(data=state_counts, x='Incident.Location.State', y='n_shootings', kind='bar')

# Electoral Votes example

In [None]:
# This one also reads the csv off the web
ev2016 = pd.read_csv('https://raw.githubusercontent.com/RubeRad/camcom/master/2016ev.csv')
ev2016

**Note** in our `states` DataFrame, the column with state names is `name`. In this new DataFrame, the column is named `State`. It is important that the values in the Series are spelled and punctuated and capitalized *exactly* the same, or that part of the data won't merge.

After the merge, use `info()` and `head()` to verify that everything merged successfully -- same number of rows as before, and matched up properly.

In [None]:
evmrg = pd.merge(states, ev2016, left_on='name', right_on='State')
evmrg.info()

In [None]:
evmrg.head() # scroll to the right to see the new columns

**As seen above,** choropleths color regions based on values in a numerical column. However, what if the data is categorical?

This shows an example of creating a new column with colors for categorical data, and having Geopandas map with those colors.

Let's make a new column and use `.loc` to fill it with colors to plot with:

In [None]:
evmrg['party_color'] = 'pink' # Red for Republicans, but a little less intense
evmrg.loc[ evmrg['Winning Party']=='Democrats', 'party_color' ] = 'lightblue'
evmrg.party_color.value_counts()

In [None]:
evmrg.head()
# scroll right to see new column 'party_color'

In [None]:
evmrg.plot(color=all['party_color'], edgecolor='gray')

# Merging GeoDataFrames
The Washington Post data has 2-letter state abbreviations in `'Incident.Location.State'`, and the `states` mapping data has 2-letter state abbreviations in `'iso_3166_2'`, so we can merge them.

In [None]:
# Merge the WaPo shooting counts with the previous based on matching 2-letter code
wapo_merge = pd.merge(states, state_counts, left_on='iso_3166_2', right_on='Incident.Location.State')
wapo_merge.info()

In [None]:
# This is raw number of shootings, no accounting for state population
wapo_merge.plot(column='n_shootings')

But of course it would be more informative to graph per capita

In [None]:
wapo_merge['n_per_cap'] = wapo_merge.n_shootings / wapo_merge.census

In [None]:
wapo_merge.plot(column='n_per_cap')

## Exercise
Using capabilities examined in the chloropleth exercise above make an excellent chloropleth visualization of `ev_per_million`

In [None]:
# Here's the new column: electoral votes per million people:
evmrg['ev_per_million'] = evmrg['Votes'] / evmrg['census'] * 1000000
evmrg

In [None]:
sns.catplot(data=evmrg, x='iso_3166_2', y='ev_per_million', kind='bar')

In [None]:
evmrg.plot('ev_per_million')

# From Here:
Either the bar plot above, or this map, can be tailored in all the ways demonstrated in the above examples. Some ideas:
* Create a new column like shootings_per_million (see example ev_per_million above)
  * Use that for either the map plot column, or the bar plot y

For the bar plot:
* Control the aspect and color
* Sort the bars
* Color bars by hue='Winning Party'
  * (try palette='coolwarm', dodge=False)

For the map plot:
* Choose a useful palette from the link
* Control the size/aspect
* Add a legend