July 21, 2020


To get started, go to:

# https://github.com/IDREsandbox/gisworkshop

Then, click on the JupyterHub or Binder link at the bottom of the page


<i>Note: This workshop will be recorded</i>


### Who are you?

<img src="images/whoareyou.png" style="width:800px">


<img src="images/haveyougis.png" style="width:800px">

<img src="images/haveyoupython.png" style="width:800px">

# Introduction to Spatial Analysis with Python

<img src="images/intro.png" style="height:500px">

Workshop survey:
https://bit.ly/2VkkyXm

#### Part 1: Mapping California Coronavirus Data
- Jupyter Notebooks
- Using Python in Jupyter Notebooks
- Using Python libraries: pandas and plotly express

#### Part 2: Interactive Web Mapping
- Interactive web-mapping within Jupyter Notebooks
- Using Python libraries: folium and altair


# Hello Jupyter Newbies

If this is the first time for you to use Jupyter Notebooks, I highly recommend you visit the help link below. You can also take Ben Winjum's excellent "Introduction to Jupyter" course which is available on our YouTube channel:

- https://www.youtube.com/watch?v=vqlJCBHQaHc

You can also find some useful information on their official documentation:

- About this interface https://jupyterlab.readthedocs.io/en/stable/user/interface.html

In order to run the code sequentially on this page, highlight a cell (you should see a blue bar on the left side) and use the following keyboard shortcut to run the cell:

- `shift + enter`

This should take you to the next cell, where you can repeat `shift + enter` until you reach the end. You can modify the contents of any cell to experiment with the code, but note that doing so may impact the subsequent code.

Jupyter Notebooks have two different keyboard input modes:

<b>Command mode</b> - A selected cell that is indicated by a blue left margin.

<b>Edit mode</b> - when you’re typing in a cell. Indicated by a blinking cursor inside the cell.

<b>Command Mode</b>

* `shift + enter` run cell, select below
* `ctrl + enter` run cell
* `option (mac) or alt (pc) + enter` run cell, insert below
* `A` insert cell above
* `B` insert cell below
* `C` copy cell
* `V` paste cell
* `D , D` delete selected cell
* `Y` change cell to code mode
* `M` change cell to markdown mode (good for documentation)

<b>Edit Mode</b>
* `cmd (ctrl) + click` for multi-cursor editing
* `cmd (ctrl) + /` toggle comment lines
* `tab` code completion or indent
* `shift + tab` tooltip
* `ctrl + shift + -` split cell
* `shift + M` merge cells

In [None]:
# practice edit mode
mynum = 3

for x in range(5):
    print(x*mynum)

## A "Spatial" Data Science Approach

<img src="images/spatialdatascience.png" style="height:700px">
Source: <a href="https://carto.com/what-is-spatial-data-science/" target="_blank">Carto</a>

## Our Workflow

1. Find and acquire data
1. Manage and clean the data
1. Explore
1. Model
1. Communicate/visualize

## Our Data

The LA Times Data Desk team has taken the lead to centralize Los Angeles based COVID-19 related datasets. Shortly after the pandemic erupted in the US, they have maintained the following page to report real-time statistics.

<img src="images/latimes.png" style="height:300px">

https://www.latimes.com/projects/california-coronavirus-cases-tracking-outbreak/

In order to maintain transparency over their methods, they have made multiple datasets available on this GitHub page, allowing academics to use it for research purposes under these terms and services (https://www.latimes.com/terms-of-service).


https://github.com/datadesk/california-coronavirus-data

## Libraries

For this session, we will be using two libraries: plotly and pandas. Make sure to install the libraries using `pip` or `conda`. If you have reached this Jupyter notebook via the IDRE GitHub page, or, if you installed Anaconda and ran `requirements.txt`, your library should already be installed. If not, uncomment the install code below and run them.

<img src="images/pandas.png" style="height:250px">
<img src="images/plotly.png" >

In [None]:
# !conda install plotly --yes
# !conda install pandas

Import libraries
- Plotly Express documentation https://plotly.com/python/plotly-express/
- Pandas: https://pandas.pydata.org/docs/

In [None]:
import plotly.express as px
import pandas as pd

## Using Python's pandas library to get data

Get the data from LA Times directly from their github page using `.read_csv(url link)`. By doing so, it ensures that we are grabbling the latest dataset that they have uploaded on their site. Note that it also adds the risk that if they change their data model, it can potentially break the methods used in this session.

Data source: https://github.com/datadesk/california-coronavirus-data


In [None]:
latimes = pd.read_csv(
    "https://raw.githubusercontent.com/datadesk/california-coronavirus-data/master/latimes-place-totals.csv"
)

Note that you can always find help for a method using the following command:

In [None]:
# ?pd.read_csv

Preview the data by typing the variable name out. If you are using command line and not Jupyter, you have to use the print() method.

In [None]:
latimes
# if using command line
# print(latimes)

You can also just output the first 5 rows using `.head()`.

In [None]:
latimes.head()

How many rows and columns? Use the `.shape` method by typing it in below (note that for .shape, you do not need the parentheses) 

Output the columns using the  `.columns` method.

## Cleaning the data

Data is not perfect. In fact, data is never perfect. After a close reading of the data, the need to filter out problematic records becomes necessary. For this session, let us filter out the following:

- empty confirmed_cases values (NaN's) 
    - `confirmed_cases == 'NaN'`
- empty coordinates
    - `"x=='NaN'"`
    - `"y=='NaN'"`
- incorrect coordinates (ie, positive longitudes which are not possible in California)
    - `x < 0`
- null dates
    - `date.notnull()`

We will do so by using the pandas `.query()` method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html). This allows us to query and filter the dataset using SQL syntax.

First, let's see if there are any records with `NaN` values for confirmed_cases:

In [None]:
latimes.query("confirmed_cases == 'NaN'")

What about `NaN` values for x?

In [None]:
latimes.query("x == 'NaN'")

And what about `y`? Try it on your own:

In [None]:
# What about positive x (longitude) coordinates?
latimes.query("x > 0")

In [None]:
# any null dates?
latimes.query("date.isnull()", engine='python')

Now combine all those arguments into a single `.query()` statement to update our data. Notice that we are reversing the conditions, so instead of `==`, using not equals `!=` to filter data:

In [None]:
latimes = latimes.query("confirmed_cases != 'NaN' & x < 0 & x != 'NaN' & date.notnull()", engine='python')
latimes.head()

How many records do we have now? And how does it compare with the number of records prior to the cleanup?

## Find the most recent date

Our data, with over 80,000 records, is large. Let's create a sub table for the most current date, which is most likely yesterday. 

First, order the data by date using the `.sort_values()` pandas function (<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html" target="_blank">pandas sort_values</a>):


In [None]:
latimes = latimes.sort_values(by=["date"], ascending=True)

Output the last entries to see the most recent date in the table. Use `tail()` instead of `head()`:

In [None]:
latimes.tail()

What is date of the last entry in our database? Let's use pandas `.iloc` <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html" target="_blank">iloc method</a> (index location) to grab the last date.

* to get a value from the <b>first</b> row of a dataset: `iloc[0]['column_name']`

* to get a value from the <b>last</b> row of a dataset: `iloc[-1]['column_name']`.

In [None]:
# put it in a variable `lastdate`
lastdate = latimes.iloc[-1]['date']
lastdate

Create a new dataset that will hold the data filtered by `lastdate` using `.query`. Notice the `@` sign in front of `lastdate` within the query argument, which indicates that it is referencing a variable.
* `.query` documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html

In [None]:
latimes_single_day = latimes.query('date==@lastdate')
latimes_single_day

Create a variable `latimes_LA` that is a filter for just Los Angeles County data.

In [None]:
latimes_LA = latimes.query("county=='Los Angeles'")

Now we have three datasets to work with:
- `latimes`: the entire database
- `latimes_single_day`: filtered for one day
- `latimes_LA`: just Los Angeles County data


# Stats
Get some stats about our data using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html" target="_blank">.describe()</a>.

In [None]:
latimes.confirmed_cases.describe()

You can use `describe()` on grouped rows, such as by county:

In [None]:
latimes.groupby("county").confirmed_cases.describe()

Sort the table by `max` values.

In [None]:
latimes.groupby("county").confirmed_cases.describe().sort_values(by=["max"], ascending=False)

Now imagine we want to dig in further, and want to find out what is going on within Los Angeles. Let's try that by places in Los Angeles. Use our subset dataset for Los Angeles `latimes_LA` and group by `place`:

In [None]:
latimes_LA.groupby("place").confirmed_cases.describe().sort_values(by=["max"], ascending=False).head(50)

## Explore the data... bar charts to the rescue!
We will use the <a href="https://plotly.com/python/plotly-express/" target="_blank">plotly express</a> library, which claims to be a "terse, consistent, high-level API for rapid data exploration and figure generation." It is also great for producing quick and easy maps, which is one of the main goals in this session! And, unlike other libraries, plotly express allows for user interaction with the graphic elements it produces.
<img src="images/plotly.png">

We know that the original dataset titled `la-times-place-totals.csv` is about covid-19 cases by place. Places are units derived from neighborhoods. Let's create a bar chart (<a href="https://plotly.com/python/bar-charts/" target="_blank">using plotly express</a>) of a very familiar neighborhood by UCLA:

In [None]:
WestLA = latimes.query("place == ['Westwood']")
px.bar(WestLA,
      x='date',
      y='confirmed_cases')

What about multiple places in one chart? Create a list of places, and query it by that list. Feel free to modify the list by adding places of interest:

In [None]:
WestLA = latimes.query("place == ['Westwood','Santa Monica','Culver City']")
px.bar(WestLA,
      x='date',
      y='confirmed_cases')

How about a legend, and colors to represent different neighborhoods in our stacked chart?

In [None]:
WestLA = latimes.query("place == ['Westwood','Culver City','Santa Monica']")
px.bar(WestLA,
      x='date',
      y='confirmed_cases',
      color = 'place')

You can also separate each neighborhood into its own chart using `facet_row`:

In [None]:
WestLA = latimes.query("place == ['Westwood','Culver City','Santa Monica']")
px.bar(WestLA,
      x='date',
      y='confirmed_cases',
      color = 'place',
      facet_row="place")

# Scatter Plots

Documentation: https://plotly.com/python/line-and-scatter/


To create a scatter plot use the `px.scatter` function. The first argument must be the data frame you want to feed it, in this case, we will use our single day dataset, `latimes_single_day`. It must be followed with `x` and `y` values. 

A scatter plot is dictated by an x and a y axis. So too are spatial coordinates, albeit complicated by its spherical nature. Plot the `latimes_LA` data with lat/lon's on the axis. Also add `hover_name='place'` to display the place name when you hover over a point. I have also added `render_mode="svg"` as chrome has issues with the default mode, which is `webgl`.

In [None]:
px.scatter(latimes_single_day,
           x='x',
           y='y',
           hover_name='place')

Let's add some color. Color code the dots by confirmed cases.

In [None]:
px.scatter(latimes_single_day,
           x='x',
           y='y',
           hover_name='place',
           color='confirmed_cases')

The colors are hard to see, especially when many points are clustered around the same area. Let's use `size`, `size_max` (in pixels) as another visual measure for size.

In [None]:
px.scatter(latimes_single_day,
           x='x',
           y='y',
           color='confirmed_cases', 
           size='confirmed_cases',
           size_max=40, 
           hover_name='place',
           title = 'Confirmed Cases for ' + lastdate)

You can change the color scale with `color_continuous_scale`. Check out the available values here: https://plotly.com/python/builtin-colorscales/


In [None]:
px.scatter(latimes_single_day,
           x='x',
           y='y',
           color='confirmed_cases', 
           size='confirmed_cases',
           size_max=40, 
           hover_name='place',
           color_continuous_scale = 'RdYlGn_r') # added _r to reverse color scheme

You can also define a range `range_color` to control the lower and upper bounds of the color scale. First, get the mean of our single day dataset in order to define a relevant range:

In [None]:
latimes_single_day_mean = latimes_single_day.confirmed_cases.mean()
latimes_single_day_mean

Now that you know the mean, let's use that as the halfway point of our continuous scale, and therefore double the number to create our range.

In [None]:
px.scatter(latimes_single_day,
           x='x',
           y='y',
           color='confirmed_cases', 
           size='confirmed_cases',
           size_max=40, 
           hover_name='place',
           color_continuous_scale = 'RdYlGn_r', # added _r to reverse color scheme
           range_color = (0,latimes_single_day_mean * 2) # double the mean
          )

# Animated scatter

- https://plotly.com/python/animations/

Previously, we were looking at all the data on a plot. We can create a frame for each date in the data, and then "play" it over time to animate it. Let's do so for just the LA County data, using `latimes_LA` as our data frame. Add `animation_frame` and `animation_group` to your scatter arguments.


In [None]:
latimes_LA_mean = latimes_LA.confirmed_cases.mean()
latimes_LA_mean

In [None]:
px.scatter(latimes_LA,
           x='x',
           y='y',
           color='confirmed_cases', 
           size='confirmed_cases',
           size_max=40, 
           hover_name='place',
           animation_frame='date', # this creates a frame by frame animation by day
           color_continuous_scale = 'RdYlGn_r',
           range_color = (0,latimes_LA_mean*2))

# Putting it on a map

https://plotly.com/python/scatter-plots-on-maps/


The `scatter_geo` method puts your data on a map. Note that there are limitations. The geographic scope allows for global, continental, and USA maps, so this is not suitable for more localized data.

In [None]:
fig = px.scatter_geo(latimes_single_day,
           lon='x',
           lat='y',
           color='confirmed_cases', 
           size='confirmed_cases',
           size_max=40, 
           hover_name='place',
           scope='usa',
           color_continuous_scale = 'RdYlGn_r',
           range_color = (0,latimes_single_day_mean * 2) # double the mean 
            )

fig.update_geos(fitbounds="locations") 

In [None]:
fig = px.scatter_geo(latimes_LA,
           lon='x',
           lat='y',
           color='confirmed_cases', 
           size='confirmed_cases',
           size_max=40, 
           hover_name='place',
           scope='usa',                     
           animation_frame='date',
           color_continuous_scale = 'RdYlGn_r',
           range_color = (0,latimes_LA_mean*2))

fig.update_geos(fitbounds="locations") 

# Post workshop survey
Please take the following survey if you participated in any part of this workshop.

https://bit.ly/2VkkyXm