# Mapping COVID19 data in Python

## Help
- About this interface https://jupyterlab.readthedocs.io/en/stable/user/interface.html
- Jupyter keyboard shortcuts
https://yoursdata.net/jupyter-lab-shortcut-and-magic-functions-tips/
- Plotly Express documentation https://plotly.com/python/plotly-express/
- Working with csv and pandas https://towardsdatascience.com/data-science-with-python-intro-to-loading-and-subsetting-data-with-pandas-9f26895ddd7f

# Hello Jupyter Newbies

If this is the first time for you to use Jupyter Notebooks, I highly recommend you visit the help links above. 

In order to run the code sequentially on this page, highlight a cell (you should see a blue bar on the left side) and use the following keyboard shortcut to run the cell:

- `shift + enter`

This should take you to the next cell, where you can repeat `shift + enter` until you reach the end. You can modify the contents of any cell to experiment with the code, but note that doing so may impact the subsequent code.

## Libraries

For this session, we will be using two libraries: plotly and pandas. Make sure to install the libraries using `pip` or `conda`. If you have reached this Jupyter notebook via the IDRE GitHub page, or, if you installed Anaconda and ran `requirements.txt`, your library should already be installed. If not, uncomment the install code below and run them.


In [None]:
# !conda install plotly --yes
# !conda install pandas

Import libraries

In [None]:
import plotly.express as px
import pandas as pd

## Data

The LA Times Data Desk team has taken the lead to centralize Los Angeles based COVID-19 related datasets. Shortly after the pandemic erupted in the US, they have maintained the following page to report real-time statistics.

https://www.latimes.com/projects/california-coronavirus-cases-tracking-outbreak/

In order to maintain transparency over their methods, they have made multiple datasets available on this GitHub page, allowing academics to use it for research purposes under these terms and services (https://www.latimes.com/terms-of-service).

https://github.com/datadesk/california-coronavirus-data

## Using Python's pandas library to get data

Get the data from LA Times. We can grab the data directly from their github page. By doing so, it ensures that we are grabbling the latest dataset that they have uploaded on their site. Note that it also adds the risk that if they change their data model, it can potentially break the methods used in this session.

- <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html" target="_blank">reference for pandas read_csv</a>

In [None]:
latimes = pd.read_csv(
    "https://raw.githubusercontent.com/datadesk/california-coronavirus-data/master/latimes-place-totals.csv"
)

Note that you can always find help for a method using the following command:

In [None]:
?pd.read_csv

Preview the data by typing its name out. If you are using command line and not Jupyter, you have to use the print() method.

In [None]:
latimes
# if using command line
# print(latimes)

You can also just output the first 5 rows using `.head()`.

In [None]:
latimes.head()

How many rows and columns? Use the `.shape` method by typing it in below (note that for .shape, you do not need the parentheses) 

Output the columns using the  `.columns` method.

## Cleaning the data

Data is not perfect. In fact, data is never perfect. After a close reading of the data, the need to filter out problematic records becomes necessary. For this session, let us filter out the following:

- empty values (NaN's) 
    - `confirmed_cases != 'NaN'`
- incorrect coordinates (ie, positive longitudes which are not possible in California)
    - `x < 0`
- null dates
    - `date.notnull()`

We will do so by using the pandas query method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html).

In [None]:
latimes = latimes.query("confirmed_cases != 'NaN' & x < 0 & date.notnull()", engine='python')
latimes.head()

Output the number of rows and columns using `.shape`

## Find the most recent date

Let's create a sub table of rows for the most current date. 

Order the data by date using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html" target="_blank">pandas sort_values</a>:


In [None]:
latimes = latimes.sort_values(by=["date"], ascending=True)

Output the last entries to see the most recent date in the table. Use `tail()` instead of `head()`:

In [None]:
latimes.tail()

What is date of the last entry in our database? Let's use pandas <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html" target="_blank">iloc method</a> (index location) to grab the last date.

If you want to get the value of an element, you can do with `iloc[0]['column_name']`, `iloc[-1]['column_name']`.

In [None]:
# iloc[-1] grabs the last row in the data
lastdate = latimes.iloc[-1]
lastdate

In [None]:
# specify the exact column name you want, in this case ['date']
lastdate = latimes.iloc[-1]['date']
lastdate

Create a new variable that will hold the data filtered by `lastdate` using `.query`. Notice the `@` sign in front of `lastdate` within the query argument, which indicates that it is referencing a variable.

In [None]:
latimes_single_day = latimes.query('date==@lastdate')
latimes_single_day

Create another filter for just Los Angeles County data.

In [None]:
latimes_LA=latimes.query("county=='Los Angeles'")
latimes_LA

Now we have three variables to work with:
- `latimes`: the entire database
- `latimes_single_day`: filtered for one day
- `latimes_LA`: just Los Angeles County data


# Stats
Get some stats about our data using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html" target="_blank">.describe()</a>.

In [None]:
latimes.confirmed_cases.describe()

You can use `describe()` on grouped rows, such as by county:

In [None]:
latimes.groupby("county").confirmed_cases.describe()

Sort the table by `max` values.

In [None]:
latimes.groupby("county").confirmed_cases.describe().sort_values(by=["max"], ascending=False)

Let's try that by date. Change the `.groupby` argument to `date` instead of `confirmed_cases`.

# Scatter Plots

Documentation: https://plotly.com/python/line-and-scatter/


Let's create a non-spatial scatter plot. We will use the <a href="https://plotly.com/python/plotly-express/" target="_blank">plotly express</a> library, which claims to be a "terse, consistent, high-level API for rapid data exploration and figure generation." It is also great for producing quick and easy maps, which is one of the main goals in this session! And, unlike other libraries, plotly express allows for user interaction with the graphic elements it produces.

To create a scatter plot use the `px.scatter` function. The first argument must be the data frame you want to feed it, in this case, we will use our full dataset, `latimes`. It must be followed with an `x` and `y` values. Let's put `date` in the x axis, and `confirmed_cases` in the y axis.

In [None]:
px.scatter(latimes,
           x="date",
           y="confirmed_cases")

Let's add some color to differentiate the dots by county.

In [None]:
px.scatter(latimes,
           x="date",
           y="confirmed_cases",
           color="county")

# "Scatter" maps

Let's think spatially now. A scatter plot is dictated by an x and a y axis. So too are spatial coordinates, albeit complicated by its spherical nature. Plot the `latimes` data with lat/lon's on the axis. Also add `hover_name='place'` to display the place name when you hover over a point.

In [None]:
px.scatter(latimes,
           x='x',
           y='y',
           hover_name='place')

Let's add some color. Color code the dots by confirmed cases.

In [None]:
px.scatter(latimes,
           x='x',
           y='y',
           color='confirmed_cases')

The colors are hard to see, especially when many points are clustered around the same area. Let's use size as another visual measure for size.

In [None]:
px.scatter(latimes,
           x='x',
           y='y',
           color='confirmed_cases', 
           size='confirmed_cases',
           size_max=80, 
           hover_name='place')

You can change the color scale. Check out the available values here: https://plotly.com/python/builtin-colorscales/


In [None]:
px.scatter(latimes,
           x='x',
           y='y',
           color='confirmed_cases', 
           size='confirmed_cases',
           size_max=40, 
           hover_name='place',
           color_continuous_scale = 'OrRd')

# Animated scatter

- https://plotly.com/python/animations/

Previously, we were looking at all the data on a plot. We can create a frame for each date in the data, and then "play" it over time to animate it. Let's do so for just the LA County data, using `latimes_LA` as our data frame. Add `animation_frame` and `animation_group` to your scatter arguments.


In [None]:
px.scatter(latimes_LA,
           x='x',
           y='y',
           color='confirmed_cases', 
           size='confirmed_cases',
           size_max=40, 
           hover_name='place',
           animation_frame='date',
           animation_group='place',
           color_continuous_scale = 'OrRd')

# Putting it on a map

https://plotly.com/python/scatter-plots-on-maps/


The `scatter_geo` method puts your data on a map. Note that there are limitations. The geographic scope allows for global, continental, and USA maps, so this is not suitable for more localized data.

In [None]:
px.scatter_geo(latimes_single_day,
           lon='x',
           lat='y',
           color='confirmed_cases', 
           size='confirmed_cases',
           size_max=40, 
           hover_name='place',
           scope='usa',
           color_continuous_scale = 'OrRd')

## Mapbox

Help: https://plotly.com/python/scattermapbox/

Plotly also comes with a method to add data to a <a href="https://mapbox.com" target="_blank">mapbox</a> interface. MapBox does require a unique access_token, so you will need to create an account and acquire one.

In [None]:
?px.scatter_mapbox

In [None]:
access_token = 'pk.eyJ1IjoieW9obWFuIiwiYSI6IkxuRThfNFkifQ.u2xRJMiChx914U7mOZMiZw'
px.set_mapbox_access_token(access_token)
px.scatter_mapbox(latimes_LA, 
                  lat="y", 
                  lon="x",     
                  color="confirmed_cases", 
                  size="confirmed_cases",
                  size_max=30, 
                  opacity=0.5,
                  zoom=5,
                  mapbox_style="dark",
                  hover_name='place',
                  color_continuous_scale = 'YlOrRd',
                  height=600,
                  title = 'LA Times Covid-19 Maps for ' + lastdate)

# Post workshop survey
Please take the following survey if you participated in any part of this workshop.

https://bit.ly/39GNKfS