In this notebook, you'll combine the data you extracted last week with data showing the count of COVID cases by zip code.

In [115]:
import pandas as pd
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns

First, import the vaccines data (contained in the file `COVID_CountByZipPerDate 03292021.csv` in the `data` folder. Save it in a DataFrame named `cases`.

In [120]:
# Your Code Here

Take a look at this DataFrame so that you have an understanding of how it is structured.

Aggregate the cases per zip code to get a total (cumulative) count of cases per zip code. Convert the result to a DataFrame named `total_cases`.

In [119]:
# Your Code Here

Create a plot showing the number of cases per zip code. Which zip code has the highest total number of cases?

In [121]:
# Your Code Here

Now we're going to incorporate some data retrieved from the census.

We can reuse the code from last week to fetch and prepare the race and hhinc DataFrames.

In [68]:
engine = create_engine('sqlite:///../data/census_data.sqlite')

In [69]:
query = '''
SELECT *
FROM race;
'''

race = pd.read_sql(query, con = engine)
race['pct_white'] = race['Not Hispanic or Latino_White alone'] / race['Total']

In [70]:
query = '''
SELECT *
FROM hhinc;
'''

hhinc = pd.read_sql(query, con = engine)

hhinc['Total_less_than_60000'] = hhinc[['Less than $10,000', '$10,000 to $14,999',
       '$15,000 to $19,999', '$20,000 to $24,999', '$25,000 to $29,999',
       '$30,000 to $34,999', '$35,000 to $39,999', '$40,000 to $44,999',
       '$45,000 to $49,999', '$50,000 to $59,999']].sum(axis = 1)

hhinc['pct_less_than_60000'] = hhinc['Total_less_than_60000'] / hhinc['Total:']

hhinc['low_income'] = hhinc['pct_less_than_60000'] >= 0.5

We don't need all of the columns from these two dataframes. Slice `race` so that it only includes the 'geoid', 'zip', 'Total', and 'pct_white' columns and `hhinc` so that it only includes 'geoid', 'zip', and 'low_income'.

In [71]:
# Your Code Here

Prior to merging, it might make sense to rename the 'zip' columns in `race` and `hhinc` to `Zip` in order to match the column name in the `total_cases` DataFrame.

Also, rename the 'Total' column in race to 'population'.

In [122]:
# Your Code Here

Do a series of two merges:

First, merge the `total_cases` DataFrame and the `race` DataFrame together and save the result back to `total_cases`.

Second, merge the `total_cases` DataFrame and the `hhinc` DataFrame together and save the result back to `total_cases`.

In [123]:
# Your Code Here

Using the `Cases` and `population` columns, create a new calculated column, `cases_per_100000`.

In [124]:
# Your Code Here

Create a bar plot to display this newly calculated column.

In [125]:
# Your Code Here

Look back at your `total_cases` DataFrame to investigate any zip codes with unusually high or low values. Why might these values be so high or low?

For the zip codes 37027, 37072, 37080, and 37138 at least 10% of the residents live outside of Davidson County. Remove the rows for these counties from the DataFrame.

You can use the code in the following cell to accomplish this.

We also want to remove any rows whose ZipCode is NaN, which can be accomplished with the code in this cell.

In [100]:
total_cases = total_cases[~total_cases['Zip'].isin([37027, 37072, 37080, 37138])]
total_cases = total_cases[~total_cases['Zip'].isna()]

**Question:** Does there appear to be any difference in cases_per_100000 for zip codes identified as low_income vs. those that are not?

In [101]:
# Your Code Here

**Question:** Does there appear to be any relationship between a zip code's pct_white value and cases_per_100000 value?

In [112]:
# Your Code Here

Finally, we are going to be reusing the DataFrame that we have created in future weeks. To save your findings, you can use the `.to_csv()` method as the following cell shows. Run the following cell to save your work.

In [None]:
total_cases.to_csv('../data/total_cases.csv', index = False)