In this notebook, you'll combine the data you extracted last week with data showing the count of COVID cases by zip code.

In [None]:
import pandas as pd
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns

First, import the vaccines data (contained in the file `COVID_CountByZipPerDate 03292021.csv` in the `data` folder. Save it in a DataFrame named `cases`.

In [None]:
# Your Code Here

Take a look at this DataFrame so that you have an understanding of how it is structured.

In [None]:
cases

Aggregate the cases per zip code to get a total (cumulative) count of cases per zip code. Convert the result to a DataFrame named `total_cases`.

In [None]:
# Your Code Here

Create a plot showing the number of cases per zip code. Which zip code has the highest total number of cases?

In [None]:
# Your Code Here

Now we're going to incorporate some data retrieved from the census.

First, use the `create_engine()` function to connect to the database (`census_data.sqlite`, which is in the data folder).

In [None]:
# Your Code Here

To see all tables contained in this database, you can run the following cell.

In [None]:
query = '''
SELECT 
    name
FROM 
    sqlite_master 
WHERE 
    type ='table' AND 
    name NOT LIKE 'sqlite_%';
'''

engine.execute(query).fetchall()

We are going go look specifically at the race and the hhinc (household income) tables.

[Studies](https://www.ama-assn.org/delivering-care/health-equity/data-10-cities-show-covid-19-impact-based-poverty-race) have seen higher rates of infection and death from COVID-19 in more racially diverse counties compared to those which are substantially white with the same income level.

Eventually, we'll compare infection rates for majority white vs. majority non-white zipcodes in Davidson County. For now, let's just identify which zip codes are majority white.

**Step 1:** Write a query to retrieve all rows from the race table. Run this query and save the results into a pandas DataFrame named `race`.

In [None]:
query = '''
# Fill this in
'''

In [None]:
race = pd.read_sql(query, con = engine)

**Step 2:** Create a column named `pct_white` by dividing the `Not Hispanic or Latino_White alone` column by the `Total` column.

In [None]:
# Your Code Here

**Question:** How many zip codes in this table are majority white (meaning `pct_white` is at least 0.5)?

In [None]:
# Your Code Here

[Other recent studies](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2779417) have looked at the relationship between the level of income inequality in an area and the number of COVID-19 infections. 

According to [Census Reporter](https://censusreporter.org/profiles/05000US47037-davidson-county-tn/) the median household income in Davidson County is $63,938.

Let's now identify which zip codes in Davidson County have a household income less than $60,000.

**Step 1:** Write and run a query to fetch all rows from the `hhinc` table. Save the results into a DataFrame named `hhinc`.

In [None]:
query = '''
# Fill This In
'''

hhinc = pd.read_sql(query, con = engine)

In [None]:
hhinc.head()

In [None]:
hhinc.columns

We will identify the zip codes that have median incomes less than $60,000 using the following procedure:

1. Get a total count for all income levels from "Less than \\$10,000" through "\\$50,000 to \\$59,999".

2. Use this count to get a percentage of households in that zip code whose income is less than \\$60,000.

3. Identify zip codes for which this percentage is at least 50%.

For the first step, we can use the `.sum()` method.

By default, `pandas` will sum up the columns. However, we can change this behavior by using the `axis` argument of the `.sum()`. method. The argument `axis = 1` indicates to pandas that we want to sum along the rows.

Once you are satisfied that the calculation is being done correctly, save the result back to a column named "Total_less_than_60000".

In [None]:
hhinc[['Less than $10,000', '$10,000 to $14,999',
       '$15,000 to $19,999', '$20,000 to $24,999', '$25,000 to $29,999',
       '$30,000 to $34,999', '$35,000 to $39,999', '$40,000 to $44,999',
       '$45,000 to $49,999', '$50,000 to $59,999']].# Fill this part in

Now, create a new column named "pct_less_than_60000" by dividing the newly-created column by the "Total:" column.

In [None]:
# Your Code Here

Finally, create a columns `low_income` that indicates whether a zip code has a median income less than 60000 (that is, if the `pct_less_than_60000` is greater than 0.5).

In [None]:
# Your Code Here

We don't need all of the columns from these two dataframes. Slice `race` so that it only includes the 'geoid', 'zip', 'Total', and 'pct_white' columns and `hhinc` so that it only includes 'geoid', 'zip', and 'low_income'.

In [None]:
# Your Code Here

Prior to merging, it might make sense to rename the 'zip' columns in `race` and `hhinc` to `Zip` in order to match the column name in the `total_cases` DataFrame.

Also, rename the 'Total' column in race to 'population'.

In [None]:
# Your Code Here

Do a series of two merges:

First, merge the `total_cases` DataFrame and the `race` DataFrame together and save the result back to `total_cases`.

Second, merge the `total_cases` DataFrame and the `hhinc` DataFrame together and save the result back to `total_cases`.

In [None]:
# Your Code Here

Using the `Cases` and `population` columns, create a new calculated column, `cases_per_100000`.

In [None]:
# Your Code Here

Create a bar plot to display this newly calculated column.

In [None]:
# Your Code Here

Look back at your `total_cases` DataFrame to investigate any zip codes with unusually high or low values. Why might these values be so high or low?

In [None]:
total_cases.sort_values('cases_per_100000')

For the zip codes 37027, 37072, 37080, and 37138 at least 10% of the residents live outside of Davidson County. Remove the rows for these counties from the DataFrame.

You can use the code in the following cell to accomplish this.

We also want to remove any rows whose ZipCode is NaN, which can be accomplished with the code in this cell.

In [None]:
total_cases = total_cases[~total_cases['Zip'].isin([37027, 37072, 37080, 37138])]
total_cases = total_cases[~total_cases['Zip'].isna()]

**Question:** Does there appear to be any difference in cases_per_100000 for zip codes identified as low_income vs. those that are not?

In [None]:
# Your Code Here

**Question:** Does there appear to be any relationship between a zip code's pct_white value and cases_per_100000 value?

In [None]:
# Your Code Here

Finally, we are going to be reusing the DataFrame that we have created in future weeks. To save your findings, you can use the `.to_csv()` method as the following cell shows. Run the following cell to save your work.

In [None]:
total_cases.to_csv('../data/total_cases.csv', index = False)