In this notebook, you'll retrieve some data from a SQLite database containing data gathered from the US Census.

In [53]:
import pandas as pd
from sqlalchemy import create_engine

First, use the `create_engine()` function to connect to the database (`census_date.sqlite`, which is in the data folder).

In [54]:
# Your code here

To see all tables contained in this database, you can run the following cell.

In [55]:
query = '''
SELECT 
    name
FROM 
    sqlite_master 
WHERE 
    type ='table' AND 
    name NOT LIKE 'sqlite_%';
'''

engine.execute(query).fetchall()

[('age',), ('race',), ('educ',), ('hhinc',)]

We are going go look specifically at the race and the hhinc (household income) tables.

[Studies](https://www.ama-assn.org/delivering-care/health-equity/data-10-cities-show-covid-19-impact-based-poverty-race) have seen higher rates of infection and death from COVID-19 in more racially diverse counties compared to those which are substantially white with the same income level.

Eventually, we'll compare infection rates for majority white vs. majority non-white zipcodes in Davidson County. For now, let's just identify which zip codes are majority white.

**Step 1:** Write a query to retrieve all rows from the race table. Run this query and save the results into a pandas DataFrame named `race`.

In [None]:
# Your code here

**Step 2:** Create a column named `pct_white` by dividing the `Not Hispanic or Latino_White alone` column by the `Total` column.

In [None]:
# Your Code Here

**Question:** How many zip codes in this table are majority white (meaning `pct_white` is at least 0.5)?

In [None]:
# Your Code Here

**Difficult Question:** The race table currently includes many more zip codes than just Davidson County. Read in the COVID cases dataset, and use it to narrow down the number of rows in the `race` table to just those in Davidson County. Out of the Davidson County zip codes, how many are majority white? majority non-white?

In [89]:
# Your Code Here

[Other recent studies](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2779417) have looked at the relationship between the level of income inequality in an area and the number of COVID-19 infections. 

According to [Census Reporter](https://censusreporter.org/profiles/05000US47037-davidson-county-tn/) the median household income in Davidson County is $63,938.

Let's now identify which zip codes in Davidson County have a household income less than $60,000.

**Step 1:** Write and run a query to fetch all rows from the `hhinc` table. Save the results into a DataFrame named `hhinc`.

In [None]:
# Your Code Here

Take a look at the first few rows of `hhinc`.

In [91]:
hhinc.head()

Unnamed: 0,geoid,zip,Total:,"Less than $10,000","$10,000 to $14,999","$15,000 to $19,999","$20,000 to $24,999","$25,000 to $29,999","$30,000 to $34,999","$35,000 to $39,999","$40,000 to $44,999","$45,000 to $49,999","$50,000 to $59,999","$60,000 to $74,999","$75,000 to $99,999","$100,000 to $124,999","$125,000 to $149,999","$150,000 to $199,999","$200,000 or more"
0,86000US37013,37013,35597,1424,1023,1197,1751,1647,1685,1748,1937,1748,3956,4510,5507,3547,1703,1212,1002
1,86000US37015,37015,6673,421,256,115,415,365,452,350,277,323,719,903,903,639,294,157,84
2,86000US37027,37027,19950,460,173,246,250,254,250,316,538,472,597,1235,1861,2167,1688,2837,6606
3,86000US37062,37062,4040,119,122,138,140,123,124,121,216,140,329,569,645,468,342,286,158
4,86000US37064,37064,21640,771,302,623,512,609,630,721,495,567,1479,1784,2293,2338,2265,2394,3857


In [93]:
hhinc.columns

Index(['geoid', 'zip', 'Total:', 'Less than $10,000', '$10,000 to $14,999',
       '$15,000 to $19,999', '$20,000 to $24,999', '$25,000 to $29,999',
       '$30,000 to $34,999', '$35,000 to $39,999', '$40,000 to $44,999',
       '$45,000 to $49,999', '$50,000 to $59,999', '$60,000 to $74,999',
       '$75,000 to $99,999', '$100,000 to $124,999', '$125,000 to $149,999',
       '$150,000 to $199,999', '$200,000 or more'],
      dtype='object')

We will identify the zip codes that have median incomes less than $60,000 using the following procedure:

1. Get a total count for all income levels from "Less than \\$10,000" through "\\$50,000 to \\$59,999".

2. Use this count to get a percentage of households in that zip code whose income is less than \\$60,000.

3. Identify zip codes for which this percentage is at least 50%.

For the first step, we can use the `.sum()` method.

By default, `pandas` will sum up the columns. However, we can change this behavior by using the `axis` argument of the `.sum()`. method. The argument `axis = 1` indicates to pandas that we want to sum along the rows.

Once you are satisfied that the calculation is being done correctly, save the result back to a column named "Total_less_than_60000".

In [None]:
hhinc[['Less than $10,000', '$10,000 to $14,999',
       '$15,000 to $19,999', '$20,000 to $24,999', '$25,000 to $29,999',
       '$30,000 to $34,999', '$35,000 to $39,999', '$40,000 to $44,999',
       '$45,000 to $49,999', '$50,000 to $59,999']].# Fill this part in

Now, create a new column named "pct_less_than_60000" by dividing the newly-created column by the "Total:" column.

In [101]:
# Your Code Here

**Question:** For how many rows is it the case that the median income is less than \\$60,000?

In [105]:
# Your Code Here