<h1 style="padding-top: 25px;padding-bottom: 25px;text-align: left; padding-left: 10px; background-color: #DDDDDD; 
    color: black;"> <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109A Introduction to Data Science </h1>

## Homework 1: Data Collection, Parsing, and Quick Analyses

**Harvard University**<br/>
**Fall 2021**<br/>
**Instructors**: Pavlos Protopapas and Natesh Pillai<br/>
<hr style='height:2px'>

In [1]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2021-CS109A/master/themes/static/css/cs109.css").text
HTML(styles)


## Overview 

In this homework, your goal is to learn how to acquire, parse, clean, and analyze data. Toward this goal, we will address certain questions about COVID, and you will scrape data directly from a website. For the remainder of the semester, we will provide you data files directly; however, since real-world problems often require gathering information from a variety of sources, including the Internet, web scraping is a highly useful skill to have.

### Instructions
- To submit your assignment, follow the instructions given in Canvas.

### Learning Objectives
- Get started using [Jupyter Notebooks](https://jupyter.org/), which are incredibly popular, powerful, and will be our medium of programming for the duration of CS109A and CS109B.
- Become familiar with how to access and use data from various sources (i.e., web scraping and directly from files).
- Gain experience with data exploration and simple analysis.
- Become comfortable with [pandas](https://pandas.pydata.org/) as a means of storing and working with data.
- Reflect on what further analysis you may wish to do with this data. For example, given the material we've covered so far, what *more* do you wish you had the ability to do (e.g., modelling, prediction, etc). That is, think about questions you may have about the data, and try to imagine what types of tools you might need to help answer your questions.

### Notes
- Exercise **responsible scraping**. Web servers can become slow or unresponsive if they receive too many requests from the same source in a short amount of time. In your code, use a delay of 2 seconds between requests. This helps to not get blocked by the target website -- imagine how frustrating it would be to have this occur. Section 1 of this homework involves saving the scraped web pages to your local machine. Thus, after completing Section 1, you do not need to re-scrape any of the pages, unless you wish to occasionally grab the latest data. 

- <span style='color:purple'>**Web scraping requests can take several minutes**</span>. This is another reason why you should not wait until the last minute to do this homework.
- As you run a Jupyter Notebook, it maintains a running state of memory. Thus, <span style='color:purple'>the order in which you run cells matters</span> and plays a crucial role; it can be easy to make mistakes based on *when* you run different cells as you develop and test your code. Before submitting every Jupyter Notebook homework assignment, be sure to restart your Jupyter Notebook and run the entire notebook from scratch, all at once (i.e., "Kernel -> Restart & Run All"). Just make sure to not re-run the time intensive tasks unnecessarily. In this notebook for example, you could declare a variable to act as a 'setting' and use some controll logic to prevent a re-scrap from happening when not desired.

- We will be working with COVID data. COVID has impacted everyone in the world, and naturally some people have been greatly more affected than others. We, the teaching staff, are sensitive to this, empathize, and understand that working with COVID data may be unsettling to some. We apologize for any discomfort this may cause. Our intent with this assignment is purely pedagogical, and we'd like to remind students that data science and machine learning can be used to provide insights that can be used for good and invoke change. Toward this goal, parts of the homework are intended to shed light on the unfortunate, widespread inequality that exists. So, while this data may be unsettling, our aim is for the learned skills addressed here -- and in all future assignments -- to provide you with knowledge and confidence to do good work.

## 1. Obtaining Data (17 points)

For any given situation or scenario that we wish to understand, we will rely on having relevant data. Here, we are interested in the degree to which the SARS-CoV-2 virus has affected United States citizens (SARS-CoV-2 is the virus that causes the COVID-19 disease). The Centers for Disease Control and Prevention (CDC) provides relevant data from USAFacts.org that includes the number of confirmed COVID-19 cases on a per-county basis. Visit https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/. At the bottom of the web page, in a blue table, you should see a list of every state, each of which has its own web page.

In this exercise, we will focus on automating the downloading of each state's data with [Requests](https://docs.python-requests.org/en/master/) and then manipulating it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). 

But first, as we will do for every Jupyter Notebook, let's import necessary packages that we will use throughout the notebook (i.e., run the cell below). 

In [2]:
# import the necessary libraries
import re
import requests
import pandas as pd
import numpy as np
from time import sleep
from bs4 import BeautifulSoup
import pickle # for loading a dictionary from disk
from typing import Optional # typehint that value can also be None

# NOTE: files will be saved to this directory, so you need to ensure
# that it exists on your system first (it should be visible from the
# directory of where you are running this Notebook file)
# i.e.,
# >> ls
# cs109a_hw1_student.ipynb
# data/
# state_data/
state_dir = "state_data/"

In [3]:
# we define this for convenience, as every state's url begins with this prefix
base_url = 'https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/'

<div class='exercise'><b> Exercise 1.1 [1 pt]: Fetching Website data via Requests</b>

Fetch the web page located at `base_url` and save the request's returned object (a Response object) to a variable named `home_page`.
</div>

In [4]:
# YOUR CODE HERE
home_page = requests.get(base_url)
# END OF YOUR CODE HERE

<div class='exercise'><b>Exercise 1.2 [2 pts]:</b> In the cell below:
    
- Write a line of code that prints to the screen the status of `home_page` (the web page's returned object). You should receive a code of 200 if the request was successful; then,

- **When working with Jupyter Notebooks, avoiding unnecessarily long output in is essential.** Write code that prints the first 10,000 characters from the contents of `home_page` and [enable scolling output for the cell](https://www.youtube.com/watch?v=U4usAUZCv_c&t=1s).</div>


In [5]:
# YOUR CODE HERE
print(home_page.status_code)
print(home_page.text[:10000])
# END OF YOUR CODE HERE

200
<!doctype html><html lang="en"><head><script type="text/javascript">window.NREUM||(NREUM={});NREUM.info = {"agent":"","beacon":"bam-cell.nr-data.net","errorBeacon":"bam-cell.nr-data.net","licenseKey":"NRJS-c11b817f31177e0b4d1","applicationID":"1475026924","applicationTime":2161.554995,"transactionName":"ZwZaNUEFVhZZAkNRWl5Mdg5BCVkJURtSXGBCChdL","queueTime":0,"ttGuid":"50a3b7d4afb96077","agentToken":null}; (window.NREUM||(NREUM={})).loader_config={licenseKey:"NRJS-c11b817f31177e0b4d1",applicationID:"1475026924"};window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var i=e[n]={exports:{}};t[n][0].call(i.exports,function(e){var i=t[n][1][e];return r(i||e)},i,i.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var i=0;i<n.length;i++)r(n[i]);return r}({1:[function(t,e,n){function r(){}function i(t,e,n){return function(){return o(t,[u.now()].concat(f(arguments)),e?null:this,n),e?void 0:this}}var o=t("handle"),a=t(8),f=t(9),c=

<div class='exercise'><b> Exercise 1.3 [1 pt]:</b>
    
In the cell below, create a new BeautifulSoup object that parses the `home_page` as an HTML document (can be done with 1 line of code)</div>

In [6]:
# YOUR CODE HERE
bs_page = BeautifulSoup(home_page.content, "html.parser")
# END OF YOUR CODE HERE

<div class='exercise'><b> Exercise 1.4 [8 pts]:</b>
    
In the cell below, write code that uses the BeautifulSoup object to parse through the home page in order to extract the link for every state. Feel free to use [Regular Expressions]('https://docs.python.org/3/library/re.html'), in conjunction with any BeautifulSoup parsing. Specifically, the goal is to populate a `state_urls` [dictionary]('https://docs.python.org/3/tutorial/datastructures.html#dictionaries') by setting each key to be the state name and the value to be the full URL. When complete, there will be 51 keys (50 states + 1 for DC).

### AS A CRITICAL EXAMPLE:
Within `state_urls`, one of your <key, value> pairs should be:

``"District of Columbia" : "https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/state/district-of-columbia"``

The casing here is **incredibly** important because later, in Exercise 4, you will merge your data with another dataset that has casing of this form. Thus, our key here should be `District of Columbia` and not `District Of Columbia` or `district-of-columbia`.


**NOTES:**
- There are _many_ solutions, but you may find it easiest to use Regular Expression(s)
- Pay attention to the casing example above, so that your later exercises go smoothly.
- Some HTML tag attributes may change over time. It your code stops working, make sure you are not targeting such ephemeral elements ('jss' class attributes are a common culprit)
</div>

In [7]:
state_urls = {}

# YOUR CODE HERE
for link in bs_page.findAll('a', attrs={'href': re.compile("/visualizations/coronavirus-covid-19-spread-map/state/")}):
    state_urls[link.text] = "https://usafacts.org" + link.get('href')
# END OF YOUR CODE HERE

Run the cell below to help ensure your formatting is correct and has 51 <key, value> pairs.

In [8]:
# SANITY CHECK
if len(state_urls.keys()) != 51 or \
state_urls["District of Columbia"] != "https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/state/district-of-columbia":
    print("** 1.4 is incorrect")
else:
    print("** 1.4 might be correct")

** 1.4 might be correct


We wish to use the data without having to re-download it every time. So, let's save each webpage to our local hard drive. **NOTE: It's probably okay to download all of the state web pages a few times a day, but it's safer to keep it to a minimum.**

<div class='exercise'><b> Exercise 1.5 [5 pts]:</b>
    
In the cell below, we will iterate through all <key, value> items in `state_urls`. Your job is to make a web request for each URL and save the **contents** out to a file on your hard drive (use `state_dir`, defined above, as the prefix to the path.) 

**NOTES:**
- <span style='color:purple'>Leave a 2 second pause between requests</span>
- You should be saving to a file the actual content of the webpage, not a BeautifulSoup object. That is, you should be able to open the saved files in an editor and see the HTML code, just as you could if you were to view the webpage in your browser and click 'View Page Source'.
- See [official Python documentation](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) for details on how to read/write files to disk
- You should have saved 51 different files to your hard drive.
- **Once you have written the files you can comment out this cell. This will save time and prevent you from making unnecessary requests when you restart the kernel & re-run all cells in the noteboook before submitting (as you should!)**
</div>

In [9]:
# 1.5 (5 pts) -- save each webpage to disk
for state, url in state_urls.items():
    
    # YOUR CODE HERE
    state_webpage = requests.get(url).content
    
    # writes file
    f = open(state_dir + state, 'wb')
    f.write(state_webpage)
    f.close()
    # END OF YOUR CODE HERE
    
    sleep(2) # LEAVE THIS IN

## 2. Loading and Exploring Data (22 pts)
Now, let's actually use the data! Fortunately, it's saved to our local machine, so we don't need to re-crawl the data every time we wish to access it. We want you to understand that [pandas](https://pandas.pydata.org/) is a library of useful data structures and operations, but we also wish to remind you that it isn't magic and it isn't the _only_ way to do Data Science; it's just a tool to help, and you could do the same operations without pandas. Thus, here we ask you to perform a few operations without using pandas, and then in Exercise 3 we will use pandas.

**Terminology Notice:** In the United States, every state is comprised of many **counties.** You can think of a **county** as being a pretty large district. 

First, run the cell below to construct `state_info`
This is an example of a Python [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions).

In [10]:
state_info = [(state, state_dir + state) for state in state_urls.keys()]

<div class='exercise'><b> Exercise 2.1 [10 pts]: Parsing and storing data</b>
    
Complete the `load_covid_data()` function, which:

- Takes as input `state_info`, which is a list of [tuples](https://docs.python.org/3.3/library/stdtypes.html?highlight=tuple#tuple): (state name, path to the corresponding file)
- Parses the contents of the file and extracts for **each county**:
    - 7 day average case
    - 7 day average deaths
    - \# of confirmed cases (total)
    - \# of deaths
    - Stores the above 4 pieces of data above as well as <font color='green'>population</font> in a **non-pandas** data structure named `covid_data` **for every county across every state**
- Returns `covid_data`
    <font color='blue'>


**NOTES:**
- <font color='green'> Attention: the population variable not in `state_info`. More on info on where to get this value is found in the green block below</font>
- To be clear, as of September 7, 2021, the webpage for Alabama currently lists 67 counties. District of Columbia has 1 county, and Wyoming has 23. Here we are asking you to store in `covid_data` *all counties* across every state. So, later, if we were wished to access just Wyoming's information, you could easily retrieve such for each of its 23 counties, or the info for any of the 67 counties in Alabama.
- `covid_data` **must not be a PANDAS data structure;** it must use a combination of lists and/or dictionaries. It's up to you to decide how to organize this, e.g., a lists of lists of lists, or a list of dictionaries, or a dictionary of dictionaries, or a dictionary of lists of lists, etc. A guiding decision should be ease of access for computing basic stats (Exercises 2.2, 2.3, and 2.4)
- For the duration of our using this data for the homework, be sure to **properly store the data with the correct data types;** that is, counts should be represented as Integers and rates should be represented as Floats. For example:
    - \# of confirmed cases (total) should be an **Integer**
    - \# of deaths should be an **Integer**
    - \# of confirmed cases (per 100k) should be a **Float** (we haven't created this feature yet!)
    - 7 day average cases should be an **Integer** (you'd think an average should be a float but the values you scrapped were rounded to the nearest int)
</div>

<div style='background-color:lightgreen;padding:15px'>
    <strong>Injecting population data</strong>
    

The table on usafacts.org you've just scrapped originally had additional columns related to county population. But these have recently been removed! We'd like you to be able to utilize the population data in the following section but also use up-to-date COVID data (so the [Internet Archive](https://archive.org/) was not an option). And, though this information is available elsewhere on usafacts.org, we've decided that you've already done enough web scraping for one HW. So below we've provided a [kludge](https://en.wikipedia.org/wiki/Kludge#Computer_science).
    
`population_dict` is a nested dictionary. The keys are states whose values are _themselves_ dictionaries. Those '_inner_' dictionaries' keys are counties and their values are populations. It looks like this:
```python
{'Alabama': {'Autauga County': 55869,
             'Baldwin County': 223234,
           ...
'Wyoming': {'Albany County': 38880,
            'Big Horn County': 11790,
            ...
```

To get at a population you could use double dictionary indexing like `population_dict['Alabama']['Autauga County']`

But not all of the counties you've scrapped have population data in this dictionary. So we've provided a helper function, `get_pop`, that will return `None` if the county data was not found. Use `get_pop` to inject popoulation data into your `covid_data` as you build it up in the `load_covid_data` function you'll implement below.
    
**Final Note: you should <font color='brown'>ignore counties with missing population data or populations of 0</font>. Simply do not add them to `covid_data` as it is constructed.**
</div>

In [11]:
# load additional county population data as a nested dictionary
# you can read about this strange .pkl 'pickle' file here
# https://docs.python.org/3/library/pickle.html
with open('population.pkl', 'rb') as f:
    population_dict = pickle.load(f)

# not sure what's happening with the data types in the function header?
# check out: https://docs.python.org/3/library/typing.html#module-typing
def get_pop(state: str, county: str) -> Optional[int]:
    '''
    returns population of country, state (int)
    If county or state not found, returns None
    Example: get_pop('Alabama', 'Autauga County')
    '''
    try:
        return population_dict.get(state).get(county)
    except AttributeError:
        print('incorrect state name!')
        return None

In [29]:
def load_covid_data(state_info):
    covid_data = {}
    # YOUR CODE HERE
    for (state, state_path) in state_info:
        covid_data[state] = []
        with open(state_path, 'r') as f:
            soup = BeautifulSoup(f.read(), 'html.parser')
            counties = soup.find_all('a', href=re.compile('county/'))

            for c in counties:
                row = c.find_parent('tr')

                cols = [col.text.replace(',','') for col in row.find_all('td')]

                county_name = c.text
                pop = get_pop(state, county_name)
#                 print(type(pop := get_pop(state, county_name)))
                if ((get_pop(state, county_name)) is None) or (pop == 0):
                    continue
                covid_data[state].append({'county_name': county_name,
                                          'population': pop,
                                          '7_day_avg_cases': float(cols[0]),
                                          '7_date_ave_deaths': float(cols[1]),
                                          'cases': int(cols[2]),
                                          'deaths': int(cols[3])})
    # END OF YOUR CODE HERE
    return covid_data

Run the cell below (no changes necessary) to execute your code above

In [31]:
covid_data = load_covid_data(state_info)
covid_data

{'Alabama': [{'county_name': 'Autauga County',
   'population': 55869,
   '7_day_avg_cases': 19.0,
   '7_date_ave_deaths': 1.0,
   'cases': 9744,
   'deaths': 140},
  {'county_name': 'Baldwin County',
   'population': 223234,
   '7_day_avg_cases': 56.0,
   '7_date_ave_deaths': 3.0,
   'cases': 36447,
   'deaths': 509},
  {'county_name': 'Barbour County',
   'population': 24686,
   '7_day_avg_cases': 9.0,
   '7_date_ave_deaths': 0.0,
   'cases': 3490,
   'deaths': 71},
  {'county_name': 'Bibb County',
   'population': 22394,
   '7_day_avg_cases': 11.0,
   '7_date_ave_deaths': 0.0,
   'cases': 4131,
   'deaths': 83},
  {'county_name': 'Blount County',
   'population': 57826,
   '7_day_avg_cases': 32.0,
   '7_date_ave_deaths': 1.0,
   'cases': 9818,
   'deaths': 160},
  {'county_name': 'Bullock County',
   'population': 10101,
   '7_day_avg_cases': 2.0,
   '7_date_ave_deaths': 0.0,
   'cases': 1503,
   'deaths': 44},
  {'county_name': 'Butler County',
   'population': 19448,
   '7_day_avg

<div class='exercise'><b> Exercise 2.2 [4 pts]: Simple analytics</b>
    
Complete the `calculate_county_stats()` function, which calculates:
1. The single county (and the state to which it belongs) that has the **lowest rate** of COVID cases per 100k people
2. The single county (and the state to which it belongs) that has the **highest rate** of COVID cases per 100k people
   
**NOTES:**
- Place your resulting variables within the blanks of the `print()` statements that we provide
- These values you report should be Floating point numbers (e.g., 3.4), not Integers (e.g., 3).
- If there are ties, return any one of the tied counties (see if you can do it in an unbiased way!)
</div>

In [15]:
def calculate_county_stats(covid_data):
    
    # YOUR CODE HERE
    min_county_count = 999999
    min_county_name = ""
    max_county_count = -1
    max_county_name = ""
    
    # looks through every county in every state, while checking
    # to see if we have a new low or high
    for state in covid_data.keys():
        for county in covid_data[state]:
            if ((pop := county['population']) is None) or (pop == 0):
                continue
            covid_rate = round(county['cases'] / (pop/100000),2)
            if covid_rate < min_county_count:
                min_county_count = covid_rate
                min_county_name = county['county_name'] + " (" + state + ")"
            if covid_rate > max_county_count:
                max_county_count = covid_rate
                max_county_name = county['county_name'] + " (" + state + ")"

    print(min_county_name + " has the lowest COVID cases per 100k: " + str(float(min_county_count)))
    print(max_county_name + " has the highest COVID cases per 100k: " + str(float(max_county_count)))                
                
    # END OF YOUR CODE HERE

Run the cell below (no changes necessary) to execute your code above

In [16]:
calculate_county_stats(covid_data)

Lake and Peninsula Borough (Alaska) has the lowest COVID cases per 100k: 0.0
Bristol Bay Borough (Alaska) has the highest COVID cases per 100k: 72727.27


<div class='exercise'><b> Exercise 2.3 [4 pts]: Simple analytics</b>
    
Complete the `calculate_state_deaths()` function, which calculates:
1. The state that has the **lowest number** of deaths
2. The state that has the **highest number** of deaths

**NOTES:**
- Place your resulting variables within the blanks of the `print()` statements that we provide (don't just manually type your textual answers in the blanks)
- These values you report should be Integers, not Floating point numbers.
- If there are ties, return any of the tied states

</div>

In [17]:
def calculate_state_deaths(covid_data):
    
    # YOUR CODE HERE
    min_state_deaths = 999999
    min_state_name = ""
    max_state_deaths = -1
    max_state_name = ""
    for state in covid_data.keys():
        cur_state_count = 0
        for county in covid_data[state]:
            cur_state_count += county['deaths']
            
        if cur_state_count < min_state_deaths:
            min_state_deaths = cur_state_count
            min_state_name = state
        if cur_state_count > max_state_deaths:
            max_state_deaths = cur_state_count
            max_state_name = state

    print(min_state_name + " has the fewest COVID deaths: " + str(min_state_deaths))
    print(max_state_name + " has the most COVID deaths: " + str(max_state_deaths))            
            
    # END OF YOUR CODE HERE

Run the cell below (no changes necessary) to execute your code above

In [18]:
calculate_state_deaths(covid_data)

Hawaii has the fewest COVID deaths: 144
California has the most COVID deaths: 65284


<div class='exercise'><b> Exercise 2.4 [4 pts]: Simple analytics</b>
    
Complete the `calculate_state_deathrate()` function, which calculates:
1. The state that has the **lowest rate** of deaths based on its entire population
2. The state that has the **highest rate** of deaths based on its entire population

**NOTES:**
- To calculate a state's population, we are asserting that is sufficient to sum the population over all counties, and that each county's population can be calculated simply from the data fields stored within `covid_data`.
- **If a county has reported 0 COVID cases,** then we should ignore this county as we estimate its county population. Thus, that county would contribute 0 to its state population total.
- Round your results to the a single person (e.g., "1 out of every 2703 people has died" not 2703.4)
- Place your resulting variables within the blanks of the `print()` statements that we provide (don't just manually type your textual answers in the blanks)
</div>

In [19]:
def calculate_state_deathrate(covid_data):
    
    # YOUR CODE HERE
    min_state_death_rate = -1
    min_state_name = ""
    max_state_death_rate = 9999999
    max_state_name = ""
    
    for state in covid_data.keys():
        cur_state_deaths = 0
        cur_state_population = 0
        for county in covid_data[state]:
            if (county['cases'] > 0) and ((pop := county['population']) is not None):
                cur_state_population += pop
                cur_state_deaths += county['deaths']
                
        cur_state_deathrate = float(cur_state_population) / cur_state_deaths
        
        if cur_state_deathrate > min_state_death_rate:
            min_state_death_rate = cur_state_deathrate
            min_state_name = state
        if cur_state_deathrate < max_state_death_rate:
            max_state_death_rate = cur_state_deathrate
            max_state_name = state
            
    print(min_state_name + " has the lowest COVID death rate; 1 out of every " + str(round(min_state_death_rate)) + " people has died")
    print(max_state_name + " has the highest COVID death rate; 1 out of every " + str(round(max_state_death_rate)) + " people has died")
            
    # END OF YOUR CODE HERE

Run the cell below (no changes necessary) to execute your code above

In [20]:
calculate_state_deathrate(covid_data)

Hawaii has the lowest COVID death rate; 1 out of every 3064 people has died
New Jersey has the highest COVID death rate; 1 out of every 334 people has died


## 3. PANDAS (36 pts)
What if we wanted to observe more than just the single-most extreme counties and states? What if we wanted to inspect all states, after having sorted the data by some feature? As you saw in the above exercises, doing the most basic analytics is possible, but it can quickly become cumbersome. As we learned in class, PANDAS is a great library that provides data structures that are highly useful for data analysis.

<div class='exercise'><b> Exercise 3.1 [10 pts]: Converting to PANDAS</b>

In Exercise 2, we worked with `covid_data`, which is comprises of some combination of lists and/or dictionaries.

Complete the `convert_to_pandas()` function, which converts `covid_data` to a PANDAS DataFrame, whereby:
- Each row corresponds to a unique county
- The 5 columns are:
    - county
    - state
    - \# total covid cases (Integer)
    - \# covid case per 100k (Float)
    - \# covid deaths (Integer)
- The columns should be titled **exactly** as listed above

**NOTE:**
- If there exists multiple counties with the same name, each of which belonging to a different state, then there should be a distinct row for each.
- The 2 columns that correspond to COVID counts should all be Integers (e.g., 1498), not Floating point digits (e.g., 1498.0)
</div>

In [32]:
def convert_to_pandas(covid_data):
    
    # YOUR CODE HERE
    covid_data_flipped = []
    for state, counties in covid_data.items():
        for county in counties: 
            if ((pop:= county['population']) is None) or (pop == 0):
                continue
            cases = county['cases']
            cur_dict = {"county":county['county_name'], "state":state,
                            "# total covid cases": cases,
                            "# covid cases per 100k": cases/(pop/100000),
                           "# covid deaths": county['deaths']}
            covid_data_flipped.append(cur_dict)
    print(covid_data_flipped)
    covid_df = pd.json_normalize(covid_data_flipped)
    # END OF YOUR CODE HERE
    return covid_df

Run the cell below (no changes necessary) to execute your code above and inspect the results.

In [33]:
covid_df = convert_to_pandas(covid_data)

[{'county': 'Autauga County', 'state': 'Alabama', '# total covid cases': 9744, '# covid cases per 100k': 17440.79901197444, '# covid deaths': 140}, {'county': 'Baldwin County', 'state': 'Alabama', '# total covid cases': 36447, '# covid cases per 100k': 16326.814015786123, '# covid deaths': 509}, {'county': 'Barbour County', 'state': 'Alabama', '# total covid cases': 3490, '# covid cases per 100k': 14137.567852223932, '# covid deaths': 71}, {'county': 'Bibb County', 'state': 'Alabama', '# total covid cases': 4131, '# covid cases per 100k': 18446.905421094936, '# covid deaths': 83}, {'county': 'Blount County', 'state': 'Alabama', '# total covid cases': 9818, '# covid cases per 100k': 16978.521772213193, '# covid deaths': 160}, {'county': 'Bullock County', 'state': 'Alabama', '# total covid cases': 1503, '# covid cases per 100k': 14879.71487971488, '# covid deaths': 44}, {'county': 'Butler County', 'state': 'Alabama', '# total covid cases': 3203, '# covid cases per 100k': 16469.5598519127

In [23]:
covid_df.head()

Unnamed: 0,county,state,# total covid cases,# covid cases per 100k,# covid deaths
0,Autauga County,Alabama,9108,16302.421737,119
1,Baldwin County,Alabama,34393,15406.70328,400
2,Barbour County,Alabama,3225,13064.084906,65
3,Bibb County,Alabama,3694,16495.489863,74
4,Blount County,Alabama,8998,15560.474527,147


In [24]:
covid_df.shape

(3081, 5)

<div class='exercise'><b> Exercise 3.2 [5 pts]: Simple analytics</b>

Complete the `calculate_county_stats2()` function, **which should obtain identical information (other than ties) as problem 2.2, but now using the PANDAS `covid_df` DataFrame.**

That is, it should calculates:
1. the single county (and the state to which it belongs) that has the **lowest rate** of COVID cases per 100k people
2. the single county (and the state to which it belongs) that has the **highest rate** of COVID cases per 100k people

**NOTES:**
- If there are ties, return any of the tied counties
- Place your resulting variables within the `print()` statements that we provide (don't just manually type your textual answers in the blanks)
- The values you report should be Floating point numbers (e.g., 3.4), not Integers (e.g., 3).

</div>

In [25]:
def calculate_county_stats2(covid_df):

    # YOUR CODE HERE
    sorted_df = covid_df.sort_values(by=['# covid cases per 100k'])
    lowest = sorted_df.iloc[0]
    highest = sorted_df.iloc[-1]

    print(f"{lowest['county']} ({lowest['state']}) has the lowest rate of confirmed COVID cases per 100k: {lowest['# covid cases per 100k']:,.2f}")
    print(f"{highest['county']} ({highest['state']}) has the highest rate of confirmed COVID cases per 100k: {highest['# covid cases per 100k']:,.2f}")
    
    # END OF YOUR CODE HERE

Run the cell below (no changes necessary) to execute your code above

In [26]:
calculate_county_stats2(covid_df)

Lake and Peninsula Borough (Alaska) has the lowest rate of confirmed COVID cases per 100k: 0.00
Bristol Bay Borough (Alaska) has the highest rate of confirmed COVID cases per 100k: 72,727.27


<div class='exercise'><b> Exercise 3.3 [5 pts]: Simple analytics</b>
    
Complete the `calculate_state_deaths2()` function, **which should obtain identical information as problem 2.3 (other than ties), but now using the PANDAS `covid_df` DataFrame.**
1. the state that has the **lowest number** of deaths
2. the state that has the **highest number** of deaths

**NOTES:**
- If there are ties, return any of the tied states
- Place your resulting variables within the `print()` statements that we provide (don't just manually type your textual answers in the blanks)
- The values you report should be Integers, not Floating point numbers.
</div>

In [27]:
def calculate_state_deaths2(covid_df):
    
    # YOUR CODE HERE
    state_deaths = covid_df.groupby('state').sum().sort_values(by=['# covid deaths'])
    lowest = state_deaths.iloc[0]
    highest = state_deaths.iloc[-1]

    print(lowest.name + " has the fewest COVID deaths: " + str(lowest['# covid deaths']))
    print(highest.name + " has the most COVID deaths: " + str(highest['# covid deaths']))
    
    # END OF YOUR CODE HERE

Run the cell below (no changes necessary) to execute your code above

In [28]:
calculate_state_deaths2(covid_df)

Hawaii has the fewest COVID deaths: 144.0
California has the most COVID deaths: 65284.0


<div class='exercise'><b> Exercise 3.4 [5 pts]: Simple analytics</b>
    
Complete the `calculate_state_deathrate2()` function, **which should obtain identical information as problem 2.4, but now using the PANDAS `covid_df` DataFrame.** That is, return:

1. The state that has the **lowest rate** of deaths based on its entire population
2. The state that has the **highest rate** of deaths based on its entire population

**NOTES:**
- Just as in, 2.4, to calculate a state's population, we are asserting that is sufficient to sum the population over all counties -- and that each county's population can be calculated simply from the data fields stored within `covid_data`.
- Just as in 2.4, counties with 0 COVID cases should contibute 0 to the total population of the state.
- Round your results to the a single person (e.g., "1 out of every 2703 people has died" not 2703.4)
- Place your resulting variables within the blanks of the `print()` statements that we provide (don't just manually type your textual answers in the blanks)
</div>

In [29]:
def calculate_state_deathrate2(covid_df):
    
    # YOUR CODE HERE
    covid_df2 = covid_df
    covid_df2['population'] = 100000*covid_df2['# total covid cases'] / covid_df2['# covid cases per 100k']
    covid_df2 = covid_df2.groupby('state').sum()
    covid_df2['death_rate'] = covid_df2['population'] / covid_df2['# covid deaths']
    covid_df2 = covid_df2.sort_values(by=['death_rate'])

    print(covid_df2.iloc[-1].name + " has the lowest COVID death rate; 1 out of every " + str(int(covid_df2.iloc[-1]['death_rate'])) + " people has died")
    print(covid_df2.iloc[0].name + " has the highest COVID death rate; 1 out of every " + str(int(covid_df2.iloc[0]['death_rate'])) + " people has died")
    
    # END OF YOUR CODE HERE
    #print(____ + " has the lowest COVID death rate; 1 out of every " + ____ + " people has died")
    #print(____ + " has the highest COVID death rate; 1 out of every " + ____ + " people has died")

Run the cell below (no changes necessary) to execute your code above

In [30]:
calculate_state_deathrate2(covid_df)

Hawaii has the lowest COVID death rate; 1 out of every 3064 people has died
New Jersey has the highest COVID death rate; 1 out of every 334 people has died


These are highly alarming and tragic statistics, and doing calculations like this can really put the severity of the virus into a grounded perspective. In order to perfectly understand the virus and its spread, everyone would be tested and we would have contact tracing. Without getting into socio-political issues, our point is that (1) we wish to better understand the virus' effects; (2) naturally, any real-world data is messy, and thus we will never have _perfect_ data.


Let's now attempt to understand _some_ of the uncertainty around our COVID data. It's reasonable to believe that the # of COVID deaths is fairly reliable. That is, there are inevitably some false negatives -- people who died of COVID but were not accounted for, as other conditions were listed as the cause. However, the number of false positives is probably minimal -- if someone was denoted as dying from COVID, it's probably true. It's also the case that every disease has a mortality rate. For example, if 1,000 randomly-selected people contracted COVID, $N\%$ of them will die. We'd imagine that this percentage should be pretty constant throughout all people in the United States. Of course, we can think of reasons for this rate to not be perfectly consistent, as some people are at higher risk (e.g., older folks, people with pre-existing conditions, etc). Yet, we can imagine that this natural *variance* in the population to be fairly uniform throughout the USA at large. To this end, if all counties were equal in their **testing**, we ought to see a consistent ratio between: (a) the # of people who died from COVID; and (b) the # of people who tested positive for COVID. Within the medical domain, this ratio is referred to as the `case_fatality_rate`. For example, if 750 people tested positive for COVID, and 75 of those people died, then our `case_fatality_rate` would be 0.1 (meaning 10%).

<div class='exercise'><b>Exercise 3.5 [5 pts]: Further analytics</b>
    
Complete the `add_death_stats()` function below, which should add 3 new columns:
- `case_fatality_rate`
- `# covid deaths per 100k` and
- `population`

And return the updated DataFrame **sorted by `case_fatality_rate` in ascending order** 

**NOTES:**

- `add_death_stats()` should return a new DataFrame that has 8 columns:
    - county
    - state
    - population
    - \# total covid cases
    - \# covid cases per 100k
    - \# covid deaths
    - \# covid deaths per 100k
    - case_fatality_rate
- DataFrame should be sorted by `case_fatality_rate` in ascending order
- Again, the values for `case_fatality_rate` should be < 1. A value of 1 would mean that 100% of people who tested positive for COVID also died.
- `# covid deaths per 100k` is simply defined as the # of COVID deaths for every 100,000 people. We calculate this on a per-county basis.
- Make sure you inspect your results thoroughly. You may have to address the results of divisions by zero (or prevent these divisions in the first place). 
</div>

In [31]:
def add_death_stats(covid_df):
    
    # can add an infintesimal or fillna after the fact to handle nans from divide by 0.
    
    # YOUR CODE HERE
    covid_df['population'] = 100000*covid_df['# total covid cases'] / (covid_df['# covid cases per 100k']+0.0001)
#     covid_df.fillna(0, inplace=True)
    covid_df["population"] = covid_df["population"].astype('int32')
    
    covid_df['# covid deaths per 100k'] = 100000*covid_df['# covid deaths'] / (covid_df['population']+0.0001)
#     covid_df.fillna(0, inplace=True)
    covid_df["# covid deaths per 100k"] = covid_df["# covid deaths per 100k"].astype('int32')
    
    covid_df['case_fatality_rate'] =  covid_df['# covid deaths'] / (covid_df['# total covid cases']+0.0001)
#     covid_df.fillna(0, inplace=True)
    covid_df = covid_df.sort_values(by=['case_fatality_rate'])
    # END OF YOUR CODE HERE
    return covid_df

Run the cell below (no changes necessary) to execute your code above

In [32]:
covid_updated = add_death_stats(covid_df)
covid_updated

Unnamed: 0,county,state,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate
1203,Dukes County,Massachusetts,1733,9998.846065,0,17331,0,0.000000
1689,Logan County,Nebraska,82,10962.566845,0,747,0,0.000000
2651,Loving County,Texas,7,4142.011834,0,168,0,0.000000
2782,Wayne County,Utah,177,6528.956105,0,2710,0,0.000000
1690,Loup County,Nebraska,42,6325.301205,0,663,0,0.000000
...,...,...,...,...,...,...,...,...
417,Dodge County,Georgia,1396,6775.054598,113,20604,548,0.080946
528,Wilcox County,Georgia,597,6913.723219,49,8634,567,0.082077
1670,Grant County,Nebraska,41,6581.059390,4,622,643,0.097561
434,Glascock County,Georgia,171,5755.637832,19,2970,639,0.111111


<div class='exercise'><b>Reflection:</b> Data Analysis allows us to better understand a system or scenario.
</div>

<div class='exercise'><b>Exercise 3.6.1 [2 pts] Trends</b>
    
Having looked at the results from Exercises 3.3, 3.4, and 3.5, what are some trends you've noticed and any conclusions you have? (2-3 sentences)?</div>

<div style='background-color:#F6FEFA;padding:15px'>

**your answer here**

States vary wildly in their death rate (e.g., The number of deaths in New Jersey or California is orders of magnitude higher than those in Hawaii or Alaska) and COVID testing -- evident by the large variance in the `case_fatality_rate`. This is exacerbated by many counties having small populations and/or few reported deaths. States also fluctuate a lot amongst their counties, as some counties with extremely bad statistics are within states that at large at fairing better.
</div>

<div class='exercise'><b>Exercise 3.6.2 [2 pts]: Data Reliability</b>
    
Having looked at the results from Exercise 3.5 (i.e., `covid_updated` DataFrame), do you think the original data is reliable and accurate? Are there any potential biases that you're aware of or concerned about? Please explain (3-5 sentences).</div>

<div style='background-color:#F6FEFA;padding:15px'>

**your answer here**

It is possible that some states and counties are more proactive when it comes to testing. This could result in higher cases counts. Other counties might have a similar number of cases or higher, but they are just not being represented in the data due to lower testing. Deaths on the other hand are harder to overlook in this way, so states with lax testing policies may have inflated death per case metrics. Perhaps we could suppliment the data with some measure of testing rates in the county or state. 
    
</div>

<div class='exercise'><b>Exercise 3.6.3 [1 pt]: Relationships Between Variables</b>
    
If a county has 15 confirmed deaths, how many cases would you expect? What would you expect its population to be? Explain why (1-2 sentences in total)?

**NOTE:** For this question, we aren't evaluating the accuracy of your answer but your thought-process and reasoning.
</div>

<div style='background-color:#F6FEFA;padding:15px'>

**your answer here**
    
For our limited look at the data above we could use a rule of thumb regarding the apparent relation between deaths, cases, and population. Eyeballing the table above we see that in many counties the number of cases is roughly an order of magnitute higher than the namber of death and that the population is itself roughly 200x the number of deaths (or atleast in that order of magnitude). We could be more principled and instead use the mean death rate per 100k and mean fatality rate per 100k to calculate the expected cases and population from deahts, but even this is a crude estimate. To continue we'd like to see how these relationships vary depending on the number of deaths as the multiples are likely not constant across all counties and the relationships may be very different for counties with very few deaths compared to those with many.

</div>

<div class='exercise'><b>Exercise 3.6.4 [1 pt]: Further Questions</b>
    
What further questions do you wish to answer about COVID, including ones that may not be possible to answer from this data alone (e.g., Is there a correlation between the average age of people in a county and the # of COVID deaths)? Write at least 3 of your questions.</div>

<div style='background-color:#F6FEFA;padding:15px'>

**your answer here**

Many possible questions for this one but here are some:

- What is the covid testing rate in these counties?
- Is there a correlation between income level and death rate?
- What are the counties access to health care? (people with insurance, hospital capacity, ect.)
    
</div>

## 4. MORE DATA (25 pts)
In order to better understand how COVID (and the testing thereof) has impacted our world, we could look at how it relates to demographics, income, education, health, and political voting. For this exercise, we will make use of `election2020_by_county.csv`.

<div class='exercise'><b>Exercise 4.1 [4 pts]: Load more data</b>

Complete the `merge_data()` function, which should:
1. First, load `election2020_by_county.csv` as a new DataFrame.
2. Then, using the state and county names (case-sensitive) in both DataFrames, merge this new DataFrame with your existing `covid_updated`.
3. Return the merged DataFrame

The returned `merged` DataFrame should contain all 8 columns from `covid_updated`:
- county
- state
- \# total covid cases
- \# covid cases per 100k
- \# covid deaths
- population
- \# covid deaths per 100k
- case_fatality_rate

along with these 15 columns from `election2020_by_county.csv`:
- hispanic
- minority
- female
- unemployed
- income
- nodegree
- bachelor
- inactivity
- obesity
- density
- cancer
- voter_turnout
- voter_gap
- trump
- biden

**NOTES:**
- We are dropping two columns from `election2020_by_county.csv`:
    - fipscode
    - population
- Do not attempt to manually fix any of the state or county names. That is, **our merging should require the state and county names to be identical (case-sensitive) between the two DataFrames.** If there is a discrepancy between the two, do not worry about adjusting these names to find a perfect match.

**HINT:** there are many ways to solve this, but you may find the [pandas.merge()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) function can be really helpful

**EXTRA INFORMATION:** In case you're wondering what the different features/columns are in `election2020_by_county.csv`:

- state: the state in which the county lies
- fipscode: an ID to identify each county
- county: the name of each county
- population: total population
- hispanic: percent of adults that are hispanic
- minority: percent of adults that are nonwhite
- female: percent of adults that are female
- unemployed: unemployment rate, as a percent
- income: median income
- nodegree: percent of adults who have not completed high school
- bachelor: percent of adults with a bachelor’s degree
- inactive: percent of adults who do not exercise in their leisure time
- obesity: percent of adults with BMI > 30
- density: population density, persons per square mile of land
- cancer: prevalence of cancer per 100,000 individuals
- voter_turnout: percentage of voting age population that voted
- voter_gap: percentage point gap in 2020 presidential voting: trump-briden
</div>

In [33]:
def merge_data(covid_updated, filepath):
    
    # YOUR CODE HERE
    data2020 = pd.read_csv(filepath).drop(columns=['fipscode', 'population'])
    return pd.merge(covid_updated, data2020, on=['state', 'county'])
    # END OF YOUR CODE HERE
    #return ____

Run the cell below (no changes necessary) to execute your code above

In [34]:
merged = merge_data(covid_updated, 'election2020_by_county.csv')

In [35]:
merged.head()

Unnamed: 0,county,state,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,hispanic,minority,...,nodegree,bachelor,inactivity,obesity,density,cancer,voter_turnout,voter_gap,trump,biden
0,Dukes County,Massachusetts,1733,9998.846065,0,17331,0,0.0,1.6,13.0,...,6.6,41.2,17.7,22.0,991.3,209.3,,,,
1,Logan County,Nebraska,82,10962.566845,0,747,0,0.0,2.9,6.1,...,8.1,20.4,28.4,31.1,14.2,216.6,30.124224,82.0,90.4,8.4
2,Loving County,Texas,7,4142.011834,0,168,0,0.0,16.2,25.7,...,11.8,2.6,21.9,29.0,20.7,357.8,-11.864407,84.8,90.9,6.1
3,Wayne County,Utah,177,6528.956105,0,2710,0,0.0,5.4,8.2,...,5.7,26.5,21.0,22.8,56.9,153.8,15.911486,52.9,75.6,22.7
4,Loup County,Nebraska,42,6325.301205,0,663,0,0.0,0.0,0.4,...,6.2,14.4,33.0,30.7,1.3,,-4.849885,65.0,81.5,16.5


In [36]:
merged.shape

(3075, 23)

As mentioned above, the merging requires exact matching between the two DataFrames' `state` and `county` columns. Thus, some mismatches will occur, yielding our `merged` DataFrame to have fewer rows than `covid_updated` and `election2016_by_county.csv`.

<div class='exercise'><b>Data Construction / Understanding</b>
</div>

<div class='exercise'><b>Exercise 4.2.1 [1 pt]: Lost Rows</b>
    
Compared to `covid_updated`, how many rows were lost during this merging process to create `merged`? Running the cell below should print to the screen your answer.
</div>

In [37]:
# YOUR CODE HERE
print(len(covid_updated) - len(merged))
# END OF YOUR CODE HERE

6


<div class='exercise'><b>Exercise 4.2.2 [2 pts]: Lost Counties</b>  

List the county and state of *at least 3* such rows that exist in `covid_updated` but didn't make it into `merged`. Running the cell below should print to the screen your answer.
</div>

In [38]:
# YOUR CODE HERE
missing_counties = set()
merged_counties = set()
for index, row in merged.iterrows():
    merged_counties.add(row['county'].lower() + "_" + row['state'].lower())

missing_idxs = []
for index, row in covid_updated.iterrows():
    cur_county = row['county'].lower() + "_" + row['state'].lower()
    if cur_county not in merged_counties:
        print("missing",cur_county)
        missing_idxs.append(index)
        missing_counties.add(cur_county)
# END OF YOUR CODE HERE

missing kalawao county_hawaii
missing kusilvak census area_alaska
missing aleutians west census area_alaska
missing southeast fairbanks census area_alaska
missing doña ana county_new mexico
missing oglala lakota county_south dakota


In [39]:
# Missing Counties
covid_updated.loc[missing_idxs]

Unnamed: 0,county,state,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate
533,Kalawao County,Hawaii,0,0.0,0,0,0,0.0
78,Kusilvak Census Area,Alaska,1508,18138.080346,4,8313,48,0.002653
68,Aleutians West Census Area,Alaska,265,4703.585375,1,5633,17,0.003774
84,Southeast Fairbanks Census Area,Alaska,799,11591.469607,7,6892,101,0.008761
1780,Doña Ana County,New Mexico,27452,12581.40654,509,218194,233,0.018541
2390,Oglala Lakota County,South Dakota,2192,15461.663257,51,14176,359,0.023266


<div class='exercise'><b>Exercise 4.2.3 [2 pts]: Suggested Fixes</b>
   
If we needed to be highly thorough and needed comprehensive data coverage, do you have any suggestions on how we could quickly, soundly fix most or all of them? (Write 2-3 sentences.)
    
<b>NOTE: Please do not actually fix these mismatches; for this Exercise, it's okay that the `merged` DataFrame is smaller than `covid_updated`</b>
</div>

<div style='background-color:#F6FEFA;padding:15px'>

**your answer here**
    
In the original version of the population pickle file, matches fail for Louisiana. This is due to these counties have "parish" appended to their names on the website, while "parish" is dropped in the naming convention used in the population data (even though both came fron USAFacts.org!).
    Most of the other mismatches are due to the 2020 dataset referring to some counties as "county, city of", whereas the USAFacts.org site simply uses "county". We could use a Regular Expression to replace these; yet, the other mistakes are due to Unicode (e.g., $&#x27;$) and referring to counties as "census area".
This past example demonstrates how easy it is for data to become messy. It also shows the importance of paying close attention to your data in order to understand what you are working with.

</div>

This past example demonstrates how easy it is for data to become messy. It also shows the importance of paying close attention to your data in order to understand what you are working with.

Our `case_fatality_rate` column can be viewed as an approximation of how effective and thorough *COVID testing* is for a given county.

Our `# covid deaths` column can be viewed as an extreme indication of how severe *COVID* has impacted a given county.

Our `# covid cases per 100k` column be viewed as middle-ground between the two aforementioned features. That is, it measures the impact of the disease and is influenced by the thoroughness of COVID testing.

Using these three informative features, we can inspect how impacted each county is, while correlating this with other features of each county, such as income-level, health metrics, demographics, etc. 

<div class='exercise'><b>Exercise 4.3 [2 pts]: Cleaning the data</b>

Before we do any further analysis, we first notice that some counties haven't encountered a single COVID death (usually ones with very small populations), thus providing us with little information. Write code in the cell below to update the `merged` DataFrame so that all rows with 0 deaths are removed.

In [40]:
# YOUR CODE HERE
merged = merged.loc[merged['# covid deaths'] != 0]
# END OF YOUR CODE HERE

Running `.describe()` allows us to quickly see summary statistics of our DataFrame

In [41]:
merged.describe()

Unnamed: 0,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,hispanic,minority,female,unemployed,...,nodegree,bachelor,inactivity,obesity,density,cancer,voter_turnout,voter_gap,trump,biden
count,3041.0,3041.0,3041.0,3041.0,3041.0,3041.0,3040.0,3040.0,3040.0,3041.0,...,3041.0,3041.0,3041.0,3041.0,3041.0,3001.0,3009.0,3008.0,3008.0,3008.0
mean,12753.53,12459.248573,205.926998,105622.7,226.143374,0.018477,9.149013,22.857928,49.919308,5.525551,...,15.099474,19.90513,26.080697,31.105886,229.098619,228.372009,35.578751,32.994149,65.63863,32.644481
std,43321.38,3630.20722,768.170288,337434.4,118.254022,0.0098,13.820543,19.824572,2.380035,1.967128,...,6.763892,8.707507,5.177858,4.484579,1695.636176,55.510932,13.694091,30.816176,15.470732,15.354665
min,14.0,2220.131963,1.0,403.0,5.0,0.000978,0.0,0.2,19.166215,1.8,...,1.9,4.4,8.1,11.8,0.1,46.2,-168.323353,-90.0,4.0,3.1
25%,1379.0,10205.971249,23.0,11391.0,142.0,0.012371,2.0,7.0,49.4666,4.2,...,10.0,13.9,22.8,28.4,17.7,193.5,27.810525,15.55,56.875,20.9
50%,3273.0,12411.755164,57.0,26426.0,211.0,0.016729,3.9,15.5,50.398389,5.3,...,13.6,17.8,25.9,31.3,45.1,230.2,35.162141,39.1,68.7,29.6
75%,8467.0,14640.671536,137.0,69431.0,290.0,0.022732,9.3,34.5,51.088533,6.5,...,19.4,23.5,29.6,33.9,110.7,264.6,42.478361,56.725,77.5,41.5
max,1360180.0,72727.272727,25483.0,10039110.0,866.0,0.142856,99.2,99.4,56.633907,24.0,...,53.3,72.0,41.4,47.6,69468.4,458.3,100.0,93.1,96.2,94.0


Using the information reported from `.describe()`, we can imagine dividing our DataFrame into 4 separate bins, based on the distribution for any given feature. Specifically, based on a particular feature:
- the $1^{st}$ bin will be the data that has values between the **min** and **25%**
- the $2^{nd}$ bin will be the data that has values between **25%** and **50%**
- the $3^{rd}$ bin will be the data that has values between **50%** and **75%**
- the $4^{th}$ bin will be the data that has values between **75%** and **max**

<div class='exercise'><b>Exercise 4.4 [3 pts]: Partitioning our data</b>
    
Complete the `partition_df()` function, which takes as input:
- DataFrame to work with
- feature (e.g., obesity) to filter by
- minimum value
- maximum value

and outputs:
- a subset of the DataFrame that has values between the passed-in minimum and maximum values (inclusively) for the passed-in feature.

For example, if we called `partition_df(merged, 'obesity', 30, 45)`, it should return a subset of the `merged` DataFrame that has obesity values between 30 and 45 (and including the boundary values of 30 and 45).
</div>

In [42]:
def partition_df(df, column_name, minv, maxv):
    # YOUR CODE HERE
    return df.loc[(merged[column_name] >= minv) & (merged[column_name] <= maxv)]
    # END OF YOUR CODE HERE

<div class='exercise'><b>Exercise 4.5: [4 pts] Exploratory Data Analysis</b>
    
Identify a few features that you're interested in, and inspect if there's any correlation with the COVID data. Specifically, simply run your `partition_df()` function below, many times, each with a different subset of the data -- select a range of values and a particular feature. For example, if I'm interested in __cancer__, I could look at the 4 quartiles (per `.describe()`) and use those ranges of values as I repeatedly execute `partition_df()`. For this exercise, after running the function several times, **write 3-5 sentences about any patterns or correlations you noticed or didn't notice but expected to find.**
</div>

In [43]:
# YOUR CODE HERE
partition_df(merged, 'income', 21000, 31000).describe()
# END OF YOUR CODE HERE

#partition_df(merged, 'your feature here', your_min_value, your_max_va).describe()

Unnamed: 0,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,hispanic,minority,female,unemployed,...,nodegree,bachelor,inactivity,obesity,density,cancer,voter_turnout,voter_gap,trump,biden
count,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0,...,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0
mean,2495.933884,14093.049503,54.528926,16677.603306,319.975207,0.023939,8.215702,44.409091,49.551705,9.118182,...,26.886777,11.815702,31.850413,36.371901,83.109917,241.257025,44.908995,13.433058,56.07686,42.643802
std,3023.099014,3809.136598,65.744718,17881.233436,140.086965,0.012134,20.992959,32.024454,3.863055,2.245291,...,5.672785,3.637857,4.331688,5.373813,158.592921,48.314351,10.795226,45.084074,22.579658,22.512089
min,112.0,4871.683341,2.0,1536.0,26.0,0.001691,0.0,0.4,35.462777,4.4,...,8.0,5.8,19.7,21.0,1.4,99.2,-10.411765,-71.6,13.5,9.3
25%,968.0,11502.408541,19.0,8063.0,216.0,0.016431,0.7,6.4,48.747893,7.7,...,23.7,9.3,29.1,33.3,18.0,212.2,39.269619,-27.0,35.6,20.4
50%,1618.0,14364.214788,36.0,11194.0,318.0,0.02234,1.6,50.5,50.654499,8.7,...,26.4,11.2,32.2,36.4,37.6,239.9,44.536665,13.5,54.9,42.7
75%,2574.0,16428.22437,55.0,18522.0,406.0,0.026954,3.1,72.8,51.918536,10.3,...,29.8,13.6,35.2,40.3,75.5,272.3,51.030085,57.8,78.0,62.6
max,22506.0,26696.123147,489.0,130624.0,839.0,0.073958,99.2,99.4,56.526573,17.6,...,53.3,27.9,41.3,47.6,996.7,380.0,70.003709,80.4,89.7,85.1


In [44]:
z = merged.cancer.quantile(0.5)

In [45]:
# YOUR CODE HERE
def view_partitions(df, feature, n_partitions=3, cols=None):
    if cols is None:
        cols = df.columns
    start = 0
    for  i in range(n_partitions):
        stop = start + (1/n_partitions)
        display(partition_df(merged, feature,
                             merged[feature].quantile(start),
                             merged[feature].quantile(stop))[cols].describe())
        start = stop
view_partitions(merged, 'obesity')
# END OF YOUR CODE HERE

#partition_df(merged, 'your feature here', your_min_value, your_max_va).describe()

Unnamed: 0,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,hispanic,minority,female,unemployed,...,nodegree,bachelor,inactivity,obesity,density,cancer,voter_turnout,voter_gap,trump,biden
count,1024.0,1024.0,1024.0,1024.0,1024.0,1024.0,1024.0,1024.0,1024.0,1024.0,...,1024.0,1024.0,1024.0,1024.0,1024.0,1005.0,1008.0,1008.0,1008.0,1008.0
mean,23015.58,11245.026814,359.40918,197796.9,193.865234,0.017567,14.867969,24.807617,49.77299,5.053027,...,13.071387,25.463184,22.147559,26.325781,318.36084,213.755323,32.575558,24.924702,61.539683,36.61498
std,70388.26,3679.174439,1242.825454,543593.1,122.28426,0.011406,18.063739,19.903407,2.489231,1.836033,...,6.759016,10.718269,4.219824,2.963034,1528.473747,60.398586,17.587883,35.855241,18.042064,17.822056
min,14.0,2220.131963,1.0,403.0,5.0,0.001076,0.0,0.2,31.955024,1.8,...,1.9,6.9,8.1,11.8,0.1,46.2,-13.483146,-90.0,4.0,3.1
25%,1137.5,8827.975888,19.0,11425.5,105.75,0.010494,3.1,9.5,49.349121,3.8,...,8.4,17.275,19.3,25.1,11.3,170.5,21.639556,-1.5,48.4,21.9
50%,3668.5,11140.682215,55.0,34884.0,166.0,0.015124,7.35,18.4,50.329872,4.8,...,11.3,23.3,22.5,27.25,45.65,212.6,29.940663,27.95,63.0,34.95
75%,16236.5,13547.379257,223.25,157567.5,257.0,0.02182,19.225,34.775,51.005636,5.9,...,16.225,32.1,25.0,28.5,154.85,251.8,40.189642,54.35,76.3,49.525
max,1360180.0,40267.717979,25483.0,10039110.0,866.0,0.142856,95.3,97.4,55.707947,24.0,...,46.8,72.0,38.8,29.5,32903.3,433.9,99.552895,93.1,96.2,94.0


Unnamed: 0,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,hispanic,minority,female,unemployed,...,nodegree,bachelor,inactivity,obesity,density,cancer,voter_turnout,voter_gap,trump,biden
count,1043.0,1043.0,1043.0,1043.0,1043.0,1043.0,1043.0,1043.0,1043.0,1043.0,...,1043.0,1043.0,1043.0,1043.0,1043.0,1032.0,1035.0,1034.0,1034.0,1034.0
mean,9040.499521,12536.25358,144.67977,71975.45,228.105465,0.018578,7.265292,18.539789,49.921429,5.333365,...,14.535187,18.574401,26.413615,31.290125,198.329434,234.883527,34.98529,38.68617,68.489168,29.802998
std,18409.286865,3125.094452,330.510553,146728.0,114.639627,0.009542,11.172982,16.855867,2.199235,1.838091,...,6.145999,6.14206,3.971328,1.04213,2183.30521,54.034435,12.455518,24.682625,12.384693,12.307151
min,22.0,4638.174288,1.0,462.0,14.0,0.000978,0.0,1.5,19.166215,1.8,...,3.8,4.4,16.4,29.5,0.2,82.5,-168.323353,-68.1,15.0,5.6
25%,1268.5,10607.128766,22.5,10552.0,149.0,0.012535,1.9,6.0,49.415479,4.1,...,10.0,14.2,23.5,30.4,17.8,200.05,28.427332,23.9,60.9,20.4
50%,3218.0,12489.642185,57.0,26403.0,208.0,0.016589,3.4,12.0,50.34627,5.2,...,13.1,17.5,26.3,31.3,43.3,236.35,34.629388,42.15,70.25,28.0
75%,7957.0,14424.082575,129.0,64463.5,284.0,0.022275,7.25,26.8,50.990937,6.4,...,18.15,21.6,28.9,32.2,98.6,268.925,40.833752,57.475,77.9,37.1
max,190914.0,35687.475306,6647.0,1584063.0,839.0,0.082077,99.2,99.4,56.633907,21.8,...,53.3,43.7,41.4,33.0,69468.4,425.4,100.0,86.0,92.0,83.1


Unnamed: 0,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,hispanic,minority,female,unemployed,...,nodegree,bachelor,inactivity,obesity,density,cancer,voter_turnout,voter_gap,trump,biden
count,1027.0,1027.0,1027.0,1027.0,1027.0,1027.0,1026.0,1026.0,1026.0,1027.0,...,1027.0,1027.0,1027.0,1027.0,1027.0,1017.0,1019.0,1019.0,1019.0,1019.0
mean,6134.600779,13625.198267,112.320351,46473.48,256.364167,0.019248,5.309942,25.028558,50.07766,6.183155,...,17.698734,15.62814,29.711977,35.7111,166.793768,236.452901,39.23915,35.387537,66.894504,31.506968
std,11315.1654,3665.205147,253.309613,94040.56,107.590914,0.007975,8.432645,21.666278,2.393791,2.034852,...,6.487851,4.782821,4.205198,2.414385,1157.566205,48.284292,8.658075,28.942876,14.471159,14.481409
min,112.0,4011.632242,1.0,835.0,14.0,0.000978,0.0,0.4,33.597815,1.8,...,3.1,5.8,15.6,33.0,0.1,61.8,12.181303,-80.8,8.8,5.4
25%,1678.5,11342.701941,31.0,12790.5,184.0,0.014074,1.6,6.4,49.6344,4.9,...,12.7,12.2,26.75,33.9,23.6,205.7,33.382109,21.0,59.9,21.0
50%,3202.0,13563.402889,58.0,23436.0,245.0,0.017931,2.8,17.2,50.53272,6.0,...,17.1,14.8,29.8,35.1,46.7,236.6,39.391691,42.0,70.1,28.2
75%,6172.5,15707.152959,112.0,45932.0,315.0,0.02322,5.2,39.8,51.282194,7.3,...,22.4,18.4,32.8,36.9,101.8,265.9,44.972609,56.55,77.4,38.45
max,178417.0,72727.272727,5331.0,1749342.0,713.0,0.065217,91.8,93.4,56.526573,16.9,...,40.5,42.6,41.3,47.6,35369.2,458.3,70.003709,87.9,93.3,89.6


In [46]:
view_partitions(merged, 'inactivity')

Unnamed: 0,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,hispanic,minority,female,unemployed,...,nodegree,bachelor,inactivity,obesity,density,cancer,voter_turnout,voter_gap,trump,biden
count,1032.0,1032.0,1032.0,1032.0,1032.0,1032.0,1032.0,1032.0,1032.0,1032.0,...,1032.0,1032.0,1032.0,1032.0,1032.0,1018.0,1011.0,1011.0,1011.0,1011.0
mean,23146.45,11150.432527,340.902132,202558.5,169.793605,0.015325,12.649031,22.909012,49.690513,4.983818,...,11.510853,26.247674,20.551744,27.851066,403.765407,214.263065,31.538778,18.227201,58.094461,39.86726
std,68269.95,3931.830558,1145.262913,529484.1,105.646112,0.008038,16.398927,19.055224,2.457474,1.796694,...,5.8948,10.231851,2.863869,4.197408,2668.750016,60.503159,17.02728,32.21985,16.146636,16.081957
min,56.0,2220.131963,1.0,403.0,5.0,0.001076,0.0,0.6,19.166215,1.9,...,1.9,7.5,8.1,11.8,0.3,46.2,-5.066079,-90.0,4.0,3.1
25%,1671.25,8951.968278,22.0,16075.0,100.0,0.0101,3.0,8.6,49.321641,3.8,...,7.8,18.5,18.8,25.4,12.5,168.45,21.468618,-4.55,46.5,27.9
50%,5049.0,11095.477071,71.0,48340.0,149.0,0.013807,5.8,15.85,50.206483,4.7,...,10.1,24.4,21.4,28.3,46.0,212.35,28.493636,21.1,59.3,38.4
75%,18741.5,13125.103461,246.0,168501.0,213.25,0.01878,14.425,31.75,50.831317,5.8,...,13.2,32.225,22.8,30.8,159.975,254.675,37.809084,42.2,70.0,51.05
max,1360180.0,72727.272727,25483.0,10039110.0,796.0,0.058166,95.3,97.4,54.555681,24.0,...,43.9,72.0,23.9,39.7,69468.4,397.4,99.552895,93.1,96.2,94.0


Unnamed: 0,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,hispanic,minority,female,unemployed,...,nodegree,bachelor,inactivity,obesity,density,cancer,voter_turnout,voter_gap,trump,biden
count,1038.0,1038.0,1038.0,1038.0,1038.0,1038.0,1037.0,1037.0,1037.0,1038.0,...,1038.0,1038.0,1038.0,1038.0,1038.0,1020.0,1027.0,1026.0,1026.0,1026.0
mean,9814.897881,12397.532831,179.136802,76444.92,244.025048,0.020296,9.974542,22.27406,49.927892,5.249037,...,14.871965,18.671676,26.00578,31.134875,171.071195,229.32549,34.874362,38.247953,68.31501,30.067057
std,24924.775275,3282.397185,585.472339,191438.1,122.417561,0.011143,15.056662,20.258563,2.239983,1.851616,...,6.292695,5.845579,1.244686,3.048128,1120.419615,53.42843,12.657893,26.835066,13.461418,13.383122
min,14.0,2240.0,1.0,486.0,14.0,0.001495,0.0,0.2,32.813627,1.8,...,3.1,4.4,23.9,22.4,0.1,70.0,-168.323353,-70.9,14.1,3.9
25%,1059.5,10274.425422,20.0,9125.5,158.25,0.013357,2.1,6.2,49.376147,4.1,...,10.6,14.5,25.0,28.825,14.2,195.075,28.556416,22.8,60.6,20.3
50%,2712.5,12359.778588,51.5,22232.0,225.0,0.018193,4.1,14.2,50.333745,5.05,...,13.4,17.9,25.9,31.1,41.1,230.25,34.565739,41.45,69.9,28.3
75%,6977.5,14401.442911,120.75,58166.75,302.75,0.024837,10.1,34.5,51.059021,6.2,...,17.875,21.6,27.0,33.2,106.425,263.0,40.957131,58.1,78.275,37.7
max,310776.0,40267.717979,10644.0,2559902.0,866.0,0.142856,99.2,99.4,56.633907,21.8,...,53.3,46.3,28.2,42.0,32903.3,433.9,100.0,92.0,95.9,85.0


Unnamed: 0,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,hispanic,minority,female,unemployed,...,nodegree,bachelor,inactivity,obesity,density,cancer,voter_turnout,voter_gap,trump,biden
count,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,...,1017.0,1017.0,1017.0,1017.0,1017.0,1009.0,1016.0,1016.0,1016.0,1016.0
mean,5089.915438,13821.796675,95.706981,36143.96,265.166175,0.019868,4.84297,23.32586,50.135491,6.345821,...,18.917109,14.695084,31.762045,34.400983,108.645821,241.338057,40.20557,42.465059,70.485925,28.020866
std,9346.468949,3113.089372,243.825529,66890.32,103.371915,0.009118,7.16761,20.211823,2.396202,1.977488,...,5.945731,4.278158,2.675647,3.425351,223.453651,48.184123,8.440751,27.342446,13.656017,13.695339
min,22.0,4751.61987,1.0,462.0,14.0,0.000978,0.0,0.4,33.597815,1.8,...,3.9,5.8,28.2,23.8,0.3,94.4,1.730104,-71.6,13.5,5.0
25%,1473.0,11717.253202,28.0,11136.0,195.0,0.014178,1.5,6.0,49.773707,5.1,...,14.6,11.7,29.6,32.1,25.1,211.2,34.533637,32.075,65.375,18.8
50%,2886.0,13842.096308,55.0,20836.0,257.0,0.018113,2.6,16.1,50.631215,6.2,...,18.7,14.1,31.3,34.1,46.7,240.5,40.027305,50.0,74.1,24.3
75%,5756.0,15847.694012,104.0,40366.0,328.0,0.023894,4.9,36.5,51.316339,7.4,...,22.9,17.0,33.5,36.5,97.0,270.5,45.843751,60.8,79.7,33.025
max,190914.0,27480.036855,6647.0,1418206.0,711.0,0.097561,63.5,91.2,56.526573,16.9,...,40.5,36.9,41.4,47.6,2800.0,458.3,70.003709,88.3,93.3,85.1


In [47]:
view_partitions(merged, 'income',
                cols=['income','# total covid cases', '# covid cases per 100k', '# covid deaths',
                      'population', '# covid deaths per 100k', 'case_fatality_rate',
                      'obesity', 'inactivity', 'trump', 'biden'])

Unnamed: 0,income,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,obesity,inactivity,trump,biden
count,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0,1010.0,1010.0
mean,35791.410256,4564.531558,13350.222242,94.304734,33482.77,280.529586,0.022066,33.135503,29.494379,67.547624,31.031683
std,3791.47399,11139.018138,3816.829098,291.11674,85384.18,123.871501,0.011772,4.464598,4.660334,15.820049,15.763743
min,21658.0,14.0,2240.0,1.0,403.0,14.0,0.001562,17.4,13.6,13.5,7.1
25%,33431.5,1201.25,10806.661455,25.0,9886.25,193.0,0.014871,30.3,26.0,58.625,18.9
50%,36443.0,2387.5,13463.912711,49.0,17902.0,272.0,0.019555,33.15,29.8,72.2,26.3
75%,38775.25,4560.0,15759.90126,89.0,32225.5,351.0,0.026502,36.2,33.0,79.6,39.9
max,41079.0,190914.0,36941.098829,6647.0,1584063.0,866.0,0.142856,47.6,41.4,92.6,85.1


Unnamed: 0,income,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,obesity,inactivity,trump,biden
count,1013.0,1013.0,1013.0,1013.0,1013.0,1013.0,1013.0,1013.0,1013.0,1004.0,1004.0
mean,45253.674235,10388.89536,12226.777226,165.553801,80516.26,215.939783,0.017879,31.074729,25.879566,67.863247,30.370319
std,2600.690989,29461.161835,3248.983662,498.445314,190950.2,104.548972,0.00839,3.733559,4.210464,12.911955,12.775929
min,41085.0,22.0,3164.384309,1.0,462.0,11.0,0.001076,14.8,11.2,14.1,5.0
25%,43040.0,1298.0,10200.220173,22.0,11130.0,140.0,0.012405,28.8,23.2,60.675,21.0
50%,45151.0,3444.0,12092.698708,58.0,29209.0,210.0,0.016533,31.3,25.9,70.3,27.9
75%,47381.0,7765.0,14285.714286,134.0,64696.0,273.0,0.022242,33.5,28.6,77.3,37.6
max,50014.0,632579.0,40267.717979,10644.0,2716939.0,795.0,0.097561,42.1,38.6,94.0,85.0


Unnamed: 0,income,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,obesity,inactivity,trump,biden
count,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0,1014.0,994.0,994.0
mean,59768.488166,23304.83,11800.51699,357.882643,202844.3,181.95069,0.015484,29.107396,22.867949,61.451911,36.580282
std,10586.899122,66758.83,3626.506252,1183.500199,531677.3,103.354657,0.007557,4.286441,4.35784,16.607793,16.528661
min,50020.0,29.0,2220.131963,1.0,493.0,5.0,0.000978,11.8,8.1,4.0,3.1
25%,52415.25,1758.0,9734.487107,24.0,15117.75,114.25,0.010482,27.0,20.3,51.1,24.725
50%,56020.0,5624.5,11874.547426,79.5,50234.0,165.0,0.014163,29.5,23.05,63.25,34.7
75%,62435.75,19430.25,13802.075414,261.0,170300.0,230.0,0.018935,32.2,25.9,73.35,46.7
max,122641.0,1360180.0,72727.272727,25483.0,10039110.0,796.0,0.058823,40.2,37.8,96.2,94.0


In [48]:
view_partitions(merged, 'trump',
                cols=['trump','income','# total covid cases', '# covid cases per 100k', '# covid deaths',
                      'population', '# covid deaths per 100k', 'case_fatality_rate',
                      'obesity', 'inactivity',])

Unnamed: 0,trump,income,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,obesity,inactivity
count,1004.0,1004.0,1004.0,1004.0,1004.0,1004.0,1004.0,1004.0,1004.0,1004.0
mean,47.7,49957.554781,28678.87,11631.053349,455.145418,243307.5,200.549801,0.017098,29.970817,23.538347
std,11.162241,15303.487288,71862.18,3648.336352,1282.862007,554623.8,120.413928,0.008903,5.696659,5.515891
min,4.0,21658.0,102.0,2220.131963,1.0,768.0,5.0,0.001076,11.8,8.1
25%,40.9,38876.25,2427.0,9320.124427,43.75,22413.5,114.75,0.011124,26.3,19.8
50%,50.4,48229.0,7546.5,11665.01534,107.5,69777.5,174.0,0.015449,30.0,23.3
75%,56.9,57095.5,26364.75,13835.271535,367.0,234494.0,262.0,0.020823,33.4,26.825
max,61.3,122641.0,1360180.0,40267.717979,25483.0,10039110.0,839.0,0.073958,47.6,40.8


Unnamed: 0,trump,income,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,obesity,inactivity
count,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0,1008.0
mean,68.404167,46773.963294,6680.970238,12746.934355,108.261905,51149.239087,229.904762,0.018631,31.786706,26.44246
std,3.832817,9320.739632,9480.152069,3315.844992,150.655018,66676.2138,109.632214,0.010036,3.57921,4.228102
min,61.3,25768.0,14.0,2240.0,1.0,403.0,11.0,0.000978,19.5,14.1
25%,65.2,40400.0,1640.25,10549.995638,29.0,13686.75,149.0,0.012516,29.5,23.375
50%,68.7,45662.5,3713.0,12833.331581,63.0,28655.5,214.0,0.016767,31.9,26.2
75%,71.7,51796.0,7296.75,14862.345045,127.5,60505.5,290.0,0.022315,34.2,29.3
max,74.5,97936.0,111719.0,36941.098829,2182.0,636234.0,795.0,0.142856,42.3,39.9


Unnamed: 0,trump,income,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,obesity,inactivity
count,1012.0,1012.0,1012.0,1012.0,1012.0,1012.0,1012.0,1012.0,1012.0,1012.0
mean,80.768775,43782.178854,2811.449605,13062.377686,50.146245,20564.411067,251.620553,0.019852,31.608202,28.392095
std,4.277882,9003.383179,3442.561533,3170.98325,61.268461,23517.040977,118.730827,0.010172,3.596872,4.505372
min,74.5,23047.0,14.0,2874.743326,1.0,462.0,14.0,0.001562,19.4,13.6
25%,77.5,37405.5,740.25,10924.498794,13.0,6040.75,169.75,0.013567,29.075,25.1
50%,80.1,42132.5,1757.0,12952.91086,31.0,13604.0,238.0,0.017746,31.6,28.0
75%,83.6,49036.25,3609.5,15108.120682,66.0,26202.25,309.0,0.024481,33.9,32.0
max,96.2,86354.0,34393.0,24697.221563,862.0,223233.0,866.0,0.111111,43.2,41.4


<div style='background-color:#F6FEFA;padding:15px'>

**your answer here**

Above we split the data into equal lower, middle, and upper quantiles based on first `obeity` and then `inactivity`. We can see that the the average death rates of counties in these partitions is positivly correlated with both of these features. This was expected as preexisting health conditions (obescity) and heath risks (inactivity) increase all cause mortality but also have a strong effect on how serious a covid infection can be. Finally we see that income has an even stronger relationship with the death rate, though here the correlation is a negative one. Obesity and inactivity are both negatively correlated with income as well. The relationship between voting for Trump and income is not a string one strong, though there is a positive correlation between Trump voting and obesity, inactivity, and covid death rate. We are of course not making any causal claims here as these are unjustified outside of a controlled experiment.
    
</div>

`.describe()` provides these nice summary statistics over any portion of data that we give it. Instead of iteratively inspecting several subsets of the data, let's actually split our DataFrame into new categories; instead of representing all features by floating point numbers, let's create new _categorical_ names for feature(s) based on their numbers. The code below does just this. It creates a new column, `income group` that has 4 possible values, each one corresponding to a quartile of the original `income` values. 

Run the cell below.

In [49]:
bins = [0, 38000, 45000, 52000, 200000]
names = ['income-group-1', 'income-group-2', 'income-group-3', 'income-group-4']
d = dict(enumerate(names, 1))
merged['income group'] = np.vectorize(d.get)(np.digitize(merged['income'], bins))
merged

Unnamed: 0,county,state,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,hispanic,minority,...,bachelor,inactivity,obesity,density,cancer,voter_turnout,voter_gap,trump,biden,income group
34,Saline County,Nebraska,2044,14370.078740,2,14223,14,0.000978,24.3,29.3,...,14.3,31.0,33.0,1.5,309.1,40.293874,28.5,62.9,34.4,income-group-3
35,Lake County,Colorado,929,11431.032361,1,8126,12,0.001076,33.9,35.7,...,30.3,15.1,17.5,19.4,112.4,29.853937,-20.2,37.9,58.1,income-group-3
36,Dodge County,Minnesota,2311,11039.457342,3,20933,14,0.001298,4.9,7.5,...,24.1,18.3,24.9,709.0,140.3,17.427640,30.5,64.0,33.5,income-group-4
37,Pitkin County,Colorado,2868,16142.286261,4,17766,22,0.001395,9.8,14.3,...,56.4,8.9,14.9,17.7,70.8,12.694664,-52.1,23.2,75.3,income-group-4
38,Jefferson County,Nebraska,669,9494.748794,1,7045,14,0.001495,3.6,5.8,...,13.2,28.0,36.2,11.0,230.7,33.996448,43.1,70.4,27.3,income-group-2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3070,Dodge County,Georgia,1396,6775.054598,113,20604,548,0.080946,3.4,34.7,...,14.4,28.8,28.5,44.0,212.3,51.126453,45.5,72.4,26.9,income-group-1
3071,Wilcox County,Georgia,597,6913.723219,49,8634,567,0.082077,4.3,41.5,...,8.7,27.7,31.5,24.5,248.4,53.295374,46.9,73.2,26.3,income-group-1
3072,Grant County,Nebraska,41,6581.059390,4,622,643,0.097561,1.9,3.8,...,18.6,30.6,28.3,4.5,305.1,23.574144,88.3,93.3,5.0,income-group-3
3073,Glascock County,Georgia,171,5755.637832,19,2970,639,0.111111,1.6,12.6,...,11.4,24.8,28.6,21.4,264.6,32.529082,79.7,89.6,9.9,income-group-2


<div class='exercise'><b>Exercise 4.6 [5 pts]: Aggregate data</b>
    
    
Write code in the cell below to group (and display) the data according to the 4 income groups. Also, while we will still keep the same columns (i.e, features), the values of each should now represent the __average__ value of all rows that were subsumed in the making of the aggregate income-group. Your resulting DataFrame should have just 4 rows (income-group-1, income-group-2, income-group-3, income-group-4). See example in the cell below.


Since every feature (except for `# total cases`, `# covid deaths`, and `population`) was already an average value corresponding to a particular __county__, when we aggregate our data by income groups, we are effectively taking an average of an average. Many counties are being aggregated for each income-group row. This approach isn't as accurate as possible; it would be more accurate if we re-adjusted every value so that it was truly an average that was based on the total __population__ of all counties that are subsumed within a given income-group row. That's okay, though. An average of averages will suffice for the purpose of this exercise. 
</div>

In [50]:
# EXAMPLE: If our `merged` DataFrame were
# COUNTY    INCOME GROUP    BACHELOR ... (other columns, too)
#   A            2             50
#   B            1             20
#   C            1             30
#   D            2             70
#   E            3             95

# it should become
# INCOME GROUP    BACHELOR ... (other columns, too)
#   1                25
#   2                60
#   3                95

# YOUR CODE HERE
merged.groupby('income group').mean()
# END OF YOUR CODE HERE

Unnamed: 0_level_0,# total covid cases,# covid cases per 100k,# covid deaths,population,# covid deaths per 100k,case_fatality_rate,hispanic,minority,female,unemployed,...,nodegree,bachelor,inactivity,obesity,density,cancer,voter_turnout,voter_gap,trump,biden
income group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
income-group-1,3914.182085,13615.466077,85.320117,28207.182085,294.240822,0.022849,8.870044,32.753891,49.980281,7.318502,...,22.04743,13.603671,30.212041,33.792952,90.745521,237.447788,41.772286,32.861062,65.754867,32.893805
income-group-2,8520.437576,12489.246427,144.338182,64748.787879,235.003636,0.019232,8.779126,20.646966,49.821097,5.738788,...,15.688485,17.509455,27.076364,31.584606,132.777818,234.975124,37.150281,40.767722,69.566991,28.799269
income-group-3,12775.409524,12310.722884,193.202721,99245.555102,208.519728,0.017124,8.771293,18.513061,49.886102,4.891429,...,12.912925,20.37102,25.205306,30.736735,336.364762,227.790884,34.352153,35.464835,66.802198,31.337363
income-group-4,24623.295,11580.541113,383.7975,219532.9075,175.23,0.015219,10.1145,20.703125,49.99907,4.362,...,10.5865,27.31175,22.341375,28.664,347.65175,214.379646,29.685445,22.634955,60.32356,37.688604


<div class='exercise'><b>Wrapping Up</b>
</div>

<div class='exercise'><b>Exercise 4.7.1 [1 pt]: Conclusions</b>
What are your conclusions/finding from this alternative view of the data? (2-4 sentences).
</div>

<div style='background-color:#F6FEFA;padding:15px'>

**your answer here**
    
We notice that the covid deaths and cases per 100k are much higher in the bottom income-group-1 but don't change as significantly between the other 3 groups. The lower income group has much higher rates of obesity, inactivity, and unemployment which might also contribute covid mortality. The lower 2 income groups also has much less population density than the higher 2, so they are situated in more rural areas. There is also a big gap in minority population between income group 1 and the other groups.

</div>

<div class='exercise'><b>Exercise 4.7.2 [1 pt]: Possible Weaknesses</b>
What are some weaknesses from this view of the data? (2-4 sentences).
</div>

<div style='background-color:#F6FEFA;padding:15px'>

**your answer here**
    
We can tell from comparing the populations between the groups that this data is not treated granularly as would be ideal. Very small population counties that get weighted the same as very large population counties in regards to the mean. So rural areas get over represented in the averages nationwide. This also explains why the Trump vs Biden is so far skewed from the actual well known national average based on popular votes.

Also, the difference between income correlates more with population density than it might with an individual socio economic status. First, a higher income might not go as far towards standard of living in the city as it does in rural areas. Second, by using the average income over the whole county, income inequality in that county is not factored in. There could be many low income individuals living with many high income individuals in the same county. 

</div>

## Moving Forward

In this homework assignment, we've focused on gathering, parsing, and exploring data. However, what if we wanted to *predict* some behavior of the data. For example, imagine one is curious how a particular county will respond to COVID. Or, imagine we looked at counties' COVID data on a weekly basis, one could be interested in predicting the upcoming week's behavior.

Alternatively, one could be interested in *inference*, whereby we are more concerned with trying to understand __why__ and __how__ a system behaves the way it does. We might wish to understand which factors most correlate and cause a certain event to happen. This could give us insights into where certain inequalities persist.

For both *prediction* and *inference*, our computational method of solving such a task is referred to as a model. For the remainder of CS109, we will spend significant focus on various models.
</div>

## Reflection

As a reminder, this is just **one** of the homework assignments in this course, the point of which is to assess your learning and to provide both you and us with an indication as to how aligned your knowledge and skills are with our learning objectives. To this end, we encourage you to reflect on your progress, strengths, and weaknesses and to make changes, if necessary, to accomplish your goals. Likewise, please reach out to the TFs and teaching staff if you need help. We want everyone to feel comfortable in being honest about these elements, with both herself/himself and us. For these purposes, we will ask you several times throughout the semester to complete an anonymous poll.