# COGS 108 - Final Project Proposal - Group 029

# Names

- Stephen Kim
- Clara Yi
- Ethan Lee
- Ernest Lin
- Wesley Nguyen

# Research Question

Do the macroscopic socioeconomic features of a state, specifically median income, percentage of population without health insurance, and labor breakdown, have a correlation to COVID mortality rate in 2020-2021?

## Background and Prior Work

### Introduction
When a society faces unusual challenges, it often leads to major cultural shifts and realizations. COVID-19, which has impacted the global society in unpredictable and significant ways, stands as an opportunity for data scientists to gain insight into the nuances of healthcare, labor, and economics. By analyzing information from the CDC's database on COVID-19 related deaths, information from the United States Census Bureau, as well as data from the US Bureau of Labor Statistics, our team hopes to shed light on whether or not a state's overall socioeconomic breakdown influenced their COVID-19 mortality rate in 2020. Our macroscopic approach to this data science problem is motivated by the availability of consistent and state-specific data. We additionally propose a predictive model in the form of a function which uses the analyzed trends to make a prediction of covid mortality rate based on three hypothetical values: a specifed median income, % of population without healthcare, and % of labor force in blue collar jobs. 

### Prior Work
An health policy article by Adhikari, S. et. al [1] discussed the early impact of COVID-19 based on a city's income level and race/ethnicity data. Their paper focused on ten major cities and discovered a positive correlation between lower income, more diverse areas and an increase in COVID-19 death and infection rates. According to an American Medical Association review of Adhikari's publication, there is "no biological or genetic basis for why these inequities would exist". While this article suggests an important relationship between income, race, and COVID-19 impact, our research team wishes to better understand this relationship on a state-level scale. This macroscopic approach has also been deemed meaningful by larger organizations such as the NIH, as expressed in a 2020 publication from the Journal of General Internal Medicine [2]. The researchers in this study used the Gini index as their measurement of income inequality. They acknowledge that income levels may be representative of a state's healthcare resources and number of essential occupations, but we believe that by directly analyzing health insurance and labor statistics in our research will paint a clearer picture of what previous scientists have already suggested.

References (include links):
- 1) https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2768723?resultClick=1
- 2) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7313247/

# Hypothesis


Based on our prior research, we hypothesize that there will be a negative correlation between median income and COVID morality rate, positive correlation between the percent of the population without health insurance and COVID morality rate, and a positive correlation between the rate of"blue collar" workers among the labor force and COVID mortality rate. By combining our three socioeconomic factors into a summarizing coefficient, we hope to create a predicitive model that reflects this hypothesis.

# Data

### Dataset 1 (COVID)
<ins>Variables</ins>
- US State
- Total COVID Infection Cases
- Total COVID Death Cases
- Date

**Note**: We will not be using the remainder of the variables in the dataset we have for this as they others relate to new cases/probably new cases/etc that are already accounted for in the total cases.

The ideal dataset for COVID rates would contain the variables mentioned above. For a dataset like this, it would be best to get the official dataset from the CDC website. [Here](https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data). This data is collected by each state government and is submitted at least once a week. For our purposes, we aren't looking at changes over time, but instead at certain moments in time, so we would only need a low multiple of 50 (50, 100, 150, ...) points of this dataset. There are many options to store this data, but we will be using CSV files and converting them to Pandas dataframes for our cleaning/analysis.
### Dataset 2 (Division of Labor)
<ins>Variables</ins>
- Employees, seasonally adjusted
- Month + Year
- State
- Industry (including all industries combined, and/or Blue Collars/White Collar Industries)

We are also interested in determining the rates of participation in labor-intensive/blue-collar industries, which a dataset with these variables allow. One [sample (also available in XLSX format)](https://www.bls.gov/news.release/laus.t03.htm) does this well, as it collects data through a broad survey sample, it has trained interviewers performing the data collection, it utilizes results collected from the sample itself, and the data is heavily processed by economists and statisticians (*How BLS Collects Data*). We would want to ensure that the sample chosen includes people from each state and that participants by state adequately represent their state's labor force. We should have good data showing the total number of people in the labor force for each state as well as how many are in each industry. We plan to import the data and then clean the dataset so that we have 50n observations, where n is some integer >= 1 (the reason for this is because the dataset may have information for 50 states over n different time periods and we need to decide which time periods are most useful for analysis). We would also have to select and categorize certain industries on a blue-collar/white-collar spectrum, as we intend to only keep industries whose jobs are typically blue-collar in nature.
### Dataset 3
<ins>Variables</ins>
- Persons without Health Insurance by State, %
- Median Household Income by State in 2020

The ideal dataset would included the variables mentioned above. To find the data for percentage of people without health insurance in each state, we went to the [United States Census Bureau Website](https://www.census.gov/quickfacts/fact/map/CA/HEA775220). Because the data provided was in the form of a map, and not a CSV file or spreadsheet, we had to manually gather each data point and put them in this [Google Sheet](https://docs.google.com/spreadsheets/d/174jFoW8KsXGJmpNUx8cbh6j4l6rhQhpOUKIPnkzk3lM/edit?usp=sharing). 
Additionally, to find data for the median household income by state in 2020, we went to [this website](https://fred.stlouisfed.org/release/tables?rid=249&eid=259515&od=2020-01-01#). Because this data was not in a CSV or Excel format, we also had to manually input that data into the same GoogleSheet above. 
Because we are conducting our research project at the state level, we expect to have 50 obersavations for each of our variables so that each state has 1 corresponding data observation. Thus, our Google Sheet will have 100 observations in total, since there are 2 variables we are looking at. 
Our data was collected in 2020 so it is slightly outdated but we will still be using this data for our analysis. Right now, our data is stored in a Google Sheet but we will be converting it to a CSV file and then importing the data into a pandas DataFrame for cleaning, wrangling, and analysis. 

# Ethics & Privacy

There are no concerns regarding personal privacy as the personal information of individuals from datasets will not be used. Most of the data that we will be using are from public government datasets, so we can assume that the data collected are consensual and confidential. However, one of the concerns we discussed was the possibility that certain populations were not as well represented as others in these datasets, such as the possibility that blue collar workers may have the tendency to hide their covid cases to keep working. We also have considered the possibility of underreported covid cases and deaths due to political reasons to make political leaders seem more successful in the pandemic. We will be considering these potential biases during our analysis and our conclusion.

# Team Expectations 

* Will communicate about anything related to the project when stuck, need help, need clarification, ... beyond weekly meetings
* Will perform the work they take on at the end of our weekly meetings
* Will consult group before deleting other people's code
* Keep unorganized, raw data in local machine and only keep clean data on repository

# Project Timeline Proposal

| Meeting Date  | Meeting Time | Completed Before Meeting  | Meeting Topics | 
|---|---|---|---|
| 1/26  |  4 PM | Brainstorm topics/questions; Find datasets (all); Think about proposal  | Work on proposal; finalize research question; Finalize draft of proposal and submit | 
| 2/2  | 4 PM  | Have clean and filtered data to present; have data imported into pandas   | Discuss Wrangling and possible analytical approaches
| 2/9  | 4 PM  | EDA | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/16  | 4 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete project check-in |
| 3/2  | 4 PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project |
| 3/14  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |