# COGS 108 - Data Checkpoint

# Names

- Stephen Kim
- Clara Yi
- Ethan Lee
- Ernest Lin
- Wesley Nguyen

<a id='research_question'></a>
# Research Question

Do the macroscopic socioeconomic features of a state, specifically median income, percentage of population without health insurance, and labor breakdown, have a correlation to COVID mortality rate in 2020-2021?

# Dataset(s)



### Dataset 1

- Dataset Name: United States COVID-19 Cases and Deaths by State over Time
- Link to the dataset: https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data
- Number of observations: 44,280 rows, 15 columns, 664,200 observations total

This dataset contains the United States (and underlying US territories) data for its COVID rates over time. Such rates include total cases, new cases, total deaths, new deaths, and other metrics that give an overall view of the statistics of COVID for each state. There are submission dates for each row, so that is how we are going to link the rates to specific periods of time

### Dataset 2

- Dataset Name: Employees on nonfarm payrolls by state and selected industry sector, seasonally adjusted
- Link to the dataset: https://www.bls.gov/news.release/laus.t03.htm
- Number of observations: 50 rows, 9 columns, 450 observations total

Dataset from the US Bureau of Labor Statistics, counting the total number of employees in thousands in the labor force in each state as well as in each of eight industries (construction, manufacturing, trade/transportation/utilities, finance, services, education/health, leisure/hospitality, government).

### Dataset 3

- Dataset Name: Median Household Income and Percentage of Americans without Health Insurance in 2020
- Link to the dataset: https://docs.google.com/spreadsheets/d/174jFoW8KsXGJmpNUx8cbh6j4l6rhQhpOUKIPnkzk3lM/edit#gid=0
- Number of observations: 50 rows, 2 columns, 100 observations total

This dataset contains the United States' for the median household income and percentage of Americans without Health Insurance in 2020. This data was taken from two different sources, [United States Census Bureau Website](https://www.census.gov/quickfacts/fact/map/CA/HEA775220) and [Federal Reserve Economic Data](https://fred.stlouisfed.org/release/tables?rid=249&eid=259515&od=2020-01-01#), and all of this data was manually imported into a Google SHeet that was converted to a CSV file. 

### Merging Data
Since we are using 3 different primary datasets, we will identify each state with a unique code (California would be CA, Missouri would be MO, etc.). Ultimately, we will merge the datasets during our analysis, with several rows of data for each state.

# Setup

In [None]:
!pip3 install pandas
!pip install pandas

In [None]:
import pandas as pd 

# Data Cleaning (Process)

Describe your data cleaning steps here.
## Dataset 1
- With the imported data, we removed unncessary states. We only want the 50 states not including territories or DC
- We then removed the columns that we didn't need for analysis. We did this by selecting the columns that we needed
- We also wanted the dates to appear in a sortable/searchable way, so we made the dates arranged in yyyy-mm-dd format
**Note**: Since the data is arranged by date, we created a function ```read_covid_data``` that will return the 50 states with their respective data for just that specified date

For Dataset 3, we have two primary steps in cleaning the data. The first step, which was manually inputting the data from the data sources to a CSV file via Google Sheets. This manual step was necessary due to the fact that the original data source did not have an option to directly extract/download the raw data. Since there were only 50 observations, we decided manual input was the best option. 

Our second step for Dataset 3 was to import the data into this notebook. We uploaded the CSV file into our "Raw_Data" folder, and then used read_csv to bring it into a dataframe, which is a usable format for our future analysis. After making sure there were no issues, we then saved it to the "Cleaned Data" folder.

In [6]:
# Cleaning State Data
def clean_covid_data():
    # Date Closure
    def apply_date(date: str) -> str:
        split_date = date.split("/")
        return "/".join([split_date[2], split_date[0], split_date[1]])
        
    
    # Read the data (already in tabular form)
    covid_data_url = r".\Data\State Data\United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv"
    covid_data = pd.read_csv(covid_data_url)
    
    # States we will not be looking at (These aren't part of the 50 states)
    remove_states = ["RMI", "FSM", "GU", "MP", "PW", "NYC", "PR", "AS", "VI", "DC"]
    covid_data = covid_data[~covid_data["state"].isin(remove_states)]
    
    # Remove columns we don't need
    covid_data = covid_data[["submission_date", "state", "tot_cases", "tot_death"]]
    
    # Change Date format to allow for easier sorting
    covid_data["submission_date"] = covid_data["submission_date"].apply(apply_date)
    
    # Sort Date
    covid_data.sort_values("submission_date", inplace=True, ascending=False)
    covid_data.reset_index(inplace=True, drop=True)
    
    # Save Data
    clean_covid_data_url = r".\Cleaned Data\state_covid_data.csv"
    covid_data.to_csv(clean_covid_data_url, index=False)
    
clean_covid_data()

In [None]:
def read_covid_data(month: int, day: int, year: int):
    covid_data_url = r".\Cleaned Data\state_covid_data.csv"
    covid_data = pd.read_csv(covid_data_url)
    
    date_filter = formatDate(month, day, year)
    covid_data = covid_data[covid_data["submission_date"] == date_filter]
    covid_data.sort_values("state", inplace=True)
    covid_data.reset_index(inplace=True, drop=True)
    
    return covid_data

def formatPreZero(num: int) -> str:
    if num >= 10:
        return str(num)
    
    return "0" + str(num)
    
    
def formatDate(month: int,  day: int, year: int) -> str:
    return f"{year}/{formatPreZero(month)}/{formatPreZero(day)}"

read_covid_data(3, 15, 2021)

In [None]:
# Cleaning socioeconomic data

socioeconomic_data_url = r'./Raw Data/socioeconomic_data.csv'
socioeconomic_data = pd.read_csv(socioeconomic_data_url)
print(socioeconomic_data)

# Saving to GitHub
clean_socioeconomic_data_url = r"./Cleaned Data/clean_socioeconomic_data.csv"
socioeconomic_data.to_csv(clean_socioeconomic_data_url, index=False)