# COGS 108 - Data Checkpoint

# Names

- Stephen Kim
- Clara Yi
- Ethan Lee
- Ernest Lin
- Wesley Nguyen

<a id='research_question'></a>
# Research Question

Do the macroscopic socioeconomic features of a state, specifically median income, percentage of population without health insurance, and labor breakdown, have a correlation to COVID mortality rate in 2020-2021?

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name:
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

### Dataset 1

- Dataset Name: United States COVID-19 Cases and Deaths by State over Time
- Link to the dataset: https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data
- Number of observations: 44,280 rows, 15 columns

This dataset contains the United States (and underlying US territories) data for its COVID rates over time. Such rates include total cases, new cases, total deaths, new deaths, and other metrics that give an overall view of the statistics of COVID for each state. There are submission dates for each row, so that is how we are going to link the rates to specific periods of time

### Dataset 3

- Dataset Name: Median Household Income and Percentage of Americans without Health Insurance in 2020
- Link to the dataset: https://docs.google.com/spreadsheets/d/174jFoW8KsXGJmpNUx8cbh6j4l6rhQhpOUKIPnkzk3lM/edit#gid=0
- Number of observations: 50 row, 2 columns, 100 observations total

This dataset contains the United States' for the median household income and percentage of Americans without Health Insurance in 2020. This data was taken from two different sources, [United States Census Bureau Website](https://www.census.gov/quickfacts/fact/map/CA/HEA775220) and [Federal Reserve Economic Data](https://fred.stlouisfed.org/release/tables?rid=249&eid=259515&od=2020-01-01#), and all of this data was manually imported into a Google SHeet that was converted to a CSV file. 

# Setup

In [4]:
!pip install pandas



In [5]:
import pandas as pd


# Data Cleaning

Describe your data cleaning steps here.

For Dataset 3, we have two primary steps in cleaning the data. The first step, which was manually inputting the data from the data sources to a CSV file via Google Sheets. This manual step was necessary due to the fact that the original data source did not have an option to directly extract/download the raw data. Since there were only 50 observations, we decided manual input was the best option. 

Our second step for Dataset 3 was to import the data into this notebook. We uploaded the CSV file into our "Raw_Data" folder, and then used read_csv to bring it into a dataframe, which is a usable format for our future analysis. After making sure there were no issues, we then saved it to the "Cleaned Data" folder.

In [6]:
# Cleaning State Data
def clean_covid_data():
    # Date Closure
    def apply_date(date: str) -> str:
        split_date = date.split("/")
        return "/".join([split_date[2], split_date[0], split_date[1]])
        
    
    # Read the data (already in tabular form)
    covid_data_url = r".\Data\State Data\United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv"
    covid_data = pd.read_csv(covid_data_url)
    
    # States we will not be looking at (These aren't part of the 50 states + DC)
    remove_states = ["RMI", "FSM", "GU", "MP", "PW", "NYC", "PR", "AS", "VI"]
    covid_data = covid_data[~covid_data["state"].isin(remove_states)]
    
    # Remove columns we don't need
    covid_data = covid_data[["submission_date", "state", "tot_cases", "tot_death"]]
    
    # Change Date format to allow for easier sorting
    covid_data["submission_date"] = covid_data["submission_date"].apply(apply_date)
    
    # Sort Date
    covid_data.sort_values("submission_date", inplace=True, ascending=False)
    covid_data.reset_index(inplace=True, drop=True)
    
    # Save Data
    clean_covid_data_url = r".\Cleaned Data\state_covid_data.csv"
    covid_data.to_csv(clean_covid_data_url, index=False)
    
clean_covid_data()

FileNotFoundError: [Errno 2] No such file or directory: '.\\Data\\State Data\\United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv'

In [13]:
def read_covid_data(month: int, day: int, year: int):
    covid_data_url = r'.\Cleaned Data\state_covid_data.csv'
    covid_data = pd.read_csv(covid_data_url)
    
    date_filter = formatDate(month, day, year)
    covid_data = covid_data[covid_data["submission_date"] == date_filter]
    covid_data.sort_values("state", inplace=True)
    covid_data.reset_index(inplace=True, drop=True)
    
    return covid_data

def formatPreZero(num: int) -> str:
    if num >= 10:
        return str(num)
    
    return "0" + str(num)
    
    
def formatDate(month: int,  day: int, year: int) -> str:
    return f"{year}/{formatPreZero(month)}/{formatPreZero(day)}"

read_covid_data(3, 15, 2021)

Unnamed: 0,submission_date,state,tot_cases,tot_death
0,2021/03/15,AK,58212,331
1,2021/03/15,AL,507479,10798
2,2021/03/15,AR,327060,5481
3,2021/03/15,AZ,834006,16553
4,2021/03/15,CA,3528795,55330
5,2021/03/15,CO,452758,6040
6,2021/03/15,CT,293102,7788
7,2021/03/15,DC,42623,1042
8,2021/03/15,DE,91768,1511
9,2021/03/15,FL,1943062,33574


In [17]:
# Cleaning socioeconomic data
socioeconomic_data_url = r'./Raw_Data/socioeconomic_data.csv'
socioeconomic_data = pd.read_csv(socioeconomic_data_url)
print(socioeconomic_data)

# Saving to GitHub
clean_socioeconomic_data_url = r'./Cleaned Data/clean_socioeconomic_data.csv'
socioeconomic_data.to_csv(clean_socioeconomic_data_url, index=False)

   State  Persons without Health Insurance, % Median Household Income in 2020
0     AL                                 11.7                          54,393
1     AK                                 13.9                          74,476
2     AZ                                 13.6                          66,628
3     AR                                 10.9                          50,540
4     CA                                  8.9                          77,358
5     CO                                  9.3                          82,611
6     CT                                  7.0                          79,043
7     DE                                  8.1                          69,132
8     FL                                 16.3                          57,435
9     GA                                 15.5                          58,952
10    HI                                  5.0                          80,729
11    ID                                 12.8                   