# COGS 108 - Data Checkpoint

# Names

- Nadia Corral
- Jose Deleon
- Christina Tyagi

<a id='research_question'></a>
# Research Question

*How did changes in the Air Quality Index from 2014 to 2018 effect the amount of respiratory diseases in the Central Valley?*

# Dataset(s)

- Dataset Name: Daily AQI
- Link to the dataset: https://aqs.epa.gov/aqsweb/airdata/download_files.html#AQI
- Number of observations: 34675

Each data set provides the daily AQI measurement for every county in CA over the  duration of one year. We are going to combine the datasets in order to get the daily AQI measurements from 2014 to 2018 for cities in the following counties: Butte, Colusa, Glenn, Fresno, Kern, Kings, Madera, Merced, Placer, San Joaquin, Sacramento, Shasta, Solano, Stanislaus, Sutter, Tehama, Tulare, Yolo and Yuba. 

- Dataset Name: Incidence Rate Report for California by County; Lung and Bronchus (All Stages^), 2014-2018
- Link to the dataset: https://gis.cdc.gov/Cancer/USCS/#/StateCounty/
- Number of observations: 19

The data set provides the rate of new lung and bronchus cancers from 2014-2018 for every county in CA. We are going to collect the rate of new lung and and bronchus cancers for the following counties: Butte, Colusa, Glenn, Fresno, Kern, Kings, Madera, Merced, Placer, San Joaquin, Sacramento, Shasta, Solano, Stanislaus, Sutter, Tehama, Tulare, Yolo and Yuba.

- Dataset Name: 2014-2020 Final Deaths by Year by County
- Link to the dataset: https://data.chhs.ca.gov/dataset/death-profiles-by-county/resource/579cc04a-52d6-4c4c-b2df-ad901c9049b7
- Number of observations: 1425

The data set provides the amount of deaths per year due to respiratory disease for every county in CA. We are going to collect the amount of deaths caused by respiratory disease from 2014 to 2018 for the following counties: Butte, Colusa, Glenn, Fresno, Kern, Kings, Madera, Merced, Placer, San Joaquin, Sacramento, Shasta, Solano, Stanislaus, Sutter, Tehama, Tulare, Yolo and Yuba.


# Setup

In [2]:
import pandas as pd
import numpy as np

# Data Cleaning: Incidence Rate Report for California by County; Lung and Bronchus (All Stages^), 2014-2018

In [3]:
#load lung cancer csv
cancer = pd.read_csv('https://raw.githubusercontent.com/cgtyagi/Group062data/main/LungCancer-Sheet%201-Table%201-1.csv')
cancer.head()

Unnamed: 0,Area,Age-Adjusted Rate,Case Count,Population
0,"Alameda County, California",39.2,3399,8220232
1,"Alpine County, California",Data Suppressed,Data Suppressed,Data Suppressed
2,"Amador County, California",56.6,205,189120
3,"Butte County, California",56.4,864,1133413
4,"Calaveras County, California",45.8,202,226337


In [4]:
#rename the inputs in the area column to just get the county name 
def standardize_county(str_county):
    try: 
        str_county = str_county.strip()
        
        if 'California' in str_county:
            str_county = str_county.replace('County, California', '')
            output = str_county 
        else: 
            output = np.nan
    except:
        output = np.nan

    return output

In [5]:
cancer['Area'] = cancer['Area'].apply(standardize_county)

In [6]:
cancer['Area'].unique

<bound method Series.unique of 0             Alameda 
1              Alpine 
2              Amador 
3               Butte 
4           Calaveras 
5              Colusa 
6        Contra Costa 
7           Del Norte 
8           El Dorado 
9              Fresno 
10              Glenn 
11           Humboldt 
12           Imperial 
13               Inyo 
14               Kern 
15              Kings 
16               Lake 
17             Lassen 
18        Los Angeles 
19             Madera 
20              Marin 
21           Mariposa 
22          Mendocino 
23             Merced 
24              Modoc 
25               Mono 
26           Monterey 
27               Napa 
28             Nevada 
29             Orange 
30             Placer 
31             Plumas 
32          Riverside 
33         Sacramento 
34         San Benito 
35     San Bernardino 
36          San Diego 
37      San Francisco 
38        San Joaquin 
39    San Luis Obispo 
40          San Mateo 
41      Santa Barbara 
42 

In [7]:
#filter out the counties to only get counties in the central valley
cancer_sub = cancer.loc[(cancer['Area'] == 'Butte ') | (cancer['Area'] == 'Colusa ') | (cancer['Area'] == 'Glenn ') | 
(cancer['Area'] == 'Fresno ') | (cancer['Area'] == 'Kern ') | 
(cancer['Area'] == 'Kings ') | (cancer['Area'] == 'Madera ') | 
(cancer['Area'] == 'Merced ') | (cancer['Area'] == 'Placer ') | 
(cancer['Area'] == 'San Joaquin ') | (cancer['Area'] == 'Sacramento ') | 
(cancer['Area'] == 'Shasta ') | (cancer['Area'] == 'Solano ') | 
(cancer['Area'] == 'Stanislaus ') | (cancer['Area'] == 'Sutter ') | 
(cancer['Area'] == 'Tehama ') | (cancer['Area'] == 'Tulare ') | 
(cancer['Area'] == 'Yolo ') | (cancer['Area'] == 'Yuba ')
]
cancer_sub

Unnamed: 0,Area,Age-Adjusted Rate,Case Count,Population
3,Butte,56.4,864,1133413
5,Colusa,56.0,65,106900
9,Fresno,41.1,1863,4884073
10,Glenn,55.7,93,139317
14,Kern,44.4,1676,4407177
15,Kings,41.6,253,750009
19,Madera,41.3,333,773293
23,Merced,45.4,530,1343647
30,Placer,42.0,1125,1898446
33,Sacramento,50.7,4081,7545625


In [8]:
#load deaths csv
deaths = pd.read_csv('https://raw.githubusercontent.com/cgtyagi/Group062data/main/2021-11-29_deaths_final_2014_2020_county_year_sup.csv')
deaths.head()

Unnamed: 0,Year,County,Geography_Type,Strata,Strata_Name,Cause,Cause_Desc,Count,Annotation_Code,Annotation_Desc
0,2014,Alameda,Occurrence,Total Population,Total Population,ALL,All causes (total),9357.0,,
1,2014,Alameda,Occurrence,Age,Under 1 year,ALL,All causes (total),105.0,,
2,2014,Alameda,Occurrence,Age,1-4 years,ALL,All causes (total),17.0,,
3,2014,Alameda,Occurrence,Age,5-14 years,ALL,All causes (total),17.0,,
4,2014,Alameda,Occurrence,Age,15-24 years,ALL,All causes (total),133.0,,


In [9]:
#get understanding of shape
deaths.shape

(147784, 10)

In [10]:
#filter from years 2014-2018
deaths_sub = deaths.loc[(deaths['Year'] == 2014) | (deaths['Year'] == 2015) | (deaths['Year'] == 2016) | (deaths['Year'] == 2017) | (deaths['Year'] == 2018)]
deaths_sub.head()

Unnamed: 0,Year,County,Geography_Type,Strata,Strata_Name,Cause,Cause_Desc,Count,Annotation_Code,Annotation_Desc
0,2014,Alameda,Occurrence,Total Population,Total Population,ALL,All causes (total),9357.0,,
1,2014,Alameda,Occurrence,Age,Under 1 year,ALL,All causes (total),105.0,,
2,2014,Alameda,Occurrence,Age,1-4 years,ALL,All causes (total),17.0,,
3,2014,Alameda,Occurrence,Age,5-14 years,ALL,All causes (total),17.0,,
4,2014,Alameda,Occurrence,Age,15-24 years,ALL,All causes (total),133.0,,


# Data Cleaning: 2014-2020 Final Deaths by Year by County

In [11]:
#checknewshape
deaths_sub.shape

(105560, 10)

In [12]:
#filter by cause of death (chronic lower respiratory diseases)
deaths_by_respiratory = deaths_sub.loc[(deaths_sub['Cause'] == 'CLD')]
deaths_by_respiratory.head()

Unnamed: 0,Year,County,Geography_Type,Strata,Strata_Name,Cause,Cause_Desc,Count,Annotation_Code,Annotation_Desc
50,2014,Alameda,Occurrence,Total Population,Total Population,CLD,Chronic lower respiratory diseases,418.0,,
51,2014,Alameda,Occurrence,Gender,Female,CLD,Chronic lower respiratory diseases,219.0,,
52,2014,Alameda,Occurrence,Gender,Male,CLD,Chronic lower respiratory diseases,199.0,,
53,2014,Alameda,Occurrence,Race-Ethnicity,American Indian/Alaska Native,CLD,Chronic lower respiratory diseases,,1.0,Cell suppressed for small numbers
54,2014,Alameda,Occurrence,Race-Ethnicity,Asian,CLD,Chronic lower respiratory diseases,49.0,,


In [13]:
#filter by relevant counties
deaths_by_respiratory_per_county = deaths_by_respiratory[(deaths_by_respiratory['County'] == 'Butte') | (deaths_by_respiratory['County'] == 'Colusa') | (deaths_by_respiratory['County'] == 'Glenn') | 
(deaths_by_respiratory['County'] == 'Fresno') | (deaths_by_respiratory['County'] == 'Kern') | 
(deaths_by_respiratory['County'] == 'Kings') | (deaths_by_respiratory['County'] == 'Madera') | 
(deaths_by_respiratory['County'] == 'Merced') | (deaths_by_respiratory['County'] == 'Placer') | 
(deaths_by_respiratory['County'] == 'San Joaquin') | (deaths_by_respiratory['County'] == 'Sacramento') | 
(deaths_by_respiratory['County'] == 'Shasta') | (deaths_by_respiratory['County'] == 'Solano') | 
(deaths_by_respiratory['County'] == 'Stanislaus') | (deaths_by_respiratory['County'] == 'Sutter') | 
(deaths_by_respiratory['County'] == 'Tehama') | (deaths_by_respiratory['County'] == 'Tulare') | 
(deaths_by_respiratory['County'] == 'Yolo') | (deaths_by_respiratory['County'] == 'Yuba')]

deaths_by_respiratory_per_county.head()

Unnamed: 0,Year,County,Geography_Type,Strata,Strata_Name,Cause,Cause_Desc,Count,Annotation_Code,Annotation_Desc
596,2014,Butte,Occurrence,Total Population,Total Population,CLD,Chronic lower respiratory diseases,163.0,,
597,2014,Butte,Occurrence,Gender,Female,CLD,Chronic lower respiratory diseases,80.0,,
598,2014,Butte,Occurrence,Gender,Male,CLD,Chronic lower respiratory diseases,83.0,,
599,2014,Butte,Occurrence,Race-Ethnicity,American Indian/Alaska Native,CLD,Chronic lower respiratory diseases,,1.0,Cell suppressed for small numbers
600,2014,Butte,Occurrence,Race-Ethnicity,Asian,CLD,Chronic lower respiratory diseases,,1.0,Cell suppressed for small numbers


In [14]:
#take NaNs in count out of filtered data
deaths_by_respiratory_per_county_clean = deaths_by_respiratory_per_county.dropna(subset = ["Count"]) 
deaths_by_respiratory_per_county_clean.head()

Unnamed: 0,Year,County,Geography_Type,Strata,Strata_Name,Cause,Cause_Desc,Count,Annotation_Code,Annotation_Desc
596,2014,Butte,Occurrence,Total Population,Total Population,CLD,Chronic lower respiratory diseases,163.0,,
597,2014,Butte,Occurrence,Gender,Female,CLD,Chronic lower respiratory diseases,80.0,,
598,2014,Butte,Occurrence,Gender,Male,CLD,Chronic lower respiratory diseases,83.0,,
602,2014,Butte,Occurrence,Race-Ethnicity,Hawaiian/Pacific Islander,CLD,Chronic lower respiratory diseases,0.0,,
605,2014,Butte,Occurrence,Race-Ethnicity,White,CLD,Chronic lower respiratory diseases,153.0,,


In [17]:
"""
Here we load the AQI data for the year 2014. Thankfully it's pretty much clean upon insertion, but that was
because we had to do a lot of cleaning outside of python. These datasets contain almost daily information for 
every county in the USA, so the dataset for the entirety of the USA was quite big. This became a huge problem because not
only were not able to do the simple upload to GitHub because we were over > 25Mb but we couldn't do the command line upload 
either. For some reason, it would get uploaded to GitHub just fine but it was still big for python to handle. This led to us 
using excel to filter out and delete by hand every other state and then every other county outside of the central valley.
Then after all that manual deletion, the CSV was finally small enough to fit into python and this was just for one year!
"""
aqi_2014 = pd.read_csv("https://raw.githubusercontent.com/COGS108/Group062Sp22/master/aqi_2014_final.csv?token=GHSAT0AAAAAABTTT33MSTZCYCHLVEXFLUCKYTWAPLA")
aqi_2014.head()


Unnamed: 0,State Name,county Name,State Code,County Code,Date,AQI,Category,Defining Parameter,Defining Site,Number of Sites Reporting
0,California,Butte,6,7,1/1/2014,155,Unhealthy,PM2.5,06-007-0008,4
1,California,Butte,6,7,1/2/2014,77,Moderate,PM2.5,06-007-0008,4
2,California,Butte,6,7,1/3/2014,96,Moderate,PM2.5,06-007-0008,4
3,California,Butte,6,7,1/4/2014,90,Moderate,PM2.5,06-007-0008,3
4,California,Butte,6,7,1/5/2014,78,Moderate,PM2.5,06-007-0008,3


In [18]:
#Account for any missing values in our AQI column
aqi_2014 = aqi_2014.dropna(subset = ["AQI"]) 
aqi_2014.head()


Unnamed: 0,State Name,county Name,State Code,County Code,Date,AQI,Category,Defining Parameter,Defining Site,Number of Sites Reporting
0,California,Butte,6,7,1/1/2014,155,Unhealthy,PM2.5,06-007-0008,4
1,California,Butte,6,7,1/2/2014,77,Moderate,PM2.5,06-007-0008,4
2,California,Butte,6,7,1/3/2014,96,Moderate,PM2.5,06-007-0008,4
3,California,Butte,6,7,1/4/2014,90,Moderate,PM2.5,06-007-0008,3
4,California,Butte,6,7,1/5/2014,78,Moderate,PM2.5,06-007-0008,3
