# COGS 108 - Data Checkpoint

## Names

- Ava Hamedi
- Marc Mendoza
- Jonathan Park
- Daniel Renteria
- Siena Rivera

<a id='research_question'></a>
# Research Question

Did California Air Quality significantly improve during the years of 2011-2022 due to the COVID-19 Pandemic and the change in car traffic volume? 

# Dataset(s)

### Main Datasets

- **Annual Air Quality by County**
  - **Name:** annual_aqi_by_county_YEAR.csv
  - [Link to the Dataset](https://aqs.epa.gov/aqsweb/airdata/download_files.html#Annual)
  - **Description:** This dataset contains data and information on Annual Summary of Air Quality through years before 2000 until today. All Data use AQI (Air Quality Index) when applicable. The Measurement AQI is calculated based on the average concentration of a particular pollutant measured over a standard time interval (24 hours for most pollutants, 8 hours for carbon monoxide and ozone). We will be using this dataset to help understand and show the trend of Air Quality throughout years 2007 until today. This way, we can get a trend of how AQI has changed over the years before and through COVID-19.
  - **Source:** EPA: AirData<br>
<br>
- **Annual Average Daily Traffic (AADT) Volumes (1)**
  - **Name:** traffic_volumes_YEAR.csv
  - [Link to the Dataset](https://dot.ca.gov/programs/traffic-operations/census)
  - **Description:** In addition to Air Quality, our team wanted to look into the relationship between Air Quality and Car and Vehicular Traffic. This dataset contains data and information on Annual Average Daily Traffic Volumes through years 2017 until 2022. This and the following Datasets are the same kind of data but with different years. We will be using this dataset to help understand the average amount of Annual daily traffic through the California area. Although this dataset is more narrow, this will help us have a more narrow scope of information. In comparison to the one below which is the former years. In addition to gathering and cleaning this dataset, we will also be computing the average (mean) of the (Back and Forward Traffic) columns.
  - **Source:** AQICN.org<br>
  <br>
- **Annual Average Daily Traffic (AADT) Volumes (2)**
  - **Name:** traffic_volumes_YEAR.csv
  - [Link to the Dataset](https://data.ca.gov/dataset/fb62fd37-38e5-40a5-89d1-cb58ae12f244/resource/0f021a84-ded0-488c-93b5-f331759ad9fd/download/aadt_2007-2017_shapefiles.gdb.zip)
  - **Description:** This dataset contains data and information on Annual Average Daily Traffic Volumes through years 2007 until 2017. This and the former Datasets are the same kind of data but with different years. We will be using this dataset to help understand the average amount of Annual daily traffic through the California area. This dataset is more diverse, with over 10 years of data and information for our group to use, this will help us have a more wide scope of information. In addition to gathering and cleaning this  dataset, we will also be computing the average (mean) of the (Back and Forward Traffic) columns.
  - **Source:** AQICN.org<br>
 

**Combining Data**: The datasets we have chosen will be combined and subsequently analyzed to give a measure of the overall relationship between Traffic and Air Quality. From there, we want to look further into the addition of the COVID-19 Pandemic. Our team will be analyzing the causal relationship between the effects of COVID-19 on Air Quality stemming from a change in Car and Vehicular Traffic. All of our information will be combined into one, larger dataset for the California Area. To answer our Research Question, our main priority is getting all of our data cleaned and ready for analysis.

# Setup

In [1]:
# First make sure to have all packages installed on device (pip install pandas, numpy, seaborn, matplotlib, and etc)
# import all packages after installing pip
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt


# Our Warnings when importing code will make it harder to concentrate on what is important
import warnings
warnings.filterwarnings('ignore')

# Data Cleaning

## Table of Contents

- Air Quality (AQI) Data BEFORE (<2020) and DURING (2020-2022) COVID-19 
- Traffic Volumes BEFORE (<2020) and DURING (2020-2022) COVID-19 


## Air Quality (AQI) Data

### I. Annual Air Quality (AQI) by County
First we will focus on the Air Quality (AQI) Data given to the team by EPA. This dataset contained information for the years of 2011-2021. We will be cleaning up the information on Annual Summary of Air Quality through years before 2000 until today. We only want to focus on years 2011-Today, so we will be removing data that does not fit into our timeline. All Data use AQI (Air Quality Index) when applicable. The Measurement AQI is calculated based on the average concentration of a particular pollutant measured over a standard time interval (24 hours for most pollutants, 8 hours for carbon monoxide and ozone). We will be using this dataset to help understand and show the trend of Air Quality throughout years 2007 until today. This way, we can get a trend of how AQI has changed over the years before and through COVID-19.


## Traffic Data

### II. Annual Daily Traffic Data 
In addition to Air Quality, our team wanted to look into the relationship between Air Quality and Car and Vehicular Traffic. The chosen datasets contain data and information on Annual Average Daily Traffic Volumes through years 2007 until 2022. We will be using this dataset to help understand the average amount of Annual daily traffic through the California area. We only want to focus on a more narrow area and scope, so we will be cleaning up the datasets to help provide us with the best information. This way, we can also get a trend of how Annual Daily Traffic Data has changed over the years before and through COVID-19. We will also be showcasing the data in ways that are most intuitive. We will be  grouping and taking the average (mean) of the (Back and Forward Traffic) columns.


In [2]:
#importing data
aqi_county_path = ['datasets/annual_aqi_by_county_2010.csv', 
                 'datasets/annual_aqi_by_county_2011.csv', 
                 'datasets/annual_aqi_by_county_2012.csv', 
                 'datasets/annual_aqi_by_county_2013.csv', 
                 'datasets/annual_aqi_by_county_2014.csv', 
                 'datasets/annual_aqi_by_county_2015.csv', 
                 'datasets/annual_aqi_by_county_2016.csv', 
                 'datasets/annual_aqi_by_county_2017.csv', 
                 'datasets/annual_aqi_by_county_2018.csv',
                 'datasets/annual_aqi_by_county_2019.csv', 
                 'datasets/annual_aqi_by_county_2020.csv', 
                 'datasets/annual_aqi_by_county_2021.csv']
                 
traffic_path = ['datasets/2007_Traffic.csv']

In [3]:
#parsing data
aqi_county_frames = []
for path in aqi_county_path:
    aqi_county_frames.append(pd.read_csv(path))
aqi_county = pd.concat(aqi_county_frames, ignore_index = True)

traffic_frames = []
for path in traffic_path:
    traffic_frames.append(pd.read_csv(path))
traffic = pd.concat(traffic_frames, ignore_index= True)

In [4]:
traffic.columns

Index(['District', 'Route', 'County', 'Postmile', 'Description',
       'Back_Peak_Hour', 'Back_Peak_Month', 'Back_AADT', 'Ahead_Peak_Hour',
       'Ahead_Peak_Month', 'Ahead_AADT'],
      dtype='object')

In [5]:
#columns to remove from files
aqi_county_bad = ['Days with AQI', 'Good Days', 'Moderate Days', 'Unhealthy for Sensitive Groups Days',
                  'Unhealthy Days', 'Very Unhealthy Days', 'Hazardous Days', 'Days CO', 'Days NO2',
                  'Days Ozone', 'Days SO2', 'Days PM2.5', 'Days PM10']

traffic_bad = ['District', 'Route', 'Postmile', 'Description']

In [6]:
#dropping columns
aqi_county.drop(aqi_county_bad, axis = 1, inplace = True)
traffic.drop(traffic_bad, axis = 1, inplace = True)

In [7]:
#dropping rows
aqi_county_CA = aqi_county[aqi_county['State']=='California']
traffic.dropna(axis=0, inplace= True)

In [8]:
#resetting index
aqi_county_CA.reset_index(drop=True, inplace = True)

In [9]:
aqi_county_CA.head()

Unnamed: 0,State,County,Year,Max AQI,90th Percentile AQI,Median AQI
0,California,Alameda,2010,179,68,43
1,California,Amador,2010,151,64,35
2,California,Butte,2010,126,84,47
3,California,Calaveras,2010,154,84,41
4,California,Colusa,2010,119,49,38


In [10]:
#renaming row values
full_name = {'ALA':'Alameda','ALP':'Alpine','AMA':'Amador','BUT':'Butte', 'CAL':'Calaveras', 'CC':'Contra Costa','COL':'Colusa','DN':'Del Norte','ED':'El Dorado','FRE':'Fresno','GLE':'Glenn','HUM':'Humboldt',
            'IMP':'Imperial','INY':'Inyo','KER':'Kern','KIN':'Kings','LA':'Los Angeles','LAK':'Lake','LAS':'Lassen','MAD':'Madera','MEN':'Mendocino','MER':'Merced','MNO':'Mono','MOD':'Modoc','MON':'Monterey',
            'MPA':'Mariposa','MRN':'Marin','NAP':'Napa','NEV':'Nevada','ORA':'Orange','PLA':'Placer','PLU':'Plumas','RIV':'Riverside','SAC':'Sacramento','SB':'Santa Barbara','SBD':'San Bernardino',
            'SBT':'San Benito','SCL':'Santa Clara','SCR':'Santa Cruz','SD':'San Diego', 'SF': 'San Francisco','SHA':'Shasta','SIE':'Sierra','SIS':'Siskiyou', 'SJ':'San Joaquin', 'SLO': 'San Luis Obispo',
            'SM':'San Mateo','SOL':'Solano','SON':'Sonoma','STA':'Stanislaus','SUT':'Sutter','TEH':'Tehama','TRI':'Trinity','TUL':'Tulare','TUO':'Tuolumne','VEN':'Ventura','YOL':'Yolo','YUB':'Yuba'}
traffic['County'] = traffic['County'].replace(full_name)


In [11]:
traffic.head()

Unnamed: 0,County,Back_Peak_Hour,Back_Peak_Month,Back_AADT,Ahead_Peak_Hour,Ahead_Peak_Month,Ahead_AADT
1,Orange,3800.0,40500.0,37500.0,4000,43000,39500
2,Orange,4000.0,43000.0,39500.0,2350,32000,29500
3,Orange,2350.0,32000.0,29500.0,2900,39000,36500
4,Orange,2900.0,39000.0,36500.0,2900,39000,36500
5,Orange,2900.0,39000.0,36500.0,3650,43500,40500
