# Air Quality in Houston - Step 1: Data Wrangling#

## 1. Aim of the project and origin of the data##

### 1.1. The Problem:###
What is the future of indoor and outdoor air quality in Houston and its impact on Houstonians’ health?

Air quality has been a concern for Houston’s officials and population for several years. Houston’ s legendary around-the-clock traffic jam, its growing population, its humid subtropical climate conditions,   and the sprawling potpourri of pollutants released by refineries and chemical plants have made Space City’s air hard to breathe for a lot of Houstonians. In 2018, The Mayor's Task Force on the Health Effects of Air Pollution has identified 12 pollutants as definite health risks for Houstonians, the main one being Ozone.

In 2020, the plastic industry is growing rapidly, freeways are being widened to allow for more traffic to flow through, and more people are moving in. Houston  is currently the 5th largest metro population in the US with 6,997,384 inhabitants and is predicted to host 8.7 millions inhabitants by 2028 according to the Texas Demographic Center.  

Where is the air quality headed?

The aim of this capstone project is to predict the indoor and outdoor air quality in Houston for each upcoming decades up to 2050 using daily air data summaries, known potential drivers of air quality, and the city development forecast (i.e. population growth, change in land use...) from the Houston-Galveston Area Council. The impact of air quality on health will be presented by an overlay of AQI (Air Quality Index) calculated data and ELS (Effects Screening Levels). The analysis focuses on 6 pollutants of concern for Houston, namely Ozone (O3), Sulfur Dioxide (SO2), Carbon Monoxide (CO), Nitrogen Dioxide (NO2), and particulate matter (PM2.5 and PM10). 


### 1.2. The Data:###
All the datasets used in this capstone were available online between 09/01/2020 and 09/15/2020 from the following websites:

- Relationships of Indoor, Outdoor, and Personal Air (RIOPA) dataset:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/7UBE7P&version=1.0

- Daily Pollutant concentration measurements for Houston and effect screening levels from the Texas Commision On Environmental Quality (TECQ): https://www.tceq.texas.gov/ 

- Daily Pollutant concentration measurements for Houston from the U.S. Environmental Protection Agency: https://www.epa.gov/

- Weather daily summaries: https://www.noaa.gov/

- Land use data, population data and forecast from the Houston-Galveston Area Council: http://www.h-gac.com/home/default.aspx

- Road surface data: https://www.txdot.gov/inside-txdot/division/transportation-planning/roadway-inventory.html

- Traffic count, road size data: https://cohgis-mycity.opendata.arcgis.com/datasets/trafficcounts-opendata-wm/data


Section 1: Data Collection (Goal: Organize your data to streamline the next steps of your capstone ○ Time estimate: 1-2 hours ■ Data loading ■ Data joining ■ Hint: Data Collection will require the use of the pandas library, and functions like read_csv() , depending on the type of data you want to read in! ■ Hint: when adding one dataset to another, make sure you use the right function: you might want to)

Data Organization ○ Goal: Create a file structure and add your work to the GitHub repository you’ve created for this project. ○ Time estimate: 1-2 hours ■ File structure ■ GitHub ■ Hint: the glob library could come in handy here… ■ Remind yourself of why GitHub is useful. What are the main motivations for making a GitHub repository?

Data Definition ○ Goal: Gain an understanding of your data features to inform the next steps of your project. ○ Time estimate: 1-2 hours ■ Column names ■ Data types ■ Description of the columns ■ Counts and percents unique values ■ Ranges of values

Hint: here are some useful questions to ask yourself during this process:
Do your column names correspond to what those columns store?
Check the data types of your columns. Are they sensible?
Calculate summary statistics for each of your columns, such as mean, median, mode, standard deviation, range, and number of unique values. What does this tell you about your data? What do you now need to investigate?


Data Cleaning ○ Goal: Clean up the data in order to prepare it for the next steps of your project. ○ Time estimate: 1-2 hours ■ NA or missing values ■ Duplicates

Hint: don’t forget about the following awesome Python functions for data cleaning, which make life a whole lot easier:
loc[] - filter your data by label
iloc[] - filter your data by indexes
apply() - execute a function across an axis of a DataFrame
drop() - drop columns from a DataFrame
is_unique() - check if a column is a unique identifier
Series methods, such as str.contains(), which can be used to check if a certain substring occurs in a string of a Series, and str.extract(), which can be used to extract capture groups with a certain regex (or regular expression ) pattern
numPy methods like .where(), to clean columns. Recall that such methods have the structure: np.where(condition, then, else)
DataFrame methods to check for null values, such as df.isnull().values.any()

##2. Data Collection##
###2.1. Indoor/Outdoor Air Quality:###
In this section the relevant csv files from the RIOPA dataset are loaded and merged.

###2.2. Weather Daily Summaries:###
In this section, the six csv files containing the daily weather summaries are being loaded and merged.

### 2.3. Road and Population Data
Forecast and past data regarding population growth and road surface are annual.

###2.4. Effects Screening Level###
The


### 2.5. Outdoor Air Quality:###
going to keep epa and TECQ separate first
then merge and remove duplicates if any
need to have monitor position and calculated AQI

### 2.6. Monitoring Stations:###
Preparing a df to store lat/long of stations wit land use attribute


##3. Data Organization File Structure & GitHub location##

folder x: the original datatset are in the folder
folder x: all df created in section 2 are saved in folder (pre cleaning)
folder x: all the ready-to-go df (post lceaning) are stored in folder
folder x: all images and maps are stored in folder
folder x: store all function that will be reused (e.g. AQI function)
folder x: the data wrangling notebook can be found in folder
the same syntax will be used for eda(02), modelling (03)...



##4. Data Definition:##
mapping lives here

##5. Data Cleaning:##