## Raw Ozone Data Acquisition and Processing

This part one notebook acquires and manipulates the hourly raw ozone data from the pre-generated files of dataset available at [US Environmenal Protection Agency (EPA)](https://aqs.epa.gov/aqsweb/airdata/download_files.html#Raw). It automatically requests, acquires, and process the raw zipped data from the EPA website, and saves the processed and filtered data in CSV format for further statistical analysis presented in part two of this repository. In this demonstration, the script uses the Houston Site, Texas, United States, as this city is among the cities in the United States with the worst ozone pollution. To filter the data for the site, the "State Code", "County Code", and "Site Number" informations are used associated with that site specificaion. After concatenating the data from different years, the compact and refined data was saved to local compute as CSV format. This approach allows the user to deal with very big data in efficient way and computationally less expensive data.

### Required packages

In [1]:
import pandas as pd
from zipfile import ZipFile
from urllib.request import urlopen
import io
from pathlib import Path

### Acquiring, filtering, and processing the data

In [2]:
epa_base_dir = 'https://aqs.epa.gov/aqsweb/airdata/hourly_44201_'
epa_dir_list = [f'{epa_base_dir}{i}.zip' for i in range(2005,2020)]

In [3]:
%%time
state_code = 48
county_code = 201
site_num = 24

dfs = []
for myzip in epa_dir_list:
    with urlopen(myzip) as req:
        zip_file = ZipFile(io.BytesIO(req.read()))
        csv_file = f"{Path(myzip).stem}.csv"
        df = pd.read_csv(zip_file.open(csv_file))
        
        dff = df.loc[(df['State Code'] == state_code) & (df["County Code"] == county_code) & (df["Site Num"] == site_num)]       
       
        df_filtered = pd.DataFrame(data={
                   "State Code": dff["State Code"].values,
                   "County Code": dff["County Code"].values,
                   "Site Num": dff["Site Num"].values,
                   "ozone_ppm": dff["Sample Measurement"].values,
                   "Date GMT": dff["Date GMT"], 
                   "Time GMT": dff["Time GMT"], 
        })
        
        df_filtered["Date Time GMT"] = pd.DatetimeIndex(df_filtered["Date GMT"] + " " + df_filtered["Time GMT"])
        df_filtered_mod = df_filtered.drop(['Date GMT', 'Time GMT'], axis=1)
        
        # get Beltsville data
        dfs.append(df_filtered_mod)
        
    
        print ('Uncompressing data file ...', csv_file)
        del df, df_filtered, dff, zip_file, df_filtered_mod
        
df_final = pd.concat(dfs)
df_final.reset_index(drop = True)
df_final.to_csv("./data/Houston.csv",index=False, encoding='utf-8-sig')



Uncompressing data file ... hourly_44201_2005.csv




Uncompressing data file ... hourly_44201_2006.csv
Uncompressing data file ... hourly_44201_2007.csv
Uncompressing data file ... hourly_44201_2008.csv
Uncompressing data file ... hourly_44201_2009.csv




Uncompressing data file ... hourly_44201_2010.csv
Uncompressing data file ... hourly_44201_2011.csv
Uncompressing data file ... hourly_44201_2012.csv
Uncompressing data file ... hourly_44201_2013.csv
Uncompressing data file ... hourly_44201_2014.csv
Uncompressing data file ... hourly_44201_2015.csv
Uncompressing data file ... hourly_44201_2016.csv
Uncompressing data file ... hourly_44201_2017.csv
Uncompressing data file ... hourly_44201_2018.csv
Uncompressing data file ... hourly_44201_2019.csv
CPU times: user 4min 34s, sys: 29.3 s, total: 5min 4s
Wall time: 15min 7s
