# **ETL PROCESS**

---
---

## Description

This notebook performs an ETL (Extract, Transform, Load) process on US air pollution data from 2000 to 2016. It imports the raw CSV, cleans the data by removing missing values and unnecessary columns, renames columns for clarity, and prepares a sample for analysis. The final processed data is saved for further use in analytics or machine learning.

## Objectives

* Fetch data and save it as raw data file and upload it to the workspace. 
* Take the data through the ETL process to clean it.

## Inputs

* Raw CSV data file.

## Outputs

* This notebook will hope to generate a clean CSV file of the data. 

## Additional Comments

* This dataset was sourced from Kaggle and contains data regarding air pollution quality in the US in from 2000 to 2016.

---

## Change Working Directory

We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory. When you restart the kernel (and clear outputs, if necessary) always be certain that these 3 cells run in order.

We need to change the working directory from its current folder to its parent folder:
* We access the current directory with *os.getcwd()*.
* We confirm the current directory with *current_dir* defined in the variable.

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\F_bee\\Documents\\vs-code\\vs-code-projects\\github\\air-quality-dashboard\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory:
* *os.path.dirname()* gets the parent directory.
* *os.chir() defines* the new current directory.

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory.

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\F_bee\\Documents\\vs-code\\vs-code-projects\\github\\air-quality-dashboard'

---
---

## Section 1

Extract and read the data.

In [4]:
# Import all necessary packages
import numpy as np
import pandas as pd

print("All packages imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

All packages imported successfully!
NumPy version: 2.3.0
Pandas version: 2.3.0


In [5]:
# Read data and return full DataFrame with shape to make sure everything is working
# Also return the data shape
df = pd.read_csv("inputs/pollution_us_2000_2016.zip", compression="zip")
print(f"DataFrame shape: {df.shape}")
print("Data loaded successfully!")
df

DataFrame shape: (1746661, 29)
Data loaded successfully!


Unnamed: 0.1,Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,...,SO2 Units,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
0,0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.000000,9.0,21,13.0,Parts per million,1.145833,4.200,21,
1,1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.000000,9.0,21,13.0,Parts per million,0.878947,2.200,23,25.0
2,2,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975000,6.6,23,,Parts per million,1.145833,4.200,21,
3,3,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975000,6.6,23,,Parts per million,0.878947,2.200,23,25.0
4,4,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,Parts per billion,...,Parts per billion,1.958333,3.0,22,4.0,Parts per million,0.850000,1.600,23,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1746656,24599,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,2016-03-30,Parts per billion,...,Parts per billion,0.000000,0.0,2,,Parts per million,0.091667,0.100,2,1.0
1746657,24600,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,2016-03-31,Parts per billion,...,Parts per billion,-0.022727,0.0,0,0.0,Parts per million,0.067714,0.127,0,
1746658,24601,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,2016-03-31,Parts per billion,...,Parts per billion,-0.022727,0.0,0,0.0,Parts per million,0.100000,0.100,0,1.0
1746659,24602,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,2016-03-31,Parts per billion,...,Parts per billion,0.000000,0.0,5,,Parts per million,0.067714,0.127,0,


In [6]:
# Return the first five values of the DataFrame for future observation purposes where necessary
# Also return the data types
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1746661 entries, 0 to 1746660
Data columns (total 29 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Unnamed: 0         int64  
 1   State Code         int64  
 2   County Code        int64  
 3   Site Num           int64  
 4   Address            object 
 5   State              object 
 6   County             object 
 7   City               object 
 8   Date Local         object 
 9   NO2 Units          object 
 10  NO2 Mean           float64
 11  NO2 1st Max Value  float64
 12  NO2 1st Max Hour   int64  
 13  NO2 AQI            int64  
 14  O3 Units           object 
 15  O3 Mean            float64
 16  O3 1st Max Value   float64
 17  O3 1st Max Hour    int64  
 18  O3 AQI             int64  
 19  SO2 Units          object 
 20  SO2 Mean           float64
 21  SO2 1st Max Value  float64
 22  SO2 1st Max Hour   int64  
 23  SO2 AQI            float64
 24  CO Units           object 
 25  CO Mean           

---

# Section 2

Clean the data.

In [7]:
# Load instance of DateFrame
print("Data loaded successfully!")
df.head()

Data loaded successfully!


Unnamed: 0.1,Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,...,SO2 Units,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
0,0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,1.145833,4.2,21,
1,1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,0.878947,2.2,23,25.0
2,2,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,1.145833,4.2,21,
3,3,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,0.878947,2.2,23,25.0
4,4,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,Parts per billion,...,Parts per billion,1.958333,3.0,22,4.0,Parts per million,0.85,1.6,23,


In [8]:
# Print column names and their data types for df
print("Column names:")
print(df.columns.tolist())
print(f"DataFrame shape: {df.shape}")
print(df.dtypes)

Column names:
['Unnamed: 0', 'State Code', 'County Code', 'Site Num', 'Address', 'State', 'County', 'City', 'Date Local', 'NO2 Units', 'NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI']
DataFrame shape: (1746661, 29)
Unnamed: 0             int64
State Code             int64
County Code            int64
Site Num               int64
Address               object
State                 object
County                object
City                  object
Date Local            object
NO2 Units             object
NO2 Mean             float64
NO2 1st Max Value    float64
NO2 1st Max Hour       int64
NO2 AQI                int64
O3 Units              object
O3 Mean              float64
O3 1st Max Value     float64
O3 1st Max Hour        int64
O3 AQI                 int6

In [9]:
# Check minimum values for numerical columns only
print("Minimum values for numerical columns:")
print(df.select_dtypes(include="number").min())

Minimum values for numerical columns:
Unnamed: 0           0.0000
State Code           1.0000
County Code          1.0000
Site Num             1.0000
NO2 Mean            -2.0000
NO2 1st Max Value   -2.0000
NO2 1st Max Hour     0.0000
NO2 AQI              0.0000
O3 Mean              0.0000
O3 1st Max Value     0.0000
O3 1st Max Hour      0.0000
O3 AQI               0.0000
SO2 Mean            -2.0000
SO2 1st Max Value   -2.0000
SO2 1st Max Hour     0.0000
SO2 AQI              0.0000
CO Mean             -0.4375
CO 1st Max Value    -0.4000
CO 1st Max Hour      0.0000
CO AQI               0.0000
dtype: float64
Unnamed: 0           0.0000
State Code           1.0000
County Code          1.0000
Site Num             1.0000
NO2 Mean            -2.0000
NO2 1st Max Value   -2.0000
NO2 1st Max Hour     0.0000
NO2 AQI              0.0000
O3 Mean              0.0000
O3 1st Max Value     0.0000
O3 1st Max Hour      0.0000
O3 AQI               0.0000
SO2 Mean            -2.0000
SO2 1st Max Value   -2.

In [10]:
# Check maximum values for numerical columns only
print("Maximum values for numerical columns:")
print(df.select_dtypes(include="number").max())

Maximum values for numerical columns:
Unnamed: 0           134575.000000
State Code               80.000000
County Code             650.000000
Site Num               9997.000000
NO2 Mean                139.541667
NO2 1st Max Value       267.000000
NO2 1st Max Hour         23.000000
NO2 AQI                 132.000000
O3 Mean                   0.095083
O3 1st Max Value          0.141000
O3 1st Max Hour          23.000000
O3 AQI                  218.000000
SO2 Mean                321.625000
SO2 1st Max Value       351.000000
SO2 1st Max Hour         23.000000
SO2 AQI                 200.000000
CO Mean                   7.508333
CO 1st Max Value         19.900000
CO 1st Max Hour          23.000000
CO AQI                  201.000000
dtype: float64
Unnamed: 0           134575.000000
State Code               80.000000
County Code             650.000000
Site Num               9997.000000
NO2 Mean                139.541667
NO2 1st Max Value       267.000000
NO2 1st Max Hour         23.000000
NO

In [11]:
# Check for duplicated values and return their sum
df.duplicated().sum()

np.int64(0)

In [12]:
# Check for null values in each column and return their sum
df.isnull().sum()

Unnamed: 0                0
State Code                0
County Code               0
Site Num                  0
Address                   0
State                     0
County                    0
City                      0
Date Local                0
NO2 Units                 0
NO2 Mean                  0
NO2 1st Max Value         0
NO2 1st Max Hour          0
NO2 AQI                   0
O3 Units                  0
O3 Mean                   0
O3 1st Max Value          0
O3 1st Max Hour           0
O3 AQI                    0
SO2 Units                 0
SO2 Mean                  0
SO2 1st Max Value         0
SO2 1st Max Hour          0
SO2 AQI              872907
CO Units                  0
CO Mean                   0
CO 1st Max Value          0
CO 1st Max Hour           0
CO AQI               873323
dtype: int64

In [13]:
# Drop missing values from df
df = df.dropna()
print(f"DataFrame shape: {df.shape}")
print("Missing values dropped from df.")

DataFrame shape: (436876, 29)
Missing values dropped from df.


In [14]:
# Print column names and their data types for df
print("Column names:")
print(df.columns.tolist())
print(f"DataFrame shape: {df.shape}")
print(df.dtypes)

Column names:
['Unnamed: 0', 'State Code', 'County Code', 'Site Num', 'Address', 'State', 'County', 'City', 'Date Local', 'NO2 Units', 'NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI']
DataFrame shape: (436876, 29)
Unnamed: 0             int64
State Code             int64
County Code            int64
Site Num               int64
Address               object
State                 object
County                object
City                  object
Date Local            object
NO2 Units             object
NO2 Mean             float64
NO2 1st Max Value    float64
NO2 1st Max Hour       int64
NO2 AQI                int64
O3 Units              object
O3 Mean              float64
O3 1st Max Value     float64
O3 1st Max Hour        int64
O3 AQI                 int64

In [15]:
# Drop "Unnamed: 0" column from df
# Code drops all columns with string "Unnamed"
df = df.loc[:, ~df.columns.str.contains("^Unnamed")]
print("'Unnamed: 0' column dropped.")

'Unnamed: 0' column dropped.


In [16]:
# Print column names and their data types for df
print("Column names:")
print(df.columns.tolist())
print(f"DataFrame shape: {df.shape}")
print(df.dtypes)

Column names:
['State Code', 'County Code', 'Site Num', 'Address', 'State', 'County', 'City', 'Date Local', 'NO2 Units', 'NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI']
DataFrame shape: (436876, 28)
State Code             int64
County Code            int64
Site Num               int64
Address               object
State                 object
County                object
City                  object
Date Local            object
NO2 Units             object
NO2 Mean             float64
NO2 1st Max Value    float64
NO2 1st Max Hour       int64
NO2 AQI                int64
O3 Units              object
O3 Mean              float64
O3 1st Max Value     float64
O3 1st Max Hour        int64
O3 AQI                 int64
SO2 Units             object
SO2 Mean     

In [17]:
# Rename "Site Num" column to "Site Number" in df
df = df.rename(columns={"Site Num": "Site Number"})
print("Column 'Site Num' renamed to 'Site Number'.")

Column 'Site Num' renamed to 'Site Number'.


In [18]:
# Print column names and their data types for df
print("Column names:")
print(df.columns.tolist())
print(f"DataFrame shape: {df.shape}")
print(df.dtypes)

Column names:
['State Code', 'County Code', 'Site Number', 'Address', 'State', 'County', 'City', 'Date Local', 'NO2 Units', 'NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI']
DataFrame shape: (436876, 28)
State Code             int64
County Code            int64
Site Number            int64
Address               object
State                 object
County                object
City                  object
Date Local            object
NO2 Units             object
NO2 Mean             float64
NO2 1st Max Value    float64
NO2 1st Max Hour       int64
NO2 AQI                int64
O3 Units              object
O3 Mean              float64
O3 1st Max Value     float64
O3 1st Max Hour        int64
O3 AQI                 int64
SO2 Units             object
SO2 Mean  

In [19]:
# Add "Date", "Year" and "Month" columns using the "Date Local" column
# Make sure the new "Date" column is in "datetime" format
df["Date"] = pd.to_datetime(df["Date Local"])
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
print("Data loaded successfully!")
df.head()

Data loaded successfully!


Unnamed: 0,State Code,County Code,Site Number,Address,State,County,City,Date Local,NO2 Units,NO2 Mean,...,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Date,Year,Month
1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,19.041667,...,21,13.0,Parts per million,0.878947,2.2,23,25.0,2000-01-01,2000,1
5,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,Parts per billion,22.958333,...,22,4.0,Parts per million,1.066667,2.3,0,26.0,2000-01-02,2000,1
9,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-03,Parts per billion,38.125,...,19,16.0,Parts per million,1.7625,2.5,8,28.0,2000-01-03,2000,1
13,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-04,Parts per billion,40.26087,...,8,23.0,Parts per million,1.829167,3.0,23,34.0,2000-01-04,2000,1
17,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-05,Parts per billion,48.45,...,7,21.0,Parts per million,2.7,3.7,2,42.0,2000-01-05,2000,1


In [20]:
# Drop "Date Local" column from df
df = df.drop(columns=["Date Local"])
print("'Date Local' column dropped.")

'Date Local' column dropped.


In [21]:
# Print column names and their data types for df
print("Column names:")
print(df.columns.tolist())
print(f"DataFrame shape: {df.shape}")
print(df.dtypes)

Column names:
['State Code', 'County Code', 'Site Number', 'Address', 'State', 'County', 'City', 'NO2 Units', 'NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI', 'Date', 'Year', 'Month']
DataFrame shape: (436876, 30)
State Code                    int64
County Code                   int64
Site Number                   int64
Address                      object
State                        object
County                       object
City                         object
NO2 Units                    object
NO2 Mean                    float64
NO2 1st Max Value           float64
NO2 1st Max Hour              int64
NO2 AQI                       int64
O3 Units                     object
O3 Mean                     float64
O3 1st Max Value            float64
O3 1st Max

In [22]:
# Drop all rows where the "City" column is "Not in a city"
df = df[df["City"] != "Not in a city"]
print("Rows with 'Not in a city' in the City column dropped.")

Rows with 'Not in a city' in the City column dropped.


In [23]:
# Print column names and their data types for df
print("Column names:")
print(df.columns.tolist())
print(f"DataFrame shape: {df.shape}")
print(df.dtypes)

Column names:
['State Code', 'County Code', 'Site Number', 'Address', 'State', 'County', 'City', 'NO2 Units', 'NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI', 'Date', 'Year', 'Month']
DataFrame shape: (402257, 30)
State Code                    int64
County Code                   int64
Site Number                   int64
Address                      object
State                        object
County                       object
City                         object
NO2 Units                    object
NO2 Mean                    float64
NO2 1st Max Value           float64
NO2 1st Max Hour              int64
NO2 AQI                       int64
O3 Units                     object
O3 Mean                     float64
O3 1st Max Value            float64
O3 1st Max

In [24]:
# Drop all rows with negative values in any pollutant column
pollutant_cols = ["NO2 Mean", "O3 Mean", "SO2 Mean", "CO Mean"]
for col in pollutant_cols:
    df = df[df[col] >= 0]
print("Rows with negative pollutant values dropped.")

Rows with negative pollutant values dropped.


In [25]:
# Print column names and their data types for df
print("Column names:")
print(df.columns.tolist())
print(f"DataFrame shape: {df.shape}")
print(df.dtypes)

Column names:
['State Code', 'County Code', 'Site Number', 'Address', 'State', 'County', 'City', 'NO2 Units', 'NO2 Mean', 'NO2 1st Max Value', 'NO2 1st Max Hour', 'NO2 AQI', 'O3 Units', 'O3 Mean', 'O3 1st Max Value', 'O3 1st Max Hour', 'O3 AQI', 'SO2 Units', 'SO2 Mean', 'SO2 1st Max Value', 'SO2 1st Max Hour', 'SO2 AQI', 'CO Units', 'CO Mean', 'CO 1st Max Value', 'CO 1st Max Hour', 'CO AQI', 'Date', 'Year', 'Month']
DataFrame shape: (397368, 30)
State Code                    int64
County Code                   int64
Site Number                   int64
Address                      object
State                        object
County                       object
City                         object
NO2 Units                    object
NO2 Mean                    float64
NO2 1st Max Value           float64
NO2 1st Max Hour              int64
NO2 AQI                       int64
O3 Units                     object
O3 Mean                     float64
O3 1st Max Value            float64
O3 1st Max

In [26]:
# Remove outliers from pollutant columns using the IQR method
pollutant_cols = ["NO2 Mean", "O3 Mean", "SO2 Mean", "CO Mean"]
for col in pollutant_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df1 = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
print(f"DataFrame shape after outliers removal: {df1.shape}")
df1.head()

DataFrame shape after outliers removal: (372998, 30)


Unnamed: 0,State Code,County Code,Site Number,Address,State,County,City,NO2 Units,NO2 Mean,NO2 1st Max Value,...,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Date,Year,Month
1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,Parts per billion,19.041667,49.0,...,21,13.0,Parts per million,0.878947,2.2,23,25.0,2000-01-01,2000,1
101,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,Parts per billion,27.217391,42.0,...,1,10.0,Parts per million,0.866667,1.4,0,16.0,2000-01-26,2000,1
169,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,Parts per billion,23.208333,48.0,...,0,9.0,Parts per million,0.879167,2.0,1,23.0,2000-02-12,2000,2
173,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,Parts per billion,26.708333,43.0,...,21,4.0,Parts per million,0.645833,1.0,8,11.0,2000-02-13,2000,2
189,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,Parts per billion,22.227273,51.0,...,0,11.0,Parts per million,0.829167,2.4,0,27.0,2000-02-17,2000,2


In [29]:
# Extract a random, fractioned sample of the data of 5000 values for analytic purposes
# Also return the new data shape
df1 = df.sample(frac=0.012582, random_state=10)
print(f"DataFrame shape: {df1.shape}")
print("Data loaded successfully!")
df1.head()

DataFrame shape: (5000, 30)
Data loaded successfully!


Unnamed: 0,State Code,County Code,Site Number,Address,State,County,City,NO2 Units,NO2 Mean,NO2 1st Max Value,...,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Date,Year,Month
1359377,6,65,8001,"5888 MISSION BLVD., RUBIDOUX",California,Riverside,Rubidoux,Parts per billion,25.16087,35.8,...,7,1.0,Parts per million,0.741667,1.0,7,11.0,2013-02-13,2013,2
254396,40,21,9002,"P.O. BOX 948 TAHLEQUAH, OK 74464",Oklahoma,Cherokee,Park Hill,Parts per billion,7.666667,20.0,...,10,3.0,Parts per million,0.0625,0.1,0,1.0,2002-09-03,2002,9
1617936,6,37,4006,"2425 Webster St., Long Beach, CA",California,Los Angeles,Long Beach,Parts per billion,14.721739,24.2,...,13,4.0,Parts per million,0.341667,0.5,5,6.0,2015-09-06,2015,9
410057,25,25,42,HARRISON AVE,Massachusetts,Suffolk,Boston,Parts per billion,15.458333,26.0,...,3,14.0,Parts per million,0.025,0.1,8,1.0,2004-04-12,2004,4
666263,6,83,2011,"380 N FAIRVIEW AVENUE, GOLETA",California,Santa Barbara,Goleta,Parts per billion,3.478261,12.0,...,3,1.0,Parts per million,0.0,0.0,0,0.0,2007-02-28,2007,2


---

# Section 3

Load the data to the necessary file.

In [28]:
# Save the processed dataset
df1.to_csv("outputs/pollution_us_2000_2016_cleaned.csv", index=False)

---
---

## Insights

Here are a few insights from this notebook:

* The air pollution dataset covers US data from 2000 to 2016 and includes multiple pollutants and site information.
* Data cleaning steps included removing missing values and unnecessary columns, renaming columns, asssuring correct data types, generating necessary columns and shortening the dataset, improving data quality for analysis.
* The column "Site Num" was renamed to "Site Number" for clarity.
* The "Date Local" column was dropped and replace with a date column with a datetime format.
* "Not in a city" values from the "City" column were dropped.
* A random sample of 5,000 rows was extracted for efficient analysis.
* The cleaned data is ready for further statistical analysis, visualization, or machine learning tasks.

**NOTE**

* The dataset was very large and introduced commit conflicts to origin. This was resolved be sending it to a zip file and reintroducing it to the workspace.
* Also, working against time constraints found itself difficult, though the necessary data was manifested.
* COnsidering the use of the sites (addresses), we must keep in mind the ethical use of this data, also abiding by proper protection and safety legislation.

---

## Conclusion

This ETL session was quite intriguing. Time was spent trying to deduce the right dataset to work with, but in the end the dataset was generated and loaded to further analyze.