# **EDA**

## Objectives

* To perform exploratory data analysis

## Inputs

* The csv file "pollution_us_2012_2016-population-weather.csv" 

## Outputs

* Various plots (histogram, box plot, scatter plot etc.) to understand the distribution and correlation between variables, along with statistical tests

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Air_Pollution_Team_2\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [40]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [2]:
os.chdir(r"c:\Users\sonia\Documents\VS Studio Projects\US_Air_Pollution_Team_2")

os.getcwd()

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Air_Pollution_Team_2'

Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Air_Pollution_Team_2'

---

## Required Libraries

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import QuantileTransformer
from feature_engine.transformation import YeoJohnsonTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV

---

## Load the Dataset

In [6]:
df = pd.read_csv('Dataset/EDA/pollution_us_2012_2016-population-weather.csv') 
pd.set_option("display.max_columns", None)
df.head()


Unnamed: 0,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,O3 AQI,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Population,Latitude,Longitude,tmax,prcp,wspd,pres,Month,Year,AQI
0,Arizona,Pima,Tucson,2012-01-01,17.716667,31.0,0,29,0.013667,0.03,10,25,0.254167,0.5,19,0.0,0.336842,0.6,5,7.0,542649,31.9681,-111.7806,26.7,0.0,17.6,1022.2,January,2012,29.0
1,Arizona,Pima,Tucson,2012-01-02,15.0625,30.6,18,28,0.015083,0.03,10,25,0.2,0.6,19,0.0,0.225,0.4,23,5.0,542649,31.9681,-111.7806,24.4,0.0,27.4,1023.2,January,2012,28.0
2,Arizona,Pima,Tucson,2012-01-03,21.643478,31.0,18,29,0.011417,0.026,9,22,0.295455,0.7,8,0.0,0.295833,0.4,0,5.0,542649,31.9681,-111.7806,26.1,0.0,10.8,1023.2,January,2012,29.0
3,Arizona,Pima,Tucson,2012-01-04,25.041668,37.8,10,35,0.009208,0.02,10,17,0.7375,2.1,19,3.0,0.345833,0.5,12,6.0,542649,31.9681,-111.7806,24.4,0.0,9.0,1024.2,January,2012,35.0
4,Arizona,Pima,Tucson,2012-01-05,21.981817,37.1,17,35,0.013042,0.031,9,26,0.330435,0.8,21,0.0,0.291667,0.6,23,7.0,542649,31.9681,-111.7806,23.9,0.0,9.7,1020.9,January,2012,35.0


In [7]:
df['Date Local'] = pd.to_datetime(df['Date Local'])

---

## Keping Cities with Enough Data for Modelling

In [None]:
df["City"].nunique()

43

In [22]:
# 2. Drop rows with invalid dates (if any)
bad_dates = df['Date Local'].isna().sum()
print(f"Rows with invalid dates: {bad_dates}")

# 3. Sort
df = df.sort_values(['City', 'Date Local']).reset_index(drop=True)

# 4. Quick checks
print(df[['City','Date Local']].groupby('City').agg(['min','max']).head())

Rows with invalid dates: 0
            Date Local           
                   min        max
City                             
Albuquerque 2012-01-01 2015-12-31
Alexandria  2012-03-31 2012-08-21
Austin      2012-11-30 2014-07-01
Birmingham  2013-12-01 2016-05-31
Blaine      2012-03-13 2015-12-28


In [10]:
# Define full expected date range
full_range = pd.date_range(start='2012-01-01', end='2016-05-31')

# Count days per city
city_day_counts = df.groupby('City')['Date Local'].nunique()

# Expected number of days (including leap years)
expected_days = len(full_range)  # 1827 days total

# Identify which cities have full coverage
complete_cities = city_day_counts[city_day_counts == expected_days].index
incomplete_cities = city_day_counts[city_day_counts < expected_days].index

print(f"Cities with full data (2012–2016): {len(complete_cities)}")
print(f"Cities missing some days: {len(incomplete_cities)}\n")

# Optionally display which are incomplete and how many days they have
city_coverage = pd.DataFrame({
    'Days available': city_day_counts,
    'Missing days': expected_days - city_day_counts
}).sort_values('Days available', ascending=False)

city_coverage.head(20)  # shows top 20 cities by coverage

Cities with full data (2012–2016): 0
Cities missing some days: 43



Unnamed: 0_level_0,Days available,Missing days
City,Unnamed: 1_level_1,Unnamed: 2_level_1
New York,1573,40
El Paso,1553,60
Deer Park,1547,66
Houston,1515,98
Charlotte,1514,99
Dallas,1501,112
Victorville,1480,133
Albuquerque,1432,181
Concord,1427,186
San Pablo,1426,187


In [19]:
city_day_counts = df.groupby("City")["Date Local"].nunique().reset_index(name="Days_available")

# Keep only cities with > 1400 days
cities_to_keep = city_day_counts[city_day_counts["Days_available"] > 1400]["City"].tolist()

# Filter main dataframe
df_filtered = df[df["City"].isin(cities_to_keep)].copy()

print(f"Keeping {len(cities_to_keep)} cities:")
print(cities_to_keep)

Keeping 11 cities:
['Albuquerque', 'Charlotte', 'Concord', 'Dallas', 'Deer Park', 'El Paso', 'Houston', 'New York', 'Oakland', 'San Pablo', 'Victorville']


In [23]:
df_filtered.head(2)

Unnamed: 0,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,O3 AQI,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Population,Latitude,Longitude,tmax,prcp,wspd,pres,Month,Year,AQI
0,New Mexico,Bernalillo,Albuquerque,2012-01-01,21.091667,36.1,3,34,0.016375,0.031,9,26,0.770833,1.5,21,1.0,0.246154,0.6,23,7.0,564549,35.0448,-106.677,11.1,0.0,6.1,1033.0,January,2012,34.0
1,New Mexico,Bernalillo,Albuquerque,2012-01-02,31.09375,41.4,18,39,0.012,0.025,9,21,1.091667,2.1,1,3.0,0.4125,0.7,23,8.0,564549,35.0448,-106.677,10.0,0.0,5.8,1035.9,January,2012,39.0


In [26]:
# Get the list of cities currently in df_filtered
cities = df_filtered["City"].unique()

# Compute min and max Date Local for each city
city_coverage = df_filtered.groupby("City")["Date Local"].agg(["min", "max"])

# Filter to only the cities in our current set (11 cities)
city_coverage = city_coverage.loc[cities]

print(df_filtered.shape)
print("Date coverage for each city:")
print(city_coverage)

(20146, 30)
Date coverage for each city:
                   min        max
City                             
Albuquerque 2012-01-01 2015-12-31
Charlotte   2012-01-01 2016-05-31
Concord     2012-01-01 2016-04-30
Dallas      2012-01-01 2016-03-31
Deer Park   2012-01-01 2016-04-30
El Paso     2012-01-01 2016-04-30
Houston     2012-01-01 2016-03-31
New York    2012-01-01 2016-04-30
Oakland     2012-01-01 2016-04-30
San Pablo   2012-01-01 2016-04-30
Victorville 2012-01-01 2016-03-31


---

## Addition of Lag and Rolling Features

In [27]:
# Sort by city and date
df_filtered = df_filtered.sort_values(["City", "Date Local"])

# Create lag features
for lag in [1, 2, 3]:
    df_filtered[f"AQI_lag{lag}"] = df_filtered.groupby("City")["AQI"].shift(lag)

# 7-day rolling mean of past AQI
df_filtered["AQI_rolling7"] = df_filtered.groupby("City")["AQI"].transform(lambda x: x.rolling(7).mean())

# 7-day ahead target
df_filtered["AQI_7d_ahead"] = df_filtered.groupby("City")["AQI"].shift(-7)

# Drop rows with NaN from lags, rolling mean, or 7-day ahead target
df_filtered = df_filtered.dropna(subset=["AQI_7d_ahead", "AQI_rolling7", "AQI_lag1", "AQI_lag2", "AQI_lag3"]).reset_index(drop=True)

print(df_filtered.shape)
df_filtered.head(2)

(20003, 35)


Unnamed: 0,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,O3 AQI,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Population,Latitude,Longitude,tmax,prcp,wspd,pres,Month,Year,AQI,AQI_lag1,AQI_lag2,AQI_lag3,AQI_rolling7,AQI_7d_ahead
0,New Mexico,Bernalillo,Albuquerque,2012-01-07,27.116667,39.1,20,37,0.0105,0.026,10,22,1.479167,2.8,8,3.0,0.483333,0.7,1,8.0,564549,35.0448,-106.677,12.8,0.0,8.6,1014.9,January,2012,37.0,49.0,40.0,34.0,39.0,39.0
1,New Mexico,Bernalillo,Albuquerque,2012-01-08,16.020832,31.3,0,29,0.016042,0.029,12,25,0.6,1.7,0,1.0,0.308333,0.8,1,9.0,564549,35.0448,-106.677,6.1,0.0,22.3,1017.3,January,2012,29.0,37.0,49.0,40.0,38.285714,35.0


---

## Add Seasonality Features

In [28]:
# Convert month name to number if needed
month_mapping = {
    "January": 1, "February": 2, "March": 3, "April": 4,
    "May": 5, "June": 6, "July": 7, "August": 8,
    "September": 9, "October": 10, "November": 11, "December": 12
}

df_filtered['Month_num'] = df_filtered['Month'].map(month_mapping)

# Add cyclical features
df_filtered['month_sin'] = np.sin(2 * np.pi * df_filtered['Month_num'] / 12)
df_filtered['month_cos'] = np.cos(2 * np.pi * df_filtered['Month_num'] / 12)
df_filtered.head(2)

Unnamed: 0,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,O3 AQI,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI,Population,Latitude,Longitude,tmax,prcp,wspd,pres,Month,Year,AQI,AQI_lag1,AQI_lag2,AQI_lag3,AQI_rolling7,AQI_7d_ahead,Month_num,month_sin,month_cos
0,New Mexico,Bernalillo,Albuquerque,2012-01-07,27.116667,39.1,20,37,0.0105,0.026,10,22,1.479167,2.8,8,3.0,0.483333,0.7,1,8.0,564549,35.0448,-106.677,12.8,0.0,8.6,1014.9,January,2012,37.0,49.0,40.0,34.0,39.0,39.0,1,0.5,0.866025
1,New Mexico,Bernalillo,Albuquerque,2012-01-08,16.020832,31.3,0,29,0.016042,0.029,12,25,0.6,1.7,0,1.0,0.308333,0.8,1,9.0,564549,35.0448,-106.677,6.1,0.0,22.3,1017.3,January,2012,29.0,37.0,49.0,40.0,38.285714,35.0,1,0.5,0.866025


---

## Drop Unwanted Variables

In [30]:
keep_col = ["State",
            "County",
            "City",
            "Date Local",
            "NO2 Mean",
            "NO2 1st Max Hour",
            "O3 Mean",
            "O3 1st Max Hour",
            "SO2 Mean",
            "SO2 1st Max Hour",
            "CO Mean",
            "CO 1st Max Hour",
            "Population",
            "Latitude",
            "Longitude",
            "tmax",
            "prcp",
            "wspd",
            "pres",
            "Year",
            "AQI",
            "AQI_lag1",
            "AQI_lag2",
            "AQI_lag3",
            "AQI_rolling7",
            "AQI_7d_ahead",
            "month_sin",
            "month_cos" 
]

df_keep = df_filtered[keep_col]
df_keep.head()

Unnamed: 0,State,County,City,Date Local,NO2 Mean,NO2 1st Max Hour,O3 Mean,O3 1st Max Hour,SO2 Mean,SO2 1st Max Hour,CO Mean,CO 1st Max Hour,Population,Latitude,Longitude,tmax,prcp,wspd,pres,Year,AQI,AQI_lag1,AQI_lag2,AQI_lag3,AQI_rolling7,AQI_7d_ahead,month_sin,month_cos
0,New Mexico,Bernalillo,Albuquerque,2012-01-07,27.116667,20,0.0105,10,1.479167,8,0.483333,1,564549,35.0448,-106.677,12.8,0.0,8.6,1014.9,2012,37.0,49.0,40.0,34.0,39.0,39.0,0.5,0.866025
1,New Mexico,Bernalillo,Albuquerque,2012-01-08,16.020832,0,0.016042,12,0.6,0,0.308333,1,564549,35.0448,-106.677,6.1,0.0,22.3,1017.3,2012,29.0,37.0,49.0,40.0,38.285714,35.0,0.5,0.866025
2,New Mexico,Bernalillo,Albuquerque,2012-01-09,18.456522,21,0.014208,11,0.604167,16,0.2375,23,564549,35.0448,-106.677,8.3,0.0,10.8,1024.5,2012,35.0,29.0,37.0,49.0,37.714286,26.0,0.5,0.866025
3,New Mexico,Bernalillo,Albuquerque,2012-01-10,30.5625,7,0.009833,9,1.816667,8,0.495833,10,564549,35.0448,-106.677,11.1,0.0,5.0,1020.7,2012,43.0,35.0,29.0,37.0,38.142857,35.0,0.5,0.866025
4,New Mexico,Bernalillo,Albuquerque,2012-01-11,20.2375,6,0.021417,11,1.154167,7,0.395833,7,564549,35.0448,-106.677,11.1,0.0,13.7,1015.9,2012,36.0,43.0,35.0,29.0,38.428571,41.0,0.5,0.866025


---

## Split into Train and Test Sets

In [None]:
# Set train: 2012-01-01 to 2014-12-31
train_end = pd.Timestamp("2014-12-31")

# Set test: 2015-01-01 to 2016-05-31 (your last date)
test_start = pd.Timestamp("2015-01-01")
test_end = pd.Timestamp("2016-05-31")

In [None]:
train_df = df_filtered[df_filtered["Date Local"] <= train_end]
test_df  = df_filtered[(df_filtered["Date Local"] >= test_start) & (df_filtered["Date Local"] <= test_end)]

---

## Pipeline

---

---

---

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.