# **Feature Engineering**

## Objectives

* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Air_Pollution_Team_2\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\sonia\\Documents\\VS Studio Projects\\US_Air_Pollution_Team_2'

---

# Required Libraries

Section 1 content

In [None]:
import pandas as pd
import numpy as np
import re
from datetime import datetime
from meteostat import Point, Daily

---

# Load the Dataset

In [6]:
df = pd.read_csv('Dataset/Processed/pollution_us_2012_2016-cleaned.csv') # Reading the CSV file
df # Displaying the first 5 rows of the dataframe

Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,NO2 Mean,...,SO2 Units,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-01,Parts per billion,21.208332,...,Parts per billion,1.458333,5.0,0,7.0,Parts per million,1.152632,2.7,5,31.0
1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-02,Parts per billion,17.208332,...,Parts per billion,0.416667,2.0,7,3.0,Parts per million,0.425000,0.5,0,6.0
2,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-03,Parts per billion,30.000000,...,Parts per billion,2.250000,6.0,20,9.0,Parts per million,0.800000,1.7,23,19.0
3,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-04,Parts per billion,33.666668,...,Parts per billion,2.791667,5.0,7,7.0,Parts per million,1.275000,1.9,1,22.0
4,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-05,Parts per billion,31.695652,...,Parts per billion,3.043478,7.0,22,10.0,Parts per million,1.045833,1.7,23,19.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136329,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,2016-03-27,Parts per billion,4.277273,...,Parts per billion,-0.095238,0.0,0,0.0,Parts per million,0.100000,0.1,0,1.0
136330,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,2016-03-28,Parts per billion,8.317391,...,Parts per billion,0.117391,0.5,7,0.0,Parts per million,0.100000,0.1,0,1.0
136331,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,2016-03-29,Parts per billion,2.564706,...,Parts per billion,0.143750,0.7,8,0.0,Parts per million,0.006667,0.1,0,1.0
136332,56,21,100,NCore - North Cheyenne Soccer Complex,Wyoming,Laramie,Not in a city,2016-03-30,Parts per billion,1.083333,...,Parts per billion,0.016667,0.1,0,0.0,Parts per million,0.091667,0.1,2,1.0


---

## Keep Useful Columns

I'm going to make another DataFrame and keep the columns that I think will be useful. I'm doing this so that if we change our mind, it will be easy to remove/ add columns, compared to dropping them. 

In [5]:
df.info() # Display information about the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1746661 entries, 0 to 1746660
Data columns (total 29 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Unnamed: 0         int64  
 1   State Code         int64  
 2   County Code        int64  
 3   Site Num           int64  
 4   Address            object 
 5   State              object 
 6   County             object 
 7   City               object 
 8   Date Local         object 
 9   NO2 Units          object 
 10  NO2 Mean           float64
 11  NO2 1st Max Value  float64
 12  NO2 1st Max Hour   int64  
 13  NO2 AQI            int64  
 14  O3 Units           object 
 15  O3 Mean            float64
 16  O3 1st Max Value   float64
 17  O3 1st Max Hour    int64  
 18  O3 AQI             int64  
 19  SO2 Units          object 
 20  SO2 Mean           float64
 21  SO2 1st Max Value  float64
 22  SO2 1st Max Hour   int64  
 23  SO2 AQI            float64
 24  CO Units           object 
 25  CO Mean           

In [8]:
keep_col = ["Address",
            "State",
            "County",
            "City",
            "Date Local",
            "NO2 Mean",
            "NO2 1st Max Value",
            "NO2 1st Max Hour",
            "NO2 AQI",
            "O3 Mean",
            "O3 1st Max Value",
            "O3 1st Max Hour",
            "SO2 Mean",
            "SO2 1st Max Value",
            "SO2 1st Max Hour",
            "CO Mean",
            "CO 1st Max Value", 
            "CO 1st Max Hour"
]

df_keep = df[keep_col]
df_keep.head(5)

Unnamed: 0,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,CO Mean,CO 1st Max Value,CO 1st Max Hour
0,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-01,21.208332,33.0,0,31,0.015083,0.028,11,1.458333,5.0,0,1.152632,2.7,5
1,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-02,17.208332,38.0,22,36,0.018042,0.034,9,0.416667,2.0,7,0.425,0.5,0
2,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-03,30.0,47.0,18,44,0.008542,0.024,10,2.25,6.0,20,0.8,1.7,23
3,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-04,33.666668,47.0,19,44,0.005458,0.016,10,2.791667,5.0,7,1.275,1.9,1
4,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-05,31.695652,48.0,18,45,0.008292,0.024,9,3.043478,7.0,22,1.045833,1.7,23


In [10]:
df_keep[df_keep.duplicated()]

Unnamed: 0,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,CO Mean,CO 1st Max Value,CO 1st Max Hour
1070,PIKE AVE AT RIVER ROAD,Arkansas,Pulaski,North Little Rock,2012-02-02,18.183332,46.6,19,43,0.015667,0.027,10,1.320833,3.1,10,0.700000,0.8,0
1072,PIKE AVE AT RIVER ROAD,Arkansas,Pulaski,North Little Rock,2012-02-02,18.183332,46.6,19,43,0.015667,0.027,10,1.154167,2.8,10,0.700000,0.8,0
1254,PIKE AVE AT RIVER ROAD,Arkansas,Pulaski,North Little Rock,2012-03-19,5.082609,8.7,21,8,0.031208,0.039,13,1.408333,2.4,19,0.400000,0.4,0
1256,PIKE AVE AT RIVER ROAD,Arkansas,Pulaski,North Little Rock,2012-03-19,5.082609,8.7,21,8,0.031208,0.039,13,0.962500,1.9,19,0.400000,0.4,0
1258,PIKE AVE AT RIVER ROAD,Arkansas,Pulaski,North Little Rock,2012-03-20,6.483333,18.4,22,17,0.025208,0.032,10,1.220833,2.1,11,0.400000,0.4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133142,10TH ST. & VINE ST. DAVENPORT,Iowa,Scott,Davenport,2016-05-17,6.790909,15.2,21,14,0.024625,0.033,7,0.034783,0.2,0,0.250000,0.4,10
133144,10TH ST. & VINE ST. DAVENPORT,Iowa,Scott,Davenport,2016-05-18,7.136364,22.8,23,21,0.033167,0.050,11,0.013043,0.1,7,0.195833,0.3,22
133146,10TH ST. & VINE ST. DAVENPORT,Iowa,Scott,Davenport,2016-05-19,7.177273,17.3,21,16,0.039000,0.056,11,0.165217,0.7,8,0.229167,0.3,0
133148,10TH ST. & VINE ST. DAVENPORT,Iowa,Scott,Davenport,2016-05-20,9.336364,15.9,7,14,0.033375,0.051,11,0.434783,1.0,7,0.225000,0.3,9


In [11]:
df_keep = df_keep.drop_duplicates()
df_keep.shape

(136018, 18)

In [13]:
df_keep["State"].unique()

array(['Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut',
       'Delaware', 'District Of Columbia', 'Florida', 'Georgia', 'Hawaii',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Louisiana', 'Maine',
       'Maryland', 'Massachusetts', 'Minnesota', 'Nevada', 'New Jersey',
       'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio',
       'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island',
       'South Dakota', 'Texas', 'Utah', 'Virginia', 'Wyoming', 'Alabama',
       'Kentucky', 'Missouri', 'Washington', 'Alaska', 'New Hampshire',
       'Tennessee', 'South Carolina'], dtype=object)

In [14]:
df_keep["County"].nunique()

90

In [15]:
df_keep["City"].nunique()

90

In [19]:
df_keep["City"].value_counts()

City
Not in a city         14208
North Little Rock      4749
Los Angeles            3334
New York               3129
Cleveland              3128
                      ...  
Cicero                  169
Cherry Tree             158
Alexandria              135
Dentsville (Dents)       29
Roosevelt                26
Name: count, Length: 90, dtype: int64

## Add New Column "Population"

In [17]:
df_city_pop = pd.read_csv("Dataset/Support_files/City_Pop_Map.csv", encoding='latin1')
df_city_pop

Unnamed: 0,Area,Population
0,"Abbeville city, Alabama",2349
1,"Adamsville city, Alabama",4393
2,"Addison town, Alabama",661
3,"Akron town, Alabama",229
4,"Alabaster city, Alabama",33342
...,...,...
21408,,
21409,,
21410,,
21411,,


In [18]:
df_city_pop = df_city_pop.dropna(how='all')
df_city_pop

Unnamed: 0,Area,Population
0,"Abbeville city, Alabama",2349
1,"Adamsville city, Alabama",4393
2,"Addison town, Alabama",661
3,"Akron town, Alabama",229
4,"Alabaster city, Alabama",33342
...,...,...
19474,"Wamsutter town, Wyoming",203
19475,"Wheatland town, Wyoming",3586
19476,"Worland city, Wyoming",4784
19477,"Wright town, Wyoming",1645


In [20]:
df_city_pop = df_city_pop.copy()

# Remove leading/trailing spaces 
df_city_pop['Area'] = df_city_pop['Area'].str.strip()

# Split at the last comma into city and state
df_city_pop[['City', 'State']] = df_city_pop['Area'].str.rsplit(',', n=1, expand=True)

# Strip extra spaces and remove "city" in the name
df_city_pop['City'] = df_city_pop['City'].str.replace(r'\b[Cc]ity\b', '', regex=True).str.strip()
df_city_pop['State'] = df_city_pop['State'].str.strip()

# Optional: drop 'Area'
df_city_pop = df_city_pop.drop(columns=['Area'])

df_city_pop.head()

Unnamed: 0,Population,City,State
0,2349,Abbeville,Alabama
1,4393,Adamsville,Alabama
2,661,Addison town,Alabama
3,229,Akron town,Alabama
4,33342,Alabaster,Alabama


In [None]:
def normalize_city(name):
    name = name.lower().strip()
    
    # Remove parenthetical content, e.g., " (city)"
    name = re.sub(r"\s*\(.*?\)\s*", "", name)
    
    # Remove common suffixes
    remove_terms = [" city"]
    for term in remove_terms:
        name = name.replace(term, "")
    
    # Remove extra spaces
    name = re.sub(r"\s+", " ", name).strip()
    
    return name

# Apply to both datasets
df_keep['City_norm'] = df_keep['City'].apply(normalize_city)
df_keep['State_norm'] = df_keep['State'].str.lower().str.strip()

df_city_pop['City_norm'] = df_city_pop['City'].apply(normalize_city)
df_city_pop['State_norm'] = df_city_pop['State'].str.lower().str.strip()

In [24]:
# Make sets of cities (and states) in each dataset
pol_cities = set(zip(df_keep['City'], df_keep['State']))
pop_cities = set(zip(df_city_pop['City'], df_city_pop['State']))

# Count how many cities in df_balanced are in df_pop
matched = pol_cities & pop_cities
print(f"Cities in df_keep found in df_pop: {len(matched)}")
print(f"Cities in df_keep NOT found in df_pop: {len(pol_cities - pop_cities)}")

Cities in df_keep found in df_pop: 63
Cities in df_keep NOT found in df_pop: 39


In [25]:
df_county_pop = pd.read_csv("Dataset/Support_files/County_Pop_Map.csv", encoding='latin1')
df_county_pop.head()

Unnamed: 0,County,Population
0,".Autauga County, Alabama",58800
1,".Baldwin County, Alabama",231767
2,".Barbour County, Alabama",25226
3,".Bibb County, Alabama",22284
4,".Blount County, Alabama",59130


In [26]:
df_county_pop = df_county_pop.copy()

# Step 1: Remove leading "."
df_county_pop["County"] = df_county_pop["County"].str.lstrip(".")

# Step 2: Split into "County" and "State"
df_county_pop[["County", "State"]] = df_county_pop["County"].str.replace(" County", "", regex=False).str.rsplit(",", n=1, expand=True)

# Step 3: Clean up whitespace
df_county_pop["County"] = df_county_pop["County"].str.strip()
df_county_pop["State"] = df_county_pop["State"].str.strip()

# Check result
df_county_pop.head()

Unnamed: 0,County,Population,State
0,Autauga,58800,Alabama
1,Baldwin,231767,Alabama
2,Barbour,25226,Alabama
3,Bibb,22284,Alabama
4,Blount,59130,Alabama


In [None]:
def normalize_county(name):
    name = name.lower().strip()
    
    # Remove parenthetical content, e.g., " (city)"
    name = re.sub(r"\s*\(.*?\)\s*", "", name)
    
    # Replace common suffixes
    remove_terms = [
        " county", " parish", " borough", " census area", 
        " independent city", " municipality", " district",
        " planning region", " region"
    ]
    for term in remove_terms:
        name = name.replace(term, "")
    
    # Normalize "st." or "st" to "st"
    name = re.sub(r"\bst\.?", "st", name)
    
    # Remove extra spaces
    name = re.sub(r"\s+", " ", name).strip()
    
    return name

# Apply to dataset
df_keep['County_norm'] = df_keep['County'].apply(normalize_county)

df_county_pop['County_norm'] = df_county_pop['County'].apply(normalize_county)
df_county_pop['State_norm'] = df_county_pop['State'].str.lower().str.strip()

In [29]:
# Make sets of counties (and states) in each dataset
pol_counties = set(zip(df_keep['County_norm'], df_keep['State_norm']))
pop_counties = set(zip(df_county_pop['County_norm'], df_county_pop['State_norm']))

# Count how many cities in df_balanced are in df_pop
matched = pol_counties & pop_counties
print(f"Counties in df found in df_county_pop: {len(matched)}")
print(f"Counties in df NOT found in df_county_pop: {len(pol_counties - pop_counties)}")

Counties in df found in df_county_pop: 88
Counties in df NOT found in df_county_pop: 5


In [30]:
list(pol_counties - pop_counties)[:5]

[('litchfield', 'connecticut'),
 ('hartford', 'connecticut'),
 ('fairfield', 'connecticut'),
 ('new haven', 'connecticut'),
 ('saint clair', 'illinois')]

In [31]:
# Check df_pop for duplicate city-state pairs
dupes_city_pop = df_city_pop[df_city_pop.duplicated(subset=['City_norm', 'State_norm'], keep=False)]
print(f"Duplicate city-state pairs in df_pop: {dupes_city_pop.value_counts().sum()}")
print(dupes_city_pop.sort_values(['City_norm', 'State_norm']))

# Check df_cpop for duplicate county-state pairs
dupes_county_pop = df_county_pop[df_county_pop.duplicated(subset=['County_norm', 'State_norm'], keep=False)]
print(f"Duplicate county-state pairs in df_cpop: {dupes_county_pop.value_counts().sum()}")
print(dupes_county_pop.sort_values(['County_norm', 'State_norm']))

Duplicate city-state pairs in df_pop: 49
      Population                   City         State              City_norm  \
3269       4,714        Beecher village      Illinois        beecher village   
3270         429       Beecher  village      Illinois        beecher village   
14706        177    Centerville borough  Pennsylvania    centerville borough   
14707      3,257    Centerville borough  Pennsylvania    centerville borough   
16718      2,853            Clarksville         Texas            clarksville   
16719        768            Clarksville         Texas            clarksville   
6686       1,030                   Clay      Kentucky                   clay   
6687       1,194                   Clay      Kentucky                   clay   
14740        129       Coaldale borough  Pennsylvania       coaldale borough   
14741      2,427       Coaldale borough  Pennsylvania       coaldale borough   
18976        234          Genoa village     Wisconsin          genoa village   

In [32]:
df_city_pop_clean = df_city_pop[~df_city_pop.index.isin(dupes_city_pop.index)].copy()

In [33]:
# Merge city population
df_merge = df_keep.merge(df_city_pop_clean[['City_norm','State_norm','Population']],
                                left_on=['City_norm','State_norm'],
                                right_on=['City_norm','State_norm'],
                                how='left')
df_merge.rename(columns={'Population':'Population_city'}, inplace=True)

df_merge.shape

(136018, 22)

In [34]:
# Merge county population
df_merge = df_merge.merge(df_county_pop[['County_norm','State_norm','Population']],
                                left_on=['County_norm','State_norm'],
                                right_on=['County_norm','State_norm'],
                                how='left')
df_merge.rename(columns={'Population':'Population_county'}, inplace=True)

df_merge.shape

(136018, 23)

In [35]:
# Fill final population using city first, then County
df_merge['Population'] = df_merge['Population_city'].fillna(df_merge['Population_county'])

df_merge.shape

(136018, 24)

In [37]:
print(df_merge["Population_city"].isna().sum())
print(df_merge["Population_county"].isna().sum())
print(df_merge["Population"].isna().sum())

37024
5111
3664


In [38]:
df_merge = df_merge.dropna(subset=['Population'])
df_merge.shape

(132354, 24)

In [39]:
df_merge.head()

Unnamed: 0,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,...,SO2 1st Max Hour,CO Mean,CO 1st Max Value,CO 1st Max Hour,City_norm,State_norm,County_norm,Population_city,Population_county,Population
0,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-01,21.208332,33.0,0,31,0.015083,...,0,1.152632,2.7,5,phoenix,arizona,maricopa,1608415,4425315,1608415
1,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-02,17.208332,38.0,22,36,0.018042,...,7,0.425,0.5,0,phoenix,arizona,maricopa,1608415,4425315,1608415
2,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-03,30.0,47.0,18,44,0.008542,...,20,0.8,1.7,23,phoenix,arizona,maricopa,1608415,4425315,1608415
3,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-04,33.666668,47.0,19,44,0.005458,...,7,1.275,1.9,1,phoenix,arizona,maricopa,1608415,4425315,1608415
4,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-05,31.695652,48.0,18,45,0.008292,...,22,1.045833,1.7,23,phoenix,arizona,maricopa,1608415,4425315,1608415


---

## Add Weather Conditions

In [53]:
county_coords = pd.read_csv("Dataset/Support_files/County_Centroids.csv", encoding='utf-8-sig')
county_coords.head()

Unnamed: 0,state,county,cfips,latitude,longitude
0,Alabama,Autauga County,1001,32.5081,-86.6513
1,Alabama,Baldwin County,1003,30.7725,-87.7842
2,Alabama,Barbour County,1005,31.8832,-85.3931
3,Alabama,Bibb County,1007,33.0388,-87.0967
4,Alabama,Blount County,1009,34.0126,-86.5339


In [54]:
# Clean column names
county_coords.columns = county_coords.columns.str.strip()  # remove weird characters like ï»¿
county_coords.rename(
    columns={
        "state": "State",
        "county": "County",
        "latitude": "Latitude",
        "longitude": "Longitude"
    },
    inplace=True
)

# Preview
county_coords.head()

Unnamed: 0,State,County,cfips,Latitude,Longitude
0,Alabama,Autauga County,1001,32.5081,-86.6513
1,Alabama,Baldwin County,1003,30.7725,-87.7842
2,Alabama,Barbour County,1005,31.8832,-85.3931
3,Alabama,Bibb County,1007,33.0388,-87.0967
4,Alabama,Blount County,1009,34.0126,-86.5339


In [55]:
county_coords = county_coords.drop(["cfips"], axis=1)
county_coords.head()

Unnamed: 0,State,County,Latitude,Longitude
0,Alabama,Autauga County,32.5081,-86.6513
1,Alabama,Baldwin County,30.7725,-87.7842
2,Alabama,Barbour County,31.8832,-85.3931
3,Alabama,Bibb County,33.0388,-87.0967
4,Alabama,Blount County,34.0126,-86.5339


In [58]:
def normalize_county(name):
    name = name.lower().strip()
    
    # Remove parenthetical content, e.g., " (city)"
    name = re.sub(r"\s*\(.*?\)\s*", "", name)
    
    # Replace common suffixes
    remove_terms = [
        " county", " parish", " borough", " census area", 
        " independent city", " municipality", " district",
        " planning region", " region"
    ]
    for term in remove_terms:
        name = name.replace(term, "")
    
    # Normalize "st." or "st" to "st"
    name = re.sub(r"\bst\.?", "st", name)
    
    # Remove extra spaces
    name = re.sub(r"\s+", " ", name).strip()
    
    return name

# Apply to dataset
df_merge['County_norm'] = df_merge['County'].apply(normalize_county)
df_merge['State_norm'] = df_merge['State'].apply(normalize_county)

county_coords['County_norm'] = county_coords['County'].apply(normalize_county)
county_coords['State_norm'] = county_coords['State'].str.lower().str.strip()

In [61]:
county_coords.head()

Unnamed: 0,State,County,Latitude,Longitude,County_norm,State_norm
0,Alabama,Autauga County,32.5081,-86.6513,autauga,alabama
1,Alabama,Baldwin County,30.7725,-87.7842,baldwin,alabama
2,Alabama,Barbour County,31.8832,-85.3931,barbour,alabama
3,Alabama,Bibb County,33.0388,-87.0967,bibb,alabama
4,Alabama,Blount County,34.0126,-86.5339,blount,alabama


In [None]:
df_merge.head()

Unnamed: 0,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,...,SO2 1st Max Hour,CO Mean,CO 1st Max Value,CO 1st Max Hour,City_norm,State_norm,County_norm,Population_city,Population_county,Population
0,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-01,21.208332,33.0,0,31,0.015083,...,0,1.152632,2.7,5,phoenix,arizona,maricopa,1608415,4425315,1608415
1,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-02,17.208332,38.0,22,36,0.018042,...,7,0.425,0.5,0,phoenix,arizona,maricopa,1608415,4425315,1608415
2,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-03,30.0,47.0,18,44,0.008542,...,20,0.8,1.7,23,phoenix,arizona,maricopa,1608415,4425315,1608415
3,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-04,33.666668,47.0,19,44,0.005458,...,7,1.275,1.9,1,phoenix,arizona,maricopa,1608415,4425315,1608415
4,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-05,31.695652,48.0,18,45,0.008292,...,22,1.045833,1.7,23,phoenix,arizona,maricopa,1608415,4425315,1608415


In [None]:
# Check how many normalized (State, County) pairs match before merging
matches = df_merge.merge(
    county_coords[['State_norm', 'County_norm']],
    on=['State_norm', 'County_norm'],
    how='inner'
)

print(f"Number of matching counties: {len(matches)}")
print(f"Total counties in df_merge: {len(df_merge)}")
print(f"Match rate: {len(matches) / len(df_merge) * 100:.2f}%")

Number of matching counties: 132354
Total counties in df.merge: 132354
Match rate: 100.00%


In [None]:
merged = df_merge.merge(
    county_coords[['State_norm', 'County_norm', 'Latitude', 'Longitude']],
    on=['State_norm', 'County_norm'],
    how='left'
)

merged.head()

Unnamed: 0,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,...,CO 1st Max Value,CO 1st Max Hour,City_norm,State_norm,County_norm,Population_city,Population_county,Population,Latitude,Longitude
0,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-01,21.208332,33.0,0,31,0.015083,...,2.7,5,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681
1,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-02,17.208332,38.0,22,36,0.018042,...,0.5,0,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681
2,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-03,30.0,47.0,18,44,0.008542,...,1.7,23,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681
3,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-04,33.666668,47.0,19,44,0.005458,...,1.9,1,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681
4,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-05,31.695652,48.0,18,45,0.008292,...,1.7,23,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681


In [108]:
df_merge["Population"].isna().sum()

0

## Addition of Available Weather Data

In [None]:
df = merged.copy()
df.head()

Unnamed: 0,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,...,CO 1st Max Value,CO 1st Max Hour,City_norm,State_norm,County_norm,Population_city,Population_county,Population,Latitude,Longitude
0,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-01,21.208332,33.0,0,31,0.015083,...,2.7,5,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681
1,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-02,17.208332,38.0,22,36,0.018042,...,0.5,0,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681
2,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-03,30.0,47.0,18,44,0.008542,...,1.7,23,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681
3,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-04,33.666668,47.0,19,44,0.005458,...,1.9,1,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681
4,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-05,31.695652,48.0,18,45,0.008292,...,1.7,23,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681


In [72]:
print(df["Latitude"].isna().sum())
print(df["Longitude"].isna().sum())

0
0


In [74]:
df['Date Local'].dtype

dtype('O')

In [76]:
df['Date Local'] = pd.to_datetime(df['Date Local'])
df['Date Local'].dtype

dtype('<M8[ns]')

In [None]:
# Ensure 'Date Local' is datetime
merged['Date Local'] = pd.to_datetime(merged['Date Local'])

# Keep only unique locations
unique_locations = merged[['State', 'County', 'Latitude', 'Longitude']].drop_duplicates()

weather_data = []

for _, loc in unique_locations.iterrows():
    try:
        # Create a Point for the location
        point = Point(loc['Latitude'], loc['Longitude'])
        
        # Determine date range for this location
        dates = merged[(merged['Latitude'] == loc['Latitude']) &
                       (merged['Longitude'] == loc['Longitude'])]
        start = dates['Date Local'].min()
        end = dates['Date Local'].max()
        
        # Find nearest stations (up to 5) with data in the date range
        stations = Stations().nearby(loc['Latitude'], loc['Longitude'])
        stations = stations.fetch(10)
        
        # Pick the first station with data available for the date range
        station_found = False
        for station_id in stations.index:
            try:
                daily = Daily(station_id, start, end).fetch()
                if not daily.empty:
                    daily['State'] = loc['State']
                    daily['County'] = loc['County']
                    weather_data.append(daily)
                    station_found = True
                    break
            except:
                continue
        
        if not station_found:
            print(f"No data for {loc['County']}, {loc['State']} in the date range.")
    
    except Exception as e:
        print(f"Error processing {loc['County']}, {loc['State']}: {e}")

# Combine all locations
weather_df = pd.concat(weather_data).reset_index()

weather_df.head()

Unnamed: 0,time,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun,State,County
0,2013-01-01,4.6,-2.0,12.0,,,,8.6,,,,Arizona,Maricopa
1,2013-01-02,7.8,1.0,16.0,,,,9.2,,,,Arizona,Maricopa
2,2013-01-03,9.1,,,,,,13.5,,,,Arizona,Maricopa
3,2013-01-04,8.8,,,,,,12.8,,,,Arizona,Maricopa
4,2013-01-05,6.0,-2.0,15.0,,,,9.3,,,,Arizona,Maricopa


In [88]:
weather_df.head(10)

Unnamed: 0,time,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun,State,County
0,2013-01-01,4.6,-2.0,12.0,,,,8.6,,,,Arizona,Maricopa
1,2013-01-02,7.8,1.0,16.0,,,,9.2,,,,Arizona,Maricopa
2,2013-01-03,9.1,,,,,,13.5,,,,Arizona,Maricopa
3,2013-01-04,8.8,,,,,,12.8,,,,Arizona,Maricopa
4,2013-01-05,6.0,-2.0,15.0,,,,9.3,,,,Arizona,Maricopa
5,2013-01-06,7.7,0.0,17.0,,,,7.9,,,,Arizona,Maricopa
6,2013-01-07,7.8,2.0,16.0,,,,6.9,,,,Arizona,Maricopa
7,2013-01-08,10.6,2.0,19.0,,,,8.8,,,,Arizona,Maricopa
8,2013-01-09,10.9,3.0,21.0,,,,8.0,,,,Arizona,Maricopa
9,2013-01-10,9.4,,,,,,12.3,,,,Arizona,Maricopa


In [92]:
print(f"Missing tmax {weather_df["tmax"].isna().sum()}")
print(f"Missing wspd {weather_df["wspd"].isna().sum()}")
print(f"Missing prcp {weather_df["prcp"].isna().sum()}")

Missing tmax 4209
Missing wspd 5972
Missing prcp 40431


In [None]:
# Merge with original data to align exact dates
final_df = df.merge(
    weather_df,
    left_on=['State', 'County', 'Date Local'],
    right_on=['State', 'County', 'time'],
    how='left'
)

# Drop duplicate 'time' column if desired
final_df = final_df.drop(columns=['time'])
final_df. head()
pd.set_option("display.max_columns", None)

In [97]:
# For your main DataFrame
print("df date range:", df['Date Local'].min(), "to", df['Date Local'].max())

# For the weather DataFrame
print("weather_df date range:", weather_df['time'].min(), "to", weather_df['time'].max())

df date range: 2012-01-01 00:00:00 to 2016-05-31 00:00:00
weather_df date range: 2012-01-01 00:00:00 to 2016-05-31 00:00:00


In [100]:
pd.set_option("display.max_columns", None)
final_df. head(1)

Unnamed: 0,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,CO Mean,CO 1st Max Value,CO 1st Max Hour,City_norm,State_norm,County_norm,Population_city,Population_county,Population,Latitude,Longitude,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun
0,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-01,21.208332,33.0,0,31,0.015083,0.028,11,1.458333,5.0,0,1.152632,2.7,5,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681,,,,,,,,,,


In [101]:
final_df = final_df.drop(columns=["tavg", "tmin", "snow", "wdir", "wpgt", "pres", "tsun"])
final_df.head()

Unnamed: 0,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,CO Mean,CO 1st Max Value,CO 1st Max Hour,City_norm,State_norm,County_norm,Population_city,Population_county,Population,Latitude,Longitude,tmax,prcp,wspd
0,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-01,21.208332,33.0,0,31,0.015083,0.028,11,1.458333,5.0,0,1.152632,2.7,5,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681,,,
1,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-02,17.208332,38.0,22,36,0.018042,0.034,9,0.416667,2.0,7,0.425,0.5,0,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681,,,
2,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-03,30.0,47.0,18,44,0.008542,0.024,10,2.25,6.0,20,0.8,1.7,23,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681,,,
3,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-04,33.666668,47.0,19,44,0.005458,0.016,10,2.791667,5.0,7,1.275,1.9,1,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681,,,
4,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2012-01-05,31.695652,48.0,18,45,0.008292,0.024,9,3.043478,7.0,22,1.045833,1.7,23,phoenix,arizona,maricopa,1608415,4425315,1608415,33.2798,-112.7681,,,


In [104]:
print(f"Rows with missing tmax {final_df['tmax'].isna().sum()}")
print(f"Rows with missing wspd {final_df['wspd'].isna().sum()}")
print(f"Rows with missing prcp {final_df['prcp'].isna().sum()}")

Rows with missing tmax 15310
Rows with missing wspd 15009
Rows with missing prcp 57047


In [105]:
final_df = final_df.dropna()
final_df.shape

(53436, 29)

In [109]:
final_df.head()

Unnamed: 0,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,CO Mean,CO 1st Max Value,CO 1st Max Hour,City_norm,State_norm,County_norm,Population_city,Population_county,Population,Latitude,Longitude,tmax,prcp,wspd
638,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-01,17.716667,31.0,0,29,0.013667,0.03,10,0.254167,0.5,19,0.336842,0.6,5,tucson,arizona,pima,542649,1043441,542649,31.9681,-111.7806,26.7,0.0,17.6
639,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-02,15.0625,30.6,18,28,0.015083,0.03,10,0.2,0.6,19,0.225,0.4,23,tucson,arizona,pima,542649,1043441,542649,31.9681,-111.7806,24.4,0.0,27.4
640,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-03,21.643478,31.0,18,29,0.011417,0.026,9,0.295455,0.7,8,0.295833,0.4,0,tucson,arizona,pima,542649,1043441,542649,31.9681,-111.7806,26.1,0.0,10.8
641,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-04,25.041668,37.8,10,35,0.009208,0.02,10,0.7375,2.1,19,0.345833,0.5,12,tucson,arizona,pima,542649,1043441,542649,31.9681,-111.7806,24.4,0.0,9.0
642,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-05,21.981817,37.1,17,35,0.013042,0.031,9,0.330435,0.8,21,0.291667,0.6,23,tucson,arizona,pima,542649,1043441,542649,31.9681,-111.7806,23.9,0.0,9.7


---

## Save DataFrames to CSV Files

In [113]:
keep_col = ["Address",
            "State",
            "County",
            "City",
            "Date Local",
            "NO2 Mean",
            "NO2 1st Max Value",
            "NO2 1st Max Hour",
            "NO2 AQI",
            "O3 Mean",
            "O3 1st Max Value",
            "O3 1st Max Hour",
            "SO2 Mean",
            "SO2 1st Max Value",
            "SO2 1st Max Hour",
            "CO Mean",
            "CO 1st Max Value", 
            "CO 1st Max Hour",
            "Population",
            "Latitude",
            "Longitude",
            "tmax",
            "prcp",
            "wspd" 

]

df_keep = final_df[keep_col]
df_keep.head()

Unnamed: 0,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,CO Mean,CO 1st Max Value,CO 1st Max Hour,Population,Latitude,Longitude,tmax,prcp,wspd
638,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-01,17.716667,31.0,0,29,0.013667,0.03,10,0.254167,0.5,19,0.336842,0.6,5,542649,31.9681,-111.7806,26.7,0.0,17.6
639,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-02,15.0625,30.6,18,28,0.015083,0.03,10,0.2,0.6,19,0.225,0.4,23,542649,31.9681,-111.7806,24.4,0.0,27.4
640,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-03,21.643478,31.0,18,29,0.011417,0.026,9,0.295455,0.7,8,0.295833,0.4,0,542649,31.9681,-111.7806,26.1,0.0,10.8
641,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-04,25.041668,37.8,10,35,0.009208,0.02,10,0.7375,2.1,19,0.345833,0.5,12,542649,31.9681,-111.7806,24.4,0.0,9.0
642,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-05,21.981817,37.1,17,35,0.013042,0.031,9,0.330435,0.8,21,0.291667,0.6,23,542649,31.9681,-111.7806,23.9,0.0,9.7


In [114]:
df_keep.to_csv("Dataset/EDA/pollution_us_2012_2016-population-weather.csv", index=False)

---

In [115]:
keep_col = ["Address",
            "State",
            "County",
            "City",
            "Date Local",
            "NO2 Mean",
            "NO2 1st Max Value",
            "NO2 1st Max Hour",
            "NO2 AQI",
            "O3 Mean",
            "O3 1st Max Value",
            "O3 1st Max Hour",
            "SO2 Mean",
            "SO2 1st Max Value",
            "SO2 1st Max Hour",
            "CO Mean",
            "CO 1st Max Value", 
            "CO 1st Max Hour",
            "Population",
            "Latitude",
            "Longitude"

]

df_keep = final_df[keep_col]
df_keep.head()

Unnamed: 0,Address,State,County,City,Date Local,NO2 Mean,NO2 1st Max Value,NO2 1st Max Hour,NO2 AQI,O3 Mean,O3 1st Max Value,O3 1st Max Hour,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,CO Mean,CO 1st Max Value,CO 1st Max Hour,Population,Latitude,Longitude
638,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-01,17.716667,31.0,0,29,0.013667,0.03,10,0.254167,0.5,19,0.336842,0.6,5,542649,31.9681,-111.7806
639,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-02,15.0625,30.6,18,28,0.015083,0.03,10,0.2,0.6,19,0.225,0.4,23,542649,31.9681,-111.7806
640,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-03,21.643478,31.0,18,29,0.011417,0.026,9,0.295455,0.7,8,0.295833,0.4,0,542649,31.9681,-111.7806
641,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-04,25.041668,37.8,10,35,0.009208,0.02,10,0.7375,2.1,19,0.345833,0.5,12,542649,31.9681,-111.7806
642,400 W RIVER ROAD,Arizona,Pima,Tucson,2012-01-05,21.981817,37.1,17,35,0.013042,0.031,9,0.330435,0.8,21,0.291667,0.6,23,542649,31.9681,-111.7806


In [116]:
df_keep.to_csv("Dataset/EDA/pollution_us_2012_2016-population.csv", index=False)

---