# **(ADD THE NOTEBOOK NAME HERE)**

## Objectives

* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\ngubo\\Documents\\vscode-projects\\US_Air_Pollution_Team_2\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\ngubo\\Documents\\vscode-projects\\US_Air_Pollution_Team_2'

# Section 1

Section 1 content

Importing libraries and Converting CSV to a Dataframe

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('Dataset/Raw/pollution_us_2000_2016.csv') # Reading the CSV file
df.head() # Displaying the first 5 rows of the dataframe

Unnamed: 0.1,Unnamed: 0,State Code,County Code,Site Num,Address,State,County,City,Date Local,NO2 Units,...,SO2 Units,SO2 Mean,SO2 1st Max Value,SO2 1st Max Hour,SO2 AQI,CO Units,CO Mean,CO 1st Max Value,CO 1st Max Hour,CO AQI
0,0,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,1.145833,4.2,21,
1,1,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,3.0,9.0,21,13.0,Parts per million,0.878947,2.2,23,25.0
2,2,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,1.145833,4.2,21,
3,3,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-01,Parts per billion,...,Parts per billion,2.975,6.6,23,,Parts per million,0.878947,2.2,23,25.0
4,4,4,13,3002,1645 E ROOSEVELT ST-CENTRAL PHOENIX STN,Arizona,Maricopa,Phoenix,2000-01-02,Parts per billion,...,Parts per billion,1.958333,3.0,22,4.0,Parts per million,0.85,1.6,23,


In [5]:
df.info() # Display information about the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1746661 entries, 0 to 1746660
Data columns (total 29 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Unnamed: 0         int64  
 1   State Code         int64  
 2   County Code        int64  
 3   Site Num           int64  
 4   Address            object 
 5   State              object 
 6   County             object 
 7   City               object 
 8   Date Local         object 
 9   NO2 Units          object 
 10  NO2 Mean           float64
 11  NO2 1st Max Value  float64
 12  NO2 1st Max Hour   int64  
 13  NO2 AQI            int64  
 14  O3 Units           object 
 15  O3 Mean            float64
 16  O3 1st Max Value   float64
 17  O3 1st Max Hour    int64  
 18  O3 AQI             int64  
 19  SO2 Units          object 
 20  SO2 Mean           float64
 21  SO2 1st Max Value  float64
 22  SO2 1st Max Hour   int64  
 23  SO2 AQI            float64
 24  CO Units           object 
 25  CO Mean           

In [6]:
df.isnull().sum() # Check for missing values in each column

Unnamed: 0                0
State Code                0
County Code               0
Site Num                  0
Address                   0
State                     0
County                    0
City                      0
Date Local                0
NO2 Units                 0
NO2 Mean                  0
NO2 1st Max Value         0
NO2 1st Max Hour          0
NO2 AQI                   0
O3 Units                  0
O3 Mean                   0
O3 1st Max Value          0
O3 1st Max Hour           0
O3 AQI                    0
SO2 Units                 0
SO2 Mean                  0
SO2 1st Max Value         0
SO2 1st Max Hour          0
SO2 AQI              872907
CO Units                  0
CO Mean                   0
CO 1st Max Value          0
CO 1st Max Hour           0
CO AQI               873323
dtype: int64

In [7]:
df.duplicated().sum() # Check for duplicate rows

0

In [8]:
# drop unnecessary columns
df.drop(columns=['Unnamed: 0'], inplace=True) # Drop the 'Unnamed: 0' column

In [9]:
# Removes leading and trailing whitespace characters from string columns
df['State'] = df['State'].str.strip()
df['County'] = df['County'].str.strip()
df['City'] = df['City'].str.strip()


In [10]:
# Convert columns to appropriate data types to avoid memory issues
df['State'] = df['State'].astype('category')
df['County'] = df['County'].astype('category')
df['City'] = df['City'].astype('category')


In [11]:
df['Date Local'] = pd.to_datetime(df['Date Local'], errors='coerce') # Convert 'Date Local' to datetime format

In [13]:
# Convert numerical columns to more memory-efficient types
for col in df.select_dtypes(include='float64').columns:
    df[col] = pd.to_numeric(df[col], downcast='float')
for col in df.select_dtypes(include='int64').columns:
    df[col] = pd.to_numeric(df[col], downcast='integer')

print(df.info(memory_usage='deep')) # Display updated information about the DataFrame



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1746661 entries, 0 to 1746660
Data columns (total 28 columns):
 #   Column             Dtype         
---  ------             -----         
 0   State Code         int8          
 1   County Code        int16         
 2   Site Num           int16         
 3   Address            object        
 4   State              category      
 5   County             category      
 6   City               category      
 7   Date Local         datetime64[ns]
 8   NO2 Units          object        
 9   NO2 Mean           float32       
 10  NO2 1st Max Value  float32       
 11  NO2 1st Max Hour   int8          
 12  NO2 AQI            int16         
 13  O3 Units           object        
 14  O3 Mean            float32       
 15  O3 1st Max Value   float32       
 16  O3 1st Max Hour    int8          
 17  O3 AQI             int16         
 18  SO2 Units          object        
 19  SO2 Mean           float32       
 20  SO2 1st Max Value  float

In [14]:
# Check for missing values in critical columns
critical_cols = ['NO2 Mean', 'O3 Mean', 'SO2 Mean', 'CO Mean', 'NO2 AQI', 'O3 AQI', 'SO2 AQI', 'CO AQI']
nan_counts = df[critical_cols].isnull().sum()
print(nan_counts)


NO2 Mean         0
O3 Mean          0
SO2 Mean         0
CO Mean          0
NO2 AQI          0
O3 AQI           0
SO2 AQI     872907
CO AQI      873323
dtype: int64


In [15]:
df = df.dropna() # Remove rows containing NaN values

In [16]:
# Keep only rows where date is between Jan 1, 2012 and Dec 31, 2016
df = df[(df['Date Local'] >= '2012-01-01') & (df['Date Local'] <= '2016-12-31')]

# confirm date range
print(df['Date Local'].min(), df['Date Local'].max())


2012-01-01 00:00:00 2016-05-31 00:00:00


In [17]:
df.info() # Final check of the DataFrame information

<class 'pandas.core.frame.DataFrame'>
Index: 136334 entries, 1201631 to 1746658
Data columns (total 28 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   State Code         136334 non-null  int8          
 1   County Code        136334 non-null  int16         
 2   Site Num           136334 non-null  int16         
 3   Address            136334 non-null  object        
 4   State              136334 non-null  category      
 5   County             136334 non-null  category      
 6   City               136334 non-null  category      
 7   Date Local         136334 non-null  datetime64[ns]
 8   NO2 Units          136334 non-null  object        
 9   NO2 Mean           136334 non-null  float32       
 10  NO2 1st Max Value  136334 non-null  float32       
 11  NO2 1st Max Hour   136334 non-null  int8          
 12  NO2 AQI            136334 non-null  int16         
 13  O3 Units           136334 non-null  object

In [19]:
df.to_csv('Dataset/Processed/pollution_us_2012_2016-cleaned.csv', index=False) # Save the cleaned DataFrame to a new CSV file

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (553063055.py, line 5)