# Predicting Air Quality in Hetauda Using Machine Learning
This project aims to predict the Air Quality Index (AQI) using meteorological data. We will explore two models: Linear Regression and a Neural Network, and compare their performance.

**Author:** Anmol Lamichhane
**Co-Author:** Bibek Pokhrel

This notebook covers the data collection, cleaning, and exploratory data analysis for the project.


In [14]:
import pandas as pd

In [15]:
df1 = pd.read_csv("C:/Users/anmol/OneDrive/Documents/PROJECTS/Hetauda_Air_Quality_Prediction/data/raw/hetauda-raw-weather-parameters.csv")
df2 = pd.read_csv("C:/Users/anmol/OneDrive/Documents/PROJECTS/Hetauda_Air_Quality_Prediction/data/raw/hetauda-raw-air-quality.csv")

In [16]:
df1.head()

Unnamed: 0,time,temperature_max,temperature_min,dew_point_2m_mean (°C),wind_speed_10m_mean (km/h),surface_pressure_mean (hPa),relative_humidity_2m_mean (%)
0,1/1/2022,19.4,9.0,9.3,3.9,965.7,76
1,1/2/2022,20.4,7.8,8.9,4.1,964.2,75
2,1/3/2022,20.4,9.0,8.3,4.5,962.5,71
3,1/4/2022,19.6,8.8,9.3,4.3,963.1,73
4,1/5/2022,20.3,9.3,9.5,4.3,962.9,73


In [17]:
df2.head()

Unnamed: 0,date,pm25,pm10
0,2025/8/1,36,13.0
1,2025/8/2,44,14.0
2,2025/8/3,49,
3,2025/7/1,36,9.0
4,2025/7/2,29,10.0


In [18]:
df2_reversed = df2.iloc[::-1]

In [19]:
df2_reversed.head()

Unnamed: 0,date,pm25,pm10
906,2021/12/27,,23
905,2022/3/30,,20
904,2022/9/28,,5
903,2022/9/16,,4
902,2022/9/7,,6


In [20]:
df2_reversed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 907 entries, 906 to 0
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    907 non-null    object
 1    pm25   907 non-null    object
 2    pm10   907 non-null    object
dtypes: object(3)
memory usage: 21.4+ KB


In [21]:
df1.rename(columns={'time': 'date'}, inplace=True)

In [22]:
df2_reversed['date'] = pd.to_datetime(df2_reversed['date'])
df1['date'] = pd.to_datetime(df1['date'])

In [23]:
df1.info()
df2_reversed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1096 entries, 0 to 1095
Data columns (total 7 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   date                           1096 non-null   datetime64[ns]
 1   temperature_max                1096 non-null   float64       
 2   temperature_min                1096 non-null   float64       
 3   dew_point_2m_mean (°C)         1096 non-null   float64       
 4   wind_speed_10m_mean (km/h)     1096 non-null   float64       
 5   surface_pressure_mean (hPa)    1096 non-null   float64       
 6   relative_humidity_2m_mean (%)  1096 non-null   int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 60.1 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 907 entries, 906 to 0
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    907 non-null    

In [24]:
df1.head()

Unnamed: 0,date,temperature_max,temperature_min,dew_point_2m_mean (°C),wind_speed_10m_mean (km/h),surface_pressure_mean (hPa),relative_humidity_2m_mean (%)
0,2022-01-01,19.4,9.0,9.3,3.9,965.7,76
1,2022-01-02,20.4,7.8,8.9,4.1,964.2,75
2,2022-01-03,20.4,9.0,8.3,4.5,962.5,71
3,2022-01-04,19.6,8.8,9.3,4.3,963.1,73
4,2022-01-05,20.3,9.3,9.5,4.3,962.9,73


In [25]:
df2_reversed.head()

Unnamed: 0,date,pm25,pm10
906,2021-12-27,,23
905,2022-03-30,,20
904,2022-09-28,,5
903,2022-09-16,,4
902,2022-09-07,,6


In [27]:
df_raw_merged = pd.merge(df1,df2_reversed,how='outer',on='date')

In [28]:
df_raw_merged.head()

Unnamed: 0,date,temperature_max,temperature_min,dew_point_2m_mean (°C),wind_speed_10m_mean (km/h),surface_pressure_mean (hPa),relative_humidity_2m_mean (%),pm25,pm10
0,2021-12-27,,,,,,,,23
1,2021-12-28,,,,,,,73.0,11
2,2021-12-29,,,,,,,43.0,6
3,2021-12-30,,,,,,,28.0,12
4,2021-12-31,,,,,,,44.0,14
