# Multivariate linear regression: Predicting the AQI (air quality index) using weather and pollution data for Aoti Zhongxin station - Beijing.

## Introduction
This data set includes air pollutants data from Aoti Zhongxin station(Beijing) taken from the Beijing Municipal Environmental Monitoring Center. Variables in this dataset include: time, measured concentrations of various pollutants, temperature, pressure, dew point, precipitation and wind information. The time period is: March 1st, 2013 - February 28th, 2017.  
AQI is a measurement for air quality that indicates how polluted the air currently is. It is calculated using the pollution levels of O3, PM2.5, PM10, CO, SO2 and NO2.  
We will use a multivariate linear regression model to answer the question "How do the chosen weather variables affect AQI in Beijing?".

## Methods & Results

In [None]:
# Load libraries, run before everything else
library(tidyverse)
library(repr)
library(tidymodels)
install.packages("con2aqi")
library(con2aqi)
library(zoo) # for moving averages
install.packages("GGally")
library(GGally)
options(jupyter.plot_mimetypes = "image/png")  # Added by TA; we ran into 100mb file size limit problems

In [None]:
# Get weather + pollution data for the Aotizhongxin station in Beijing
download.file("https://raw.githubusercontent.com/DonkeyBlaster/dsci-100-2023w1-group43/main/PRSA_Data_Aotizhongxin_20130301-20170228.csv", "Aotizhongxin_data.csv")
air_quality_data <- read_csv("Aotizhongxin_data.csv") |>
    select(-station) |> # This just says "Aotizhongxin", no need to keep it around
    select(-No) # This is a continuously increasing counter, we don't need it either
head(air_quality_data, 3)
tail(air_quality_data, 3)

AQI can be easily calculated with the "con2aqi" library (after wrangling). First, we remove any N/A values:

In [None]:
air_quality_data <- air_quality_data |> 
    na.omit() |> # AQI cannot be calculated with NA values\
    select(-wd) |>  # We don't know how to use this properly
    filter(year >= 2015, year <= 2016)  # We only want 2015-2016, measuring changed in 2014 and 2017 data is incomplete

# TODO: Explain why we are using 2015-2016 only, why we're removing wind direction, then remove the comments from my code

Additionally, we need to wrangle the pollutant units into ones the library understands. Existing data are in ug/m^3, and the library wants the following units:
| PM2.5  | PM10   | SO2 | NO2 | CO  | O3  |
|--------|--------|-----|-----|-----|-----|
| ug/m^3 | ug/m^3 | ppb | ppb | ppm | ppm |

In [None]:
R = 0.082057366080960  # Gas constant for litres, atmospheres, kelvin, mols.
SO2_molecular_weight = 64.07  # g/mol
NO2_molecular_weight = 46.01  # g/mol
CO_molecular_weight = 28.01  # g/mol
O3_molecular_weight = 48.00  # g/mol
air_quality_data <- air_quality_data |>
    # PV = nRT formula rearranged to V = RT/P, n=1.
    mutate(volume = R * (273.2 + TEMP) / (PRES/1013)) |>   # Convert temp to Kelvin, pressure to atmospheres
    mutate(so2_ppb = volume * SO2 / SO2_molecular_weight) |>
    mutate(no2_ppb = volume * NO2 / NO2_molecular_weight) |>
    # Multiply by div by 1000 for ppb -> ppm
    mutate(co_ppm = volume * CO / CO_molecular_weight / 1000) |>
    mutate(o3_ppm = volume * O3 / O3_molecular_weight / 1000)
head(air_quality_data, 3)
tail(air_quality_data, 3)

Next, we need to calculate moving averages for the concentrations, as per [the specification](https://www.airnow.gov/sites/default/files/2020-05/aqi-technical-assistance-document-sept2018.pdf). Each pollutant has a different period required, shown below:
| PM2.5    | PM10     | SO2      | NO2    | CO      | O3           |
|----------|----------|----------|--------|---------|--------------|
| 24 hours | 24 hours | 1 hour   | 1 hour | 8 hours | 1 or 8 hours |


In [None]:
air_quality_data <- air_quality_data |>
    mutate(pm2.5_24hour = zoo::rollmean(PM2.5, k = 24, fill = NA, align = "right")) |>
    mutate(pm10_24hour = zoo::rollmean(PM10, k = 24, fill = NA, align = "right")) |>
    mutate(co_8hour = zoo::rollmean(co_ppm, k = 8, fill = NA, align = "right")) |>
    mutate(o3_8hour = zoo::rollmean(o3_ppm, k = 8, fill = NA, align = "right"))  # For o3 specifically, con2aqi allows us to choose 1 or 8 hours.
    # We're using 8 hours as the 1-hour window does not allow for reporting of AQI values less than 101.
head(air_quality_data, 26)

Finally, we calculate AQI for each pollutant.

In [None]:
# This cell takes a while.
air_quality_data <- air_quality_data |>
    na.omit() |>  #  We will remove all rows with NA first.
    mutate(pm2.5_aqi = con2aqi(pollutant = "pm25", con = pm2.5_24hour)) |>
    mutate(pm10_aqi = con2aqi(pollutant = "pm10", con = pm10_24hour)) |>
    mutate(so2_aqi = con2aqi(pollutant = "so2", con = so2_ppb)) |>
    mutate(no2_aqi = con2aqi(pollutant = "no2", con = no2_ppb)) |>
    mutate(co_aqi = con2aqi(pollutant = "co", con = co_8hour)) |>
    mutate(o3_aqi = con2aqi(pollutant = "o3", con = o3_8hour, type = "8h"))
air_quality_data

Because AQI is reported daily as the highest of the individual pollutant AQIs, we can obtain one final AQI value per day.

In [None]:
air_quality_data <- air_quality_data |>
    rowwise() |>  # This is required for the max function to read row-by-row
    mutate(aqi = max(pm2.5_aqi:o3_aqi)) |>
    mutate(pollutant = )
head(air_quality_data, 3)
tail(air_quality_data, 3)
# Do not modify air_quality_data from this point onwards! It contains all original and calculated information. Duplicate frame if other modifications are needed.

In [None]:
# Retrieve training and testing splits
aqd_split <- initial_split(air_quality_data, prop = 0.75, strata = aqi)
aqd_train <- training(aqd_split)
aqd_test <- testing(aqd_split)
head(aqd_train, 4)

In [None]:
colnames(aqd_train)

# TODO: We're going to do the pair plot analysis here to determine which variables are useful. Someone please fill details

In [None]:
options(repr.plot.width = 12, repr.plot.height = 12)
aqd_pairplot <- aqd_train |>
    select(TEMP:WSPM, aqi) |>
    na.omit() |>
    scale() |>
    as.data.frame()

head(aqd_pairplot, 6)

ggpairs(aqd_pairplot,
        lower = list(continuous = wrap('points', alpha = 0.1)),
        diag = list(continuous = "barDiag")
    ) +
    theme(text = element_text(size = 20))

# TODO: it is clear that TEMP and DEWP are very closely related, blah blah blah, we only pick one of them, so we have PRES DEWP WSPM, rain looks pretty terrible, blah blah, and so on

# TODO: but also it looks like none of them are particularly correlated with aqi, blah blah, what if we did just one pollutant individually? let's try a couple (totally not rigged ones that we picked)

In [None]:
aqd_pairplot <- aqd_train |>
    select(TEMP:WSPM, pm2.5_aqi, o3_aqi) |>
    na.omit() |>
    scale() |>
    as.data.frame()
ggpairs(aqd_pairplot,
        lower = list(continuous = wrap('points', alpha = 0.1)),
        diag = list(continuous = "barDiag")
    ) +
    theme(text = element_text(size = 20))

# TODO: clearly some pollutants are easier to predict (o3) and some are harder (pm2.5), what if we predicted individual pollutants to get an overall aqi value from that, etc (maybe move this to improvements?)

# TODO: let's do the linear regression with PRES and DEWP, predicting overall aqi now. We still have aqd_train and aqd_test, add details here

In [None]:
lm_spec <- linear_reg () |>
    set_engine("lm") |>
    set_mode("regression")
aqi_recipe <- recipe(aqi ~ PRES + DEWP, data = aqd_train)
aqi_fit <- workflow() |>
    add_model(lm_spec) |>
    add_recipe(aqi_recipe) |>
    fit(data = aqd_train)

# TODO: now we evaluate accuracy of model (it's pretty bad)

In [None]:
aqi_rmspe <- aqi_fit |>
    predict(aqd_test) |>
    bind_cols(aqd_test) |>
    metrics(truth = aqi, estimate = .pred) |>
    filter(.metric == "rmse") |>
    select(.estimate) |>
    pull()
aqi_rmspe

# TODO: This model is pretty terrible, RMSPE (measuring accuracy against never-seen-before data) is 69.1. Each aqi classification bracket is only 50

# TODO: What if we did it on the one pollutant that looks decent? (o3)

In [None]:
o3_recipe <- recipe(o3_aqi ~ PRES + DEWP, data = aqd_train)
o3_fit <- workflow() |>
    add_model(lm_spec) |>
    add_recipe(o3_recipe) |>
    fit(data = aqd_train)
o3_rmspe <- o3_fit |>
    predict(aqd_test) |>
    bind_cols(aqd_test) |>
    metrics(truth = o3_aqi, estimate = .pred) |>
    filter(.metric == "rmse") |>
    select(.estimate) |>
    pull()
o3_rmspe

# TODO: This is pretty good, within the 50-aqi bracket

## Method
We will carry out a multivariate linear regression analysis on our data, to predict AQI based on weather conditions. We chose this method because:
* Knn would be very slow for such a large dataset
* We can have more confidence in our model for predictions where weather conditions are slightly beyond the range of inputs (more extreme weather)
* The regression equation shows a mathematical relationship – quantifies the relative contribution of each predictor    

Before creating our model, we will consider each variable, and their relationships, as follows:
* Wind direction is given in the dataset, but disregarded as it is not numerical 
* The dataset is large - any outliers are not likely to strongly affect our results
* We will assess the relationships between our weather variables (by producing pairwise scatter plots), to identify correlations between them and avoid multicollinearity.



## Expected Outcomes and Significance

- Compared to **AQI**, based on our research, we expect:
    - Positive correlations with:
        - **Pressure** - higher pressures will stagnate air, causing pollutants to accumulate
    - Negative correlations with:
        - **Wind speed** - faster winds disperse pollutants, lowering concentrations
        - **Temperature** - higher ground temperature causes hot air to rise, reducing atmospheric pressure
        - **Precipitation** - this traps pollutants as they descend
        - **Dew point** - higher dew points form more water droplets, which traps pollutants
- Impacts:
    - Bad air quality contributes to thousands of hospital visits and premature deaths a year, with the related consequences totaling to an economic value of $120bn dollars a year. Thus, these findings could allow individuals to take preventative measures against pollution to protect their health.
- Future research:
  - We could use location-based predictors such as wind direction and topography to ask the question “How is AQI affected by location?”.
  - Additionally, we could ask “How is AQI affected by transportation and energy?” to investigate whether lifestyle contributes to pollution.


Word count: 517

### References

“Air pollution – How to convert between mg/m3, µg/m3 and ppm, ppb.” Breeze Technologies, 20 Aug.
2021,https://www.breeze-technologies.de/blog/air-pollution-how-to-convert-between-mgm3-%C2%B5gm3-ppm-ppb/. 
Accessed 28 Oct. 2023.


Feng, Xinyuan and Shigong Wang, “Influence of different weather events on concentrations of 
particulate matter with different sizes in Lanzhou, China.”Journal of Environmental
Sciences, Vol. 24, no. 4, 2012, pp. 665-674. https://doi.org/10.1016/S1001-0742(11)60807-3.


“Health impacts from air pollution.” Government of Canada, 2 June 2023, https://www.canada.ca/en/environment-climate-change/campaigns/canadian-environment-week/clean-air-day/health-impacts-air-pollution.html.  Accessed 28 Oct. 2023.


“How the weather affects air quality.” Government of Canada, 26 Jan. 2023,
https://www.canada.ca/en/environment-climate-change/services/air-quality-health-index/weather.html. 
Accessed 28 Oct. 2023.


Kumari, Shweta, and Manish Kumar Jane. (2018). “A Critical Review on Air Quality Index.” Water 
Science and Technology Library, vol. 77. Springer, Singapore. https://doi.org/10.1007/978-981-10-5792-2_8.


Liu, Yansui, Yang Zhou, and Jiaxin Lu. “Exploring the relationship between air pollution and
meteorological conditions in China under environmental governance.” Scientific Reports, vol.
10, no. 1, 2020, pp. 1-14. https://www.nature.com/articles/s41598-020-71338-7.
doi: 10.1038/s41598-020-71338-7.


Technical Assistance Document for the Reporting of Daily Air Quality – the Air Quality Index
(AQI). USEPA, 2018. https://www.airnow.gov/sites/default/files/2020-05/aqi-technical-assistance-document-sept2018.pdf.


Xu, Yingying and Xinyue Zhu, "Recognizing Dew as an Indicator and an Improver of Near-Surface Air
Quality", Advances in Meteorology, vol. 2017, 2017. https://doi.org/10.1155/2017/3514743.

