# Title

text text remove this

## Introduction
some introduction here (mention aqhi here)

## Preliminary exploratory data analysis
hey that's me

In [None]:
# Load libraries, run before everything else
library(tidyverse)
library(repr)
library(tidymodels)
install.packages("con2aqi")
library(con2aqi)
library(zoo) # for moving averages

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

In [None]:
# Get weather + pollution data for the Aotizhongxin station in Beijing
download.file("https://raw.githubusercontent.com/DonkeyBlaster/dsci-100-2023w1-group43/main/PRSA_Data_Aotizhongxin_20130301-20170228.csv", "Aotizhongxin_data.csv")
air_quality_data <- read_csv("Aotizhongxin_data.csv") |>
    select(-station) |> # This just says "Aotizhongxin", no need to keep it around
    select(-No)  # This is a continuously increasing counter, we don't need it either
head(air_quality_data, 3)
tail(air_quality_data, 3)

As shown above by the preview of the data, there are observations for each hour from the start of March 2013 to the end of February  2017. The "station" column was dropped as it only said "Aotizhongxin". As previously mentioned, we will be calculating AQI for the reporting stations. AQI can be easily calculated with the "con2aqi" library. However, there needs to be some pre-processing done before using the library. First, we can check for and remove any N/A values:

In [None]:
air_quality_data <- air_quality_data |> na.omit()

And additionally, we need to wrangle the pollutant units into ones the library understands. All existing data are in ug/m^3, and the library wants the following units:
| PM2.5  | PM10   | SO2 | NO2 | CO  | O3  |
|--------|--------|-----|-----|-----|-----|
| ug/m^3 | ug/m^3 | ppb | ppb | ppm | ppm |

To do this, each pollutant must be calculated separately. First, we calculate the volume of a mole of the given gas at the pressure and temperature on that day. Then, the ppb concentration is that volume multiplied by concentration in ug/m^3 divided by the molecular weight of that molecule. We can do this for SO2 and NO2 now. The same can be done for CO and O3, plus a conversion from ppb to ppm at the end (divide by 1000).

In [None]:
R = 0.082057366080960  # Gas constant for litres, atmospheres, kelvin, mols.
SO2_molecular_weight = 64.07  # g/mol
NO2_molecular_weight = 46.01  # g/mol
CO_molecular_weight = 28.01  # g/mol
O3_molecular_weight = 48.00  # g/mol
air_quality_data <- air_quality_data |>
    mutate(volume = R * (273.2 + TEMP) / (PRES/1013)) |>
    mutate(so2_ppb = volume * SO2 / SO2_molecular_weight) |>
    mutate(no2_ppb = volume * NO2 / NO2_molecular_weight) |>
    mutate(co_ppm = volume * CO / CO_molecular_weight / 1000) |>
    mutate(o3_ppm = volume * O3 / O3_molecular_weight / 1000)
head(air_quality_data, 3)
tail(air_quality_data, 3)

Next, we need to calculate certain moving averages for the concentration values, as per [the specification](https://www.airnow.gov/sites/default/files/2020-05/aqi-technical-assistance-document-sept2018.pdf). Each pollutant has a different period for the required moving averages, listed in this table:
| PM2.5    | PM10     | SO2      | NO2    | CO      | O3           |
|----------|----------|----------|--------|---------|--------------|
| 24 hours | 24 hours | 1 hour   | 1 hour | 8 hours | 1 or 8 hours |

For O3, the library allows us to specify which period to use. However, we will use 8 hours, as the 1-hour window does not allow for reporting of AQI values less than 101.

In [None]:
air_quality_data <- air_quality_data |>
    mutate(pm2.5_24hour = zoo::rollmean(PM2.5, k = 24, fill = NA, align = "right")) |>
    mutate(pm10_24hour = zoo::rollmean(PM10, k = 24, fill = NA, align = "right")) |>
    mutate(co_8hour = zoo::rollmean(co_ppm, k = 8, fill = NA, align = "right")) |>
    mutate(o3_8hour = zoo::rollmean(o3_ppm, k = 8, fill = NA, align = "right"))
head(air_quality_data, 26)

We can finally calculate the AQI values for each of the pollutants. We will remove all rows with NA first.

In [None]:
air_quality_data <- air_quality_data |>
    na.omit() |>
    mutate(pm2.5_aqi = con2aqi(pollutant = "pm25", con = pm2.5_24hour)) |>
    mutate(pm10_aqi = con2aqi(pollutant = "pm10", con = pm10_24hour)) |>
    mutate(so2_aqi = con2aqi(pollutant = "so2", con = so2_ppb)) |>
    mutate(no2_aqi = con2aqi(pollutant = "no2", con = no2_ppb)) |>
    mutate(co_aqi = con2aqi(pollutant = "co", con = co_8hour)) |>
    mutate(o3_aqi = con2aqi(pollutant = "o3", con = o3_8hour, type = "8h"))
air_quality_data
    

In [None]:
colnames(air_quality_data)

Because AQI is reported daily as the highest of the individual pollutant AQIs, we can obtain one final AQI value per day.

In [None]:
air_quality_data <- air_quality_data |>
    select(year, month, day, hour, TEMP, PRES, DEWP, RAIN, WSPM, pm2.5_aqi, pm10_aqi, so2_aqi, no2_aqi, co_aqi, o3_aqi) |>
    group_by(year, month, day) |>
    summarize(across(TEMP:WSPM, mean), across(pm2.5_aqi:o3_aqi, max)) |>
    rowwise()|>
    mutate(aqi = round(max(pm2.5_aqi:o3_aqi)))
head(air_quality_data, 3)
tail(air_quality_data, 3)
    

We can visualize this data to gain some insight into how certain predictor variables may be affecting overall AQI. As we don't need extremely fine details, we can go by average monthly values. We will also normalize all values to make them easier to compare.

In [None]:
options(repr.plot.width = 9, repr.plot.height = 6)
scaled_data <- air_quality_data |>
    group_by(year, month, day) |>
    summarize(across(TEMP:aqi, mean)) |>
    mutate(across(TEMP:aqi, scale))

ggplot(scaled_data, aes(x = TEMP, y = aqi)) +
    geom_point(alpha = 0.3) +
    labs(x = "Normalized Temperature", y = " Normalized AQI", title = "Normalized Temperature vs. Normalized AQI")
ggplot(scaled_data, aes(x = DEWP, y = aqi)) +
    geom_point(alpha = 0.3) +
    labs(x = "Normalized Dew Point", y = " Normalized AQI", title = "Normalized Dew Point vs. Normalized AQI")
ggplot(scaled_data, aes(x = PRES, y = aqi)) +
    geom_point(alpha = 0.3) +
    labs(x = "Normalized Air Pressure", y = " Normalized AQI", title = "Normalized Air Pressure vs. Normalized AQI")
ggplot(scaled_data, aes(x = RAIN, y = aqi)) +
    geom_point(alpha = 0.3) +
    labs(x = "Normalized Rain", y = " Normalized AQI", title = "Normalized Rain vs. Normalized AQI")
ggplot(scaled_data, aes(x = WSPM, y = aqi)) +
    geom_point(alpha = 0.3) +
    labs(x = "Normalized Wind Speed", y = " Normalized AQI", title = "Normalized Wind Speed vs. Normalized AQI")

As shown in the graphs above, there is a weak positive relationship between AQI and temperature, a stronger positive relationship between AQI and dew point, a very weak negative relationship between AQI and air pressure, no relationship between AQI and rain, and no relationship between AQI and wind speed. This indicates which variables may be more useful in doing AQI predictions in the future.