## **PROPOSAL FOR AIR POLLUTION REGRESSION ANALYSIS**

## Introduction

Air pollution is affected by the weather due to the moisture and the precipitation affecting the concentration of pollutants through factors like humidity, temperature, and particulate matter. 

We’re choosing the air pollutant CO and air particles PM 2.5 because both variables frequently emerge in regards to the air pollution in Beijing, specifically the region Tiantan because of the large population of tourists and locals. We can predict the air condition in the region and the impact it has on the community and how we can further improve it. We want to predict CO and PM 2.5 based on the weather which is decided by 5 variables (the temperature (C°), pressure (hPa), dew point temperature (C°), precipitation (mm), wind speed (m/s)). Questions for this regression analysis includes :
 
1. How does the weather affect the concentration of air particle PM 2.5?
2. How does the weather affect the concentration of gas pollutant CO?

The dataset consists of hourly concentration of air pollutants and meteorological variables from 12 air-monitoring stations in Beijing between March 1, 2013 and February 28, 2017. 

## Preliminary Exploratory Data Analysis

In [None]:
library(tidyverse)
library(repr)
options(repr.matrix.max.rows = 6)
library(testthat)
library(digest)
library(tidymodels)

## Weather Data Set in Tiantan 

In [None]:
weather_data <- read.csv("PRSA_Data_Tiantan_20130301-20170228.csv")
weather_data

### Summary Statistics

In [34]:
weather_predictors <- weather_data |> select('TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM') |>
map_dfr(mean, na.rm = TRUE) |> pivot_longer(cols = TEMP:WSPM, names_to = "Predictors", values_to = "Mean")

weather_predictors

Standard_Deviation <- weather_data |> select('TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM') |>
map_dfr(sd, na.rm = TRUE) |> pivot_longer(cols = TEMP:WSPM, names_to = "Predictors", values_to = "Standard Deviation") |> pull(2)

NA_Count <- weather_data |> select('TEMP', 'PRES', 'DEWP', 'RAIN', 'WSPM') |> 
map_dfr(~sum(is.na(.))) |> pivot_longer(cols = TEMP:WSPM, names_to = "Predictors", values_to = "NA Count") |> pull(2)

summary_statistics <- data.frame(weather_predictors, Standard_Deviation, NA_Count)
summary_statistics


Predictors,Mean
<chr>,<dbl>
TEMP,13.67149
PRES,1012.547
DEWP,2.447535
RAIN,0.06401952
WSPM,1.860785


Predictors,Mean,Standard_Deviation,NA_Count
<chr>,<dbl>,<dbl>,<int>
TEMP,13.67149,11.458418,20
PRES,1012.547,10.266059,20
DEWP,2.447535,13.810696,20
RAIN,0.06401952,0.786282,20
WSPM,1.860785,1.280368,14


### KNN Regression

In [None]:
tidy_data <- weather_data |>
            select(-No , -year, -month, -day, -hour, -PM10, -SO2, -NO2, -O3, -wd, -PM2.5, -station)|>
            na.omit()|>
            arrange(CO)

colnames(tidy_data) <-
            c('CO', 'Temperature' , 'Pressure' , 'Dew_Point_Temperature' , 'Rain' , 'Wind_Speed')


tidy_data

In [None]:
set.seed(2000) 

weather_split <- initial_split(tidy_data, prop = 0.75, strata = CO)
weather_training <- training(weather_split)
weather_testing <- testing(weather_split)


weather_split
weather_training
weather_testing

In [None]:
weather_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
                set_engine("kknn") |>
                set_mode("regression") 

weather_recipe <- recipe(CO ~. , data = weather_training) |>
                step_scale(all_predictors()) |>
                step_center(all_predictors())
weather_spec
weather_recipe

In [None]:
set.seed(1234) 

weather_vfold <- vfold_cv(weather_training, v = 5, strata = CO)

weather_workflow <- workflow() |>
                    add_recipe(weather_recipe) |>
                    add_model(weather_spec)
weather_workflow

In [None]:
set.seed(2019)

gridvals <- tibble(neighbors = seq(from = 1, to = 200))

weather_results <- weather_workflow |>
                  tune_grid(resamples = weather_vfold, grid = gridvals) |>
                  collect_metrics()

weather_results

## Methods
We’ll conduct our data analysis with variables as such :

1. Objects for Regression:
- PM 2.5 
- CO

2. Predictors:
- Temperature (TEMP) 
- Pressure (PRES) 
- Dew Point Temperature (DEWP) 
- Rain (RAIN) 
- Wind Speed (WSPM)

3. Variables NOT included within the analysis and why :
- Wind Direction (wd) is categorical therefore is not integrated with the plot. 
- Seasonal weather may influence the prediction, however this is not our topic of study (year, month, day, hour)
- Normalize our data so certain variables with large absolute quantities do not receive undue weight.

We will then perform a KNN regression to analyze the effect that weather has on the concentration of pollutants; we will visualize the results through a regression plot on pollutants (PM 2.5 and CO). We will then create separate regression plots for each predictor in relation to our pollutant.

## Expected outcomes and significance
1. We expect to find a correlation between meteorological conditions with PM 2.5 and CO, while finding differences between the behaviors in different types of air pollutants reacting to different conditions.

2. The relationship between air pollutants and weather conditions may prove the effects of air pollution; helping to advance active pollution reduction technologies or discovering better methods of reducing the penetration of PM 2.5 into households. Also, we may be able to use meteorology combined with behavioral techniques to reduce air pollution.

3. What future questions could this lead to?
- Is the air pollution temporarily reduced or permanently removed from the air through weather events? 
Certain pollutants such as PM 2.5 may be integrated into the environment such as how smoke is washed into the soil and turned into ash.

- How might meteorological conditions be controlled to reduce air pollution in cities? 
Techniques such as rain cloud seeding may be available to reduce the concentration of pollutants during high concentration of pollutants.