## **Proposal for Air Pollution Regression Analysis**

## Introduction

Air pollution is affected by the weather due to the moisture and the precipitation of the air affecting the concentration of pollutants through factors such as humidity, temperature, and particulate matter. For example, when it rains, particulate matter such as PM 2.5 is carried out of the air through rain droplets into groundwater, and gaseous pollutants may dissolve into the water. 

We’re choosing the air pollutant CO and air particles PM 2.5 because both variables frequently emerge in regards to the air pollution in Beijing, specifically the region Tiantan because of its population consisting of both locals and tourists creating a generous amount of people to be affected by. Through this data analysis, we can predict the air condition in the region and the impact it has on the community and how we can further improve it. 

We want to predict the concentration of the air particles PM 2.5 and air pollutants CO based on the weather which is decided by 5 variables (the temperature (C°), pressure (hPa), dew point temperature (C°), precipitation (mm), wind speed (m/s)). Questions that we came up with for this regression analysis includes :

1. How does the weather affect the concentration of pollutants such as PM 2.5 CO?
2. How does the weather affect the concentration of gas pollutants such as CO?

The dataset consists of hourly concentration of air pollutants and meteorological variables from 12 air-monitoring stations in Beijing between March 1, 2013 and February 28, 2017. 

## Data Set : Weather Data Set in the Region Tiantan 

In [1]:
library(tidyverse)
library(repr)
options(repr.matrix.max.rows = 6)
library(testthat)
library(digest)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘testthat’


The following object is masked from ‘package:dplyr’:

    matches


The following object is masked from ‘package:purrr’:

    is_null


The following objects are masked from ‘package:readr’:

    edition_get, local_edition


The following object is masked from ‘package:tidyr’:

    matches


── [1mAttaching packages[22m

In [2]:
weather_data <- read.csv("PRSA_Data_Tiantan_20130301-20170228.csv")
weather_data

No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>
1,2013,3,1,0,6,6,4,8,300,81,-0.5,1024.5,-21.4,0,NNW,5.7,Tiantan
2,2013,3,1,1,6,29,5,9,300,80,-0.7,1025.1,-22.1,0,NW,3.9,Tiantan
3,2013,3,1,2,6,6,4,12,300,75,-1.2,1025.3,-24.6,0,NNW,5.3,Tiantan
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
35062,2017,2,28,21,18,32,4,48,500,48,10.8,1014.2,-13.3,0,NW,1.1,Tiantan
35063,2017,2,28,22,15,42,5,52,600,44,10.5,1014.4,-12.9,0,NNW,1.2,Tiantan
35064,2017,2,28,23,15,50,5,68,700,21,8.6,1014.1,-15.9,0,NNE,1.3,Tiantan


## Methods
We will conduct our data analysis with variables used for the data set :

1. Objects for Regression: 
- PM 2.5 
- CO
 
2. Predictors: 
- Temperature (TEMP) 
- Pressure (PRES) 
- Dew Point Temperature (DEWP) 
- Rain (RAIN) 
- Wind Speed (WSPM)

3. Variables NOT included within the analysis and why : 
- Wind Direction (wd) is categorical therefore is not integrated with the plot. 
- Seasonal weather may influence the prediction, however this is not our topic of study. Therefore all time variables such as year, month, day, hour are dropped to prevent interference with our regression plot.
- We normalize our data to make sure that certain variables with large absolute quantities such as CO do not receive undue weight.

In [3]:
tidy_data <- weather_data |>
            select(-No , -year, -month, -day, -hour, -PM10, -SO2, -NO2, -O3, -wd)

colnames(tidy_data) <-
            c('PM_2.5', 'CO', 'Temperature' , 'Pressure' , 'Dew_Point_Temperature' , 'Rain' , 'Wind_Speed' , 'Region')


tidy_data

PM_2.5,CO,Temperature,Pressure,Dew_Point_Temperature,Rain,Wind_Speed,Region
<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
6,300,-0.5,1024.5,-21.4,0,5.7,Tiantan
6,300,-0.7,1025.1,-22.1,0,3.9,Tiantan
6,300,-1.2,1025.3,-24.6,0,5.3,Tiantan
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
18,500,10.8,1014.2,-13.3,0,1.1,Tiantan
15,600,10.5,1014.4,-12.9,0,1.2,Tiantan
15,700,8.6,1014.1,-15.9,0,1.3,Tiantan


We will then perform a KNN regression on the data and analyze the strength of the effect that weather has on the concentration of pollutants and we will visualize the results will be through a regression plot on pollutants such as PM 2.5 and CO. We will then create separate regression plots for each predictor in relation to our pollutant. 

In [4]:
#knn regression

## Expected outcomes and significance:

1. What do you expect to find?
We expect to find a correlation between certain meteorological conditions and the quantity of certain air pollutants, while also finding differences between the manner in which different types of air pollutants react to different conditions. For example, gaseous pollutants may not be affected as much by precipitation as particulate matter, while gaseous pollutants may be more affected by wind speed.

2. What impact could such findings have?
Discovering the relationship between air pollutants and weather conditions may provide insight into methods of mitigating the effects of air pollution. This can help to advance active pollution reduction technologies such as carbon capture or discovering better methods of reducing the penetration of PM 2.5 into households. Since we can tentatively predict the weather, we may be able to use meteorology combined with behavioral techniques to reduce air pollution such as by using green energy during periods of low wind speed.

3. What future questions could this lead to?
- Is the air pollution temporarily reduced or permanently removed from the air through weather events?
Certain pollutants such as PM 2.5 may be integrated into the environment such as how smoke is washed into the soil and turned into ash. Other pollutants such as CO may simply evaporate back into the environment when the temperature changes.

- Does the pollution from air pollution become integrated into the environment through other means such as water pollution?
Certain types of toxic pollutants may be washed into the water supply, such as meteorological conditions such as acid rain.

- How might meteorological conditions be stimulated or controlled to reduce air pollution in cities?
Techniques such as rain cloud seeding may be available to reduce the concentration of pollutants during high concentration of pollutants.

