/
Data_exploration_Bibata.Rmd
68 lines (48 loc) · 1.46 KB
/
Data_exploration_Bibata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
title: "Data Exporation"
output: html_document
date: "2023-06-21"
---
```{r Setup}
library(tidyverse)
library(here)
library(lubridate)
here()
city_day_data <- read_csv("city_day_agg_cleaned.csv.gz")
country_day_data <- read_csv("country_day_agg_cleaned.csv.gz")
```
Looking at the data we have two tables with the following features:
#### city_day_data (Features)
- country
- countryCode
- city_id
- date
- parameter
- mean
#### country_data_data (Features)
- countryCode
- date
- parameter
- mean
```{r Data Preparation}
wider_city_data <- city_day_data %>%
pivot_wider(names_from=parameter, values_from=mean) %>%
select(-c(no2, o3)) %>%
na.omit() %>%
rename(city_average_pm25=pm25)
wider_country_data <- country_day_data %>%
pivot_wider(names_from=parameter, values_from=mean) %>%
select(-c(no2, o3)) %>%
na.omit() %>%
rename(country_average_pm25=pm25)
wider_city_data %>%
left_join(wider_country_data, by=c("countryCode", "date")) %>%
mutate(year_period=as.factor(ifelse(year(date) < 2020, "3-year", "2020"))) %>%
group_by(countryCode, city_id, year_period) %>%
summarize(mean_pm25=mean(city_average_pm25)) %>%
group_by(year_period) %>%
mutate(median_2020=median(mean_pm25)) %>%
ggplot(aes(x=mean_pm25, color=year_period)) +
geom_density(aes(fill=year_period, alpha=0.5)) +
geom_vline(aes(xintercept = median_2020, color=year_period))
```