-
Notifications
You must be signed in to change notification settings - Fork 0
/
p8105_hw2_CongyangXie.Rmd
205 lines (169 loc) · 7.96 KB
/
p8105_hw2_CongyangXie.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
title: "p8105_hw2_CongyangXie"
author: "Congyang Xie"
date: "10/3/2021"
output: github_document
---
# P8105 homework2
```{r, message = FALSE}
library(tidyverse)
library(readxl)
library(lubridate)
```
## Problem 1
```{r}
file_path1 = "Trash-Wheel-Collection-Totals-7-2020-2.xlsx"
```
1. Read and clean the Mr. Trash Wheel sheet
```{r, message = FALSE}
MrTrashWheel_df <-
read_excel(file_path1, sheet = "Mr. Trash Wheel") %>%
janitor::clean_names() %>%
select(-x15, -x16, -x17) %>%
drop_na() %>%
mutate(sports_balls = round(sports_balls))
```
2. Read and clean precipitation data for 2018 and 2019.
+ For each, omit rows without precipitation data and add a variable for year..
```{r}
precip18_df <-
read_excel(file_path1, sheet = "2018 Precipitation", skip = 1) %>%
janitor::clean_names() %>%
drop_na() %>%
mutate(year = 2018) %>%
relocate(year)
precip19_df <-
read_excel(file_path1, sheet = "2019 Precipitation", skip = 1) %>%
janitor::clean_names() %>%
drop_na() %>%
mutate(year = 2019) %>%
relocate(year)
```
+ Combine precipitation datasets and convert month to a character variable.
```{r}
precip_df <-
full_join(precip18_df, precip19_df) %>%
arrange(year, month) %>%
relocate(year, month) %>%
mutate(month = month.name[month])
```
3. The summary of MrTrashWheel dataset:
+ The MrTrashWheel dataset has `r ncol(MrTrashWheel_df)` variables and `r nrow(MrTrashWheel_df)` observations. The dataset stores trash collect data from each dumpster in the year of `r unique(pull(MrTrashWheel_df, year))`. The trash type includes `r names(MrTrashWheel_df)[7:13]`. The top3 trash types are stated here:
```{r echo=FALSE, message=FALSE, warning=FALSE}
MrTrashWheel_df %>% select(-year) %>% skimr::skim() %>% select(skim_variable, numeric.mean) %>% top_n(3) %>% arrange(desc(numeric.mean)) %>% knitr::kable(caption = "The top3 trash types", align = 'c')
```
+ The median number of sports balls in a dumpster in 2019 is `r MrTrashWheel_df %>% filter(year == 2019) %>% pull(sports_balls) %>% median()`.
4. The summary of precipitation data:
+ The precipitation dataset has `r ncol(precip_df)` variables and `r nrow(precip_df)` observations. The variables are: `r names(precip_df)`. The dataset stores trash precipitation data on a monthly basis collected from 2018 to 2019. The mean monthly precipitation for 2018 is stated below:
```{r echo=FALSE, message=FALSE}
precip_df %>% group_by(year) %>% summarise(mean_month_precip = mean(total)) %>% filter(year == 2018) %>% knitr::kable(caption = "mean monthly precipitation for 2018", align = "c")
```
+ The total precipitation in 2018 is `r precip_df %>% filter(year == 2018) %>% pull(total) %>% sum()`
## Problem 2 Merge FiveThirtyEight datasets into 1 dataframe.
1. clean the data in pols-month.csv.
```{r, message=FALSE}
pols_df <-
read_csv("fivethirtyeight_datasets/pols-month.csv") %>%
janitor::clean_names() %>%
# Use separate() to break up the variable mon into integer variables year, month, and day
separate(mon, c("year", "month", "day"), sep = "-") %>%
mutate(year = as.integer(year),
# replace month number with month name
month = month.name[as.integer(month)],
day = as.integer(day),
president = ifelse(prez_dem == 1, "dem", "gop")) %>%
select(-starts_with("prez"), -day)
```
2. clean the data in snp.csv
```{r, message=FALSE}
snp_df <-
read_csv("fivethirtyeight_datasets/snp.csv") %>%
janitor::clean_names() %>%
mutate(date = mdy(date)) %>%
# Use separate() to break up the variable date into integer variables year, month, and day
separate(date, c("year", "month", "day"), sep = "-") %>%
mutate(# replace month number with month name
month = month.name[as.integer(month)],
day = as.integer(day),
year = as.integer(year)) %>%
arrange(year, month) %>%
relocate(year, month) %>%
select(-day)
```
3. tidy the unemployment data so that it can be merged with the previous datasets
```{r, message=FALSE}
unemployment_df <-
read_csv("fivethirtyeight_datasets/unemployment.csv") %>%
pivot_longer(cols = 2:13,
names_to = "month",
values_to = "unemployment rate") %>%
janitor::clean_names() %>%
# convert month from abbreviation to full
mutate(month = month.name[match(month, month.abb)])
```
4. Join the datasets by merging snp into pols, and merging unemployment into the result
```{r}
final_df <-
left_join(pols_df, snp_df, by = c("year", "month")) %>%
left_join(unemployment_df, by = c("year", "month"))
```
5. Write a short paragraph about these datasets. Explain briefly what each dataset contained, and describe the resulting dataset (e.g. give the dimension, range of years, and names of key variables).
+ The pols dataset has `r ncol(pols_df)` variables and `r nrow(pols_df)` observations, containing the number of national politicians who are democratic or republican at time ranging from `r min(pull(pols_df, year))` to `r max(pull(pols_df, year))`. The variables in the dataset are: `r names(pols_df)`.
+ The snp dataset has `r ncol(snp_df)` variables and `r nrow(snp_df)` observations, containing Standard & Poor’s stock market index (S&P) at time ranging from `r min(pull(snp_df, year))` to `r max(pull(snp_df, year))`.The variables in the dataset are: `r names(snp_df)`.
+ The unemployment dataset has `r ncol(unemployment_df)` variables and `r nrow(unemployment_df)` observations, containing unemployment rate at time ranging from `r min(pull(unemployment_df, year))` to `r max(pull(unemployment_df, year))`. The variables in the dataset are: `r names(unemployment_df)`.
+ The final dataset is merged by the 3 datasets above, which has `r ncol(final_df)` variables and `r nrow(final_df)` observations, containing the number of national politicians who are democratic or republican, S&P, and unemployment rate at time ranging from `r min(pull(final_df, year))` to `r max(pull(final_df, year))`. The variables in the dataset are: `r names(final_df)`.
## Problem 3
1. Load and tidy dataset popular baby names.
```{r, message=FALSE}
popular_baby_names_df <-
read_csv("Popular_Baby_Names.csv") %>%
janitor::clean_names() %>%
mutate(
ethnicity = recode(
ethnicity,
"BLACK NON HISP" = "BLACK NON HISPANIC",
"WHITE NON HISP" = "WHITE NON HISPANIC",
"ASIAN AND PACI" = "ASIAN AND PACIFIC ISLANDER"
),
childs_first_name = toupper(childs_first_name)
) %>%
distinct()
```
2. create tabel showing the rank in popularity of the name “Olivia” as a female baby name over time.
```{r}
# the rank in popularity of the name “Olivia” as a female baby name over time
popular_baby_names_df %>%
filter(childs_first_name == "OLIVIA" & gender =="FEMALE") %>%
arrange(year_of_birth) %>%
select(-childs_first_name, -count) %>%
pivot_wider(names_from = year_of_birth, values_from = rank) %>%
knitr::kable(caption = "the rank in popularity of the name “Olivia” as a female baby name over time", align = "c")
```
3. create table showing the most popular name among male children over time.
```{r}
# the most popular name among male children over time
popular_baby_names_df %>%
filter(gender =="MALE" & rank == 1) %>%
select(-count) %>%
arrange(year_of_birth) %>%
pivot_wider(names_from = year_of_birth, values_from = childs_first_name) %>%
knitr::kable(caption = "the most popular name among male children over time", align = "c")
```
4. For male, white non-hispanic children born in 2016, produce a scatter plot showing the number of children with a name (y axis) against the rank in popularity of that name (x axis).
```{r echo=TRUE}
plt_df <- popular_baby_names_df %>%
filter(gender == "MALE" &
ethnicity == "WHITE NON HISPANIC" &
year_of_birth == 2016)
# scatter plot
ggplot(plt_df, aes(x = rank, y = count)) +
geom_point(alpha = .5)+
labs(
title = "Children name count - Popularity rank plot",
x = "Name popularity rank",
y = "Name count",
caption = "Data from the NYC Open Data"
) +
theme_classic() +
theme(legend.position = "bottom")
```