p8105_hw2_CongyangXie.Rmd

---
title: "p8105_hw2_CongyangXie"
author: "Congyang Xie"
date: "10/3/2021"
output: github_document
---
# P8105 homework2
```{r, message = FALSE}
library(tidyverse)
library(readxl)
library(lubridate)
```
## Problem 1
```{r}
file_path1 = "Trash-Wheel-Collection-Totals-7-2020-2.xlsx"
```

1. Read and clean the Mr. Trash Wheel sheet
```{r, message = FALSE}
MrTrashWheel_df <-
  read_excel(file_path1, sheet = "Mr. Trash Wheel") %>%
  janitor::clean_names() %>%
  select(-x15, -x16, -x17) %>%
  drop_na() %>%
  mutate(sports_balls = round(sports_balls))


```
2. Read and clean precipitation data for 2018 and 2019.
  + For each, omit rows without precipitation data and add a variable for year..
```{r}
precip18_df <- 
  read_excel(file_path1, sheet = "2018 Precipitation", skip = 1) %>%
  janitor::clean_names() %>%
  drop_na() %>%
  mutate(year = 2018) %>%
  relocate(year)
  

precip19_df <- 
  read_excel(file_path1, sheet = "2019 Precipitation", skip = 1) %>%
  janitor::clean_names() %>%
  drop_na() %>%
  mutate(year = 2019) %>%
  relocate(year)
```
  + Combine precipitation datasets and convert month to a character variable.
```{r}
precip_df <- 
  full_join(precip18_df, precip19_df) %>%
  arrange(year, month) %>%
  relocate(year, month) %>%
  mutate(month = month.name[month]) 
  
```
  
3. The summary of MrTrashWheel dataset:  
  + The MrTrashWheel dataset has `r ncol(MrTrashWheel_df)` variables and `r nrow(MrTrashWheel_df)` observations. The dataset stores trash collect data from each dumpster in the year of `r unique(pull(MrTrashWheel_df, year))`. The trash type includes `r names(MrTrashWheel_df)[7:13]`. The top3 trash types are stated here:
```{r echo=FALSE, message=FALSE, warning=FALSE}
MrTrashWheel_df %>% select(-year) %>% skimr::skim() %>% select(skim_variable, numeric.mean) %>% top_n(3) %>% arrange(desc(numeric.mean)) %>% knitr::kable(caption = "The top3 trash types", align = 'c')

```
  + The median number of sports balls in a dumpster in 2019 is `r MrTrashWheel_df %>% filter(year == 2019) %>% pull(sports_balls) %>% median()`.

4. The summary of precipitation data:

  + The precipitation dataset has `r ncol(precip_df)` variables and `r nrow(precip_df)` observations. The variables are: `r names(precip_df)`. The dataset stores trash precipitation data on a monthly basis collected from 2018 to 2019. The mean monthly precipitation for 2018 is stated below:

```{r echo=FALSE, message=FALSE}
precip_df %>% group_by(year) %>% summarise(mean_month_precip = mean(total)) %>% filter(year == 2018) %>% knitr::kable(caption = "mean monthly precipitation for 2018", align = "c")

``` 

  + The total precipitation in 2018 is `r precip_df %>% filter(year == 2018) %>% pull(total) %>% sum()`
  

## Problem 2 Merge FiveThirtyEight datasets into 1 dataframe.

1. clean the data in pols-month.csv.

```{r, message=FALSE}
pols_df <-
  read_csv("fivethirtyeight_datasets/pols-month.csv") %>%
  janitor::clean_names() %>%
  # Use separate() to break up the variable mon into integer variables year, month, and day
  separate(mon, c("year", "month", "day"), sep = "-") %>%
  mutate(year = as.integer(year),
         # replace month number with month name
         month = month.name[as.integer(month)],
         day = as.integer(day),
         president = ifelse(prez_dem == 1, "dem", "gop")) %>%
  select(-starts_with("prez"), -day)

```
2. clean the data in snp.csv

```{r, message=FALSE}
snp_df <-
  read_csv("fivethirtyeight_datasets/snp.csv") %>%
  janitor::clean_names() %>%
  mutate(date = mdy(date)) %>%
  # Use separate() to break up the variable date into integer variables year, month, and day
  separate(date, c("year", "month", "day"), sep = "-") %>%
  mutate(# replace month number with month name
    month = month.name[as.integer(month)],
    day = as.integer(day),
    year = as.integer(year)) %>%
  arrange(year, month) %>%
  relocate(year, month) %>%
  select(-day)
```


3. tidy the unemployment data so that it can be merged with the previous datasets
```{r, message=FALSE}
unemployment_df <-
  read_csv("fivethirtyeight_datasets/unemployment.csv") %>%
  pivot_longer(cols = 2:13,
               names_to = "month",
               values_to = "unemployment rate") %>%
  janitor::clean_names() %>%
  # convert month from abbreviation to full
  mutate(month = month.name[match(month, month.abb)])
```
4. Join the datasets by merging snp into pols, and merging unemployment into the result
```{r}
final_df <- 
  left_join(pols_df, snp_df, by = c("year", "month")) %>%
  left_join(unemployment_df, by = c("year", "month"))
```

5. Write a short paragraph about these datasets. Explain briefly what each dataset contained, and describe the resulting dataset (e.g. give the dimension, range of years, and names of key variables).

  + The pols dataset has `r ncol(pols_df)` variables and `r nrow(pols_df)` observations, containing the number of national politicians who are democratic or republican at time ranging from `r min(pull(pols_df, year))` to `r max(pull(pols_df, year))`. The variables in the dataset are:  `r names(pols_df)`.

  + The snp dataset has `r ncol(snp_df)` variables and `r nrow(snp_df)` observations, containing Standard & Poor’s stock market index (S&P) at time  ranging from `r min(pull(snp_df, year))` to `r max(pull(snp_df, year))`.The variables in the dataset are:  `r names(snp_df)`. 

  + The unemployment dataset has `r ncol(unemployment_df)` variables and `r nrow(unemployment_df)` observations, containing unemployment rate at time ranging from `r min(pull(unemployment_df, year))` to `r max(pull(unemployment_df, year))`. The variables in the dataset are:  `r names(unemployment_df)`.
  
  + The final dataset is merged by the 3 datasets above, which has `r ncol(final_df)` variables and `r nrow(final_df)` observations, containing the number of national politicians who are democratic or republican, S&P, and unemployment rate at time ranging from `r min(pull(final_df, year))` to `r max(pull(final_df, year))`. The variables in the dataset are:  `r names(final_df)`.


## Problem 3

1. Load and tidy dataset popular baby names.
```{r, message=FALSE}
popular_baby_names_df <- 
  read_csv("Popular_Baby_Names.csv") %>%
  janitor::clean_names() %>%
  mutate(
    ethnicity = recode(
      ethnicity,
      "BLACK NON HISP" = "BLACK NON HISPANIC",
      "WHITE NON HISP" = "WHITE NON HISPANIC",
      "ASIAN AND PACI" = "ASIAN AND PACIFIC ISLANDER"
    ),
    childs_first_name = toupper(childs_first_name)
  ) %>%
  distinct()
```

2. create tabel showing the rank in popularity of the name “Olivia” as a female baby name over time.
```{r}
# the rank in popularity of the name “Olivia” as a female baby name over time
popular_baby_names_df %>%
  filter(childs_first_name == "OLIVIA" & gender =="FEMALE") %>%
  arrange(year_of_birth) %>%
  select(-childs_first_name, -count) %>%
  pivot_wider(names_from = year_of_birth, values_from = rank) %>%
  knitr::kable(caption = "the rank in popularity of the name “Olivia” as a female baby name over time", align = "c")
  
```
3. create table showing the most popular name among male children over time.

```{r}
# the most popular name among male children over time
popular_baby_names_df %>%
  filter(gender =="MALE" & rank == 1) %>%
  select(-count) %>%
  arrange(year_of_birth) %>%
  pivot_wider(names_from = year_of_birth, values_from = childs_first_name) %>%
  knitr::kable(caption = "the most popular name among male children over time", align = "c")

```

4. For male, white non-hispanic children born in 2016, produce a scatter plot showing the number of children with a name (y axis) against the rank in popularity of that name (x axis).

```{r echo=TRUE}
plt_df <- popular_baby_names_df %>%
  filter(gender == "MALE" &
           ethnicity == "WHITE NON HISPANIC" &
           year_of_birth == 2016) 
# scatter plot
ggplot(plt_df, aes(x = rank, y = count)) +
  geom_point(alpha = .5)+
  labs(
    title = "Children name count - Popularity rank plot",
    x = "Name popularity rank",
    y = "Name count",
    caption = "Data from the NYC Open Data"
  ) +
  theme_classic() + 
  theme(legend.position = "bottom")
```