# Section 04: Case Study: Joins on Stack Overflow Data
### `01- Left joining questions and tags`
- Join together `questions` and `question_tags` using the `id` and `question_id` columns, respectively.
- Use another join to add in the `tags` table.
- Use `replace_na` to change the `NAs` in the `tag_name` column to `"only-r"`.

Note that we'll be using `left_joins` in this exercise to ensure we keep all questions, even those without a corresponding tag. However, since we know the `questions` data is all R data, we'll want to manually tag these as R questions with `replace_na`.

In [4]:
library(tidyverse)
library(dplyr)
library(tidyr)

In [5]:
tags <- read.csv("..\\00_Datasets\\tags.csv", header=TRUE)
questions <- read.csv("..\\00_Datasets\\questions.csv", header=TRUE)
question_tags <- read.csv("..\\00_Datasets\\question_tags.csv", header=TRUE)


In [6]:
# Replace the NAs in the tag_name column
questions_with_tags <- questions %>%
  left_join(question_tags, by = c("id" = "question_id")) %>%
  left_join(tags, by = c("tag_id" = "id")) %>%
  replace_na(list(tag_name = "only-r"))

head(questions_with_tags)

Unnamed: 0_level_0,id,creation_date,score,tag_id,tag_name
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<chr>
1,22557677,3/21/2014,1,18,regex
2,22557677,3/21/2014,1,139,string
3,22557677,3/21/2014,1,16088,time-complexity
4,22557677,3/21/2014,1,1672,backreference
5,22557677,3/21/2014,1,18,regex
6,22557677,3/21/2014,1,139,string


### `02-Comparing scores across tags`
- Aggregate by the `tag_name`.
- Summarize to get the mean score for each question, `score`, as well as the total number of questions, `num_questions`.
- Arrange `num_questions` in descending order to sort the answers by the most asked questions.

In [7]:
questions_with_tags %>% 
  # Group by tag_name
  group_by(tag_name) %>%
  # Get mean score and num_questions
  summarize(score = mean(score),
        	num_questions = n()) %>%
  # Sort num_questions in descending order
  arrange(desc(num_questions))

tag_name,score,num_questions
<chr>,<dbl>,<int>
only-r,1.1727341,64168
ggplot2,2.3311985,42627
dataframe,2.2220842,28183
dplyr,1.6645284,23227
shiny,1.2840644,22685
plot,2.1490154,15844
data.table,2.5548955,12961
matrix,1.5877412,8810
loops,0.7105331,7728
regex,1.7853900,7269


### `03-What tags never appear on R questions?`
- Use a join to determine which tags never appear on an R question.

In [10]:
# Using a join, filter for tags that are never on an R question
tags %>%
  anti_join(question_tags, by = c("id" = "tag_id"))

id,tag_name
<int>,<chr>
124399,laravel-dusk
124402,spring-cloud-vault-config
124404,spring-vault
124405,apache-bahir
124407,astc
124408,simulacrum
124410,angulartics2
124411,django-rest-viewsets
124414,react-native-lightbox
124417,java-module
