# JOINs and UNIONs

Along the way, we'll work with two imaginary tables, called `owners` and `pets`.

![https://i.imgur.com/dYVwS4T.png](https://i.imgur.com/dYVwS4T.png)

Each row of the `owners` table identifies a different pet owner, where the `ID` column is a unique identifier. The `Pet_ID` column (in the `owners` table) contains the ID for the pet that belongs to the owner (this number matches the ID for the pet from the `pets` table).

# JOINs

Recall that we can use an `INNER JOIN` to pull rows from both tables where the value in the `Pet_ID` column in the `owners` table has a match in the `ID` column of the `pets` table.

![https://i.imgur.com/C5wimKT.png](https://i.imgur.com/C5wimKT.png)

For instance, to create a table containing all rows from the `owners` table, we use a `LEFT JOIN`.

![https://i.imgur.com/tnOqw2S.png](https://i.imgur.com/tnOqw2S.png)

If we instead use a `RIGHT JOIN`, we get the matching rows, along with all rows in the right table (whether there is a match or not).

Finally, a `FULL JOIN` returns all rows from both tables. Note that in general, any row that does not have a match in both tables will have `NULL` entries for the missing values. You can see this in the image below.

![https://i.imgur.com/1Dvmg8S.png](https://i.imgur.com/1Dvmg8S.png)

## UNIONs

As you've seen, `JOINs` horizontally combine results from different tables. If you instead would like to vertically concatenate columns, you can do so with a `UNION`.

![https://i.imgur.com/oa6VDig.png](https://i.imgur.com/oa6VDig.png)

Note that with a `UNION`, the data types of both columns must be the same, but the column names can be different.

We use `UNION ALL` to include duplicate values - you'll notice that 9 appears in both the owners table and the pets table, and shows up twice in the concatenated results. If you'd like to drop duplicate values, you need only change `UNION ALL` in the query to `UNION DISTINCT`.

In [1]:
from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)
table_ref = dataset_ref.table("comments")
table = client.get_table(table_ref)
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,by,author,time,time_ts,text,parent,deleted,dead,ranking
0,2701393,5l,5l,1309184881,2011-06-27 14:28:01+00:00,And the glazier who fixed all the broken windo...,2701243,,,0
1,5811403,99,99,1370234048,2013-06-03 04:34:08+00:00,Does canada have the equivalent of H1B/Green c...,5804452,,,0
2,21623,AF,AF,1178992400,2007-05-12 17:53:20+00:00,"Speaking of Rails, there are other options in ...",21611,,,0
3,10159727,EA,EA,1441206574,2015-09-02 15:09:34+00:00,Humans and large livestock (and maybe even pet...,10159396,,,0
4,2988424,Iv,Iv,1315853580,2011-09-12 18:53:00+00:00,I must say I reacted in the same way when I re...,2988179,,,0


In [2]:
table_ref = dataset_ref.table("stories")
table = client.get_table(table_ref)
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,by,score,time,time_ts,title,url,text,deleted,dead,descendants,author
0,6940813,sarath237,0,1387536270,2013-12-20 10:44:30+00:00,Sheryl Brindo Hot Pics,http://www.youtube.com/watch?v=ym1cyxneB0Y,Sheryl Brindo Hot Pics,,True,,sarath237
1,6991401,123123321321,0,1388508751,2013-12-31 16:52:31+00:00,Are you people also put off by the culture of ...,,They&#x27;re pretty explicitly &#x27;startup f...,,True,,123123321321
2,1531556,ssn,0,1279617234,2010-07-20 09:13:54+00:00,New UI for Google Image Search,http://googlesystem.blogspot.com/2010/07/googl...,Again following on Bing's lead.,,,0.0,ssn
3,5012398,hoju,0,1357387877,2013-01-05 12:11:17+00:00,Historic website screenshots,http://webscraping.com/blog/Generate-website-s...,Python script to generate historic screenshots...,,,0.0,hoju
4,7214182,kogir,0,1401561740,2014-05-31 18:42:20+00:00,Placeholder,,Mind the gap.,,,0.0,kogir


The query below pulls information from the `stories` and `comments` tables to create a table showing all stories posted on January 1, 2012, along with the corresponding number of comments. We use a `LEFT JOIN` so that the results include stories that didn't receive any comments.

In [3]:
join_query = """
             WITH c AS
             (
             SELECT parent, COUNT(*) as num_comments
             FROM `bigquery-public-data.hacker_news.comments` 
             GROUP BY parent
             )
             SELECT s.id as story_id, s.by, s.title, c.num_comments
             FROM `bigquery-public-data.hacker_news.stories` AS s
             LEFT JOIN c
             ON s.id = c.parent
             WHERE EXTRACT(DATE FROM s.time_ts) = '2012-01-01'
             ORDER BY c.num_comments DESC
             """
join_result = client.query(join_query).result().to_dataframe()
join_result.head()

Unnamed: 0,story_id,by,title,num_comments
0,3412900,whoishiring,Ask HN: Who is Hiring? (January 2012),154.0
1,3412901,whoishiring,Ask HN: Freelancer? Seeking freelancer? (Janua...,97.0
2,3412643,jemeshsu,Avoid Apress,30.0
3,3412891,Brajeshwar,"There's no shame in code that is simply ""good ...",27.0
4,3414012,ramanujam,Impress.js - a Prezi like implementation using...,27.0


Since the results are ordered by the num_comments column, stories without comments appear at the end of the DataFrame. (Remember that `NaN` stands for "not a number".)

In [4]:
join_result.tail()

Unnamed: 0,story_id,by,title,num_comments
439,3414105,sabmayahai,Saudi Universities Offer Cash in Exchange for ...,
440,3414116,theproductguy,Happy New Year Product Management in 2012,
441,3413481,FluidDjango,A Toast To Technology,
442,3413256,microcon,Newcastle vs Man Utd Live Stream 4 January 2012,
443,3413234,aksharajanu,URDU SEX STORIES: Doodh Or Kelaa,


Next, we write a query to select all usernames corresponding to users who wrote stories or comments on January 1, 2014. We use `UNION DISTINCT` (instead of `UNION ALL`) to ensure that each user appears in the table at most once.

In [5]:
union_query = """
              SELECT c.by
              FROM `bigquery-public-data.hacker_news.comments` AS c
              WHERE EXTRACT(DATE FROM c.time_ts) = '2014-01-01'
              UNION DISTINCT
              SELECT s.by
              FROM `bigquery-public-data.hacker_news.stories` AS s
              WHERE EXTRACT(DATE FROM s.time_ts) = '2014-01-01'
              """
union_result = client.query(union_query).result().to_dataframe()
union_result.head()

Unnamed: 0,by
0,jeassonlens
1,adamcoomes
2,Bootvis
3,adeyemiadisa
4,purzelrakete


To get the number of users who posted on January 1, 2014, we need only take the length of the DataFrame.

In [6]:
len(union_result)

2282

## Exercises

In [9]:
from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)
table_ref = dataset_ref.table("posts_questions")
table = client.get_table(table_ref)
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,31568634,keytool for Android debug key giving garbage v...,<p>I am using macbook \nI typed below code on ...,31569628.0,1,0,,2015-07-22 16:16:14.847000+00:00,,2015-07-22 17:05:30.597000+00:00,NaT,,,,2020622,,1,2,android|keytool,256
1,31600116,type error in sqlalchemy/flask query,<p>I was wondering if I could get some help on...,31600125.0,1,0,,2015-07-24 00:06:03.747000+00:00,,2015-07-27 19:32:02.880000+00:00,2015-07-27 19:32:02.880000+00:00,,5149754.0,,5149754,,1,1,python|flask|sqlalchemy,256
2,31616794,file.isDirectory() returning false for directory,<p>I am trying to display images stored in 'Pi...,,1,0,,2015-07-24 17:50:41.087000+00:00,,2015-08-02 17:41:52.860000+00:00,NaT,,,,4943245,,1,0,listview,256
3,31622328,Responsive CSS Sprite (top to bottom sprite),<p>I am looking for a responsive sprite. I was...,,1,0,,2015-07-25 02:39:44.193000+00:00,,2015-07-25 10:32:49.033000+00:00,2015-07-25 02:46:42.330000+00:00,,5154415.0,,5154415,,1,0,html|css|responsive-design|sprite,256
4,31648312,Spring Webflow AttributeMap doesn't apply defa...,<p>Regarding Spring Webflow 2.4.1.RELEASE.</p>...,,0,0,,2015-07-27 08:31:17.307000+00:00,,2015-07-30 10:27:07.910000+00:00,2015-07-30 10:27:07.910000+00:00,,2976062.0,,2976062,,1,1,java|spring-webflow,256


In [10]:
table_ref = dataset_ref.table("posts_answers")
table = client.get_table(table_ref)
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,58545647,,"<p>You can implement the <a href=""https://docs...",,,0,,2019-10-24 16:35:51.947000+00:00,,2019-10-24 16:35:51.947000+00:00,,,,,2541560,58545487,2,0,,
1,58545649,,"<p>You may be having an issue with the ""stage""...",,,0,,2019-10-24 16:35:59.377000+00:00,,2019-10-24 16:35:59.377000+00:00,,,,,4434749,56565949,2,0,,
2,58545664,,<p>I am not sure why you need that exactly but...,,,0,,2019-10-24 16:36:39.870000+00:00,,2019-10-24 16:36:39.870000+00:00,,,,,8343843,58545068,2,0,,
3,58545675,,<pre><code>Object delegateObj = readField(valu...,,,1,,2019-10-24 16:37:20.207000+00:00,,2019-10-24 16:37:20.207000+00:00,,,,,12269981,57195785,2,0,,
4,58545677,,<p>I had to remove the line</p>\n\n<pre><code>...,,,0,,2019-10-24 16:37:51.253000+00:00,,2019-10-24 16:37:51.253000+00:00,,,,,1775258,58428566,2,0,,


### 1) How long does it take for questions to receive answers?

You're interested in exploring the data to have a better understanding of how long it generally takes for questions to receive answers. Armed with this knowledge, you plan to use this information to better design the order in which questions are presented to Stack Overflow users.

In [11]:
first_query = """
              SELECT q.id AS q_id,
                  MIN(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND)) as time_to_answer
              FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                  INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
              ON q.id = a.parent_id
              WHERE q.creation_date >= '2018-01-01' and q.creation_date < '2018-02-01'
              GROUP BY q_id
              ORDER BY time_to_answer
              """

first_result = client.query(first_query).result().to_dataframe()
print("Percentage of answered questions: %s%%" % \
      (sum(first_result["time_to_answer"].notnull()) / len(first_result) * 100))
print("Number of questions:", len(first_result))
first_result.head()

Percentage of answered questions: 100.0%
Number of questions: 134227


Unnamed: 0,q_id,time_to_answer
0,48100614,0
1,48541142,0
2,48221678,0
3,48552682,0
4,48553125,0


You're surprised at the results and strongly suspect that something is wrong with your query. In particular:

1. According to the query, 100% of the questions from January 2018 received an answer. But, you know that ~80% of the questions on the site usually receive an answer.
2. The total number of questions is surprisingly low. You expected to see at least 150,000 questions represented in the table.

Given these observations, you think that the type of `JOIN` you have chosen has inadvertently excluded unanswered questions. Using the code cell below, can you figure out what type of `JOIN` to use to fix the problem so that the table includes unanswered questions?

In [13]:
correct_query = """
              SELECT q.id AS q_id,
                  MIN(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND)) as time_to_answer
              FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                  LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
              ON q.id = a.parent_id
              WHERE q.creation_date >= '2018-01-01' and q.creation_date < '2018-02-01'
              GROUP BY q_id
              ORDER BY time_to_answer
              """
correct_result = client.query(correct_query).result().to_dataframe()
print("Percentage of answered questions: %s%%" % \
      (sum(correct_result["time_to_answer"].notnull()) / len(correct_result) * 100))
print("Number of questions:", len(correct_result))

Percentage of answered questions: 82.4009331164247%
Number of questions: 162895


### 2) Initial questions and answers, Part 1

You're interested in understanding the initial experiences that users typically have with the Stack Overflow website. Is it more common for users to first ask questions or provide answers? After signing up, how long does it take for users to first interact with the website? 

You want to keep track of users who have asked questions, but have yet to provide answers. And, your table should also include users who have answered questions, but have yet to pose their own questions.

With this in mind, please fill in the appropriate `JOIN` (i.e., `INNER`, `LEFT`, `RIGHT`, or `FULL`) to return the correct information. 

To avoid returning too much data, we'll restrict our attention to questions and answers posed in January 2019. We'll amend the timeframe in Part 2 of this question to be more realistic!

In [14]:
q_and_a_query = """
                SELECT q.owner_user_id AS owner_user_id,
                    MIN(q.creation_date) AS q_creation_date,
                    MIN(a.creation_date) AS a_creation_date
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                    FULL JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                ON q.owner_user_id = a.owner_user_id 
                WHERE q.creation_date >= '2019-01-01' AND q.creation_date < '2019-02-01' 
                    AND a.creation_date >= '2019-01-01' AND a.creation_date < '2019-02-01'
                GROUP BY owner_user_id
                """

### 3) Initial questions and answers, Part 2

Now you'll address a more realistic (and complex!) scenario. To answer this question, you'll need to pull information from three different tables! This syntax very similar to the case when we have to join only two tables. For instance, consider the three tables below.

![https://i.imgur.com/OyhYtD1.png](https://i.imgur.com/OyhYtD1.png)

We can use two different `JOINs` to link together information from all three tables, in a single query.

![https://i.imgur.com/G6buS7P.png](https://i.imgur.com/G6buS7P.png)

With this in mind, say you're interested in understanding users who joined the site in January 2019. You want to track their activity on the site: when did they post their first questions and answers, if ever?

In [15]:
three_tables_query = """
    SELECT u.id AS id,
        MIN(q.creation_date) AS q_creation_date,
        MIN(a.creation_date) AS a_creation_date 
    FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                         FULL JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                             ON q.owner_user_id = a.owner_user_id 
                         RIGHT JOIN `bigquery-public-data.stackoverflow.users` AS u
                             ON q.owner_user_id = u.id
    WHERE u.creation_date >= '2019-01-01' AND u.creation_date < '2019-02-01'  
    GROUP BY id
    """

### 4) How many distinct users posted on January 1, 2019?

In the code cell below, write a query that returns a table with a single column: `owner_user_id` - the IDs of all users who posted at least one question or answer on January 1, 2019. Each user ID should appear at most once.

In [16]:
all_users_query = """
    select owner_user_id
    from `bigquery-public-data.stackoverflow.posts_questions`
    where extract(date from creation_date) = '2019-01-01'
    union distinct
    select owner_user_id
    from `bigquery-public-data.stackoverflow.posts_answers`
    where extract(date from creation_date) = '2019-01-01'
"""