# Joining Data

To get information that applies to a certain pet, we match the `ID` column in the `pets` table to the `Pet_ID` column in the `owners` table.

![https://i.imgur.com/Rx6L4m1.png](https://i.imgur.com/Rx6L4m1.png)

![https://i.imgur.com/eXvIORm.png](https://i.imgur.com/eXvIORm.png)

## JOIN

Using JOIN, we can write a query to create a table with just two columns: the name of the pet and the name of the owner.

![https://i.imgur.com/fLlng42.png](https://i.imgur.com/fLlng42.png)

We combine information from both tables by matching rows where the `ID` column in the `pets` table matches the `Pet_ID` column in the `owners` table.

In the query, `ON` determines which column in each table to use to combine the tables. Notice that since the `ID` column exists in both tables, we have to clarify which one to use. We use `p.ID` to refer to the `ID` column from the `pets` table, and `o.Pet_ID` refers to the `Pet_ID` column from the `owners` table.

In general, when you're joining tables, it's a good habit to specify which table each of your columns comes from. That way, you don't have to pull up the schema every time you go back to read the query.

The type of `JOIN` we're using today is called an `INNER JOIN`. That means that a row will only be put in the final output table if the value in the columns you're using to combine them shows up in both the tables you're joining. For example, if Tom's ID number of 4 didn't exist in the `pets` table, we would only get 3 rows back from this query. There are other types of `JOIN`, but an `INNER JOIN` is very widely used, so it's a good one to start with.

In [1]:
from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset("github_repos", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)
licenses_ref = dataset_ref.table("licenses")
licenses_table = client.get_table(licenses_ref)
client.list_rows(licenses_table, max_results=5).to_dataframe()

Unnamed: 0,repo_name,license
0,Manwar/WWW-Google-APIDiscovery,artistic-2.0
1,FindAllTogether/LifeIDE,artistic-2.0
2,skaji/perl6-HTTP-Tinyish,artistic-2.0
3,jonathanstowe/Oyatul,artistic-2.0
4,gitpan/App-FastishCGI,artistic-2.0


In [2]:
files_ref = dataset_ref.table("sample_files")
files_table = client.get_table(files_ref)
client.list_rows(files_table, max_results=5).to_dataframe()

Unnamed: 0,repo_name,ref,path,mode,id,symlink_target
0,git/git,refs/heads/master,RelNotes,40960,62615ffa4e97803da96aefbc798ab50f949a8db7,Documentation/RelNotes/2.10.0.txt
1,np/ling,refs/heads/master,tests/success/plug_compose.t/plug_compose.ll,40960,0c1605e4b447158085656487dc477f7670c4bac1,../../../fixtures/all/plug_compose.ll
2,np/ling,refs/heads/master,fixtures/strict-par-success/parallel_assoc_lef...,40960,b59bff84ec03d12fabd3b51a27ed7e39a180097e,../all/parallel_assoc_left.ll
3,np/ling,refs/heads/master,fixtures/sequence/parallel_assoc_2tensor2_left.ll,40960,f29523e3fb65702d99478e429eac6f801f32152b,../all/parallel_assoc_2tensor2_left.ll
4,np/ling,refs/heads/master,fixtures/success/my_dual.ll,40960,38a3af095088f90dfc956cb990e893909c3ab286,../all/my_dual.ll


Next, we write a query that uses information in both tables to determine how many files are released in each license.

In [4]:
query = """
        SELECT L.license, COUNT(1) AS number_of_files
        FROM `bigquery-public-data.github_repos.sample_files` AS sf
        INNER JOIN `bigquery-public-data.github_repos.licenses` AS L 
            ON sf.repo_name = L.repo_name
        GROUP BY L.license
        ORDER BY number_of_files DESC
        """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query, job_config=safe_config)
file_count_by_license = query_job.to_dataframe()
file_count_by_license

Unnamed: 0,license,number_of_files
0,mit,20432844
1,gpl-2.0,16867410
2,apache-2.0,7123968
3,gpl-3.0,4936531
4,bsd-3-clause,2943900
5,agpl-3.0,1293773
6,lgpl-2.1,793054
7,bsd-2-clause,694767
8,lgpl-3.0,564433
9,mpl-2.0,473078


![https://i.imgur.com/QeufD01.png](https://i.imgur.com/QeufD01.png)

We'll begin with the `JOIN` (highlighted in blue above). This specifies the sources of data and how to join them. We use `ON` to specify that we combine the tables by matching the values in the `repo_name` columns in the tables.

Next, we'll talk about `SELECT` and `GROUP BY` (highlighted in yellow). The `GROUP BY` breaks the data into a different group for each license, before we `COUNT` the number of rows in the `sample_files` table that corresponds to each license. (Remember that you can count the number of rows with `COUNT(1)`.)

Finally, the `ORDER BY` (highlighted in purple) sorts the results so that licenses with more files appear first.

## Exercices

In [5]:
from google.cloud import bigquery
client = bigquery.Client()
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)

### 1) Explore the data

Before writing queries or JOIN clauses, you'll want to see what tables are available. 

In [8]:
list_of_tables = list(map(lambda x : x.table_id, list(client.list_tables(dataset))))
for table in list_of_tables: print(table)

badges
comments
post_history
post_links
posts_answers
posts_moderator_nomination
posts_orphaned_tag_wiki
posts_privilege_wiki
posts_questions
posts_tag_wiki
posts_tag_wiki_excerpt
posts_wiki_placeholder
stackoverflow_posts
tags
users
votes


### 2) Review relevant tables

If you are interested in people who answer questions on a given topic, the `posts_answers` table is a natural place to look. Run the following cell, and look at the output.

In [9]:
answers_table_ref = dataset_ref.table("posts_answers")
answers_table = client.get_table(answers_table_ref)
client.list_rows(answers_table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,58545647,,"<p>You can implement the <a href=""https://docs...",,,0,,2019-10-24 16:35:51.947000+00:00,,2019-10-24 16:35:51.947000+00:00,,,,,2541560,58545487,2,0,,
1,58545649,,"<p>You may be having an issue with the ""stage""...",,,0,,2019-10-24 16:35:59.377000+00:00,,2019-10-24 16:35:59.377000+00:00,,,,,4434749,56565949,2,0,,
2,58545664,,<p>I am not sure why you need that exactly but...,,,0,,2019-10-24 16:36:39.870000+00:00,,2019-10-24 16:36:39.870000+00:00,,,,,8343843,58545068,2,0,,
3,58545675,,<pre><code>Object delegateObj = readField(valu...,,,1,,2019-10-24 16:37:20.207000+00:00,,2019-10-24 16:37:20.207000+00:00,,,,,12269981,57195785,2,0,,
4,58545677,,<p>I had to remove the line</p>\n\n<pre><code>...,,,0,,2019-10-24 16:37:51.253000+00:00,,2019-10-24 16:37:51.253000+00:00,,,,,1775258,58428566,2,0,,


In [10]:
questions_table_ref = dataset_ref.table("posts_questions")
questions_table = client.get_table(questions_table_ref)
client.list_rows(questions_table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,31568634,keytool for Android debug key giving garbage v...,<p>I am using macbook \nI typed below code on ...,31569628.0,1,0,,2015-07-22 16:16:14.847000+00:00,,2015-07-22 17:05:30.597000+00:00,NaT,,,,2020622,,1,2,android|keytool,256
1,31600116,type error in sqlalchemy/flask query,<p>I was wondering if I could get some help on...,31600125.0,1,0,,2015-07-24 00:06:03.747000+00:00,,2015-07-27 19:32:02.880000+00:00,2015-07-27 19:32:02.880000+00:00,,5149754.0,,5149754,,1,1,python|flask|sqlalchemy,256
2,31616794,file.isDirectory() returning false for directory,<p>I am trying to display images stored in 'Pi...,,1,0,,2015-07-24 17:50:41.087000+00:00,,2015-08-02 17:41:52.860000+00:00,NaT,,,,4943245,,1,0,listview,256
3,31622328,Responsive CSS Sprite (top to bottom sprite),<p>I am looking for a responsive sprite. I was...,,1,0,,2015-07-25 02:39:44.193000+00:00,,2015-07-25 10:32:49.033000+00:00,2015-07-25 02:46:42.330000+00:00,,5154415.0,,5154415,,1,0,html|css|responsive-design|sprite,256
4,31648312,Spring Webflow AttributeMap doesn't apply defa...,<p>Regarding Spring Webflow 2.4.1.RELEASE.</p>...,,0,0,,2015-07-27 08:31:17.307000+00:00,,2015-07-30 10:27:07.910000+00:00,2015-07-30 10:27:07.910000+00:00,,2976062.0,,2976062,,1,1,java|spring-webflow,256


### 3) Selecting the right questions

A lot of this data is text.

We'll explore one last technique in this course which you can apply to this text.

A `WHERE` clause can limit your results to rows with certain text using the `LIKE` feature. For example, to select just the third row of the `pets` table from the tutorial, we could use the query in the picture below.

![https://i.imgur.com/RccsXBr.png](https://i.imgur.com/RccsXBr.png)

You can also use `%` as a "wildcard" for any number of characters. 

In [11]:
questions_query = """
                  SELECT id, title, owner_user_id
                  FROM `bigquery-public-data.stackoverflow.posts_questions`
                  WHERE tags LIKE '%bigquery%'
                  """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
questions_query_job = client.query(questions_query, job_config=safe_config) # Your code goes here
questions_results = questions_query_job.to_dataframe()
print(questions_results.head())

         id                                              title  owner_user_id
0  32011252             Schema to load JSON to Google BigQuery      3118765.0
1  31945303  BigQuery: TABLE_QUERY but columns differ betwe...      2286166.0
2  31779174       Combine hundreds of bigquery tables into one       730901.0
3  31768131  How to "ignore" missing columns in a bigquery ...      2254391.0
4  31934590             Parsing response from Google big query      1699730.0


### 4) Your first join

Write a query that returns the `id`, `body` and `owner_user_id` columns from the `posts_answers` table for answers to "bigquery"-related questions.

In [15]:
answers_query = """
                SELECT a.id, a.body, a.owner_user_id
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q 
                INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**11)
answers_query_job = client.query(answers_query, job_config=safe_config)
answers_results = answers_query_job.to_dataframe()
answers_results.head()

Unnamed: 0,id,body,owner_user_id
0,45542973,<p>You can write subquery which will result de...,1654021.0
1,45548183,"<p>select \ndays,\nexact_count_distinct(user) ...",8130742.0
2,45554026,<p>Could you be missing the credentials or hav...,2607220.0
3,45578913,"<p><a href=""https://github.com/bomboradata/pub...",384554.0
4,45664808,<p>Recently I fixed a similar problem by speci...,634627.0


### 5) Answer the question

Write a new query that has a single row for each user who answered at least one question with a tag that includes the string "bigquery".

In [16]:
bigquery_experts_query = """
                SELECT a.owner_user_id as user_id, count(1) as number_of_answers
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q 
                INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                GROUP BY user_id
                """
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
bigquery_experts_query_job = client.query(bigquery_experts_query, job_config=safe_config) # Your code goes here
bigquery_experts_results = bigquery_experts_query_job.to_dataframe() # Your code goes here
print(bigquery_experts_results.head())

     user_id  number_of_answers
0  7567628.0                  1
1  1184156.0                  1
2   796963.0                  1
3   483567.0                  2
4  3777211.0                  7


### 6) Building a more generally useful service

How could you convert what you've done to a general function a website could call on the backend to get experts on any topic? 

In [19]:
def expert_finder(topic, client):
    '''
    Returns a DataFrame with the user IDs who have written Stack Overflow answers on a topic.

    Inputs:
        topic: A string with the topic of interest
        client: A Client object that specifies the connection to the Stack Overflow dataset

    Outputs:
        results: A DataFrame with columns for user_id and number_of_answers. Follows similar logic to bigquery_experts_results shown above.
    '''
    my_query = """
               SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
               FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
               INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                   ON q.id = a.parent_Id
               WHERE q.tags like '%{topic}%'
               GROUP BY a.owner_user_id
               """

    # Set up the query (a real service would have good error handling for 
    # queries that scan too much data)
    safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)      
    my_query_job = client.query(my_query, job_config=safe_config)

    # API request - run the query, and return a pandas DataFrame
    results = my_query_job.to_dataframe()

    return results