In [1]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/Users/anastasiyashabunevich/Desktop/Kaggle/ashabunevich_key.json"

In [2]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "stackoverflow" dataset
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

### 1) Explore the data

Before writing queries or **JOIN** clauses, you'll want to see what tables are available.

In [3]:
# Get a list of available tables 
tables = list(client.list_tables(dataset))
list_of_tables = [table.table_id for table in tables] 

# Print your answer
print(list_of_tables)

['badges', 'comments', 'post_history', 'post_links', 'posts_answers', 'posts_moderator_nomination', 'posts_orphaned_tag_wiki', 'posts_privilege_wiki', 'posts_questions', 'posts_tag_wiki', 'posts_tag_wiki_excerpt', 'posts_wiki_placeholder', 'stackoverflow_posts', 'tags', 'users', 'votes']


### 2) Review relevant tables

If you are interested in people who answer questions on a given topic, the `posts_answers` table is a natural place to look. Run the following cell, and look at the output.

In [4]:
# Construct a reference to the "posts_answers" table
answers_table_ref = dataset_ref.table("posts_answers")

# API request - fetch the table
answers_table = client.get_table(answers_table_ref)

# Preview the first five lines of the "posts_answers" table
client.list_rows(answers_table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,64548693,,<p>My workaround without ejecting:</p>\n<ol>\n...,,,0,NaT,2020-10-27 05:20:53.417000+00:00,,2020-10-27 05:20:53.417000+00:00,NaT,,,,9428719,55821078,2,0,,
1,64548694,,<p>The execution thread may be secluded on dif...,,,1,NaT,2020-10-27 05:21:15.397000+00:00,,2020-10-27 05:21:15.397000+00:00,NaT,,,,13927193,33876455,2,0,,
2,64548698,,"<p><code>vw</code> is well supported, so can b...",,,0,NaT,2020-10-27 05:22:37.873000+00:00,,2020-10-27 05:22:37.873000+00:00,NaT,,,,8942566,64548101,2,0,,
3,64548705,,<p>This could be simple. Please check that you...,,,0,NaT,2020-10-27 05:23:43.193000+00:00,,2020-10-27 05:23:43.193000+00:00,NaT,,,,13832463,64276190,2,0,,
4,64548730,,<p>Install the flutter plugin on android studi...,,,3,NaT,2020-10-27 05:26:53.640000+00:00,,2020-10-27 05:26:53.640000+00:00,NaT,,,,11211493,64443398,2,0,,


It isn't clear yet how to find users who answered questions on any given topic. But `posts_answers` has a `parent_id` column. If you are familiar with the Stack Overflow site, you might figure out that the `parent_id` is the question each post is answering.

Look at `posts_questions` using the cell below.

In [5]:
# Construct a reference to the "posts_questions" table
questions_table_ref = dataset_ref.table("posts_questions")

# API request - fetch the table
questions_table = client.get_table(questions_table_ref)

# Preview the first five lines of the "posts_questions" table
client.list_rows(questions_table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,12040975,File read lines blue screen windows,<p>What problem in this code?? It crash my win...,,1,2,NaT,2012-08-20 15:51:31.053000+00:00,,2013-03-28 11:06:44.740000+00:00,NaT,,,,719323,,1,0,windows-7|python-2.7,256
1,12045786,How to insert a UITableViewCell at the beginni...,"<p>I am trying to set up a UITableView, with x...",12047459.0,3,0,NaT,2012-08-20 21:56:03.590000+00:00,,2012-08-21 05:13:07.037000+00:00,2012-08-21 05:13:07.037000+00:00,,23897.0,,748343,,1,0,iphone|ios|uitableview,256
2,12073211,Composite-Component within a form that require...,<p>A quick background : I have put a captcha u...,14285905.0,1,4,NaT,2012-08-22 12:30:37.910000+00:00,,2013-01-11 20:11:41.810000+00:00,2017-05-23 12:20:08.800000+00:00,,-1.0,,686036,,1,0,java|forms|jsf-2|composite-component,256
3,12077718,Accessing non-static grid element in windows p...,<p>I've got a page:</p>\n\n<pre><code>&lt;phon...,12099231.0,1,0,NaT,2012-08-22 16:44:00.593000+00:00,,2012-08-23 20:11:53.943000+00:00,NaT,,,,1279293,,1,0,c#|windows-phone,256
4,12086962,mssql/php - Looping through resultset and perf...,<p>I am fairly new to MSSQL and have never use...,,1,0,NaT,2012-08-23 07:44:27.517000+00:00,,2012-08-30 05:38:19.107000+00:00,NaT,,,,1001034,,1,0,php|sql-server|in-memory,256


Are there any fields that identify what topic or technology each question is about? If so, how could you find the IDs of users who answered questions about a specific topic?

- Solution: `posts_questions` has a column called `tags` which lists the topics/technologies each question is about.
`posts_answers` has a column called `parent_id` which identifies the ID of the question each answer is responding to. `posts_answers` also has an `owner_user_id` column which specifies the ID of the user who answered the question.

- You can join these two tables to: determine the `tags` for each answer, and then select the `owner_user_id` of the answers on the desired tag.



### 3) Selecting the right questions

Write a query that selects the `id`, `title` and `owner_user_id` columns from the `posts_questions` table. 
- Restrict the results to rows that contain the word "bigquery" in the `tags` column. 
- Include rows where there is other text in addition to the word "bigquery" (e.g., if a row has a tag "bigquery-sql", your results should include that too).

In [6]:
questions_query = """
                  SELECT id, title, owner_user_id
                  FROM `bigquery-public-data.stackoverflow.posts_questions`
                  WHERE tags LIKE '%bigquery%'
                  """

# Set up the query to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
questions_query_job = client.query(questions_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
questions_results = questions_query_job.to_dataframe()

# Preview results
print(questions_results.head())

         id                                              title  owner_user_id
0  64870863  I want to save spark JavaRDD data to bigquery ...      3020641.0
1  65163522  How can I translate an ENUM from my Avro schem...      2299087.0
2  64869110  How to see all shared queries in the "Project ...     14646421.0
3  64964489  PybigQuery: Slice a query beyond certain bytes...      8176451.0
4  65149611  Update specific BQ partition or data range wit...     13157288.0


### 4) Your first join
Now that you have a query to select questions on any given topic (in this case, you chose "bigquery"), you can find the answers to those questions with a **JOIN**.  

Write a query that returns the `id`, `body` and `owner_user_id` columns from the `posts_answers` table for answers to "bigquery"-related questions. 
- You should have one row in your results for each answer to a question that has "bigquery" in the tags.  
- Remember you can get the tags for a question from the `tags` column in the `posts_questions` table.

In [11]:
answers_query = """
                SELECT a.id, a.body, a.owner_user_id
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q 
                INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                    ON q.id = a.parent_id
                WHERE q.tags LIKE '%bigquery%'
                """

# Set up the query to 1 GB
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=26404192256)
answers_query_job = client.query(answers_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
answers_results = answers_query_job.to_dataframe()

# Preview results
print(answers_results.head())

         id                                               body  owner_user_id
0  32381210  <p>For uniquePageViews you better want to use ...      4274130.0
1  32460356  <p>Not knowing the internals of BigQuery, I wo...      2417948.0
2  32460605  <p>JOIN EACH should be used when your table yo...      2417948.0
3  32463035  <p>This question is a feature request which wo...      4270992.0
4  32565498  <p>How you can achieve by-</p>\n\n<pre><code>S...      3291973.0


### 5) Answer the question
You have the merge you need. But you want a list of users who have answered many questions... which requires more work beyond your previous result.

Write a new query that has a single row for each user who answered at least one question with a tag that includes the string "bigquery". Your results should have two columns:
- `user_id` - contains the `owner_user_id` column from the `posts_answers` table
- `number_of_answers` - contains the number of answers the user has written to "bigquery"-related questions

In [8]:

bigquery_experts_query = """
                         SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
                         FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                         INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                             ON q.id = a.parent_Id
                         WHERE q.tags LIKE '%bigquery%'
                         GROUP BY a.owner_user_id
                         """

# Set up the query to 1 GB
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
bigquery_experts_query_job = client.query(bigquery_experts_query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
bigquery_experts_results = bigquery_experts_query_job.to_dataframe()

# Preview results
print(bigquery_experts_results.head())

      user_id  number_of_answers
0  11329071.0                  3
1   4182296.0                  1
2   4522678.0                  1
3   9741888.0                 18
4   2666739.0                  1


### 6) Building a more generally useful service

How could you convert what you've done to a general function a website could call on the backend to get experts on any topic?  

In [None]:

def expert_finder(topic, client):
    '''
    Returns a DataFrame with the user IDs who have written Stack Overflow answers on a topic.

    Inputs:
        topic: A string with the topic of interest
        client: A Client object that specifies the connection to the Stack Overflow dataset

    Outputs:
        results: A DataFrame with columns for user_id and number_of_answers. Follows similar logic to bigquery_experts_results shown above.
    '''
    my_query = """
               SELECT a.owner_user_id AS user_id, COUNT(1) AS number_of_answers
               FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
               INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                   ON q.id = a.parent_Id
               WHERE q.tags like '%{topic}%'
               GROUP BY a.owner_user_id
               """

    # Set up the query
    safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)      
    my_query_job = client.query(my_query, job_config=safe_config)

    # API request - run the query, and return a pandas DataFrame
    results = my_query_job.to_dataframe()

    return results
