# Intro

Stack Overflow (stackoverflow.com) is a widely beloved Question and Answer site for technical questions. You'll probably use it yourself as you keep using SQL (or any programming language). 

Their data is publicly available. What cool things do you think it would be useful for?

Here's one idea:
You could set up a service that identifies the Stack Overflow users who have demonstrated expertise with a specific technology by answering related questions about it, so someone could hire those experts for in-depth help.

In this exercise, you'll write the SQL queries that might serve as the foundation for this type of service.

As usual, run the following cell to set up our feedback system before moving on.

In [None]:
# Set up feedack system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex6 import *

# import package with helper functions 
import bq_helper

# create a helper object for this dataset
stack_overflow = bq_helper.BigQueryHelper(active_project="bigquery-public-data",
                                              dataset_name="stackoverflow")

# Questions

# 1) Explore the Data

Before writing queries or **JOIN** clauses, you'll want to see what tables are available. 

This may be a good time to practice **tab completion** for when you don't remember command names. If you type `stack_overflow.` and then hit tab, you will see a list of methods for the `stack_overflow` object (don't forget the dot before hitting tab.)

In [None]:
# Your code here

list_of_tables = ____    # get a list of available tables

print(list_of_tables)
q_1.check()

In [None]:
# q_1.solution()

# 2) Review Relevant Tables

If you are interested in people who answer questions on a given topic, the `posts_answers` table is a natural place to look. Run the following cell and look at the output

In [None]:
stack_overflow.head('posts_answers')

It isn't clear yet how to the find users who answered questions on any given topic. But `posts_answers` has a `parent_id` column. If you are familiar with the Stack Overflow site, you might figure out that the `parent_id` is the question each post is answering.

Look at `posts_questions` using the line below.

In [None]:
stack_overflow.head('posts_questions')

Are there any fields that identify what topic or technology each question is about?

If so, how could you find the user ID\'s of users who answered questions about a specific topic?

Think about it, then check the solution by running the code in the next cell.

In [None]:
#q_2.solution()

# 3 Selecting The Right Questions

A lot of this data is text. 

Here is one last technique you'll learn in this course which you can apply to this text:

A **WHERE** clause can limit your results to rows with certain text using the **LIKE** feature. For example, to select just the third row of the `pets` table, we would write

`SELECT * FROM PETS WHERE NAME LIKE 'Ripley'`

![](https://i.imgur.com/Ef4Puo3.png)

You can also use `%` as a "wildcard" for any number of characters. So you can get the third row with 

`SELECT * FROM PETS WHERE NAME LIKE '%ipl%'`

Try this yourself.
Before finding users who have answered questions, write a query that selects the `id`, `title` and `owner_user_id` from the `posts_questions` table. Restrict the results to rows that contain the word **bigquery** in the `tags` column. Include rows where there is other text in addition to the word `bigquery` (e.g. if a row has a tag `bigquery-sql`, your results should include that too).

In [None]:
# Your code here
questions_query = \
"""
SELECT ____
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE ____
"""

questions_results = stack_overflow.query_to_pandas_safe(questions_query, max_gb_scanned=25) # this query reads a lot of data
print(questions_results.head())
q_3.check()


In [None]:
# q_3.hint()
# q_3.solution()

# 4 Your First Join
Now that you have a query to select questions on any given topic (in this case, you chose `bigquery`), you can find the answers to those questions with a **JOIN**.  

Write a SQL query that returns the `id`, `body` and `owner_user_id` from the `posts_answers` table for answers to `bigquery` related questions. That is, you should have one row in your results for each answer to a question that has a `bigquery` in the tag.

Here's a reminder of what a **JOIN** looked like in the tutorial
```
SELECT p.Name AS Pet_Name, o.Name as Owner_Name
FROM `bigquery-public-data.pet_records.pets` as p
INNER JOIN `bigquery-public-data.pet_records.owners` as o ON p.ID = o.Pet_ID
```

It may be useful to scroll up and review the results from when you called **head** on `posts_answers` and `posts_questions`.  

In [None]:
from time import time


answers_query = \
"""
____
"""

answers_results = stack_overflow.query_to_pandas_safe(answers_query, max_gb_scanned=50) # query scans more than 1GB of data, but less than 2.
print(answers_results.head())
q_4.check()

In [None]:
# q_4.hint()
# q_4.solution()

# 5 Answer The Question
You have the merge you need. But you want a list of users who have answered many questions... which requires more work beyond your previous result.

Write a new query that selects data from the `posts_questions` and `posts_answers` tables. The results should have a single row for each user who answered at least one questions with a tag that includes the string `bigquery`. Each row in your results should have two columns:
- a column called `user_id` that contains the `owner_user_id` from the `posts_answers` table
- a column called `number_of_answers` that contains the number of answers the user has written to `bigquery` questions

In [None]:
# your code here
bigquery_experts_query = ____
bigquery_experts_results = ____

print(bigquery_experts_results)
q_5.check()

In [None]:
# q_5.hint()
# q_5.solution()

# Building A More Generally Useful Service

How could you convert what you've done so it's a general function a website could call on the backend to get experts on any topic?  

Think about it and then check the solution below.

In [None]:
# q_6.solution()

# Congratulations
You know all the key components to use BigQuery and SQL effectively. Your SQL skills are sufficient to unlock many of the world's large datasets.

Want to go play with your new powers?  Kaggle has BigQuery datasets available [here](https://www.kaggle.com/datasets?sortBy=hottest&group=public&page=1&pageSize=20&size=sizeAll&filetype=fileTypeBigQuery).

# Feedback
Bring any questions or feedback to the [Learn Discussion Forum](https://www.kaggle.com/learn-forum).
