# Get Started

Stack Overflow (stackoverflow.com) is a technical Question and Answer site that is widely beloved in the programming community. You'll probably use it yourself as you keep using SQL (or any programming language). 

Stack Overflow data is publicly available. What cool things do you think it would be useful for?

Here's an idea:
You could set up a service that identifies the Stack Overflow users who have demonstrated expertise on any specific topic by answering related questions, so someone could hire those experts for in-depth assistantce.

In this exercise, you'll write the SQL queries that would serve as the foundation for this type of service.

As usual, run the following cell to set up our feedback system before moving on.

In [6]:
# Set up feedack system
from learntools.core import binder
binder.bind(globals())
from learntools.sql.ex6 import *

# import package with helper functions 
import bq_helper

# create a helper object for this dataset
stack_overflow = bq_helper.BigQueryHelper(active_project="bigquery-public-data",
                                              dataset_name="stackoverflow")



ModuleNotFoundError: No module named 'learntools.sql'

Then write the code to answer the question below

# Questions

# 1) Explore the Data

Before writing queries or **JOIN** clauses, you'll want to see what tables are available. 

This may be a good time to practice **tab completion** for when you don't remember command names. If you type `github.` and then hit tab, you will see a list of methods for the github object (don't forget the dot before hitting tab.)

In [None]:
# Your code here

list_of_tables = ____    # get a list of available tables

print(list_of_tables)
q_1.check()

In [None]:
# q_1.solution()

# 2) Review Relevant Tables

If you are interested in people who answer questions, you could start by looking at the `posts_answers` table. Run the cell below and look at the output

In [None]:
stack_overflow.head('posts_answers')

It's not clear how to find users who answer questions on any given topic from `post_answers`. It does have a `parent_id`. If you are familiar with the Stack Overflow site, it will seem likely that the `parent` refers to the `question` this post is answering.

Look at `posts_questions` using the line below.

In [None]:
stack_overflow.head('posts_questions')

Are there any fields that would let you identify what general topic each question is about?
If so, would that help you identify who is answering questions on each topic?

Think about, then check the solution below.

In [None]:
#q_2.solution()

# 3 Selecting The Right Questions

A lot of this data is text. There's one last technique you'll learn in this course, and which will be key in this exercise.

You can use a **WHERE** clause that filters your results based on certain text. To select just the third row of the following database we would write

`SELECT * FROM PETS WHERE NAME LIKE 'Ripley'`

![](https://i.imgur.com/Ef4Puo3.png)

The **LIKE** operator allows you to use `%` as a "wildcard" for any number of characters. So we could also get the third row with 

`SELECT * FROM PETS WHERE NAME LIKE '%ipl%'`

**Now for your turn to use it:**
As a warm-up, before finding users who have answered questions on a specific topic: Write a query that selects the `ID`, `title` and `owner_user_id` from the `posts_questions` table. Restrict the results to rows that contain the word **bigquery** in the `tag` column. Include rows where there is other text in addition to the word `bigquery` (e.g. if a row has a tag `bigquery-sql`, your results should include that).

In [None]:
# Your code here
bq_questions = \
"""
SELECT ____
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE ____
"""

bq_question_results = stack_overflow.query_to_pandas_safe(bq_questions)
print(bq_question_results.head())
q_3.check()


In [None]:
# q_3.hint()
# q_3.solution()

# 4 Your First Join
If you have a query to select questions on any given topic (in this case, you chose `bigquery`), you can find the answers with a **JOIN**.  

Write a SQL query that returns the `id`, `body` and `owner_user_id` from the `posts_answers` table for answers to `bigquery` related questions. That is, you should have one row in your results for each answer to a question that has a `bigquery` in the tag.

Here's a reminder of what a **JOIN** looked like in the tutorial
```
SELECT p.Name AS Pet_Name, o.Name as Owner_Name
FROM `bigquery-public-data.pet_records.pets` as p
INNER JOIN `bigquery-public-data.pet_records.owners` as o ON p.ID = o.Pet_ID
```

It may be useful to scroll up and review the results from when you called **head** on `posts_answers` and `posts_questions`.  

Since you could use this query to support a webpage that recomends experts, you should care about query speed. We've added code to report how long the query takes. As a warning, this query runs very slowly.

In [None]:
from time import time


answers_query = \
"""
____
"""

query_start_time = time()
answers_results = ____
print(answers_results.head())
print("Total running time: {}".format(time() - query_start_time))
q_4.check()

In [None]:
# q_4.hint()
# q_4.solution()

# 5 Answer The Question
You have the merge you need, but we wanted a list of users who have answered many questions... not just a list of question or answer ID's.

Write a new query that selects data from the `posts_questions` and `posts_answers` tables. The results should havea single row for each user who answered at least one questions with a tag that includes the string `bigquery`. Each row should have two columns:
- a column called `user_id` that contains the `owner_user_id` from the `posts_answers` table
- a column called `number_of_answers` that contains the number of answers the user has written to `bigquery` questions

In [None]:
# your code here
bigquery_experts_query = ____
bigquery_experts_results = ____

print(bigquery_experts_results)
q_5.check()

In [None]:
# q_5.hint()
# q_5.solution()

# Congratulations
You know all the key components to use BigQuery and SQL effectively. Your SQL skills are sufficient to unlock many of the world's large datasets.

Want to go play with your new powers?  Kaggle has BigQuery datasets available [here](https://www.kaggle.com/datasets?sortBy=hottest&group=public&page=1&pageSize=20&size=sizeAll&filetype=fileTypeBigQuery).

# Feedback
Bring any questions or feedback to the [Learn Discussion Forum](https://www.kaggle.com/learn-forum).
