# **SIG AIDA Data Science Workshop**
## _Structured Searching with SQL_


# Introduction
## What is SQL?

SQL stands for Structured Query Language. 

It is used mainly to interact with "Relational Database Systems" and is considered a "Query Language". This means that we ask SQL to do things like read from or write to tables. 


In [None]:
#@title Please run this cell for setup! (click on the play button)

import pandas as pd
import csv
import urllib.request
import sqlite3
from pprint import pprint

conn = sqlite3.connect('example.db')

c = conn.cursor()
c.execute("SELECT name FROM sqlite_master WHERE type='table'")
if len(c.fetchall()) > 0:
    c.execute("DROP TABLE IF EXISTS uber")
    c.execute("DROP TABLE IF EXISTS gpa")

uber_url = "https://raw.githubusercontent.com/fivethirtyeight/uber-tlc-foil-response/master/Uber-Jan-Feb-FOIL.csv"
uber_data = pd.read_csv(uber_url, index_col=0)
uber_data.to_sql('uber', conn)

gpa_url = "https://raw.githubusercontent.com/wadefagen/datasets/master/gpa/uiuc-gpa-dataset.csv"
gpa_data = pd.read_csv(gpa_url, index_col=0)
gpa_data.to_sql('gpa', conn)

def run_query(query):
    return pd.read_sql_query(query, conn)

print("Setup Complete!")

In [None]:
# How to query the DB: a simple get
query = """
           SELECT *
           FROM uber
        """
run_query(query)

## What did that do?

#### We got all rows from a table called _uber_ that had a corresponding date of January 1st, 2015

SQL is nice because you can sequentially read what you're telling it to do. Let's break down the query we asked SQL to execute above.

## Some basic keywords

### `SELECT`
We want SQL to _return_ things from a table.

What do you want from the table? A number? A list of rows?

In SQL, the asterisk * is a wildcard that essentially means "give me everything". In the example above, we told SQL to select every column of the rows that matched and return it to us.

You can also tell SQL to give you only the values for specific columns (see example below)

[W3 Schools tutorial](https://www.w3schools.com/sql/sql_select.asp)



### `FROM`

This tells SQL _where_ or _which table_ it should be looking to interact with.

You could be working with multiple tables in a single SQL query, SQL needs to know which one(s) to go to.

[W3 Schools tutorial](https://www.w3schools.com/sql/sql_from.asp)

In [None]:
query = """
           SELECT dispatching_base_number, date, active_vehicles
           FROM uber;
        """
run_query(query)

### `WHERE`
this tells SQL what _condition_ it should be looking to match in the rows.

It's not useful to get every single row in the table, most of the time we're looking for rows that pertain to some date or some person.

[W3 Schools tutorial](https://www.w3schools.com/sql/sql_where.asp)

In [None]:
query = """
           SELECT dispatching_base_number, date, active_vehicles
           FROM uber
           WHERE active_vehicles == 870;
        """
run_query(query)

## Some more advanced keywords


In [None]:
# new dataset!
query = """
        SELECT *
        FROM gpa;
        """
run_query(query)

### `GROUP BY`
This tells SQL to limit the results only to a specific group, which can be configured. These make more sense with examples, so we'll go into some here.

The below piece of code first groups all of the subjects together (puts all rows with Subject as `AAS` as one row, all rows with Subject as `STAT` as one row, etc.) and then calculates some aggregating function, like `SUM()` in this case.

Intuitively, this query will sum up the values of `A+` (which in this case is the number of students who got an A+) for every Subject, then show you the Subject and the sum of A+'s as two columns.

[W3 Schools tutorial](https://www.w3schools.com/sql/sql_groupby.asp)


In [None]:
query = """
        SELECT Subject, SUM(`A+`)
        FROM gpa
        GROUP BY Subject;
        """
run_query(query)

### `ORDER BY`
This command, as the command name suggests, orders the results by a column. In the query below, we sort the rows by the column `A+` in DESCending order (as opposed to ASCending order).

Intuitively, this command sorts the rows by the number of students who received an A+.

In [None]:
query = """
        SELECT YearTerm, Subject, Number, `Course Title`, `A+`
        FROM gpa
        ORDER BY `A+` DESC;
        """
run_query(query)

### `HAVING`
This command is the same as the `WHERE` command except only for aggregate functions (such as `SUM`, `AVG`, `COUNT`).

The below query will grab all `Subjects` where the number of total `A+`'s given is greater than 100.

(Notice that we can't use `WHERE` here because we are using `GROUP BY` and the aggregate function `SUM`)

[W3 Schools tutorial](https://www.w3schools.com/sql/sql_having.asp)

In [None]:
query = """
        SELECT Subject, SUM(`A+`)
        FROM gpa
        GROUP BY Subject
        HAVING SUM(`A+`) > 100;
        """
run_query(query)

### `LIKE`
This command will try to match strings in a column based on patterns that you specify.

The query below will find all rows where the `Course Title` has the words `"Machine Learning"` somewhere within the string.

Note: you can specify whether to allow any length of string as a wildcard using `%` or only one character using `_`.

[W3 Schools tutorial](https://www.w3schools.com/sql/sql_like.asp)

In [None]:
# example using %
# there can be any number of characters before and after the words Machine Learning
query = """
        SELECT YearTerm, Subject, Number, `Course Title`
        FROM gpa
        WHERE `Course Title`
        LIKE '%Machine Learning%'
        AND Subject == 'CS'
        """
run_query(query)

In [None]:
# example using _
# the first character can be anything, but the second character must be 'S'
query = """
        SELECT YearTerm, Subject, Number, `Course Title`
        FROM gpa
        WHERE Subject
        LIKE '_S'
        """
run_query(query)

# Practice!
Now here's a chance for you to practice!

In [None]:
#@title Open this for hints (double click me)

# Close this cell by double clicking on the right hand side

# Hint 1: You can do this by just getting all the rows of the dataset and
#   scrolling to the bottom, if you like

# Hint 2: Use the WHERE query (look back at the demo)

# Hint 3: GROUP BY might come in handy...
#   Also look up the list of all aggregate functions you can run on a group:
#   'sqlite3 aggregate functions' on Google should help

# Hint 4: Figure out which column contains the value you want, and look back at
#   the list of queries from above to see if one of them was similar to what
#   we want here.
#   Also, keep an eye out for what we have to do if a column name contains spaces

# Hint 5: GROUP BY... this should be similar to some of the above queries

# Hint 6: The aggregate function COUNT could come in handy here.
#   If you want to find the number of unique instructors, DISTINCT is a keyword
#   that you can use:
#     https://www.w3resource.com/sql/aggregate-functions/count-with-distinct.php

# Hint 7: How can you get the total number of students in a class?

# Hint 8: You can use arithmetic operations in SQL (at least the SQL we're using,
#   SQLite3)
#   Some helpful reading: https://www.w3resource.com/sqlite/arithmetic-operators.php#:~:text=There%20are%20four%20type%20of,multiplication(*)%20and%20division(%2F).&text=Expression%20made%20up%20of%20a,values%20or%20perform%20arithmetic%20calculations.
#   Extra reading: https://www.w3schools.com/sql/sql_operators.asp

# Hint 9: 

In [None]:
# Problem 1: Find the last date that is available in the table 'uber'
query = """
        
        """
run_query(query)

In [None]:
# Problem 2: Find the dispatching base number with the most active vehicles in the table 'uber' (repeat for trips)
query = """

        """
run_query(query)

In [None]:
# Problem 3: Find the date with the most amount of active vehicles in the table 'uber' (repeat for trips)
query = """

        """
run_query(query)

In [None]:
# Problem 4: Find a class you've taken on campus before in the table 'gpa'
# If you haven't taken a class before Fall 2019, then find a class you want to take!
query = """

        """
run_query(query)

In [None]:
# Problem 5: Find the instructor with the highest number of A's given in the table 'gpa'
# You can modify this for whatever grade you want to look at
query = """

        """
run_query(query)

In [None]:
# Problem 6: Find the department with the most number of instructors in the table 'gpa'
query = """

        """
run_query(query)

In [None]:
# Problem 7: Find the class with the most number of people in the table 'gpa'
query = """

        """
run_query(query)

In [None]:
# Problem 8: Find a GPA for each class in the table 'gpa'
#   Additional problem: now find the class with the highest GPA
query = """

        """
run_query(query)

In [None]:
# Problem 9: Find the department with the highest GPA in the table 'gpa'
#   Additional problem: semester with the higest GPA?
query = """

        """
run_query(query)

In [None]:
# Problem 10: Explore this dataset to your heart's content! Tell us if you find
#   anything interesting!
query = """

        """
run_query(query)


# More!

## `JOIN`
What if you have content in separate tables that you want to "join" together? For example, you have one table that contains weather data for each day of the year, while another table has the number of Uber rides that were requested that day. If we want to do some analysis on Uber ridership in relation to the weather, then we would need to "join" the two datasets via the date (find all rows where the specific date exists in both datasets and combine all the columns). This operation is called a "join", and there are many different types.

Read more about them here: [W3 Schools SQL Join](https://www.w3schools.com/sql/sql_join.asp)

## Inserting into a dataset
We only dealt with queries today, but a database has to have a way we can insert content into it. There are `INSERT` commands as well as `CREATE TABLE` commands to, as the commands say, insert content and create new tables. Using SQL through another programming language can allow us to create and maintain databases!

## SQL Injections
You may have heard of this before. This is a security concept, but the basic idea is that for websites that use SQL to ask for something to display, the website might not do a good enough job making sure that the input to their website is actually valid SQL. Nothing is stopping a person from writing just `;` as a query to their database, which could mess up the website's code.

Since this is talking about hacking, there is a necessary disclaimer here that we do not condone any illegal activity like stealing information or ruining websites. For this sake, there are small sandboxed environments where you can safely explore concepts such as this, called Capture the Flag (CTF) challenges.

[Medium Article explaining SQL injections](https://medium.com/@TurtledCoder/ctflearn-com-basic-injection-4dc5114e911c)

If you want to skip straight to the challenge: [CTFlearn](https://ctflearn.com/challenge/88)