# SECTION 07: SQL DATABASES

- onl01-dtsc-ft-022221
- 03/05/21

## LEARNING OBJECTIVES:

- Understand what a relational database is and how it is different than a DataFrame/Excel sheet.
- Understand how to read database map (AKA "Entity Relationship Diagram (ERD)")
    - Primary keys vs forgein keys
- Understand how to select, filter, order, and group data using SQL
- Understand the different types of Joins 


#### 🏝 Breakout Group Activity:  [Survive on sql-island](https://sql-island.informatik.uni-kl.de/)

## Questions:

- When doing some of the admin tasks like inserting rows, I tried to run an insert command again and got a "database is locked" error. In what circumstances does this happen, and are there best practices for closing the connection to avoid locking?


- The SQL quiz had a question about the role of Associated Entities on many-to-many joins. I went back through the lessons and couldn't find a section which talked about that term. What should we know about many-to-many joins other than that they can become very large?



- Can you rephrase the below statement? 
>"For example, let's say you have another table 'restaurants' that has many columns including name, city, and rating. If you were to join this 'restaurants' table with the offices table using the shared city column, you might get some unexpected behavior. That is, in the office table, there is only one office per city. However, because there will likely be more than one restaurant for each of these cities in your second table, you will get unique combinations of Offices and Restaurants from your join. If there are 513 restaurants for Boston in your restaurant table and 1 office for Boston, your joined table will have each of these 513 rows, one for each restaurant along with the one office.<br>
If you had 2 offices for Boston and 513 restaurants, your join would have 1026 rows for Boston; 513 for each restaurant along with the first office and 513 for each restaurant with the second office. Three offices in Boston would similarly produce 1539 rows; one for each unique combination of restaurants and offices. This is where you should be particularly careful of many to many joins as the resulting set size can explode drastically potentially consuming vast amounts of memory and other resources."


# SQL Databases

<img src="https://raw.githubusercontent.com/learn-co-students/dsc-sql-introduction-online-ds-sp-000/master/images/Database-Schema.png">


- SQL is designed to work with **relational data**. 
- This really just means pieces of data that are **related to eachother**.

- Each table has a **primary key** (like a DataFrame index), with a unique index for each row in the database.
- The name of the primary key is preceded by an asterix (\*). 

- Columns that are the **primary key one on table** can also appear on **other tables**. 
    - Then it is refered to as a **foreign key** aka the primary key from a different ("foreign") table. 

### ⨠ Q: Why do we need databases? Why can't we just use a bunch of Pandas DataFrames?

- 

## Querying Databases - `SELECT`ing data



- To retrieve data from one or more tables you usually use a `SELECT` statement. 
```SQL
SELECT * FROM table;
```


> - NOTE: SQL queries dot not _have_ to be all-caps, but it is a convention to help differentiate sql syntax versus names of tables/columns.



- A more advanced select query.
```SQL
SELECT col1, col2, col3
FROM table
WHERE records match criteria
LIMIT 100;
```

- **All select statements must:**
    1. **start with the `SELECT`**
    2. followed by **what you want to select**. Separate multiple column names separated by a `,` 
    3. Then specify where the data is coming `FROM` followed by the table name. 
    4. **Afterward, you can provide conditions such as filters or sorting**.

```SQL
SELECT *
FROM payments
ORDER BY amount DESC
LIMIT 10;
```



## SQL with `sqlite3`

- Use `sqlite3` for SQL queries in Python.
    1. Connect to database
    2. Create a cursor.
    3. Form your query
    4. Execute/fetch your results.

```python
import sqlite3
connection = sqlite3.connect('pet_database.db') # Creates pet_database, but empty until create a table    
cursor = connection.cursor()


# Select from table
cursor.execute('''SELECT name FROM cats;''').fetchall()

```

In [1]:
import pandas as pd
import os
os.getcwd()

'/Users/jamesirving/Documents/GitHub/_COHORT_NOTES/022221FT/Online-DS-FT-022221-Cohort-Notes/Phase_1/topic_07_SQL_and_relational_databases'

In [None]:
## Our data.sqlite is in the datasets folder in our note repo.
db = '../../datasets/SQL/data.sqlite'

In [None]:
import sqlite3
# connect to database
conn = sqlite3.connect(db)
cur = conn.cursor()

type(conn)

### How to get all of the table names in a database
>- The container for all tables in a database with sqlite3 is called `sqlite_master` 
>- We can find the name of all of the tables in a db using: 
>```python
table_names = cur.execute("""
SELECT name 
FROM sqlite_master 
WHERE type='table';
""").fetchall()
```

In [None]:
# Get table names
table_names = cur.execute("""
SELECT name 
FROM sqlite_master 
WHERE type='table';
""").fetchall()
table_names

In [None]:
## Clean up the list of table names to be just 1 list with strings
table_names = [x[0] for x in table_names]
table_names

<img src="https://raw.githubusercontent.com/learn-co-students/dsc-sql-introduction-online-ds-sp-000/master/images/Database-Schema.png" width=500>

### How to get the column names after executing a query:


In [None]:
df = pd.DataFrame(cur.execute("select * from products").fetchall())
df

> - the cursor has a `.description` that contains information about the column names

In [None]:
cur.description

In [None]:
## Get a list of just the column names from the description
col_names =[col[0] for col in cur.description]
print(col_names)

In [None]:
df = pd.DataFrame(cur.execute('select * from products').fetchall(),
                  columns=col_names)
df.head()

In [None]:
cur.execute("""PRAGMA table_info(products)""").fetchall()

>### Quick Activity: Make executing the query -> dataframe into a function

In [None]:
def query_to_df(query,cursor=cur):
    df = pd.DataFrame(cursor.execute(query).fetchall(), 
                      columns=[col[0] for col in cur.description])
    return df

In [None]:
query_to_df("""SELECT * FROM EMPLOYEES""")

# FILTERING AND ORDERING

- `ORDER BY` - `DESC`/`ASC`
- `LIMIT`
- `BETWEEN`
- `NULL`
- `COUNT`
- `GROUP BY`

In [None]:
query = """select * from products
GROUP BY productLine 
ORDER BY quantityInStock DESC;"""
query_to_df(query)
# data = cur.execute(query).fetchall()
# col_names =[col[0] for col in cur.description]
# df = pd.DataFrame(data,
#                   columns=col_names)
# df

## GROUPING DATA WITH SQL

- Like we do with Pandas, we can use GROUP BY statements in SQL and then apply **aggregate functions:**
    - `COUNT`
    - `MAX`
    - `MIN`
    - `SUM`
    - `AVG`

In [None]:
query="""SELECT city, COUNT(employeeNumber)
FROM offices 
JOIN employees
USING(officeCode)
GROUP BY city
ORDER BY count(employeeNumber) DESC;"""
query_to_df(query)

## ALIASING

- can assign a temporary name to data being imported
- Useful for `JOIN`,`GROUP BY`, and aggregates.

In [None]:
query="""SELECT city, COUNT(employeeNumber) AS numEmployees
               FROM offices
               JOIN employees
               USING(officeCode)
               GROUP BY 1
               ORDER BY numEmployees DESC;"""
query_to_df(query)

In [None]:
query_to_df("""SELECT customerName,
               COUNT(customerName) AS number_purchases,
               MIN(amount) AS min_purchase,
               MAX(amount) AS max_purchase,
               AVG(amount) AS avg_purchase,
               SUM(amount) AS total_spent
               FROM customers
               JOIN payments
               USING(customerNumber)
               GROUP BY 1
               ORDER BY SUM(amount) DESC;""")

## The `WHERE` Clause

In general, the `WHERE` clause filters query results by some condition. As you are starting to see, you can also combine multiple conditions.

- 
```python
cur.execute("""SELECT * FROM customers WHERE city = 'Boston' OR city = 'Madrid';""")
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df
```


- To refine your searches, you can add `ORDER BY` and `LIMIT` clauses. 
    - The order by clause allows you to sort the results by a particular feature.
- Finally, the limit clause is typically the last argument in a SQL query and simply limits the output to a set number of results.

In [None]:
query_to_df("""SELECT * FROM customers WHERE city = 'Boston' OR city = 'Madrid';""")

## The `HAVING` clause

 The `HAVING` clause works similarly to the `WHERE` clause, except it is used to filter data selections on conditions **after** the `GROUP BY` clause.

In [None]:
query_to_df("""SELECT city, COUNT(customerNumber) AS number_customers
               FROM customers
               GROUP BY 1
               HAVING COUNT(customerNumber)>=5;""")

## Combining `WHERE` and `HAVING`

We can also use the `WHERE` and `HAVING` clauses in conjunction with each other for more complex rules.

- For example, let's say we want a list of customers who have made at least 3 purchases of over 50K each.

In [None]:
query_to_df("""SELECT customerName,
               COUNT(amount) AS number_purchases_over_50K
               FROM customers
               JOIN payments
               USING(customerNumber)
               WHERE amount >= 50000
               GROUP BY 1
               HAVING count(amount) >= 3
               ORDER BY count(amount) DESC;""")

In [None]:
query_to_df("""SELECT customerName,
               COUNT(amount) AS number_purchases_over_50K
               FROM customers
               JOIN payments
               USING(customerNumber)
               WHERE amount >= 50000
               GROUP BY 1
               HAVING number_purchases_over_50K >= 3
               ORDER BY number_purchases_over_50K DESC;""")


# JOINS

### TYPES OF JOINS

- Joins may be:
    - INNER (default)
    - OUTER
    - LEFT 
    - RIGHT

<img src="https://raw.githubusercontent.com/learn-co-students/dsc-join-statements-online-ds-sp-000/master/images/venn.png">


## `JOIN` Statements

### Primary vs Foreign Keys
- primary key:
- forgein key:


<img src="https://raw.githubusercontent.com/learn-co-students/dsc-sql-introduction-online-ds-sp-000/master/images/Database-Schema.png">

### The `USING` clause

- If the column name is identical,you can use  is the `USING` clause. 
- Rather then saying on `tableA.column = tableB.column` we can simply say `using(column)`. 
- Only works if the column is **identically named** for both tables.

#### Task: Displaying product details along with order details

Let's say you need to generate some report that includes details about products from orders. To do that, we would need to take data from multiple tables in a single statement. 

In [None]:
query_to_df("""SELECT * 
               FROM orderdetails
               JOIN products
               ON orderdetails.productCode = products.productCode
               LIMIT 10;
               """)

## Types of Relationships Between Tables

#### One-to-One, One-to-many, many-to-many Joins


- **Let's say we have databases A and B**


- **One-to-One joins:**
    - There is only 1 entry in database B that aligns with each individual entry in database A
    - e.g. A person and their social security number.
    
    
- **One-to-Many join:**
    - There are multiple entries in database B that match the entry in database A
    - e.g. Joining an order number from db A with the individual products in db B.
    
    
- **Many-to-many joins:**
    - There are multiple entries in database A that match multiple entries in database B.
    - e.g. A = classes at a college, B = students.

### One-to-One

- There is only 1 entry in database B that aligns with each individual entry in database A
- e.g. A person and their social security number.

In [None]:
## Preview full offices table
query_to_df("Select * FROM offices;")

In [None]:
## Preview full employees table
query_to_df("Select * FROM employees;")

In [None]:
## One to one join
query_to_df('SELECT * FROM employees JOIN offices USING(officeCode);')

### One-to-Many

- There are multiple entries in database B that match the entry in database A
- e.g. Joining an order number from db A with the individual products in db B.

In [None]:
## preview table 1 
query_to_df("SELECT * FROM products;")

In [None]:
## preview table 2
query_to_df("SELECT * FROM productlines;")

In [None]:
## One-to-Many Join
query_to_df("""SELECT * 
               FROM products
               JOIN productlines
               USING(productLine);""")

### Many-to-Many

- There are multiple entries in database A that match multiple entries in database B.
- e.g. A = classes at a college, B = students.

In [None]:
query_to_df("""SELECT * FROM offices
                        JOIN customers
                        USING(state);""")

## SQL Subqueries

```python
cur.execute("""SELECT lastName, firstName, officeCode
               FROM employees
               WHERE officeCode IN (SELECT officeCode
                                    FROM offices 
                                    WHERE country = "USA");
                                    """)
df = pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df
```

# Pandas + SQL

## Using SQL-like queries in Pandas - `df.query()`

- Pandas DataFrames have a method called `.query()`
- This allows us to use SQL-like commands to reference data.
```python
## Normal Pandas Syntax
foo_df = bar_df.loc[bar_df['Col_1']>bar_df['Col_2']]
```

```python
## Using .query()
foo_df = bar_df.query("Col_1 > Col_2")
```
- How to use:
    - Enter the querty as a single string, using just column names to reference data.
    - To use and/or statements, use `&` and `|`, respectively

```python
foo_df = bar_df.query("Col_1 > Col_2 & Col_2 <= Col_3")
```

## Using SQL syntax with `pandasql`


- There is a library is called [pandasql](https://pypi.org/project/pandasql/) that allows for sql queries with pandas

We can install `pandasql` using the bash command `pip install pandasql`.

### Importing pandasql

- In order to use `pandasql`, we need to start by importing a `sqldf` object from `pandasql`
- You must also already have loaded in all dataframes that you wish to query
```python
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
```

### Writing Queries
```python
q = """SELECT
        m.date, m.beef, b.births
     FROM
        meats m
     INNER JOIN
        births b
           ON m.date = b.date;"""

results = pysqldf(q)

```

## DATABASE ADMIN 101

### Creating Tables/SQL Data Types

- `CREATE TABLE IF NOT EXISTS table_name`
    -  Must include (col_name, datatype, and if its the key)
```PYTHON
  cur.execute("""CREATE TABLE IF NOT EXISTS cats (
    id INTEGER PRIMARY KEY,
    name TEXT,
    age INTEGER,
    breed TEXT)
    """)  
    ```
    

- Data types in SQLite3:
    - https://www.sqlite.org/datatype3.html
    
- Data types:
    - TEXT
    - INTEGER
    - REAL
    - BLOB
    - NULL

#### Dropping a Table
- `DROP TABLE table_name`
    - `DROP TABLE IF EXISTS table_name`
    
    

#### Adding values to a table
- `INSERT INTO table_name`
    - list the columns to fill in first and then the VALUES()   
```python
cur.execute('''INSERT INTO cats (name, age, breed) 
                  VALUES ('Maru', 3, 'Scottish Fold');
            ''')
```

- To add multiple:
```python
cur.execute('''INSERT INTO cats (name, age, breed) 
            VALUES (?, ?, ?);
      ''',(dict_cats))
```
- `UPDATE`

```python
for dct in contacts:
    fname = dct['firstName']
    lname = dct['lastName']
    role = dct['role']
    phone = dct['telephone ']
    street = dct['street']
    city = dct['city']
    state = dct['state']
    z = dct['zipcode ']

    cur.execute('''INSERT INTO contactInfo (firstname, lastname, role, telephone, street, city, state, zipcode)
VALUES (?,?,?,?,?,?,?,?)''', (fname, lname, role, phone, street, city, state, z))
```

In [None]:
# cur.execute("""CREATE TABLE cats (
#     id INTEGER PRIMARY KEY,
#     name TEXT,
#     age INTEGER,
#     breed TEXT)
#     """)

# 🏝 Breakout Group Activity:  [Survive on sql-island](https://sql-island.informatik.uni-kl.de/)

- Survive on sql-island
    - https://sql-island.informatik.uni-kl.de/
- 3 min walk through together before breakout rooms