<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice SQL with Pandas pt. 1

_Authors: Joseph Nelson (DC), Matt Brems (DC)_

---

### Review: Pandas and SQL


#### The pandas connector and functions for SQL

We can use SQL through pandas using the `pandas.io.sql` module:

```python
import pandas as pd
from pandas.io import sql

cars = pd.read_csv('data/csv/car-names.csv', encoding = 'utf-8')
```


#### read_sql_table(table_name, con[, schema, ...])
    - Read SQL database table into a DataFrame.
#### read_sql_query(sql, con[, index_col, ...])
    - Read SQL query into a DataFrame.
#### read_sql(sql, con[, index_col, ...])
    - Read SQL query or database table into a DataFrame.
    - a convenience wrapper around read_sql_table() and read_sql_query()
    - will delegate to specific function depending on the provided input
#### DataFrame.to_sql(name, con[, flavor, ...])
    - Write records stored in a DataFrame to a SQL database.

---

### 1.  Create a SQL DB and tables using Pandas DFs and CSVs

First we will need to read our CSV files into Python before we can use Python to convert it to a SQL style dataframe

In [3]:
import pandas as pd
from pandas.io import sql

cars = pd.read_csv('./datasets/csv/car-names.csv', encoding = 'utf-8')
# If you don't specify the type encoding as 'utf-8' you're going to have a bad time when you try to convert to SQL


In [4]:
# Checking what our dataframe looks like
cars.head(3)

Unnamed: 0,Id,Model,Make
0,1,'chevrolet','chevrolet chevelle malibu'
1,2,'buick','buick skylark 320'
2,3,'plymouth','plymouth satellite'


In [5]:
# Checking for nulls in our data
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 3 columns):
Id       406 non-null int64
Model    406 non-null object
Make     406 non-null object
dtypes: int64(1), object(2)
memory usage: 9.6+ KB


**Import the `sqlite3` package and connect to the database.**

```python
connection = sqlite3.connect('./datasets/sql/Cars.db.sqlite')
```

In [8]:
# Import Sqlite3 Library
import sqlite3

# Establishing the Connection to our Database.  If no database exists here, this will create one.
connection = sqlite3.connect('./datasets/sql/Cars.db.sqlite')

# Keep in mind the directory your notebook is open in is the base directory for all of our SQL actions from here.

#### Convert the loaded csv to a sql file.

```python
cars.to_sql(name = 'car_names', con = connection, if_exists = 'replace', index = False)
```

Important `.to_sql` arguments:
- `name` = name of the database useful if you have multiple tables in a SQL database
- `con` = the connection path to where the data should be placed
- `if_exists` = condition to pass if the database already exists.

If you check that directory now you should see an 'Cars.db' sql file.

In [9]:
# Converts a DataFrame into a SQL database
cars.to_sql(name = 'car_names', con = connection, if_exists = 'replace', index = False)

# name = name of the database useful if you have multiple tables in a SQL database
# con = the connection path to where the data should be placed
# if_exists = condition to pass if the database already exists.

> **Note:** Using the below will allow you to acess a database store in memory(RAM) as opposed to in Storage, if you wanted a temporary SQL database

``` python
conn = sqlite3.connect(':memory:')
```

### 2. Create a table in the cars database for car makers.

The table should be called `car_makers`

In [11]:
car_makers_csv = './datasets/csv/car-makers.csv'

# Creating a Table for Order Breakdowns
makers = pd.read_csv(car_makers_csv, encoding = 'utf-8')

connection = sqlite3.connect('./datasets/sql/Cars.db.sqlite')

makers.to_sql(name = 'car_makers', con = connection, if_exists = 'replace', index = False)


### 3. Create a table in the cars database for the car data.

The table should be called `car_data`

In [14]:
car_data_csv = './datasets/csv/cars-data.csv'

# Creating a Table for the Sales Targets
data = pd.read_csv(car_data_csv, encoding = 'utf-8')

connection = sqlite3.connect('./datasets/sql/Cars.db.sqlite')

data.to_sql(name = 'car_data', con = connection, if_exists = 'replace', index = False)


### 4. Using a query string, read the entire `car_names` table from your SQL database to a dataframe

Reading into a dataframe with a query string can be done with:
```python
# Using the read_sql from the Pandas SQL library and setting it equal to a DF object.
cars = sql.read_sql(query_string, con = connection)
```

In [12]:
#The SQL Sub-library from Pandas will allow us to run SQL queries within python.
from pandas.io import sql
# We already imported sqlite3, but it will also be needed for reading in SQL 
import sqlite3

# Specifying the SQL Path to the SQL Database
connection = sqlite3.connect('data/sql/Cars.db.sqlite')

# This is our SQL Query
query = 'select * from car_names'

# Using the read_sql from the Pandas SQL library and setting it equal to a DF object.
cars = sql.read_sql(query, con = connection)

cars.head()

Unnamed: 0,Id,Model,Make
0,1,'chevrolet','chevrolet chevelle malibu'
1,2,'buick','buick skylark 320'
2,3,'plymouth','plymouth satellite'
3,4,'amc','amc rebel sst'
4,5,'ford','ford torino'


> **Tip:** If you 'shift+tab' in the function call, you can see that the `read_sql` function takes the arguments 'sql' and 'con'

## Side note: normalized vs. denormalized databases

---

There are several ways to organize data in a relational database. Two common definitions for data setups are: normalized and denormalized.

- __Normalized__ structures have a single table per entity, and use many foreign keys or link tables to connect the entities.

- __Denormalized__ tables have fewer tables and may (for example) place all of the tweets and the information on users in one table.

Each style has advantages and disadvantages. Denormalized tables duplicate a lot of information. For example, in a combined tweets/users table, we may store the address of each user. Now instead of storing this once per user, we are storing this once per tweet!

However, this makes the data easy to access if we ever need to find the tweet along with the user's location.

Normalized tables save the storage space by separating the information. However, if we ever need to access those two pieces of information, we would need to join the two tables, which can be a fairly slow operation.


### 5. Write a python function to query a database using pandas and return a dataframe

The function should take two arguments:
- the query string
- the datbase connection object

In [16]:
# In the case that typing out sql.read_sql() is a little too much,
# we'll create a function shortcut.

CARS = sqlite3.connect('./datasets/sql/Cars.db.sqlite')

def Q(query, db=CARS):
    return sql.read_sql(query, db)

### 6. Select the first 5 rows of the `car_names` table

> Hint: the LIMIT command in SQL can limit the number of rows returned

In [17]:
Q('select * from car_names limit 5')

Unnamed: 0,Id,Model,Make
0,1,'chevrolet','chevrolet chevelle malibu'
1,2,'buick','buick skylark 320'
2,3,'plymouth','plymouth satellite'
3,4,'amc','amc rebel sst'
4,5,'ford','ford torino'


### 7. Add the cars into the `car_names` table

The execute function will come in handy here. It will execute a sql command string.
```python
connection.execute()
```

In [18]:
ferrari = (None, 'Ferrari','The Ferrari')
tesla = [None, 'Tesla', None]

In [19]:
CARS.execute('INSERT INTO car_names VALUES (?, ?, ?)', ferrari)

<sqlite3.Cursor at 0x1125baf80>

In [20]:
CARS.execute('INSERT INTO car_names VALUES (?, ?, ?)',tesla)

<sqlite3.Cursor at 0x1125baf10>

### 8. Query the `car_names` table for all columns where `'Model' = 'Tesla'`

In [21]:
Q('SELECT * FROM car_names WHERE car_names."Model" = "Tesla"')

Unnamed: 0,Id,Model,Make
0,,Tesla,


### 9. Select the first 5 rows from the `car_makers` table

In [22]:
Q('select * from car_makers limit 5')

Unnamed: 0,Id,Maker,FullName,Country
0,1,'amc','American Motor Company',1
1,2,'volkswagen','Volkswagen',2
2,3,'bmw','BMW',2
3,4,'gm','General Motors',1
4,5,'ford','Ford Motor Company',1


### 10. Select the first 5 rows from the `car_data` table

In [23]:
Q('select * from car_data limit 5')

Unnamed: 0,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year
0,1,18,8,307.0,130,3504,12.0,1970
1,2,15,8,350.0,165,3693,11.5,1970
2,3,18,8,318.0,150,3436,11.0,1970
3,4,16,8,304.0,150,3433,12.0,1970
4,5,17,8,302.0,140,3449,10.5,1970


## SQL join types

---

SQL joins are used when data is spread in different tables. A join operation allows to combine rows from two or more tables in a single new table. In order for this to be possible, a common field between the tables need to exist.

Join operations can be thought of as operations between two sets, where records with the same key are combined and records missing in one set are either discarded or included as NULL values.

![join types](images/joins.gif)

Join Types:
- **INNER JOIN:** Returns all rows when there is at least one match in BOTH tables
- **LEFT JOIN:** Return all rows from the left table, and the matched rows from the right table
- **RIGHT JOIN:** Return all rows from the right table, and the matched rows from the left table
- **FULL JOIN:** Return all rows when there is a match in ONE of the tables

![sql join types](images/sql-joins.jpeg)

### Order ID is our matching feature that we can use to merge.

Lets Checkout all the ways we can merge these.

### 11. Practice inner joining

The most common type of join is: `SQL INNER JOIN` (simple join). An `SQL INNER JOIN` returns all rows from multiple tables where the join condition is met. 

If we `INNER JOIN` on "Id", this takes the intersection of the two tables, excluding the rows for which CustomerID is null in EITHER of the two tables.

Essentially, only matching pairs of Order Id's from both Datasets will be taken.

**Select Make, MPG, Horsepower, and Year**:
- You will need to `INNER JOIN` the `car_names` and `car_data` tables on the `Id` column.


In [24]:
inner_join = Q('SELECT car_names."Make", car_data."MPG", car_data."Horsepower", car_data."Year" '
'FROM car_names '
'INNER JOIN car_data '
'ON car_names."Id"=car_data."Id"')
inner_join.head()

Unnamed: 0,Make,MPG,Horsepower,Year
0,'chevrolet chevelle malibu',18,130,1970
1,'buick skylark 320',15,165,1970
2,'plymouth satellite',18,150,1970
3,'amc rebel sst',16,150,1970
4,'ford torino',17,140,1970


In [25]:
inner_join.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 4 columns):
Make          406 non-null object
MPG           406 non-null object
Horsepower    406 non-null object
Year          406 non-null int64
dtypes: int64(1), object(3)
memory usage: 12.8+ KB


### 12. Practice left joining

The `LEFT JOIN` keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.

**Select Make, MPG, Horsepower, and Year**
- `SELECT FROM` the `car_names` table
- `LEFT JOIN` the `car_data` table by `Id`

In [26]:
left_join = Q('SELECT car_names."Make", car_data."MPG", car_data."Horsepower", car_data."Year" '
'FROM car_names '
'LEFT JOIN car_data '
'ON car_names."Id"=car_data."Id"')
left_join.head()

Unnamed: 0,Make,MPG,Horsepower,Year
0,'chevrolet chevelle malibu',18,130,1970.0
1,'buick skylark 320',15,165,1970.0
2,'plymouth satellite',18,150,1970.0
3,'amc rebel sst',16,150,1970.0
4,'ford torino',17,140,1970.0


In [27]:
left_join.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Data columns (total 4 columns):
Make          407 non-null object
MPG           406 non-null object
Horsepower    406 non-null object
Year          406 non-null float64
dtypes: float64(1), object(3)
memory usage: 12.8+ KB


###  Right joins and Full Outer Joins (unsupported)

> **No exercises for RIGHT and FULL OUTER because they are not supported in this relation.**

The `RIGHT JOIN` keyword would all rows from the right table (table2), with the matching rows in the left table (table1). The result is NULL in the left side when there is no match.

The `FULL OUTER JOIN` keyword returns all rows from the left table (table1) and from the right table (table2). The `FULL OUTER JOIN` keyword combines the result of both `LEFT` and `RIGHT` joins. In this case we could have NULL values on both sides.

## Addtional resources

---

A bit long winded, but good resources as far as explaining Pandas functions from a SQL programmers perspective:
(The opposite of us.)

Pydata Video:
https://www.youtube.com/watch?v=1uVWjdAbgBg

Assciated GitHub Repo:
https://github.com/gjreda/pydata2014nyc/tree/master/data



Pandas Merge, Join and Concatenate
http://pandas.pydata.org/pandas-docs/stable/merging.html