<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice SQL with Pandas pt. 1

_Authors: Joseph Nelson (DC), Matt Brems (DC)_

---

### Review: Pandas and SQL


#### The pandas connector and functions for SQL

We can use SQL through pandas using the `pandas.io.sql` module:

```python
import pandas as pd
from pandas.io import sql

cars = pd.read_csv('data/csv/car-names.csv', encoding = 'utf-8')
```


#### read_sql_table(table_name, con[, schema, ...])
    - Read SQL database table into a DataFrame.
#### read_sql_query(sql, con[, index_col, ...])
    - Read SQL query into a DataFrame.
#### read_sql(sql, con[, index_col, ...])
    - Read SQL query or database table into a DataFrame.
    - a convenience wrapper around read_sql_table() and read_sql_query()
    - will delegate to specific function depending on the provided input
#### DataFrame.to_sql(name, con[, flavor, ...])
    - Write records stored in a DataFrame to a SQL database.

---

### 1.  Create a SQL DB and tables using Pandas DFs and CSVs

First we will need to read our CSV files into Python before we can use Python to convert it to a SQL style dataframe

In [6]:
import pandas as pd
from pandas.io import sql
cars=pd.read_csv('/Users/Mahendra/desktop/GA/hw/2.4.1_database-sql_with_pandas_1-lab/datasets/csv/car-names.csv', encoding = 'utf-8')
# If you don't specify the type encoding as 'utf-8' you're going to have a bad time when you try to convert to SQL
cars.head()

Unnamed: 0,Id,Model,Make
0,1,'chevrolet','chevrolet chevelle malibu'
1,2,'buick','buick skylark 320'
2,3,'plymouth','plymouth satellite'
3,4,'amc','amc rebel sst'
4,5,'ford','ford torino'


**Import the `sqlite3` package and connect to the database.**

```python
connection = sqlite3.connect('./datasets/sql/Cars.db.sqlite')
```

In [8]:
# A:
import sqlite3
connection=sqlite3.connect('/Users/Mahendra/desktop/GA/hw/2.4.1_database-sql_with_pandas_1-lab/datasets/sql/Cars.db.sqlite')
c=connection.cursor()

#### Convert the loaded csv to a sql file.

```python
cars.to_sql(name = 'car_names', con = connection, if_exists = 'replace', index = False)
```

Important `.to_sql` arguments:
- `name` = name of the database useful if you have multiple tables in a SQL database
- `con` = the connection path to where the data should be placed
- `if_exists` = condition to pass if the database already exists.

If you check that directory now you should see an 'Cars.db' sql file.

In [10]:
# A:
df=cars.to_sql(name = 'car_names', con = connection, if_exists = 'replace', index = False)


> **Note:** Using the below will allow you to acess a database store in memory(RAM) as opposed to in Storage, if you wanted a temporary SQL database

``` python
conn = sqlite3.connect(':memory:')
```

### 2. Create a table in the cars database for car makers.

The table should be called `car_makers`

In [16]:
# A:
c.execute('create table car_makers(field1 INTEGER PRIMARY KEY);')
connection.commit

OperationalError: table car_makers already exists

### 3. Create a table in the cars database for the car data.

The table should be called `car_data`

In [18]:
# A:
c.execute('create table car_data(field1 INTEGER PRIMARY KEY);')
connection.commit

OperationalError: table car_data already exists

### 4. Using a query string, read the entire `car_names` table from your SQL database to a dataframe

Reading into a dataframe with a query string can be done with:
```python
# Using the read_sql from the Pandas SQL library and setting it equal to a DF object.
cars = sql.read_sql(query_string, con = connection)
```

In [21]:
# A:
cars=sql.read_sql('select * from car_names',con=connection)
cars


Unnamed: 0,Id,Model,Make
0,1,'chevrolet','chevrolet chevelle malibu'
1,2,'buick','buick skylark 320'
2,3,'plymouth','plymouth satellite'
3,4,'amc','amc rebel sst'
4,5,'ford','ford torino'
5,6,'ford','ford galaxie 500'
6,7,'chevrolet','chevrolet impala'
7,8,'plymouth','plymouth fury iii'
8,9,'pontiac','pontiac catalina'
9,10,'amc','amc ambassador dpl'


> **Tip:** If you 'shift+tab' in the function call, you can see that the `read_sql` function takes the arguments 'sql' and 'con'

## Side note: normalized vs. denormalized databases

---

There are several ways to organize data in a relational database. Two common definitions for data setups are: normalized and denormalized.

- __Normalized__ structures have a single table per entity, and use many foreign keys or link tables to connect the entities.

- __Denormalized__ tables have fewer tables and may (for example) place all of the tweets and the information on users in one table.

Each style has advantages and disadvantages. Denormalized tables duplicate a lot of information. For example, in a combined tweets/users table, we may store the address of each user. Now instead of storing this once per user, we are storing this once per tweet!

However, this makes the data easy to access if we ever need to find the tweet along with the user's location.

Normalized tables save the storage space by separating the information. However, if we ever need to access those two pieces of information, we would need to join the two tables, which can be a fairly slow operation.


### 5. Write a python function to query a database using pandas and return a dataframe

The function should take two arguments:
- the query string
- the datbase connection object

In [27]:
# A:
df=pd.read_sql('select * from car_names ;',con=connection)
df

Unnamed: 0,Id,Model,Make
0,1,'chevrolet','chevrolet chevelle malibu'
1,2,'buick','buick skylark 320'
2,3,'plymouth','plymouth satellite'
3,4,'amc','amc rebel sst'
4,5,'ford','ford torino'
5,6,'ford','ford galaxie 500'
6,7,'chevrolet','chevrolet impala'
7,8,'plymouth','plymouth fury iii'
8,9,'pontiac','pontiac catalina'
9,10,'amc','amc ambassador dpl'


### 6. Select the first 5 rows of the `car_names` table

> Hint: the LIMIT command in SQL can limit the number of rows returned

In [26]:
# A:
pd.read_sql('select * from car_names limit 5;',con=connection)

Unnamed: 0,Id,Model,Make
0,1,'chevrolet','chevrolet chevelle malibu'
1,2,'buick','buick skylark 320'
2,3,'plymouth','plymouth satellite'
3,4,'amc','amc rebel sst'
4,5,'ford','ford torino'


### 7. Add the cars into the `car_names` table

The execute function will come in handy here. It will execute a sql command string.
```python
connection.execute()
```

In [29]:
ferrari = (None, 'Ferrari','The Ferrari')
tesla = [None, 'Tesla', None]

In [37]:
# A:
connection.execute('Insert into car_names values (?,?,?)',ferrari)
connection.execute('Insert into car_names values (?,?,?)',tesla)
connection.commit

<function commit>

### 8. Query the `car_names` table for all columns where `'Model' = 'Tesla'`

In [40]:
# A:
pd.read_sql('select * from car_names where Model is "Tesla";',con=connection)

Unnamed: 0,Id,Model,Make
0,,Tesla,
1,,Tesla,
2,,Tesla,


In [55]:
Q('SELECT * FROM car_names WHERE car_names."Model" = "Tesla"')

NameError: name 'Q' is not defined

### 9. Select the first 5 rows from the `car_makers` table

In [41]:
# A:
pd.read_sql('select * from car_makers limit 5;',con=connection)

Unnamed: 0,Id,Maker,FullName,Country
0,1,'amc','American Motor Company',1
1,2,'volkswagen','Volkswagen',2
2,3,'bmw','BMW',2
3,4,'gm','General Motors',1
4,5,'ford','Ford Motor Company',1


### 10. Select the first 5 rows from the `car_data` table

In [42]:
# A:
pd.read_sql('select * from car_data limit 5;',con=connection)

Unnamed: 0,Id,MPG,Cylinders,Edispl,Horsepower,Weight,Accelerate,Year
0,1,18,8,307.0,130,3504,12.0,1970
1,2,15,8,350.0,165,3693,11.5,1970
2,3,18,8,318.0,150,3436,11.0,1970
3,4,16,8,304.0,150,3433,12.0,1970
4,5,17,8,302.0,140,3449,10.5,1970


## SQL join types

---

SQL joins are used when data is spread in different tables. A join operation allows to combine rows from two or more tables in a single new table. In order for this to be possible, a common field between the tables need to exist.

Join operations can be thought of as operations between two sets, where records with the same key are combined and records missing in one set are either discarded or included as NULL values.

![join types](images/joins.gif)

Join Types:
- **INNER JOIN:** Returns all rows when there is at least one match in BOTH tables
- **LEFT JOIN:** Return all rows from the left table, and the matched rows from the right table
- **RIGHT JOIN:** Return all rows from the right table, and the matched rows from the left table
- **FULL JOIN:** Return all rows when there is a match in ONE of the tables

![sql join types](images/sql-joins.jpeg)

### Order ID is our matching feature that we can use to merge.

Lets Checkout all the ways we can merge these.

### 11. Practice inner joining

The most common type of join is: `SQL INNER JOIN` (simple join). An `SQL INNER JOIN` returns all rows from multiple tables where the join condition is met. 

If we `INNER JOIN` on "Id", this takes the intersection of the two tables, excluding the rows for which CustomerID is null in EITHER of the two tables.

Essentially, only matching pairs of Order Id's from both Datasets will be taken.

**Select Make, MPG, Horsepower, and Year**:
- You will need to `INNER JOIN` the `car_names` and `car_data` tables on the `Id` column.


In [51]:
# A:
pd.read_sql('select CN.make,CD.MPG,CD.Horsepower,CD.Year from car_names CN inner join car_data CD where CN.id=CD.id;',con=connection)

Unnamed: 0,Make,MPG,Horsepower,Year
0,'chevrolet chevelle malibu',18,130,1970
1,'buick skylark 320',15,165,1970
2,'plymouth satellite',18,150,1970
3,'amc rebel sst',16,150,1970
4,'ford torino',17,140,1970
5,'ford galaxie 500',15,198,1970
6,'chevrolet impala',14,220,1970
7,'plymouth fury iii',14,215,1970
8,'pontiac catalina',14,225,1970
9,'amc ambassador dpl',15,190,1970


### 12. Practice left joining

The `LEFT JOIN` keyword returns all rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.

**Select Make, MPG, Horsepower, and Year**
- `SELECT FROM` the `car_names` table
- `LEFT JOIN` the `car_data` table by `Id`

In [54]:
# A:
pd.read_sql('select CN.make,CD.MPG,CD.Horsepower,CD.Year from car_names CN left join car_data CD where CN.id=CD.id',con=connection)

Unnamed: 0,Make,MPG,Horsepower,Year
0,'chevrolet chevelle malibu',18,130,1970
1,'buick skylark 320',15,165,1970
2,'plymouth satellite',18,150,1970
3,'amc rebel sst',16,150,1970
4,'ford torino',17,140,1970
5,'ford galaxie 500',15,198,1970
6,'chevrolet impala',14,220,1970
7,'plymouth fury iii',14,215,1970
8,'pontiac catalina',14,225,1970
9,'amc ambassador dpl',15,190,1970


###  Right joins and Full Outer Joins (unsupported)

> **No exercises for RIGHT and FULL OUTER because they are not supported in this relation.**

The `RIGHT JOIN` keyword would all rows from the right table (table2), with the matching rows in the left table (table1). The result is NULL in the left side when there is no match.

The `FULL OUTER JOIN` keyword returns all rows from the left table (table1) and from the right table (table2). The `FULL OUTER JOIN` keyword combines the result of both `LEFT` and `RIGHT` joins. In this case we could have NULL values on both sides.

## Addtional resources

---

A bit long winded, but good resources as far as explaining Pandas functions from a SQL programmers perspective:
(The opposite of us.)

Pydata Video:
https://www.youtube.com/watch?v=1uVWjdAbgBg

Assciated GitHub Repo:
https://github.com/gjreda/pydata2014nyc/tree/master/data



Pandas Merge, Join and Concatenate
http://pandas.pydata.org/pandas-docs/stable/merging.html