# Pandas and postgres

We are going to show how we can get Pandas (and more generally, Python) to interact with the information in our PostgreSQL database. We are going to look at two different techniques:
- SQL functionality built into Pandas
  This will mostly be used when doing SELECTs. We will be able to get the infomation from our database and put it directly into a pandas dataframe.
- A cursor based method
  This is a method for executing arbitrary queries on our database (e.g. adding tables, INSERTing values, UPDATEs, or DELETEs). It can also be useful if we are pulling down a lot of data, as the cursor can fetch part of the data at a time (so we can process it as we go). The Pandas method pulls down the query all at once.

To build our skills using Python and Postgres, we are going to look at the `names` database we have already uploaded.

In [None]:
# Get pandas and postgres to work together
import psycopg2 as pg
import pandas as pd
import pandas.io.sql as pd_sql

# We are also going to do some basic viz
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# There is a bug in matplotlib. You cannot set the rc parameters in the same
# cell that you use the "%matplotlib inline" magic command
plt.style.use('ggplot')
plt.rc('font', size=18)

In [None]:
# Postgres info to connect

connection_args = {
    'host': 'localhost',  # We are connecting to our _local_ version of psql
    'dbname': 'names',    # DB that we are connecting to
    'port': 5432          # port we opened on AWS
}

# We will talk about this magic Python trick!
connection = pg.connect(**connection_args)

See the [python tutorial](https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists) for more info on unpacking arguments

## Pandas and the read_sql command

Let's look through the `regions` table: 

In [None]:
query = "SELECT * FROM region;"

pd_sql.read_sql(query, connection)

This is pretty cool! We even get the column names to match! Notice that we have some problems with the states. For example, there are not enough of them AND `RI` is still in the `Midwest`! We will come back and fix these errors when we look at the cursor methods later.

For right now, let's look at a slightly more advanced SQL query. We checked in `psql` that the number of Alice's had been dropping off a couple of years ago. Let's make a line plot of
* Year (x axis)
* Total number of people named Alice (y axis)

In [None]:
query_alice = """
    SELECT year, sum(freq) as num_damien
        FROM name_freq
        WHERE name='Damien'
        GROUP BY name, year
        ORDER BY year
"""

alice_df = pd_sql.read_sql(query_alice, connection)

In [None]:
alice_df.head()

In [None]:
alice_df.plot(x='year', y='num_damien', figsize=(16, 10))
#
# Can also do
# plt.plot(alice_df['year'], alice_df['num_alice'], 'r-')
plt.title("Number of Damien's born in the US per year");

## Using the cursor method

The pandas method above is cool, as it allows you to pull data down from PostgreSQL directly into a dataframe. It pulls everything at once, which is convinient if you have small datasets. If your datasets are large (i.e. 100s MB to TB), you probably want to batch your data requests.

**Note**: This cell will almost certainly become dated. Students in the 2022 cohort will think "100MB is large? How quaint." C'est la vie.

Here is a mental model of a cursor. We are trying to execute
```sql
SELECT * FROM regions;
```
with a cursor. The cursor `executes` and finds the index of the "first" row:
```
Database                Where the cursor is
.... record 0 ..... <--- cursor (i.e. cursor remembers 0)
.... record 1 .....
.... record 2 .....
.... ........ .....
.... record N .....
```
We can tell the cursor to `fetch` a record. It will return the record it is pointing at (in this case, record 0), and move to the next one.
```
After a call to cursor.fetch()

Database                Where the cursor is
.... record 0 ..... 
.... record 1 ..... <--- cursor (i.e. cursor remembers 1) 
.... record 2 .....
.... ........ .....
.... record N .....

Cursor has returned record 0
```
We can keep calling fetch until the cursor tells us there are no more records.

**Mental model:**
A cursor (for a select) acts like a "bookmark" in your table.

In [None]:
# make a cursor
cursor = connection.cursor()

# make a query (sets the cursor pointing at the first record)
cursor.execute("SELECT * FROM region;")

In [None]:
# Nothing has been returned yet. Let's fetch a result
cursor.fetchone()

In [None]:
# Let's grab the next one as well
cursor.fetchone()

In [None]:
# Ok, now let's get the next 3
cursor.fetchmany(3)

In [None]:
# .... and the rest of them
cursor.fetchall()

We can use a loop to iterate through all the results

In [None]:
cursor.fetchone()

In [None]:
query_midwest = """
SELECT * FROM region WHERE region='Midwest'
"""

# Reuse our cursor to point at the first result.

cursor.execute(query_midwest)

# Each iteration will call 'fetchone' on the cursor
for result in cursor:
    message = "State {} is in the best region in the US".format(result[0])
    print(message)

Hey, wait a second ..... what is Rhode Island doing in there? Yesterday we moved RI to the Midwest. Let's fix that using an UPDATE.


In [None]:
query_fix_RI = """
UPDATE region SET region='New_England' WHERE state='RI'
"""

cursor.execute(query_fix_RI)

In [None]:
#  Let's check the result
cursor.execute(query_midwest)

for result in cursor:
    message = "State {} is in the best region in the US".format(result[0])
    print(message)

Yay! We moved Rhode Island back to its correct location!


....or did we? Let's jump to `psql` in our AWS terminal. At the terminal, start `psql`:
```bash
ubuntu@ip-xxx.xxx.xxx.xxx: psql
```

Then in `psql`, run the following commands
```sql
postgres=# \connect names;
names=# SELECT * FROM regions WHERE state='RI';
```
The result that I get is 

| state | region |
|---|---|
| RI    | Midwest|

Let's check in Python again:

In [None]:
cursor.execute("SELECT * FROM region WHERE state='RI'")
cursor.fetchall()

Wat!!?!?!?!?!

<img width='300px' src='./images/vader_nonsense.jpg'/>

## Commits and rollbacks

When we execute a command in `psql`, it alters the underlying data immediately. __Most__ of the time, when executing commands from Python, it will make the changes off to the side, and wait for you to __commit__ those changes. If you have decided that you made a mistake, you can decide to __rollback__ those changes.

This is similar to `git` - the "master" copy on the database only changes when you "commit" your change. 

There are a few things that __will__ be executed immediately without commits:
- Making tables
- Making views
- Dropping tables
- Dropping views
- Dropping databases
You won't be able to rewind from making these changes.

Note that if you are doing `SELECT`s, you are not changing the underlying data so you don't care about commiting or not.

In [None]:
# We are really sure that Rhode Island is in the midwest. Let's commit.

cursor.execute('commit;')

Now go back to `psql` (on AWS) and see if it changed:
```sql
names=# SELECT * FROM regions WHERE state='RI';
```

You should get the following:

| state |   region    |
|-------|-------------|
| RI    | New_England |

Success!!

### Warning!

Once you have committed a cursor, it will keep auto-commiting. This is annoying. 

After doing a commit, you start a new transaction using a "BEGIN" command. Then you can decide to commit or rollback a transaction. 

In [None]:
# Let's make a mistake, but since we have committed already we should
# start a transaction explicitly.
cursor.execute('BEGIN;')

il_in_wrong_place = """
  UPDATE region SET region='South' WHERE state='IL'
"""

# put Illinios in the wrong region
cursor.execute(il_in_wrong_place)

# check to see what we have done
cursor.execute("SELECT * FROM region WHERE state='IL'")
cursor.fetchall()

**Question**: What will we get when we run
```sql
SELECT * FROM regions WHERE state='IL';
```
on PostgreSQL? Try it!

In [None]:
cursor.execute('rollback;')

In [None]:
# check to see what we have done
cursor.execute("SELECT * FROM region WHERE state='IL'")
cursor.fetchall()

### Sad cursors

Typos happen. It's part of life. If we make a mistake in `psql`, then PostgreSQL yells at us, but we can try again and get on with our lives. PostgreSQL cursors freak out, and we have to explicit roll them back.

In [None]:
cursor.execute('BEGIN;')

query_error = """
SELECT * FROM name_typo_freq WHERE name='Alice' AND state='IL';
"""

query_correct = """
SELECT * FROM name_freq WHERE name='Alice' AND state='IL';
"""

# Let's run the bad query:
cursor.execute(query_error)

In [None]:
# Ok, it correctly figured out my table is wrong. No problem!
# We know what the correct query is, so let's execute that instead
cursor.execute(query_correct)

In [None]:
# We have a sad cursor! We need to fix it with a rollback:
# check to see what we have done
cursor.execute("rollback;")

In [None]:
cursor.execute(query_correct)

In [None]:
# Success! Let's look at the top 5 results
cursor.fetchmany(5)

## SQL practice


We still have the SQL exercises from last time to work through. 

1. Gentle start with SELECTs:
  Find the total number of children born in the US between 2000 and 2010. i.e you should have one number.
2. Find the number of children born in the US between 2000 and 2010, broken down by region. i.e. you should have 6 numbers.
3. Find the number of children born for each year between 2000 and 2010, in each region (i.e. you should have 60 numbers returned). Put these in a dataframe. We can try and plot them (a line graph like the Alice's, but one line per region).
4. Okay, let's clean up the data in a way that doesn't use a SELECT. We have a missing state! Find which state is missing.
  Hint #1: You can use 'SELECT DISTINCT(state) FROM name_freq' to get a list of all 50 states + DC.
  Hint #2: You should take regions and JOIN the list above on state. You should chose a LEFT or RIGHT JOIN so you keep the nulls
  Hint #3: In SQL, trying `variable=null` doesn't work (null is not equal to itself, to help us not match missing data to other missing data). When looking for null values, use `WHERE variable IS null` instead.