# Data, Databases, SQL, and interacting with databases using Python

For this exercise, we will be using data contained in the "homework" database on the Big Data for Social Science Class Server. This notebook will walk you through accessing the class homework data using IPython Notebook and help to familiarize you with the available class data.

## Table of Contents

- [Tables we will look at](#Tables-we-will-look-at)

    - [StarMetrics Database](#StarMetrics-Database)
    - [UMETRICS Grants Database](#UMETRICS-Grants-Database)
    - [USPTO Patents](#USPTO-Patents)

- [Databases and Python](#Databases-and-Python)

    - [Making and using a database connection](#Making-and-using-a-database-connection)
    - [Troubleshooting problems with connections and cursors](#Troubleshooting-problems-with-connections-and-cursors)
    - [Querying the database](#Querying-the-database)
    
        - [Executing an SQL statement](#Executing-an-SQL-statement)
        - [Fetch column names from result row](#Fetch-column-names-from-result-row)
        - [Fetch a certain number of rows](#Fetch-a-certain-number-of-rows)
        - [Processing all rows in a large table](#Processing-all-rows-in-a-large-table)
        - [Processing a large table one row at a time](#Processing-a-large-table-one-row-at-a-time)
    
    - [Close what you open](#Close-what-you-open)
    
- [SQL](#SQL)

    - [Querying the database](#Querying-the-database)
    - [**Exercise 1**](#Exercise-1)
    - [WHERE clauses: Limiting the results](#WHERE-clauses:-Limiting-the-results)
    - [**Exercise 2**](#Exercise-2)
    - [JOIN: Connecting multiple tables](#JOIN:-Connecting-multiple-tables)
    - [GROUP BY and aggregate functions](#GROUP-BY-and-aggregate-functions)
    - [**Exercise 3**](#Exercise-3)
    - [ORDER BY](#ORDER-BY)
    - [Modifying the database](#Modifying-the-database)

- [Addendum - Different ways to make a cursor](#Addendum---Different-ways-to-make-a-cursor)
- [References](#References)

## Database Tables

- Back to the [Table of Contents](#Table-of-Contents)

For these exercises we will use tables in the "homework" database. These tables were created using data from the broader starmetrics, umetricsgrants and usptopatents databases. Each of these databases contain different types of information and are available for your use during this class.

You also have a personal database where you can create modify tables as you wish, to support your work.  Your databases has the same name as your username.

### Tables covered in class

Quick description of the data available: 

#### StarMetrics Database

- Back to the [Table of Contents](#Table-of-Contents)

The starmetrics database contains transactional data from universities that describe expenditures on federal research grants. The data includes four different types of expenditures: 

- 1) employee expenditures - this describes the people by occupation who charged time to federal grants.
- 2) vendor expenditures - this descirbes the businesses that goods were bought from federal grants.
- 3) subaward expenditures - this describes the universities and other institutions that are paid to collaborate from federal grants.
- 4) award expenditures - this describes the overhead that is associated with each federal grant.

#### UMETRICS Grants Database 

- Back to the [Table of Contents](#Table-of-Contents)

The umetricsgrants database contains public data that describes NIH, NSF, USDA & NASA federal awards. This database was created by combining several small databases together to capture all the grant data in one database. The structure of the database tables are different depending on the source of the data.

#### USPTO Patents

- Back to the [Table of Contents](#Table-of-Contents)

The USPTOPatents database contains public patent and inventor data. These data include all patents, inventors, assignees and their associated metadata on location, patent classes, etc. 

### Tables for this homework assignment

For this assignment, we will be connecting to the "homework" database and looking at the OSU_vendor and OSU_grant tables.  Basic information on these tables follows.  For more detailed information on the contents of each, see the Database Assignment Data Dictionary Word document in moodle [http://jpsmonline.umd.edu/mod/resource/view.php?id=2436](http://jpsmonline.umd.edu/mod/resource/view.php?id=2436).

#### OSU_vendor

Ohio State University internal transaction data that describes the expenditures on a federal grants. Note that this information does not tell you what goods/services were bought, but tells you who was paid and how much was spent. 

#### OSU_grant

Ohio State University public grant information from NSF and NIH websites that describe the funded grant projects.

## Databases and Python

- Back to the [Table of Contents](#Table-of-Contents)

### Making and using a database connection

- Back to the [Table of Contents](#Table-of-Contents)

Python lets you interact with databases using SQL just like you would in any SQL gui or terminal. Python code can do SELECTs, CREATEs, INSERTs, UPDATEs, and DELETEs, and any other SQL, and the results are returned in a format that lets you interact with them after the SQL statements finish.

To interact with a database using python, first you have to connect to the database.

To create a database connection, you first must import that database's DB-API implementation, then you call the connect() function, passing it information on where to find the database to which you are trying to connect.

_Note: in most cases, it is best to place the values you pass to a function or method in variables:_

    # declare variables
    user = "username"
    password = "password"
    database = "homework"

    # invoke the connect() function, passing parameters in variables.
    db = MySQLdb.connect( user = user, passwd = password, db = database )

_rather than placing the values directly in the arguments in a function or method call:_

    # invoke the connect() function, passing values directly to function.
    db = MySQLdb.connect( user = "username", passwd = "password", db = "homework" )

_This allows you change how you populate the values (looking up a value from a command line parameter, for example, or the result of a database query) without also changing the line where the function is invoked.  If you separate setting values and using values, you make it easier to isolate problems with ** how you set the values** from problems with **how you use the values**._


In [None]:
# imports
import MySQLdb

# declare variables - 
user = "<username>"
password = "<password>"
database = "homework"

# invoke the connect() function, passing parameters in variables.
db = MySQLdb.connect( user = user, passwd = password, db = database )

# output basic database connection info.
print( db )

Next, you use the connection to create a cursor. A cursor takes an SQL statement written as a string in python and passes it to the database, where it is executed. It then uses the results and converts them to a format that can be interacted with using python, and returns that transformed, usable response back to you.

To make a cursor, call the cursor() method on the connection object instance returned by the call to connect.

In [None]:
# assumes you have alread opened the database using the connect() method.  If not:
#db = MySQLdb.connect( user = user, passwd = password, db = database )

# create mysql cursor that maps column names to values in the query result.
cursor = db.cursor( MySQLdb.cursors.DictCursor )

# output basic database cursor info.
print( cursor )

### Troubleshooting problems with connections and cursors

- Back to the [Table of Contents](#Table-of-Contents)

Throughout the following examples of querying the database, we will be re-using the connection and cursor created above.  If you want to run the code samples below, make sure that you:

- enter your credentials in the `username` and `password` variables (same credentials as for jupyter).
- run the two code cells above, so you have a connection and a cursor.

This connection also might time out if it is idle for more than a few minutes.  If this happens, you'll see an exception and a stack trace with the message `OperationalError: (2006, 'MySQL server has gone away')` at the end.  Example:

    ---------------------------------------------------------------------------
    OperationalError                          Traceback (most recent call last)
    <ipython-input-3-3649a7756a8b> in <module>()
         10 # select
         11 sql_select = "SELECT COUNT( * ) AS 'transaction_count' FROM OSU_vendor"
    ---> 12 row_count = cursor.execute( sql_select )
         13 
         14 # get row and 'transaction_count'

    /usr/local/lib/python2.7/dist-packages/MySQLdb/cursors.pyc in execute(self, query, args)
        203             del tb
        204             self.messages.append((exc, value))
    --> 205             self.errorhandler(self, exc, value)
        206         self._executed = query
        207         if not self._defer_warnings: self._warning_check()

    /usr/local/lib/python2.7/dist-packages/MySQLdb/connections.pyc in defaulterrorhandler(***failed resolving arguments***)
         34     del cursor
         35     del connection
    ---> 36     raise errorclass, errorvalue
         37 
         38 re_numeric_part = re.compile(r"^(\d+)")

    OperationalError: (2006, 'MySQL server has gone away')
    
When you encounter this error, it means that your connection has timed out.  To resolve, just re-run the code cells above that create a connection and a cursor to re-connect, and you should then be able to go back to the cell you were trying to run and run it again, and it should again be able to access the database.

### Querying the database

- Back to the [Table of Contents](#Table-of-Contents)

#### Executing an SQL statement

- Back to the [Table of Contents](#Table-of-Contents)

To execute SQL, pass a string that contains an SQL statement to the cursor's execute() method, then call the fetchall() method to pull the rows returned by the query into a list you can loop over.

For example: 

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# declare variables
sql_select = ""
result_count = -1
query_results = None
current_row = None

# run a simple select statement against our homework data.
#    When executing a SELECT, call to "execute" returns the
#    number of rows returned by the SELECT.
sql_select = "SELECT * FROM OSU_vendor LIMIT 10;"
result_count = cursor.execute( sql_select )

# how many rows returned?
print( "Found " + str( result_count ) + " rows" )

# get list of rows returned, so we can loop over them.
query_results = cursor.fetchall()

# loop over the results
for current_row in query_results:
        
    print( "==> " + current_row[ "university" ] + " - " + current_row[ "agency_text" ] + " - $" + str( current_row[ "paymentamount" ] ) + " - " + current_row[ "institutionid" ] )

#-- END loop over results --#

<hr />

#### Fetch column names from result row

- Back to the [Table of Contents](#Table-of-Contents)

You can retrieve the names of the columns present in a given query's results by grabbing the names (or keys) out of a given row's dictionary:

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# declare variables
sql_select = ""
single_row = None
column_name_list = []

# select
sql_select = "SELECT * FROM OSU_vendor"
cursor.execute( sql_select )

# get a row, then get its keys
single_row = cursor.fetchone()
column_name_list = single_row.keys()

# what are my column names?
print( column_name_list )

<hr />

#### Fetch a certain number of rows

- Back to the [Table of Contents](#Table-of-Contents)

In the above example, we use `cursor.fetchone()` to retrieve a single row from the results of the query.  To retrieve one or more rows, but not necessarily all, use the `cursor.fetchmany()` method, passing it a parameter named `size` that contains the number of rows you want returned.

Example:

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# declare variables
row_count = -1
desired_row_count = -1
result_list = []

# select
sql_select = "SELECT * FROM OSU_vendor"
row_count = cursor.execute( sql_select )

# total returned?
print( "Total rows = " + str( row_count ) )

# that is a lot!  Just get 10...
desired_row_count = 10
result_list = cursor.fetchmany( size = desired_row_count )

# loop over the results
for current_row in result_list:
        
    print( "==> " + current_row[ "university" ] + " - " + current_row[ "agency_text" ] + " - $" + str( current_row[ "paymentamount" ] ) + " - " + current_row[ "institutionid" ] )

#-- END loop over results --#

<hr />

#### Processing all rows in a large table

- Back to the [Table of Contents](#Table-of-Contents)

To pull the full table into memory, use `cursor.fetchall()`.

Be careful, this can take awhile depending on how big the table is, and if your table is truly "big", it might not be possible if your computer doesn't have a lot of memory!

To see what we are getting into, we will use the SQL `COUNT()` function to see how many rows are in the table.  In order to access the `COUNT()` value in our dictionary result format, we'll assign this count a name using the `AS` SQL keyword.

Example:

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# declare variables
row_count = -1
single_row = -1
transaction_count = -1  # each row in OSU_vendor is a transaction!

# select
sql_select = "SELECT COUNT( * ) AS 'transaction_count' FROM OSU_vendor"
row_count = cursor.execute( sql_select )

# get row and 'transaction_count'
single_row = cursor.fetchone()
transaction_count = single_row[ 'transaction_count' ]
print( "transaction_count = " + str( transaction_count ) )

That's kind of a lot.  But we should see what happens...  Load 'em anyway!

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# declare variables
row_count = -1
single_row = -1
vendor_count = -1

# select
sql_select = "SELECT * FROM OSU_vendor"
row_count = cursor.execute( sql_select )

# default output
cursor.fetchall()

Jupyter eventually cuts off this statement, saving us from ourselves.

<hr />

#### Processing a large table one row at a time

- Back to the [Table of Contents](#Table-of-Contents)

If your data is too big to load all into memory at once, you can just read in one row at a time.  Example is below.  However, if you do this in a Jupyter notebook, you'll likely kill your browser - In the previous example, Jupyter detected that `cursor.fetchall()` was returning a lot of data and truncated that output before it could overwhelm the browser's ability to handle it.

_The example below pulls it in one row at a time, and so Jupyter won't know we need help.  70,000 rows might not sound like a lot, but it takes up enough space in browser memory to kill a browser.  I've hard-coded the number of rows below = 10.  Run it with all the rows at your own risk._

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# declare variables
row_count = -1
single_row = -1
row_counter = -1

# select
sql_select = "SELECT * FROM OSU_vendor"
cursor.execute( sql_select )

# you can also get rowcount from cursor
row_count = cursor.rowcount

# if we actually process all rows this way, it will kill browser.
#    Just do 10, to show how it works.
row_count = 10

# Use for loop to loop row_count times, fetching one row
#    each time through the loop.
for row_counter in range( row_count ):
        
    single_row = cursor.fetchone()
    print( "==> " + str( row_counter) + " - " + str( single_row ) )

#-- END loop over rows, fetching one at a time --#

<hr />

### Close what you open

- Back to the [Table of Contents](#Table-of-Contents)

Once you are done working with a database, you need to remember to always close any cursors you opened, and then close the database connection itself.

If you don't close what you open, you run the risk of consuming all of a database's available connections (there is always a limit to the number of concurrent connections you can have to a database) and locking yourself and others out of the database entirely until the connections time out.

To close cursors and the the database connection, call the `close()` method on either object.

Example:

In [None]:
# close cursor
cursor.close()
    
# close connection
db.close()

To test your open_connection() method, first run the cell above that contains its function definition, so the function is created in memory and ready to use.

After the function definition has been run, use the code below to make a make a database connection using your function, create a cursor, then run a query.

## SQL

- Back to the [Table of Contents](#Table-of-Contents)

Now that we know how to make and use a connection to a database using Ipython, we can begin to master some SQL basics to help you get started with understanding the data and databases available. 

SQL is a quirky language. It is different from procedural languages like Python and is designed for a very specific purpose: to interact with relational data. It isn't structured like other languages, and while it can make data access easy, it also can make tasks that would be easy in other languages (though perhaps not exceptionally performant) confoundingly complex.  Let's dive in so you can see it for yourself!

### Querying the database

- Back to the [Table of Contents](#Table-of-Contents)

The basic method of querying the database is to use a select statement:

    SELECT *
    FROM OSU_vendor; 

Where:

- Columns or variables that would like returned are put in the SELECT clause (after the word "SELECT" but before the word "FROM").  An asterisk ( "\*" ) is a wildcard - it will return all columns for a given table.
- The name of the table (or names of the tables - more on this in a bit) you want to query is put after the word "FROM", in the FROM clause.
- It is considered good style to capitalize words in an SQL query that are SQL words, not variables, table names, or values you are filtering on or searching for, ie. SELECT, FROM, WHERE, etc.
- Although it isn't always necessary in MySQL, you should end SQL statements with a semi-colon.  It isn't required everywhere, but it is required in some contexts, so better to be aware and get into the habit.

Instead of specifying “all” columns ( "\*" ), you can specify which columns you want by name, in a comma-delimited list after "SELECT":

    SELECT uniqueawardnumber, fipscode, paymentamount
    FROM OSU_vendor;

You can specify calculations in the list of columns also:

    SELECT uniqueawardnumber, ( periodenddate - periodstartdate + 1 )
    FROM OSU_vendor;

And you can give those new columns names:

    SELECT uniqueawardnumber, ( periodenddate - periodstartdate + 1 ) AS num_days
    FROM OSU_vendor;
    
You can also use special keywords and functions in the SELECT clause.  For example, the keyword "DISTINCT", which only returns any given value in a given column once:

    SELECT DISTINCT uniqueawardnumber
    FROM OSU_vendor;
    
And "COUNT()", which returns a count of matching rows rather than a list:
    
    SELECT COUNT( DISTINCT uniqueawardnumber )
    FROM OSU_vendor;

### Exercise 1

- Back to the [Table of Contents](#Table-of-Contents)

Use the code block below to interact with the database to answer the questions that follow.  Re-use the `connection` and `cursor` you opened at the top of the notebook.

For each question, enter:

- The answer to the question.
- The SQL query you used to find the answer.

Questions:

- 1) Find the number of distinct vendors in the OSU_vendor database table.
- 2) Find the number of distinct topics in the OSU_grant database table.

Example code:

    # use cells at top to connect or re-connect to database and make cursor if needed

    # declare variables
    select_string = ""
    count_value = -1

    # Query template
    select_string = "SELECT COUNT( DISTINCT( 'fipscode' ) ) AS fipscode_count FROM OSU_vendor;"
    cursor.execute( select_string )
    row = cursor.fetchone()
    count_value = row[ "fipscode_count" ]

    print( "Answer = " + str( count_value ) + "; select SQL = " + select_string )

#### Exercise 1 work space

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# declare variables
select_string = ""
count_value = -1

# Query template
select_string = "SELECT COUNT( DISTINCT( 'fipscode' ) ) AS fipscode_count FROM OSU_vendor;"
cursor.execute( select_string )
row = cursor.fetchone()
count_value = row[ "fipscode_count" ]

print( "Answer = " + str( count_value ) + "; select SQL = " + select_string )

#### Question 1 - Answer

#### Question 1 - SQL

#### Question 2 - Answer

#### Question 2 - SQL

### WHERE clauses: Limiting the results

- Back to the [Table of Contents](#Table-of-Contents)

In a SELECT query, you can add a WHERE clause to limit the results:

    SELECT *
    FROM OSU_vendor
    WHERE periodstartdate = '2014-06-30';

Where:

- you are making conditional tests, just like in a Python "if" statement.
- EXCEPT here, instead of "==" being the equality operator, it is just "=".
- Comparison operators:

    - "**_`=`_**" - equal to
    - "**_`!=`_**" or "**_`<>`_**" - not equal to
    - "**_`<`_**" - less than
    - "**_`<=`_**" - less-than-or-equal-to
    - "**_`>`_**" - greater than
    - "**_`>=`_**" - greater-than-or-equal-to
    - "**_`LIKE`_**" and "**_`NOT LIKE`_**" - wild-card matching operator, where percent matches 0 or more characters ( "%" ) and an underscore matches any 1 character ( "_" ).
    - "**_`IN( value_list )`_**" and "**_`NOT IN( value_list )`_**" - checks whether the value to the left of the "IN", usually a column's value in a given row, is either IN or NOT IN the list on the right of the IN.
    
An example of using LIKE:

    SELECT *
    FROM OSU_vendor
    WHERE uniqueawardnumber LIKE '%EY022601%'

You can specify multiple conditions for matching in your WHERE clauses, as well, to more precisely filter the results of your query:

    SELECT *
    FROM OSU_vendor
    WHERE periodstartdate = '2014-06-30' and agency_abbrev = 'NSF'
    
Note:

- when you are matching a column whose type is numeric, you just put the value in the query, with no quotation marks (just like in Python).
- when you are filtering a string column, you have to include the value you are looking for (the value on the right-hand side of the equal sign) in single-quotes. They must be single-quotes, too.  Unlike in Python, double-quotes have an entirely different meaning that single quotes in SQL, and can cause your query to fail.

Like "None" in Python, the signifier of an unset value in a column for a row is special - NULL.  To check for NULL, you use "IS NULL" or "IS NOT NULL", rather than the "=" or "!=".

    /* find missing values */
    SELECT *
    FROM OSU_vendor
    WHERE institutionid IS NULL;

You can also explicitly cut off the number of results your query returns using the LIMIT keyword.  Just LIMITing to 10 only returns the first 10 results for the query:

    SELECT *
    FROM OSU_vendor
    WHERE periodstartdate = '2014-06-30' and agency_abbrev = 'NSF'
    LIMIT 10;
    
You can also use LIMIT to skip to the middle of the results by giving it two numbers, separated by a comma.  The first number is the number of records you want to skip, the second number is how many records you want to include after you skip:

    /* skip 10, the output 15 */
    SELECT *
    FROM OSU_vendor
    WHERE periodstartdate = '2014-06-30' and agency_abbrev = 'NSF'
    LIMIT 10, 15;

### Exercise 2

- Back to the [Table of Contents](#Table-of-Contents)

Use the code block below to interact with the database to answer the questions that follow.  Re-use the `connection` and `cursor` you opened at the top of the notebook.

For each question, enter:

- The answer to the question.
- The SQL query you used to find the answer.

Questions:

- 3) Using any row in the OSU_grant table that is assigned topic "45", what is the text description of topic "45"?
- 4) Using the row in the OSU_grant table that refers to topic ID "45" and award "1115005", what is the "fit" of topic "45" to award number "1115005"?  _Note: "fit" refers to the "proportion" column, which describes how well a machine learning algorithm calculated that the topic relates to the grant. Proportion is a value between 0 and 1.  The higher the proportion, the better the "fit".  To answer, state the proportion of the matching row._

Example code:

    # use cells at top to connect or re-connect to database and make cursor if needed

    # declare variables
    select_string = ""
    column_value = -1

    # Query template
    select_string = "SELECT * FROM OSU_vendor WHERE program_title = 'Amazing program!';"
    cursor.execute( select_string )
    row = cursor.fetchone()
    column_value = row[ "university" ]

    print( "Answer = " + str( column_value ) + "; select SQL = " + select_string )

#### Exercise 2 work space

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# declare variables
select_string = ""
column_value = -1

# Query template
select_string = "SELECT * FROM OSU_vendor WHERE program_title = 'Amazing program!';"
cursor.execute( select_string )
row = cursor.fetchone()
column_value = row[ "university" ]

print( "Answer = " + str( column_value ) + "; select SQL = " + select_string )

#### Question 3 - Answer

#### Question 3 - SQL

#### Question 4 - Answer

#### Question 4 - SQL

### JOIN: Connecting multiple tables

- Back to the [Table of Contents](#Table-of-Contents)

We can specify multiple tables in the FROM clause of a select query. This is called a “join”. However, when we do, we need to remember to specify how to match up rows across the two tables. Usually, there is a column that is the same in both tables that can be used to match them up. For much of the starmetrics database, that would be a column like uniqueawardnumber or award_id. 

Also, we frequently give tables temporary short names to make it easy to refer to them.

    /* Lists the topics associated with each federal award */
    SELECT DISTINCT v.uniqueawardnumber, g.topic_text
    FROM OSU_vendor v, OSU_grant g
    WHERE v.uniqueawardnumber = g.uniqueawardnumber;

We can still use regular WHERE clauses in these queries, too, to further filter:

    /* Lists the topics for each federal NSF award in 2012 */
    SELECT DISTINCT v.uniqueawardnumber, g.topic_text
    FROM OSU_vendor v, OSU_grant g
    WHERE v.uniqueawardnumber = g.uniqueawardnumber
        AND agency_abbrev = "NSF"
        AND year(periodstartdate) = 2012;

Table joins are the most important feature of SQL databases; they are very powerful and allow us to create all kinds of complex queries. You can also join more than two tables if you like.

### GROUP BY and aggregate functions

- Back to the [Table of Contents](#Table-of-Contents)

Often, one thing that you want to do is to aggregate over multiple rows. For example, "What is the total expenditures for each award in 2012?" To do this, use a GROUP BY clause:

    /* sum vendor expenditures by award and filter by 2012 */
    SELECT uniqueawardnumber, SUM(paymentamount)
    FROM OSU_vendor
    WHERE year(periodstartdate) = 2012
    GROUP BY uniqueawardnumber;

There are a number of useful aggregate functions:

- **_SUM(column)_** : Calculate the sum of column for all the rows in each group
- **_AVG(column)_** : Calculate the numeric average for all of the rows in each group
- **_COUNT(column)_** : Count the number of rows in each group
- **_MIN(column) and MAX(column)_** : Find the minimum or maximum value of column in all the rows in each group

Often, it can be very powerful to combine GROUP BY and table joins. To figure out these queries, I recommend first getting the join to return the individual rows correctly, and then adding in the GROUP BY and aggregates.

### Exercise 3

- Back to the [Table of Contents](#Table-of-Contents)

Use the code block below to interact with the database to answer the questions that follow.  Re-use the `connection` and `cursor` you opened at the top of the notebook.

For each question, enter:

- The answer to the question.
- The SQL query you used to find the answer.

Questions:

- 5) Based on information in OSU_vendor, what were the total expenditures in 2012? 
- 6) What were the total expenditures in 2012 on NSF grants?
- 7) What were the total expenditures in 2012 on NSF grants with topic ID 45?

Example code:

    # use cells at top to connect or re-connect to database and make cursor if needed

    # declare variables
    select_string = ""
    column_value = -1

    # Query template
    select_string = "SELECT SUM(paymentamount) AS payment_sum"
    select_string += " FROM OSU_vendor"
    select_string += " WHERE year( periodstartdate ) = 2012;"
    cursor.execute( select_string )
    row = cursor.fetchone()
    column_value = row[ "payment_sum" ]

    print( "Answer = " + str( column_value ) + "; select SQL = " + select_string )

#### Exercise 3 work space

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# declare variables
select_string = ""
column_value = -1

# Query template
select_string = "SELECT SUM(paymentamount) AS payment_sum"
select_string += " FROM OSU_vendor"
select_string += " WHERE year( periodstartdate ) = 2012;"
cursor.execute( select_string )
row = cursor.fetchone()
column_value = row[ "payment_sum" ]

print( "Answer = " + str( column_value ) + "; select SQL = " + select_string )

#### Question 5 - Answer

#### Question 5 - SQL

#### Question 6 - answer

#### Question 6 - SQL

#### Question 7 - answer

#### Question 7 - SQL

### ORDER BY

- Back to the [Table of Contents](#Table-of-Contents)

Normally, the results are returned in the order they appear in the database. However, it can be very useful to re-order the results using ORDER BY:

    SELECT uniqueawardnumber, paymentamount
    FROM OSU_vendor v, OSU_grant g
    WHERE v.uniqueawardnumber = g.uniqueawardnumber
        AND v.university = g.university
    ORDER BY g.award_id

(After you specify which column to order by, you can optionally specify either ASC for ascending order, or DESC for descending order.)

Using ORDER BY with custom column names can be really useful when combined with GROUP BY:

    SELECT uniqueawardnumber, SUM(paymentamount)
    FROM OSU_vendor
    WHERE year(periodstartdate) = 2012
    GROUP BY uniqueawardnumber
    ORDER BY sum(paymentamount) DESC

### Modifying the database

- Back to the [Table of Contents](#Table-of-Contents)

In addition to retrieving information from an existing database, you can also insert data into a database, update existing rows, and delete records using SQL.  Permissions on the homework, starmetrics, umetricsgrants and usptopatents databases will not allow you to modify the databases. For these exercises, we open an additional connection to your individual user database. 

Here are some example queries:

- **CREATE**: Adding a table to a database

        CREATE TABLE cjones.data (
        ID int(11) auto_increment primary key, 
        name_first varchar(20)
        name_last varchar(30))

- **INSERT**: Adding a row to a table

        INSERT INTO cjones.data
        (name_first, name_last)
        VALUES ('Christina', 'Jones')

- **UPDATE**: Changing data that is already in a table

        UPDATE cjones.data
        SET name_last = 'Johnson'
        WHERE name_first = 'Christina'
        
- **ALTER TABLE**: Changing the structure of an existing table

        ALTER TABLE cjones.data
            ADD COLUMN gender VARCHAR(1) DEFAULT 'F'

- **DELETE**: Removing one or more rows from a table

        DELETE FROM cjones.data
        WHERE name_last = 'johnson'

- **DELETE**: removing table from database

        DELETE cjones.data


Lastly, you can also CREATE a table using an existing table. 

- **CREATE**: Adding a table to a database

        CREATE TABLE cjones.osu_vendor (
        SELECT * FROM homework.OSU_vendor
        WHERE year(periodstartdate) = 2012
        and agency_abbrev = 'NSF');

## Addendum - Different ways to make a cursor

- Back to the [Table of Contents](#Table-of-Contents)

The way you create your cursor dictates the format of the rows you'll get back.

If you create your cursor by calling `cursor()` with no arguments, when you iterate over the results, each row will simply be a tuple (list) of the values in the row.  To reference a given value, you reference the position of that value in the row.

Example that shows the tuple row format:

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# create cursor with call to cursor(), no arguments.
cursor = db.cursor()

# run a simple select statement against our homework data.
sql_select = "SELECT * FROM OSU_vendor"
cursor.execute( sql_select )

# output the underlying structure of a single row.
single_row = cursor.fetchone()
print( single_row )

# get university - column 6 (7th column, starts with 0, so column 6)
university_value = single_row[ 6 ]
print( "====> university = " + university_value )

If you call the `cursor()` method with `MySQLdb.cursors.DictCursor` as the sole argument, when you access a row in the results from executing a query, each row is a dictionary that maps a column's name to that column's value, and you reference a given column by its name.  This is how our initial loop over rows worked above.

Example that shows the dictionary row format:

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# create cursor with call to cursor(), but asking it to make
#    a dictionary cursor (DictCursor).
cursor = db.cursor( MySQLdb.cursors.DictCursor )

# run a simple select statement against our homework data.
sql_select = "SELECT * FROM OSU_vendor"
cursor.execute( sql_select )

# output the underlying structure of a single row.
single_row = cursor.fetchone()
print( single_row )

# get university - who cares what position it is in!
university_value = single_row[ "university" ]
print( "====> university = " + university_value )

In general, we will be using the dictionary cursor, so opening our cursors with:

    cursor = db.cursor( MySQLdb.cursors.DictCursor )
    
rather than

    cursor = db.cursor()

Referencing items in a row by name means you can alter the order or contents of the results of a query and not have to change code that refers to those results unless you removed columns referenced in the code.  In addition, it makes one's code more self-documenting - the names of the columns you are retrieving are built right into the code that retrieves them.

## References 

- Back to the [Table of Contents](#Table-of-Contents)

The tables for this homework were created directly from the starmetrics and umetricsgrants databases. The following SQL code was used to generate the tables: 

In [None]:
CREATE TABLE homework.osu_vendor (
    SELECT periodstartdate, periodenddate, v.uniqueawardnumber, recipientaccountnumber, institutionid, paymentamount, v.university, v.cfda, 
        v.zipcode, fipscode, statecode, countycode, c.agency, agency_abbrev, agency_text, sub_agency_text, program_title
    FROM starmetrics.vendor v
        LEFT JOIN starmetrics.zip_to_fip z on z.zipcode = v.zipcode
        LEFT JOIN starmetrics.cfda c on c.cfda = v.cfda
    WHERE v.university = "OSU"
        AND periodstartdate >= "2011-01-01" AND v.zipcode != "" )

In [None]:
CREATE TABLE homework.OSU_grant (
    SELECT award_id, topic_id, model, application_id, proportion, seq, agency, topic_text, uniqueawardnumber, university 
    FROM umetricsgrants.topiclda t
        LEFT JOIN umetricsgrants.topiclda_text text using(topic_id, model)
        LEFT JOIN starmetrics.crosswalk c using(award_id) 
    WHERE t.model = "NSF"
        AND seq = 1
        AND university = "OSU")