# Visualization for Big Data - Exploring Large Data Sets

In this notebook, we will cover how to break off a subset of a large dataset like the data contained in the "homework" or other databases so you can interact and experiment with it without worrying about changing or breaking the master data.


## Table of Contents

- [Class Databases and Database Tables](#Class-Databases-and-Database-Tables)

    - [StarMetrics database - MySQL database name `starmetrics`](#StarMetrics-database---MySQL-database-name-starmetrics)
    - [UMETRICS grants database - MySQL database name `umetricsgrants`](#UMETRICS-grants-database---MySQL-database-name-umetricsgrants)
    - [Personal work database](#Personal-work-database)

- [Creating your own tables from existing database tables](#Creating-your-own-tables-from-existing-database-tables)

   - [Example: Copy data to a table in your personal work database](#Example:-Copy-data-to-a-table-in-your-personal-work-database)
   - [REALLY long-running queries](#REALLY-long-running-queries)
   - [`INSERT`ing into an existing table](#INSERTing-into-an-existing-table)

- [Keep on Visualization-ing](#Keep-on-Visualization-ing)

    - [Resources for Tableau](#Resources-for-Tableau)
    - [Resources for Keshif](#Resources-for-Keshif)

## Class Databases and Database Tables

- Back to the [Table of Contents](#Table-of-Contents)

For these exercises we will continue to use the tables in the "homework" database. 

### StarMetrics database - MySQL database name `starmetrics`

- Back to the [Table of Contents](#Table-of-Contents)

The **_StarMetrics database (MySQL database name `starmetrics`)_** contains transactional data from universities that describe expenditures on federal research grants. The data includes four different types of expenditures:

1. Employee expenditures - this describes the people by occupation who charged time to federal grants.
2. Vendor expenditures - this describes the businesses that goods were bought from federal grants.
3. Subaward expenditures - this describes the universities and other institutions that are paid to collaborate from federal grants.
4. Award expenditures - this describes the overhead that is associated with each federal grant.

### UMETRICS grants database - MySQL database name `umetricsgrants`

- Back to the [Table of Contents](#Table-of-Contents)

The **_UMETRICS grants database (MySQL database name `umetricsgrants`)_** contains public data that describes NIH, NSF, USDA & NASA federal awards. This database was created by combining several small databases together to capture all the grant data in one database. The structure of the database tables are different depending on the source of the data.

## Creating your own tables from existing database tables

- Back to the [Table of Contents](#Table-of-Contents)

For this class, it is not only important to understand how to query an existing database, but it is also important to be able to create your own tables. The Tableau software has amazing visualization capabilites, but if it is reading a database table with millions or billions of rows, it will significantly slow or even freeze up altogether the more large and complex your data becomes.

Because of this, we suggest that you use SQL, IPython or Jupyter to filter your data before syncing to Tableau. 

Copying another table to create your own table on the class server is very simple. The SQL syntax for creating your own table based on an other existing table is below, for reference:

    CREATE TABLE <desination_database>.<new_table_name>
    (
        SELECT *
        FROM <source_database>.<existing_table_name>
    );
  
### Example: Copy data to a table in your personal work database

- Back to the [Table of Contents](#Table-of-Contents)

For example, say we wanted to keep looking at grant payments like those in the `vendor` homework table, but we wanted filter by specific year.  

To do this, first, connect to the database and make a cursor.

In [None]:
# imports
import pymysql

# declare connection variables.
user = ""
password = ""
database = ""
db = None
cursor = None

# configure connection to your personal work database.
user = "<username>"
password = "<password>"
database = "<database name>"

# invoke the connect() function, passing parameters in variables.
db = pymysql.connect( user = user, passwd = password, db = database )

# create mysql cursor that maps column names to values in the query result.
cursor = db.cursor( pymysql.cursors.DictCursor )

Query the database to find and display the years that are available to you in the `homework.vendor` table.

In [None]:
# declare query variables
select_string = ""
row_list = []
current_row = None
year = -1
year_counter = -1

# Query template
select_string = "SELECT DISTINCT year(periodstartdate) AS year FROM homework.vendor;"
cursor.execute( select_string )
row_list = cursor.fetchall()

# loop over the distinct universities
year_counter = 0
for current_row in row_list:

    # increment counter
    year_counter += 1
    
    # get university name
    year = current_row[ "year" ]

    # print university name
    print( "Year " + str( year_counter ) + ": " + str( year ) )
    
#-- END loop over university names --#

Then, pick some universities from the list to use to filter out a subset of the agency data, focused on your universities of choice, and store that data in a table in your own database.

For example, in the SQL statement below, a user named `jmorgan` combines `CREATE TABLE` and `SELECT` to filter out vendor records for two big ten rivals and their little brother, for years 2011 and greater, and where there is a known zip code, then store matching vendor records along with some data from other tables in a new table named `rivals_vendor`.

    CREATE TABLE jmorgan.rivals_vendor
    SELECT periodstartdate, periodenddate, v.uniqueawardnumber, recipientaccountnumber, institutionid, paymentamount, v.university, v.cfda, v.zipcode, fipscode, statecode, countycode, c.agency, agency_abbrev, agency_text, sub_agency_text, program_title
    FROM starmetrics.vendor v
        LEFT JOIN starmetrics.zip_to_fip z on z.zipcode = v.zipcode
        LEFT JOIN starmetrics.cfda c on c.cfda = v.cfda
    WHERE v.university IN ( 'OSU', 'UMich', 'MSU' )
        AND periodstartdate >= '2011-01-01'
        AND v.zipcode != '';

Using this template, you should be able to retrieve subsets of data filtered all kinds of ways.  Just update the `WHERE` clause to match the way you'd like to subset the data.

You have a number of options when choosing how to run this SQL.  One is to execute it using Python, using your cursor object to run the `CREATE` statement as we have previously run SQL SELECT statements:

In [None]:
# use cells at top to connect or re-connect to database and make cursor if needed

# declare variables
SQL_string = ""
create_result = ""

# CREATE SQL string
SQL_string += "CREATE TABLE homework.rivals_vendor"
SQL_string += " SELECT periodstartdate, v.award_id, institutionid, paymentamount, cfda, fipscode, statecode, countycode, agency_abbrev, agency_text, sub_agency_text"
SQL_string += " FROM homework.vendor v"
SQL_string += " WHERE fipscode != '';"

# Run the CREATE statement - returns the number of rows created.
create_result = cursor.execute( SQL_string )

# output result
print( "CREATE result: " + str( create_result ) + " rows created!" )
print( "====> SQL: " + SQL_string )

**_NOTE: This query might take a while to run.  If you run it in a Jupyter notebook, as long as you see an asterisk ( "*" ) in the square brackets in `In [*]:` to the upper left of your code cell, the cell is still active, waiting for a response from the server._**

You could also just type this query directly into the Query area of a MySQL client like MySQL Workbench, Sequel Pro, or Navicat.

### REALLY long-running queries

- Back to the [Table of Contents](#Table-of-Contents)

On a REALLY large database, queries like these can run into hours, long enough that a connection over the network, even from a dedicated MySQL client like MySQL Workbench, could time out.  For really long-running queries like this, consider running the query on the server.

To run a long-running query on the server:

- connect to the server using SSH.
- run `screen` to start a background session (just type `screen` at the command line).
- then, two options:

    - _Run Query in Python:_
    
        - Write Python code similar to the above.
        - Save it to a `*.py` file (for this example, we'll call it `create_rivals.py`) and get that file to the server.
        - In a command shell on the server in the same directory where you stored the python file, run `ipython`.
        - in ipython, type `%run <code_file_path>` (for our example, `%run create_rivals.py`).
        - once you are done, type `quit` and hit `Enter`/`Return` to exit.

    - _MySQL command line:_
    
        - Open a command shell on the server.
        - open the MySQL command line client by typing the following into the shell:
        
                mysql -u <username> -p
                
        - enter your password.
        
        - Then, EITHER:
        
            - type your SQL Query at the prompt, making sure to place a semi-colon at the end ( ";" ).  You can split the SQL over as many lines as you want, and any stretches of white space longer than one character are ignored.  Just remember to type a semi-colon when your statement is done - the mysql command line doesn't consider a command completed until it comes upon a semi-colon ( ";" ).

        - OR:

            - place your SQL in a text file in the same directory where you ran the mysql client (for our example, we'll call it `create_rivals.sql`).
            - then in the mysql command line client, use the `SOURCE` command to run your SQL file:
            
                      SOURCE create_rivals.sql;
                      
        - Once you are done, enter `quit` then press `Enter`/`Return` to exit.
    
- Regardless how you do it, since you are in screen, if you need to scroll back, you'll need to enter copy-paste mode by typing:
        
        <Control>+A, then "["
                
    - Then, you'll be able to scroll up and down using the arrow keys, line by line.
    - To exit copy-paste mode, press the Escape key in the upper left of your keyboard ( `esc`/ `Esc` ).
    
- While the query or program is running, you can disconnect from and reconnect to the screen session, allowing it to continue running even if you disconnect from (or are disconnected from) the server.

    - To exit `screen` session but leave it running, type `Control+A, then D`.  This works even while a long-running query or python program are executing.
    - To rejoin a screen session you have disconnected from:
    
        - connect and log in to the server once again.
        - reconnect to your screen session using `screen -r` (where the "-r" stands for "reconnect").
        
                screen -r
        
        - if you get a message that someone else is connected to the session (you might get this if you were forcefully disconnected by a network problem, for example), add the "-d" flag (for "disconnect"):
        
                screen -r -d
    
    - To exit and end `screen` session, at the unix shell, type `exit`, then `Enter`/`Return`.

- For more on using `screen`:
    
    - screen quick reference: [http://aperiodic.net/screen/quick_reference](http://aperiodic.net/screen/quick_reference)
    - official screen manual: [http://www.gnu.org/software/screen/manual/screen.html](http://www.gnu.org/software/screen/manual/screen.html)
    - Arch Linux screen guide: [https://wiki.archlinux.org/index.php/GNU_Screen](https://wiki.archlinux.org/index.php/GNU_Screen)
    - screen beginner's tutorial: [http://www.kuro5hin.org/story/2004/3/9/16838/14935](http://www.kuro5hin.org/story/2004/3/9/16838/14935)
    - O'Reilly screen command reference: [http://archive.oreilly.com/linux/cmd/cmd.csp?path=s/screen](http://archive.oreilly.com/linux/cmd/cmd.csp?path=s/screen)
        
- For more information on using IPython to run python programs on the server: [http://nbviewer.ipython.org/gist/jonathanmorgan/a6a07dbf9986ccda2628#An-example-IPython-workflow](http://nbviewer.ipython.org/gist/jonathanmorgan/a6a07dbf9986ccda2628#An-example-IPython-workflow)    

### INSERTing into an existing table

- Back to the [Table of Contents](#Table-of-Contents)

If you wanted to add rows to an existing table rather than create an entirely new table, you can use an INSERT-SELECT statement similar to the CREATE-SELECT statement.

Basic syntax:

    INSERT INTO <desination_database>.<new_table_name>
    (
        SELECT *
        FROM <source_database>.<existing_table_name>
    );
    
So for our example above, to add vendor rows to our existing homework.rivals_vendor table:

    INSERT INTO homework.rivals_vendor
    SELECT periodstartdate, v.award_id, institutionid, paymentamount, cfda, fipscode, statecode, countycode, agency_abbrev, agency_text, sub_agency_text"
    FROM homework.vendor v
    WHERE fipscode =''

Run it however you want, based on the above.

## Keep on Visualization-ing

- Back to the [Table of Contents](#Table-of-Contents)

Try creating your own subset of the vendor table then designing your own dashboard of visualizations to compare/contrast/describe the data using Tableau.

For a refresher on how to make a database connection between Tableau and the class server, refer back to the Data Visualization Installation Guide.

### Resources for Tableau

- Back to the [Table of Contents](#Table-of-Contents)

Below you will find a link to a 5-minute video that describes a show to create visualizations with Tableau using a very simple, but effective, approach. In addition, the handout follows the progression of the vidoes, but is heavily annotated. 

- Video: [https://www.youtube.com/watch?v=-4uNv6wuGQ8](https://www.youtube.com/watch?v=-4uNv6wuGQ8)
- Handout: [https://docs.google.com/presentation/d/1bPn44W15Jq3csc87vld0FWXZpu4cnoqe1Qqob57KvTQ/edit#slide=id.p](https://docs.google.com/presentation/d/1bPn44W15Jq3csc87vld0FWXZpu4cnoqe1Qqob57KvTQ/edit#slide=id.p)

### Resources for Keshif

- Back to the [Table of Contents](#Table-of-Contents)

Keshif ( [http://www.cs.umd.edu/hcil/keshif/](http://www.cs.umd.edu/hcil/keshif/) ) is another visualization tool, a dashboard visualization program built to run inside a web browser, that you could also try out.

Here are resources similar to those for Tableau above for Keshif:

- Video: [https://www.youtube.com/watch?v=3Hmvms-1grU](https://www.youtube.com/watch?v=3Hmvms-1grU)
- Handout: [https://docs.google.com/presentation/d/1beCw3KiFjWLdVfgp8EICFPNPiuu2UzX8PFbcirJFQVw/edit#slide=id.gc5246df19_0_81](https://docs.google.com/presentation/d/1beCw3KiFjWLdVfgp8EICFPNPiuu2UzX8PFbcirJFQVw/edit#slide=id.gc5246df19_0_81)
- Github site: [https://github.com/adilyalcin/Keshif/](https://github.com/adilyalcin/Keshif/)
