# Converting data from mysql to postgresql using pandas

# Table of Contents

- [Setup](#Setup)

    - [Setup - Imports](#Setup---Imports)
    - [Setup - Database](#Setup---Database)
    
        - [Setup - Database - SQLAlchemy](#Setup---Database---SQLAlchemy)
    
    - [Setup - Functions](#Setup---Functions)
    
        - [Setup - Function `column_name_to_lower_case`](#Setup---Function-column_name_to_lower_case)
        
- [Migrate data from MySQL to PostgreSQL](#Migrate-data-from-MySQL-to-PostgreSQL)
- [Data cleanup](#Data-cleanup)

    - [Cleanup - `machine_learning.arra_funded`](#Cleanup---machine_learning.arra_funded)
    - [Cleanup - unique constraints for old multi-part keys](#Cleanup---unique-constraints-for-old-multi-part-keys)

- [Create tab-delimited files for each table](#Create-tab-delimited-files-for-each-table)
- [Update MySQL tables](#Update-MySQL-tables)

    - [`homework_MachineLearning2.sql`](#homework_MachineLearning2.sql)
    - [`homework_NSF_Award.sql`](#homework_NSF_Award.sql)
    - [`homework_TextAnalysis.sql`](#homework_TextAnalysis.sql)
    - [`homework_UCPay2011.sql`](#homework_UCPay2011.sql)
    - [`homework_grant.sql`](#homework_grant.sql)
    - [`homework_vendor.sql`](#homework_vendor.sql)

- [TODO](#TODO)
- [Finally...](#Finally...)

# Setup

- back to [Table of Contents](#Table-of-Contents)

## Setup - Imports

- back to [Table of Contents](#Table-of-Contents)

In [None]:
# imports
import datetime
import pandas
import psycopg2
import pymysql
import sqlalchemy

print( "packages imported at " + str( datetime.datetime.now() ) )

## Setup - Database

- back to [Table of Contents](#Table-of-Contents)

In [None]:
# Variables to hold connection information

# ==> MySQL
mysql_username = "<username>"
mysql_password = "<password>"
mysql_host = "localhost"
mysql_port = "3306"
mysql_database = "homework"
mysql_charset = "utf8"

# ==> PostgreSQL
pgsql_username = "<username>"
pgsql_password = "<password>"
pgsql_host = "localhost"
pgsql_port = "5432"
pgsql_database = "homework"
pgsql_encoding = "utf8"

print( "database connection info defined at " + str( datetime.datetime.now() ) )

### Setup - Database - SQLAlchemy

- back to [Table of Contents](#Table-of-Contents)

In [None]:
# Create SQLAlchemy connections to both MySQL and PostgreSQL

# declare variables
connection_string = ""
execution_option_dict = None
mysql_engine = None
pgsql_engine = None

# shared execution options
execution_option_dict = {}
execution_option_dict[ "stream_results" ] = True
#execution_option_dict[ "autocommit" ] = True

# ==> MySQL

# Create database engine for pandas.
connection_string = "mysql+pymysql://" + mysql_username + ":" + mysql_password + "@" + mysql_host + ":" + mysql_port + "/" + mysql_database + "?charset=" + mysql_charset
mysql_engine = sqlalchemy.create_engine( connection_string, execution_options = execution_option_dict )

# ==> PostgreSQL

# Create database engine for pandas.
connection_string = "postgresql+psycopg2://" + pgsql_username + ":" + pgsql_password + "@" + pgsql_host + ":" + pgsql_port + "/" + pgsql_database + "?client_encoding=" + pgsql_encoding
pgsql_engine = sqlalchemy.create_engine( connection_string, execution_options = execution_option_dict )

print( "sqlalchemy database engines created at " + str( datetime.datetime.now() ) )

### Setup - Database - psycopg2

- back to [Table of Contents](#Table-of-Contents)

Create connection and cursor for things that break SQLAlchemy (sigh).

In [None]:
# create psycopg2 connection to PostgreSQL using connection variables defined above.
pgsql_connection = psycopg2.connect( host = pgsql_host, port = pgsql_port, database = pgsql_database, user = pgsql_username, password = pgsql_password )

print( "psycopg2 database connection created at " + str( datetime.datetime.now() ) )

In [None]:
# create psycopg2 cursor using pgsql_connection.
pgsql_cursor = pgsql_connection.cursor( cursor_factory = psycopg2.extras.DictCursor )

print( "psycopg2 database cursor created at " + str( datetime.datetime.now() ) )

## Setup - Functions

- back to [Table of Contents](#Table-of-Contents)

Run the file `data_functions.py`, which contains re-usable database functions.  List of functions will be printed in the output after you execute the file.

In [None]:
# Must be run in the /data folder.
%run data_functions.py

# Migrate data from MySQL to PostgreSQL

- back to [Table of Contents](#Table-of-Contents)

In [None]:
# database name
database_name = "homework"

# make a list of the names of the tables we want to migrate
homework_table_list = []
homework_table_list.append( "machine_learning" )
homework_table_list.append( "nsf_award" )
homework_table_list.append( "text_analysis" )
homework_table_list.append( "uc_pay_2011" )
homework_table_list.append( "ugrant" )
homework_table_list.append( "vendor" )

print( "table list populated at " + str( datetime.datetime.now() ) + ": " + str( homework_table_list ) )

In [None]:
# declare variables
table_select_string = ""
table_df = None

# for each homework database table, pull it in from MySQL, write it out to PostgreSQL.
for table_name in homework_table_list:
    
    print( "==> starting migration of table " + table_name + " at " + str( datetime.datetime.now() ) )
    
    # read the table into pandas from mysql
    table_select_string = "SELECT * FROM " + database_name + "." + table_name + ";"
    table_df = pandas.read_sql( table_select_string, con = mysql_engine )
    
    # convert column names to lower case
    table_df = column_names_to_lower_case( table_df )
    
    print( table_name + " column names: " + str( list( table_df.columns ) ) )
    
    # write the table into postgresql.
    table_df.to_sql( table_name, con = pgsql_engine )

    print( "<== migration of table " + table_name + " completed at " + str( datetime.datetime.now() ) )
    
#-- END loop over tables --#

# Data cleanup

- back to [Table of Contents](#Table-of-Contents)

## Cleanup - `machine_learning.arra_funded`

- back to [Table of Contents](#Table-of-Contents)

In the `machine_learning`, table, convert arra_funded from text to int (\x01 ==> 1, \x00 ==> 0).  In the MySQL table, the data field is "bit" (boolean), and pandas doesn't know what to do with that, so it converts to a string with an explicitly decimal integer value.  Interesting.  Wonder what the format's underlying storage format is?

In [None]:
# get distinct values
sql_string = "SELECT DISTINCT arra_funded AS unique_value FROM machine_learning;"

# run SQL.
pgsql_cursor.execute( sql_string )

# loop over results
for current_row in pgsql_cursor:
    
    # output unique_value
    print( str( current_row[ "unique_value" ] ) )
    
#-- END loop over distinct values in arra_funded --#

In [None]:
# First, make new column `arra_funded_int` that is of type "int".
sql_string = "ALTER TABLE public.machine_learning ADD COLUMN arra_funded_int int2;"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

In [None]:
# Now, set arra_funded_int to 0 where arra_funded = "\x00"...
#     have to escape the back-slash with a back-slash ( "\\" ).
sql_string = "UPDATE public.machine_learning SET arra_funded_int = 0 WHERE arra_funded = '\\x00';"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

In [None]:
# ...and set arra_funded_int to 1 where arra_funded = "\x01".
#     have to escape the back-slash with a back-slash ( "\\" ).
sql_string = "UPDATE public.machine_learning SET arra_funded_int = 1 WHERE arra_funded = '\\x01';"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

In [None]:
# remove arra_funded column
sql_string = "ALTER TABLE public.machine_learning DROP COLUMN arra_funded;"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

In [None]:
# and rename "arra_funded_int" to "arra_funded"
sql_string = "ALTER TABLE machine_learning RENAME COLUMN arra_funded_int TO arra_funded;"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

## Cleanup - unique constraints for old multi-part keys

- return to [Table of Contents](#Table-of-Contents)

Convert the old primary keys on `ugrants` and `vendor` to just be unique constraints (since we have a unique integer primary key):

- ugrant

        PRIMARY KEY (`award_id`,`topic_id`)

- vendor

        PRIMARY KEY (`periodstartdate`,`institutionid`,`paymentamount`,`award_id`)

In [None]:
# ==> ugrant

# create SQL string
sql_string = "ALTER TABLE ugrant ADD CONSTRAINT ugrant_primary_key UNIQUE( award_id, topic_id );"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

In [None]:
# ==> vendor

# create SQL string
sql_string = "ALTER TABLE vendor ADD CONSTRAINT vendor_primary_key UNIQUE( periodstartdate, institutionid, paymentamount, award_id );"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

# Create tab-delimited files for each table

- back to [Table of Contents](#Table-of-Contents)

In [None]:
# declare variables
table_select_string = ""
table_df = None

# for each homework database table, pull it in from PostgreSQL, write it out to CSV file.
for table_name in homework_table_list:
    
    print( "==> starting tab-delimited export of table " + table_name + " at " + str( datetime.datetime.now() ) )
    
    # read the table into pandas from postgresql
    table_select_string = "SELECT * FROM " + table_name + ";"
    table_df = pandas.read_sql( table_select_string, con = pgsql_engine )
    
    # write the table to CSV file.
    output_file_path = table_name + ".tab.txt"
    table_df.to_csv( output_file_path, sep = '\t', encoding = 'utf-8', index = False )

    print( "<== tab-delimited export of table " + table_name + " to " + output_file_path + " completed at " + str( datetime.datetime.now() ) )
    
#-- END loop over tables --#

# Update MySQL tables

- Back to [Table of Contents](#Table-of-Contents)

Next, so we have the same table and column names for MySQL and PostgreSQL, pull in data frames from PostgreSQL, then output them to an empty database named "homework" in MySQL.  One option is to just make all new tables:

In [None]:
# declare variables
table_select_string = ""
table_df = None

# for each homework database table, pull it in from PostgreSQL, write it out to CSV file.
for table_name in homework_table_list:
    
    print( "==> starting migration of table " + table_name + " from PostgreSQL to MySQL at " + str( datetime.datetime.now() ) )
    
    # read the table into pandas from postgresql
    table_select_string = "SELECT * FROM " + table_name + ";"
    table_df = pandas.read_sql( table_select_string, con = pgsql_engine )
    
    # write the table to MySQL database.
    #table_df.to_sql( table_name, con = mysql_engine, index = False )

    print( "<== migration of table " + table_name + " from PostgreSQL to MySQL completed at " + str( datetime.datetime.now() ) )
    
#-- END loop over tables --#

Turns out you lose all of the typing, though, so going through the files for each table and using vi to replace the capital letter table and column names with all lower case, and then adding in a unique integer primary key named index.  So, we are just going to use vi to update the create scripts.

## `homework_MachineLearning2.sql`

- back to [Table of Contents](#Table-of-Contents)

Original CREATE TABLE:

    CREATE TABLE `MachineLearning2` (
      `APPLICATION_ID` int(11) NOT NULL,
      `CFDA_CODE` char(15) DEFAULT NULL,
      `YEAR` int(4) DEFAULT NULL,
      `ACTIVITY` char(3) DEFAULT NULL,
      `ADMINISTERING_IC` char(2) DEFAULT NULL,
      `ARRA_FUNDED` bit(1) DEFAULT NULL,
      `ORG_NAME` varchar(100) DEFAULT NULL,
      `ORG_DEPT` varchar(50) DEFAULT NULL,
      `topic_id` int(11) NOT NULL,
      `STUDY_SECTION` varchar(4) DEFAULT NULL,
      `TOTAL_COST` decimal(13,2) DEFAULT NULL,
      `ED_INST_TYPE` varchar(65) DEFAULT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Changes to `homework_MachineLearning2.sql`

- rename to: `homework-machine_learning.sql`
- table name: `MachineLearning2` ==> `machine_learning`
- column names (vi find and replace syntax):

        %s/`APPLICATION_ID`/`application_id`/g
        %s/`CFDA_CODE`/`cfda_code`/g
        %s/`YEAR`/`year`/g
        %s/`ACTIVITY`/`activity`/g
        %s/`ADMINISTERING_IC`/`administering_ic`/g
        %s/`ARRA_FUNDED`/`arra_funded`/g
        %s/`ORG_NAME`/`org_name`/g
        %s/`ORG_DEPT`/`org_dept`/g
        %s/`STUDY_SECTION`/`study_section`/g
        %s/`TOTAL_COST`/`total_cost`/g
        %s/`ED_INST_TYPE`/`ed_inst_type`/g

- added primary key `index`:

        `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
        PRIMARY KEY (`index`)
        
New CREATE TABLE:

    CREATE TABLE `machine_learning` (
      `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
      `application_id` int(11) NOT NULL,
      `cfda_code` char(15) CHARACTER SET utf8 DEFAULT NULL,
      `year` int(4) DEFAULT NULL,
      `activity` char(3) CHARACTER SET utf8 DEFAULT NULL,
      `administering_ic` char(2) CHARACTER SET utf8 DEFAULT NULL,
      `arra_funded` bit(1) DEFAULT NULL,
      `org_name` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
      `org_dept` varchar(50) CHARACTER SET utf8 DEFAULT NULL,
      `topic_id` int(11) NOT NULL,
      `study_section` varchar(4) CHARACTER SET utf8 DEFAULT NULL,
      `total_cost` decimal(13,2) DEFAULT NULL,
      `ed_inst_type` varchar(65) CHARACTER SET utf8 DEFAULT NULL,
      PRIMARY KEY (`index`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
    
And, updated each INSERT statement, from:

    INSERT INTO `machine_learning` VALUES
    
to:

    INSERT INTO `machine_learning` ( `application_id`, `cfda_code`, `year`, `activity`, `administering_ic`, `arra_funded`, `org_name`, `org_dept`, `topic_id`, `study_section`, `total_cost`, `ed_inst_type` ) VALUES
    
vi search and replace syntax:

    %s/INSERT INTO `machine_learning`/INSERT INTO `machine_learning` ( `application_id`, `cfda_code`, `year`, `activity`, `administering_ic`, `arra_funded`, `org_name`, `org_dept`, `topic_id`, `study_section`, `total_cost`, `ed_inst_type` )/g

## `homework_NSF_Award.sql`

- back to [Table of Contents](#Table-of-Contents)

Original CREATE TABLE:

    CREATE TABLE `NSF_Award` (
      `AwardId` bigint(20) DEFAULT NULL,
      `FirstName` text,
      `LastName` text,
      `StartDate` text,
      `EndDate` text,
      `AwardTitle` text,
      `AwardEffectiveDate` text,
      `AwardExpirationDate` text,
      `Name` text,
      `CityName` text,
      `ZipCode` text,
      `PhoneNumber` double DEFAULT NULL,
      `StreetAddress` text,
      `CountryName` text,
      `StateName` text,
      `StateCode` text
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Changes to `homework_NSF_Award.sql`

- rename to: `homework-nsf_award.sql`
- table name: `NSF_Award` ==> `nsf_award`

        %s/`NSF_Award`/`nsf_award`/g

- column names (vi find and replace syntax):

        %s/`AwardId`/`awardid`/g
        %s/`FirstName`/`firstname`/g
        %s/`LastName`/`lastname`/g
        %s/`StartDate`/`startdate`/g
        %s/`EndDate`/`enddate`/g
        %s/`AwardTitle`/`awardtitle`/g
        %s/`AwardEffectiveDate`/`awardeffectivedate`/g
        %s/`AwardExpirationDate`/`awardexpirationdate`/g
        %s/`Name`/`name`/g
        %s/`CityName`/`cityname`/g
        %s/`ZipCode`/`zipcode`/g
        %s/`PhoneNumber`/`phonenumber`/g
        %s/`StreetAddress`/`streetaddress`/g
        %s/`CountryName`/`countryname`/g
        %s/`StateName`/`statename`/g
        %s/`StateCode`/`statecode`/g

- added primary key `index`:

        `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
        PRIMARY KEY (`index`)
        
New CREATE TABLE:

    CREATE TABLE `nsf_award` (
      `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
      `awardid` bigint(20) DEFAULT NULL,
      `firstname` text,
      `lastname` text,
      `startdate` text,
      `enddate` text,
      `awardtitle` text,
      `awardeffectivedate` text,
      `awardexpirationdate` text,
      `name` text,
      `cityname` text,
      `zipcode` text,
      `phonenumber` double DEFAULT NULL,
      `streetaddress` text,
      `countryname` text,
      `statename` text,
      `statecode` text,
      PRIMARY KEY (`index`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

And, updated each INSERT statement, from:

    INSERT INTO `nsf_award` VALUES
    
to:

    INSERT INTO `nsf_award` ( `awardid`, `firstname`, `lastname`, `startdate`, `enddate`, `awardtitle`, `awardeffectivedate`, `awardexpirationdate`, `name`, `cityname`, `zipcode`, `phonenumber`, `streetaddress`, `countryname`, `statename`, `statecode` ) VALUES
    
vi search and replace syntax:

    %s/INSERT INTO `nsf_award`/INSERT INTO `nsf_award` ( `awardid`, `firstname`, `lastname`, `startdate`, `enddate`, `awardtitle`, `awardeffectivedate`, `awardexpirationdate`, `name`, `cityname`, `zipcode`, `phonenumber`, `streetaddress`, `countryname`, `statename`, `statecode` )/g

## `homework_TextAnalysis.sql`

- back to [Table of Contents](#Table-of-Contents)

Original CREATE TABLE:

    CREATE TABLE `TextAnalysis` (
      `APPLICATION_ID` int(11) DEFAULT NULL,
      `ABSTRACT_TEXT` text CHARACTER SET utf8
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Changes to `homework_TextAnalysis.sql`

- rename to: `homework-text_analysis.sql`
- table name: `TextAnalysis` ==> `text_analysis`

        %s/`TextAnalysis`/`text_analysis`/g

- column names (vi find and replace syntax):

        %s/`APPLICATION_ID`/`application_id`/g
        %s/`ABSTRACT_TEXT`/`abstract_text`/g

- added primary key `index`:

        `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
        PRIMARY KEY (`index`)
        
New CREATE TABLE:

    CREATE TABLE `text_analysis` (
      `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
      `application_id` int(11) DEFAULT NULL,
      `abstract_text` text CHARACTER SET utf8,
      PRIMARY KEY (`index`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

And, updated each INSERT statement, from:

    INSERT INTO `text_analysis` VALUES
    
to:

    INSERT INTO `text_analysis` ( `application_id`, `abstract_text` ) VALUES
    
vi search and replace syntax:

    %s/INSERT INTO `text_analysis`/INSERT INTO `text_analysis` ( `application_id`, `abstract_text` )/g

## `homework_UCPay2011.sql`

- back to [Table of Contents](#Table-of-Contents)

Original CREATE TABLE:

    CREATE TABLE `UCPay2011` (
      `ID` bigint(20) DEFAULT NULL,
      `year` bigint(20) DEFAULT NULL,
      `campus` text,
      `name` text,
      `title` text,
      `gross` double DEFAULT NULL,
      `base` double DEFAULT NULL,
      `overtime` double DEFAULT NULL,
      `extra` bigint(20) DEFAULT NULL,
      `exclude` bigint(20) DEFAULT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Changes to `homework_UCPay2011.sql`

- rename to: `homework-uc_pay_2011.sql`
- table name: `UCPay2011` ==> `uc_pay_2011`

        %s/`UCPay2011`/`uc_pay_2011`/g

- column names (vi find and replace syntax):

        %s/`ID`/`id`/g

- added primary key `index`:

        `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
        PRIMARY KEY (`index`)
        
New CREATE TABLE:

    CREATE TABLE `uc_pay_2011` (
      `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
      `id` bigint(20) DEFAULT NULL,
      `year` bigint(20) DEFAULT NULL,
      `campus` text,
      `name` text,
      `title` text,
      `gross` double DEFAULT NULL,
      `base` double DEFAULT NULL,
      `overtime` double DEFAULT NULL,
      `extra` bigint(20) DEFAULT NULL,
      `exclude` bigint(20) DEFAULT NULL,
      PRIMARY KEY (`index`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

And, updated each INSERT statement, from:

    INSERT INTO `uc_pay_2011` VALUES
    
to:

    INSERT INTO `uc_pay_2011` ( `id`, `year`, `campus`, `name`, `title`, `gross`, `base`, `overtime`, `extra`, `exclude` ) VALUES
    
vi search and replace syntax:

    %s/INSERT INTO `uc_pay_2011`/INSERT INTO `uc_pay_2011` ( `id`, `year`, `campus`, `name`, `title`, `gross`, `base`, `overtime`, `extra`, `exclude` )/g

## `homework_grant.sql`

- back to [Table of Contents](#Table-of-Contents)

Original CREATE TABLE:

    CREATE TABLE `ugrant` (
      `award_id` varchar(50) NOT NULL,
      `topic_id` int(11) NOT NULL,
      `proportion` double DEFAULT NULL,
      `agency` varchar(5) DEFAULT NULL,
      `topic_text` text,
      PRIMARY KEY (`award_id`,`topic_id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Changes to `homework_grant.sql`

- rename to: `homework-grant.sql`
- remove primary key of award_id and topic_id.
- added primary key `index`:

        `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
        PRIMARY KEY (`index`)
        
- added UNIQUE constraint on award_id and topic_id.

        UNIQUE KEY `ugrant_unique_award_id_topic_id` (`award_id`,`topic_id`)
        
New CREATE TABLE:

    CREATE TABLE `ugrant` (
      `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
      `award_id` varchar(50) NOT NULL,
      `topic_id` int(11) NOT NULL,
      `proportion` double DEFAULT NULL,
      `agency` varchar(5) DEFAULT NULL,
      `topic_text` text,
      PRIMARY KEY (`index`),
      UNIQUE KEY `ugrant_unique_award_id_topic_id` (`award_id`,`topic_id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

And, updated each INSERT statement, from:

    INSERT INTO `ugrant` VALUES
    
to:

    INSERT INTO `ugrant` ( `award_id`, `topic_id`, `proportion`, `agency`, `topic_text` ) VALUES
    
vi search and replace syntax:

    %s/INSERT INTO `ugrant`/INSERT INTO `ugrant` ( `award_id`, `topic_id`, `proportion`, `agency`, `topic_text` )/g

## `homework_vendor.sql`

- back to [Table of Contents](#Table-of-Contents)

Original CREATE TABLE:

    CREATE TABLE `vendor` (
      `periodstartdate` date NOT NULL,
      `award_id` varchar(60) CHARACTER SET utf8 NOT NULL,
      `institutionid` varchar(200) CHARACTER SET utf8 NOT NULL,
      `paymentamount` double NOT NULL,
      `cfda` varchar(10) CHARACTER SET utf8 DEFAULT NULL,
      `fipscode` varchar(6) CHARACTER SET utf8 DEFAULT NULL,
      `statecode` varchar(10) DEFAULT NULL,
      `countycode` varchar(10) DEFAULT NULL,
      `agency_abbrev` varchar(30) CHARACTER SET utf8 DEFAULT NULL,
      `agency_text` varchar(70) CHARACTER SET utf8 DEFAULT NULL,
      `sub_agency_text` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
      PRIMARY KEY (`periodstartdate`,`institutionid`,`paymentamount`,`award_id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Changes to `homework_vendor.sql`

- rename to: `homework-vendor.sql`
- remove primary key of `periodstartdate`,`institutionid`,`paymentamount`,`award_id`
- added primary key `index`:

        `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
        PRIMARY KEY (`index`)
        
- added UNIQUE constraint on `periodstartdate`,`institutionid`,`paymentamount`,`award_id`.

        UNIQUE KEY `vendor_unique_periodstartdate_institutionid_paymentamount_award_id` (`periodstartdate`,`institutionid`,`paymentamount`,`award_id`)
        
New CREATE TABLE:

    CREATE TABLE `vendor` (
      `index` bigint(20) unsigned NOT NULL AUTO_INCREMENT,  
      `periodstartdate` date NOT NULL,
      `award_id` varchar(60) CHARACTER SET utf8 NOT NULL, 
      `institutionid` varchar(200) CHARACTER SET utf8 NOT NULL,
      `paymentamount` double NOT NULL, 
      `cfda` varchar(10) CHARACTER SET utf8 DEFAULT NULL,
      `fipscode` varchar(6) CHARACTER SET utf8 DEFAULT NULL,
      `statecode` varchar(10) DEFAULT NULL,
      `countycode` varchar(10) DEFAULT NULL,
      `agency_abbrev` varchar(30) CHARACTER SET utf8 DEFAULT NULL,
      `agency_text` varchar(70) CHARACTER SET utf8 DEFAULT NULL,
      `sub_agency_text` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
      PRIMARY KEY (`index`),
      UNIQUE KEY `vendor_unique_startdate_instid_payamt_award_id` (`periodstartdate`,`institutionid`,`paymentamount`,`award_id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

And, updated each INSERT statement, from:

    INSERT INTO `vendor` VALUES
    
to:

    INSERT INTO `vendor` ( `periodstartdate`, `award_id`,  `institutionid`, `paymentamount`,  `cfda`, `fipscode`, `statecode`, `countycode`, `agency_abbrev`, `agency_text`, `sub_agency_text` ) VALUES
    
vi search and replace syntax:

    %s/INSERT INTO `vendor`/INSERT INTO `vendor` ( `periodstartdate`, `award_id`,  `institutionid`, `paymentamount`,  `cfda`, `fipscode`, `statecode`, `countycode`, `agency_abbrev`, `agency_text`, `sub_agency_text` )/g

# TODO

- back to [Table of Contents](#Table-of-Contents)

TODO:

- update notebooks to refer to lower-case names, new table names, and to use the sqlalchemy way of doing SQL calls rather than a direct call via DBAPI.  See if it works for both mysql and postgresql.

# Finally...

- back to [Table of Contents](#Table-of-Contents)

Close engines, connections, etc.:

In [None]:
# sometimes you'll need to rollback.
pgsql_connection.rollback()

In [None]:
# close SQLAlchemy engines.
mysql_engine.dispose()
pgsql_engine.dispose()

# close psycopg2 DBAPI connection.
pgsql_cursor.close()
pgsql_connection.close()

print( "database connections closed (disposed) at " + str( datetime.datetime.now() ) )