# Converting data from mysql to postgresql using pandas

# Table of Contents

- [Setup](#Setup)

    - [Setup - Imports](#Setup---Imports)
    - [Setup - Database](#Setup---Database)
    
        - [Setup - Database - SQLAlchemy](#Setup---Database---SQLAlchemy)
    
    - [Setup - Functions](#Setup---Functions)
    
        - [Setup - Function `column_name_to_lower_case`](#Setup---Function-column_name_to_lower_case)
        
- [Migrate data from MySQL to PostgreSQL](#Migrate-data-from-MySQL-to-PostgreSQL)
- [Data cleanup](#Data-cleanup)

    - [Cleanup - `machine_learning.arra_funded`](#Cleanup---machine_learning.arra_funded)
    - [Cleanup - unique constraints for old multi-part keys](#Cleanup---unique-constraints-for-old-multi-part-keys)

- [Create tab-delimited files for each table](#Create-tab-delimited-files-for-each-table)
- [TODO](#TODO)
- [Finally...](#Finally...)

# Setup

- back to [Table of Contents](#Table-of-Contents)

## Setup - Imports

- back to [Table of Contents](#Table-of-Contents)

In [1]:
# imports
import datetime
import pandas
import psycopg2
import pymysql
import sqlalchemy

print( "packages imported at " + str( datetime.datetime.now() ) )

packages imported at 2016-12-01 11:26:08.379864


## Setup - Database

- back to [Table of Contents](#Table-of-Contents)

In [22]:
# Variables to hold connection information

# ==> MySQL
mysql_username = "<username>"
mysql_password = "<password>"
mysql_host = "localhost"
mysql_port = "3306"
mysql_database = "homework"
mysql_charset = "utf8"

mysql_host = "cuspdev.local"
mysql_username = "jonathanmorgan"
mysql_password = "today123"

# ==> PostgreSQL
pgsql_username = "<username>"
pgsql_password = "<password>"
pgsql_host = "localhost"
pgsql_port = "5432"
pgsql_database = "homework"
pgsql_encoding = "utf8"

pgsql_host = "cuspdev.local"
pgsql_username = "jonathanmorgan"
pgsql_password = "today123"

print( "database connection info defined at " + str( datetime.datetime.now() ) )

database connection info defined at 2016-12-01 14:16:11.584624


### Setup - Database - SQLAlchemy

- back to [Table of Contents](#Table-of-Contents)

In [20]:
# Create SQLAlchemy connections to both MySQL and PostgreSQL

# declare variables
connection_string = ""
execution_option_dict = None
mysql_engine = None
pgsql_engine = None

# shared execution options
execution_option_dict = {}
execution_option_dict[ "stream_results" ] = True
#execution_option_dict[ "autocommit" ] = True

# ==> MySQL

# Create database engine for pandas.
connection_string = "mysql+pymysql://" + mysql_username + ":" + mysql_password + "@" + mysql_host + ":" + mysql_port + "/" + mysql_database + "?charset=" + mysql_charset
mysql_engine = sqlalchemy.create_engine( connection_string, execution_options = execution_option_dict )

# ==> PostgreSQL

# Create database engine for pandas.
connection_string = "postgresql+psycopg2://" + pgsql_username + ":" + pgsql_password + "@" + pgsql_host + ":" + pgsql_port + "/" + pgsql_database + "?client_encoding=" + pgsql_encoding
pgsql_engine = sqlalchemy.create_engine( connection_string, execution_options = execution_option_dict )

print( "sqlalchemy database engines created at " + str( datetime.datetime.now() ) )

sqlalchemy database engines created at 2016-12-01 13:58:29.290470


### Setup - Database - psycopg2

- back to [Table of Contents](#Table-of-Contents)

Create connection and cursor for things that break SQLAlchemy (sigh).

In [25]:
# create psycopg2 connection to PostgreSQL using connection variables defined above.
pgsql_connection = psycopg2.connect( host = pgsql_host, port = pgsql_port, database = pgsql_database, user = pgsql_username, password = pgsql_password )

print( "psycopg2 database connection created at " + str( datetime.datetime.now() ) )

psycopg2 database connection created at 2016-12-01 14:26:44.650466


In [26]:
# create psycopg2 cursor using pgsql_connection.
pgsql_cursor = pgsql_connection.cursor( cursor_factory = psycopg2.extras.DictCursor )

print( "psycopg2 database cursor created at " + str( datetime.datetime.now() ) )

psycopg2 database cursor created at 2016-12-01 14:26:46.303424


## Setup - Functions

- back to [Table of Contents](#Table-of-Contents)

Run the file `data_functions.py`, which contains re-usable database functions.  List of functions will be printed in the output after you execute the file.

In [24]:
# Must be run in the /data folder.
%run data_functions.py

Function column_names_to_lower_case() declared at 2016-12-01 14:21:58.253367


# Migrate data from MySQL to PostgreSQL

- back to [Table of Contents](#Table-of-Contents)

In [45]:
# database name
database_name = "homework"

# make a list of the names of the tables we want to migrate
homework_table_list = []
homework_table_list.append( "machine_learning" )
homework_table_list.append( "nsf_award" )
homework_table_list.append( "text_analysis" )
homework_table_list.append( "uc_pay_2011" )
homework_table_list.append( "ugrant" )
homework_table_list.append( "vendor" )

In [12]:
# declare variables
table_select_string = ""
table_df = None

# for each homework database table, pull it in from MySQL, write it out to PostgreSQL.
for table_name in homework_table_list:
    
    print( "==> starting migration of table " + table_name + " at " + str( datetime.datetime.now() ) )
    
    # read the table into pandas from mysql
    table_select_string = "SELECT * FROM " + database_name + "." + table_name + ";"
    table_df = pandas.read_sql( table_select_string, con = mysql_engine )
    
    # convert column names to lower case
    table_df = column_names_to_lower_case( table_df )
    
    print( table_name + " column names: " + str( list( table_df.columns ) ) )
    
    # write the table into postgresql.
    table_df.to_sql( table_name, con = pgsql_engine )

    print( "<== migration of table " + table_name + " completed at " + str( datetime.datetime.now() ) )
    
#-- END loop over tables --#

==> starting migration of table machine_learning at 2016-11-30 18:40:05.037373
machine_learning column names: ['application_id', 'cfda_code', 'year', 'activity', 'administering_ic', 'arra_funded', 'org_name', 'org_dept', 'topic_id', 'study_section', 'total_cost', 'ed_inst_type']
<== migration of table machine_learning completed at 2016-11-30 18:40:28.929735
==> starting migration of table nsf_award at 2016-11-30 18:40:28.929809
nsf_award column names: ['awardid', 'firstname', 'lastname', 'startdate', 'enddate', 'awardtitle', 'awardeffectivedate', 'awardexpirationdate', 'name', 'cityname', 'zipcode', 'phonenumber', 'streetaddress', 'countryname', 'statename', 'statecode']
<== migration of table nsf_award completed at 2016-11-30 18:40:40.296275
==> starting migration of table text_analysis at 2016-11-30 18:40:40.296351
text_analysis column names: ['application_id', 'abstract_text']
<== migration of table text_analysis completed at 2016-11-30 18:45:55.263207
==> starting migration of tabl

# Data cleanup

- back to [Table of Contents](#Table-of-Contents)

## Cleanup - `machine_learning.arra_funded`

- back to [Table of Contents](#Table-of-Contents)

In the `machine_learning`, table, convert arra_funded from text to int (\x01 ==> 1, \x00 ==> 0).  In the MySQL table, the data field is "bit" (boolean), and pandas doesn't know what to do with that, so it converts to a string with an explicitly decimal integer value.  Interesting.  Wonder what the format's underlying storage format is?

In [28]:
# get distinct values
sql_string = "SELECT DISTINCT arra_funded AS unique_value FROM machine_learning;"

# run SQL.
pgsql_cursor.execute( sql_string )

# loop over results
for current_row in pgsql_cursor:
    
    # output unique_value
    print( str( current_row[ "unique_value" ] ) )
    
#-- END loop over distinct values in arra_funded --#

\x00
\x01


In [38]:
# First, make new column `arra_funded_int` that is of type "int".
sql_string = "ALTER TABLE public.machine_learning ADD COLUMN arra_funded_int int2;"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

2016-12-01 14:56:06.431710: ALTER TABLE public.machine_learning ADD COLUMN arra_funded_int int2;


In [39]:
# Now, set arra_funded_int to 0 where arra_funded = "\x00"...
#     have to escape the back-slash with a back-slash ( "\\" ).
sql_string = "UPDATE public.machine_learning SET arra_funded_int = 0 WHERE arra_funded = '\\x00';"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

2016-12-01 14:56:18.087807: UPDATE public.machine_learning SET arra_funded_int = 0 WHERE arra_funded = '\x00';


In [40]:
# ...and set arra_funded_int to 1 where arra_funded = "\x01".
#     have to escape the back-slash with a back-slash ( "\\" ).
sql_string = "UPDATE public.machine_learning SET arra_funded_int = 1 WHERE arra_funded = '\\x01';"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

2016-12-01 14:56:27.871735: UPDATE public.machine_learning SET arra_funded_int = 1 WHERE arra_funded = '\x01';


In [41]:
# remove arra_funded column
sql_string = "ALTER TABLE public.machine_learning DROP COLUMN arra_funded;"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

2016-12-01 14:56:34.135046: ALTER TABLE public.machine_learning DROP COLUMN arra_funded;


In [42]:
# and rename "arra_funded_int" to "arra_funded"
sql_string = "ALTER TABLE machine_learning RENAME COLUMN arra_funded_int TO arra_funded;"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

2016-12-01 14:56:40.826528: ALTER TABLE machine_learning RENAME COLUMN arra_funded_int TO arra_funded;


## Cleanup - unique constraints for old multi-part keys

- return to [Table of Contents](#Table-of-Contents)

Convert the old primary keys on `ugrants` and `vendor` to just be unique constraints (since we have a unique integer primary key):

- ugrant

        PRIMARY KEY (`award_id`,`topic_id`)

- vendor

        PRIMARY KEY (`periodstartdate`,`institutionid`,`paymentamount`,`award_id`)

In [43]:
# ==> ugrant

# create SQL string
sql_string = "ALTER TABLE ugrant ADD CONSTRAINT ugrant_primary_key UNIQUE( award_id, topic_id );"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

2016-12-01 14:56:57.451560: ALTER TABLE ugrant ADD CONSTRAINT ugrant_primary_key UNIQUE( award_id, topic_id );


In [44]:
# ==> vendor

# create SQL string
sql_string = "ALTER TABLE vendor ADD CONSTRAINT vendor_primary_key UNIQUE( periodstartdate, institutionid, paymentamount, award_id );"

# run and commit.
#pgsql_cursor.execute( sql_string )
#pgsql_connection.commit()

print( str( datetime.datetime.now() ) + ": " + sql_string )

2016-12-01 14:57:03.358691: ALTER TABLE vendor ADD CONSTRAINT vendor_primary_key UNIQUE( periodstartdate, institutionid, paymentamount, award_id );


# Create tab-delimited files for each table

- back to [Table of Contents](#Table-of-Contents)

In [46]:
# declare variables
table_select_string = ""
table_df = None

# for each homework database table, pull it in from PostgreSQL, write it out to CSV file.
for table_name in homework_table_list:
    
    print( "==> starting tab-delimited export of table " + table_name + " at " + str( datetime.datetime.now() ) )
    
    # read the table into pandas from postgresql
    table_select_string = "SELECT * FROM " + table_name + ";"
    table_df = pandas.read_sql( table_select_string, con = pgsql_engine )
    
    # write the table to CSV file.
    output_file_path = table_name + ".tab.txt"
    table_df.to_csv( output_file_path, sep = '\t', encoding = 'utf-8' )

    print( "<== tab-delimited export of table " + table_name + " to " + output_file_path + " completed at " + str( datetime.datetime.now() ) )
    
#-- END loop over tables --#

==> starting tab-delimited export of table machine_learning at 2016-12-01 16:13:32.364543
<== tab-delimited export of table machine_learning to machine_learning.tab.txt completed at 2016-12-01 16:13:33.933228
==> starting tab-delimited export of table nsf_award at 2016-12-01 16:13:33.933322
<== tab-delimited export of table nsf_award to nsf_award.tab.txt completed at 2016-12-01 16:13:34.823583
==> starting tab-delimited export of table text_analysis at 2016-12-01 16:13:34.823662
<== tab-delimited export of table text_analysis to text_analysis.tab.txt completed at 2016-12-01 16:14:55.268936
==> starting tab-delimited export of table uc_pay_2011 at 2016-12-01 16:14:55.269152
<== tab-delimited export of table uc_pay_2011 to uc_pay_2011.tab.txt completed at 2016-12-01 16:14:59.812808
==> starting tab-delimited export of table ugrant at 2016-12-01 16:14:59.812940
<== tab-delimited export of table ugrant to ugrant.tab.txt completed at 2016-12-01 16:14:59.920544
==> starting tab-delimited exp

# TODO

- back to [Table of Contents](#Table-of-Contents)

TODO:

- update notebooks to refer to lower-case names, new table names, and to use the sqlalchemy way of doing SQL calls rather than a direct call via DBAPI.  See if it works for both mysql and postgresql.

# Finally...

- back to [Table of Contents](#Table-of-Contents)

Close engines, connections, etc.:

In [31]:
# sometimes you'll need to rollback.
pgsql_connection.rollback()

In [19]:
# close SQLAlchemy engines.
mysql_engine.dispose()
pgsql_engine.dispose()

# close psycopg2 DBAPI connection.
pgsql_cursor.close()
pgsql_connection.close()

print( "database connections closed (disposed) at " + str( datetime.datetime.now() ) )

database connections closed (disposed) at 2016-12-01 13:58:24.887436
