# Building and updating a data table

As one derives more complicated data about entities of interest in research, it can be helpful to keep that information together in one place.  To do this, one can create a table that is intended to hold finalized data and update it with new derived information as that information is generated.

Below we outline:

- creating a table, including updating its ownership and permissions such that others in your project can access and work with it.
- Updating that table by:

    - Creating a new copy of the table with additional rows, then deleting the old version.
    - Adding a column to a table, then updating columns in a table either by:

        - looping in Python and updating row-by-row
        - updating the column (or columns) all at once with a single UPDATE command.

# Table of Contents

- [Setup - Imports](#Setup---Imports)
- [Setup - Database](#Setup---Database)

    - [Setup - Database - SQLAlchemy](#Setup---Database---SQLAlchemy)
    - [Setup - Database - psycopg2](#Setup---Database---psycopg2)
    - [Setup - Database - rollback if needed](#Setup---Database---rollback-if-needed)
    
- [Overview](#Overview)
- [Create a new table](#Create-a-new-table)

    - [CREATE table](#CREATE-table)
    - [CREATE table from results of SELECT](#CREATE-table-from-results-of-SELECT)
    - [Provide access to table](#Provide-access-to-table)

- [Adding columns to data table](#Adding-columns-to-data-table)

    - [Option 1 - Add columns using CREATE TABLE AS](#Option-1---Add-columns-using-CREATE-TABLE-AS)
    - [Add columns to existing table](#Add-columns-to-existing-table)

        - [Option 2 - UPDATE all at once using SQL](#Option-2---UPDATE-all-at-once-using-SQL)
        - [Option 3 - For each row, calculate and store value using Python](#Option-3---For-each-row,-calculate-and-store-value-using-Python)
        
- [Doing work step-by-step using temporary tables](#Doing-work-step-by-step-using-temporary-tables)
- [Choosing between the three options](#Choosing-between-the-three-options)

# Setup - Imports

- back to [Table of Contents](#Table-of-Contents)

In [None]:
# general use imports
import datetime
import glob
import inspect
import numpy
import os
import six
import warnings

# pandas-related imports
import pandas
import sqlalchemy

# CSV file reading-related imports
import csv

# database interaction imports
import psycopg2
import psycopg2.extras

print( "Imports loaded at " + str( datetime.datetime.now() ) )

# Setup - Database

- back to [Table of Contents](#Table-of-Contents)

In [None]:
# schema name
schema_name = "ildoc"

# admin role
admin_role = schema_name + "_admin"
select_role = schema_name + "_select"

# ==> database table names - just like file names above, store reused database information in variables here.

# work table name
work_db_table = "person"

print( "Database variables initialized at " + str( datetime.datetime.now() ) )

In [None]:
# Database connection properties
db_host = "10.10.2.10"
db_port = -1
db_username = None
db_password = None
db_name = "appliedda"

print( "Database connection properties initialized at " + str( datetime.datetime.now() ) )

## Setup - Database - `SQLAlchemy`

- back to [Table of Contents](#Table-of-Contents)

Initialize database connections.  First, SQLAlchemy engine:

In [None]:
# initialize database connections
# Create connection to database using SQLAlchemy
#     (3 '/' indicates use enviroment settings for username, host, and port)
sqlalchemy_connection_string = "postgresql://"

if ( ( db_host is not None ) and ( db_host != "" ) ):
    sqlalchemy_connection_string += str( db_host )
#-- END check to see if host --#

sqlalchemy_connection_string += "/"

if ( ( db_name is not None ) and ( db_name != "" ) ):
    sqlalchemy_connection_string += str( db_name )
#-- END check to see if host --#

# create engine.
pgsql_engine = sqlalchemy.create_engine( sqlalchemy_connection_string )

print( "SQLAlchemy engine created at " + str( datetime.datetime.now() ) )

## Setup - Database - `psycopg2`

- back to [Table of Contents](#Table-of-Contents)

And then a direct psycopg2 connection and cursor:

In [None]:
# create psycopg2 connection to Postgresql

# example connect() call that uses all the possible parameters
#pgsql_connection = psycopg2.connect( host = db_host, port = db_port, database = db_name, user = db_username, password = db_password )

# for SQLAlchemy, just needed database name. Same for DBAPI?
pgsql_connection = psycopg2.connect( host = db_host, database = db_name )

print( "Postgresql connection to database \"" + db_name + "\" created at " + str( datetime.datetime.now() ) )

In [None]:
# create a cursor that maps column names to values
pgsql_cursor = pgsql_connection.cursor( cursor_factory = psycopg2.extras.DictCursor )

print( "Postgresql cursor for database \"" + db_name + "\" created at " + str( datetime.datetime.now() ) )

## Setup - Database - rollback if needed

- back to [Table of Contents](#Table-of-Contents)

In [None]:
# rollback, in case you need it.
pgsql_connection.rollback()

print( "Postgresql connection for database \"" + db_name + "\" rolled back at " + str( datetime.datetime.now() ) )

# Overview

- back to [Table of Contents](#Table-of-Contents)

As one derives more complicated data about entities of interest in research, it can be helpful to keep that information together in one place.  To do this, one can create a table that is intended to hold finalized data and update it with new derived information as that information is generated.

When building up a data table like this, the basic strategy is to start with a table where each unit of interest for your analysis gets its own row (so, one person per row, or one company per row, or one set of quarterly earnings per row), then add columns to the table as you figure out logic to derive data points for use in your analysis and modeling.

There are a number of ways you can do this.  Below, we will use an example of creating a data table for heads of household from the IDHS benefits data.  We will show how to create a table with a row per person, then show three options for adding columns to hold new variables/features:

- Create a copy of the table that also includes new columns, using the `CREATE TABLE ... AS` SQL statement.
- Add the columns to your table, then update by:

    - using the SQL `UPDATE` statement to derive and set the values for the column using SQL.
    - using Python and SQL to derive column values in Python, then update the table row-by-row.
    

Each method has different strengths and weaknesses, and we'll discuss those in a summary section at the end of the notebook.

# Create a new table

- back to [Table of Contents](#Table-of-Contents)

First, you'll want create a table to hold your analysis data and configure it so it is usable by the others in your project.

For this example, we'll create a table to hold the following information for each head of household:

- recptno
- sex
- rac
- rootrace
- foreignbn
- ssn_hash
- fname_hash
- lname_hash
- birth_date

## CREATE table

- back to [Table of Contents](#Table-of-Contents)

The basic syntax for creating a table is:

    CREATE TABLE <schema_name>.<table_name>
    (
        <column_list>
    );
    
So, to create a table to hold our head of household information:

In [None]:
# Create table - declare variables
table_name = ""
table_name = work_db_table

# generate SQL
sql_string = "CREATE TABLE " + schema_name + "." + table_name

# add columns
sql_string += " ("
sql_string += " id BIGSERIAL PRIMARY KEY"
sql_string += ", recptno bigint"
sql_string += ", sex bigint"
sql_string += ", rac bigint"
sql_string += ", rootrace bigint"
sql_string += ", foreignbn bigint"
sql_string += ", ssn_hash text"
sql_string += ", fname_hash text"
sql_string += ", lname_hash text"
sql_string += ", birth_date date"
sql_string += " )"
sql_string += ";"

print( "====> " + str( sql_string ) )

# run SQL
pgsql_cursor.execute( sql_string )
pgsql_connection.commit()

#temp_df = pandas.read_sql( sql_string, con = pgsql_engine )
#temp_length = len( temp_df )
#temp_df.head( n = temp_length )

print( "====> " + str( sql_string ) + " completed at " + str( datetime.datetime.now() ) )

## CREATE table from results of SELECT

- back to [Table of Contents](#Table-of-Contents)

When you start with a data table you want to build on, you can use the CREATE TABLE AS syntax to make a copy of an existing table, populated with the data from that table (or a subset).

The most basic form of the CREATE TABLE AS syntax:

    CREATE TABLE <schema_name>.<table_name>
    AS
    SELECT <column_list> FROM <schema_name>.<table_name>;
    
So, to create a data work table that includes our columns of interest for all the people in the `idhs.hh_member` table:

In [None]:
# Create table - declare variables
table_name = ""
table_name = work_db_table

# generate SQL
sql_string = "CREATE TABLE " + schema_name + "." + table_name
sql_string += " AS SELECT"
sql_string += " recptno"
sql_string += ", sex"
sql_string += ", rac"
sql_string += ", rootrace"
sql_string += ", foreignbn"
sql_string += ", ssn_hash"
sql_string += ", fname_hash"
sql_string += ", lname_hash"
sql_string += ", birth_date"
sql_string += " FROM idhs.hh_member"
#sql_string += " WHERE EXTRACT( year FROM birth_date ) = 1976"
#sql_string += " LIMIT 1000"
sql_string += ";"

print( "====> " + str( sql_string ) )

# run SQL
pgsql_cursor.execute( sql_string )
pgsql_connection.commit()

#temp_df = pandas.read_sql( sql_string, con = pgsql_engine )
#temp_length = len( temp_df )
#temp_df.head( n = temp_length )

print( "====> " + str( sql_string ) + " completed at " + str( datetime.datetime.now() ) )

When you create a table this way, you can expand the `AS` `SELECT` to join multiple tables and create derived columns (more on this later), and you can also use its `WHERE` clause to filter and subset.

## Provide access to table

- back to [Table of Contents](#Table-of-Contents)

Once you've created a new table, you need to provide access to it for the rest of your project members.  This includes:

- Changing the ownership of the table so it is owned by your project's database admin role ("`<project>_admin`").
- Giving your project's admin role all privileges on the table.
- Giving your project's read-only role ("`<project>_select`") "SELECT" privileges on the table.

In [None]:
# UPDATE ownership - declare variables
table_name = ""
table_name = work_db_table

# generate SQL
sql_string = "ALTER TABLE " + schema_name + "." + table_name
sql_string += " OWNER TO " + admin_role
sql_string += ";"

# run SQL
pgsql_cursor.execute( sql_string )
pgsql_connection.commit()

#temp_df = pandas.read_sql( sql_string, con = pgsql_engine )
#temp_length = len( temp_df )
#temp_df.head( n = temp_length )

print( "====> " + str( sql_string ) + " completed at " + str( datetime.datetime.now() ) )

In [None]:
# admin_role privileges - declare variables
table_name = ""
table_name = work_db_table

# generate SQL
sql_string = "GRANT ALL PRIVILEGES ON TABLE " + schema_name + "." + table_name
sql_string += " TO " + admin_role
sql_string += ";"

# run SQL
pgsql_cursor.execute( sql_string )
pgsql_connection.commit()

#temp_df = pandas.read_sql( sql_string, con = pgsql_engine )
#temp_length = len( temp_df )
#temp_df.head( n = temp_length )

print( "====> " + str( sql_string ) + " completed at " + str( datetime.datetime.now() ) )

In [None]:
# select_role privileges - declare variables
table_name = ""
table_name = work_db_table

# generate SQL
sql_string = "GRANT SELECT ON TABLE " + schema_name + "." + table_name
sql_string += " TO " + select_role
sql_string += ";"

# run SQL
pgsql_cursor.execute( sql_string )
pgsql_connection.commit()

#temp_df = pandas.read_sql( sql_string, con = pgsql_engine )
#temp_length = len( temp_df )
#temp_df.head( n = temp_length )

print( "====> " + str( sql_string ) + " completed at " + str( datetime.datetime.now() ) )

# Adding columns to data table

- back to [Table of Contents](#Table-of-Contents)

Once you have created your initial table, you can start adding on additional columns with more detailed information on the person in each row.

There are a number of ways one can do this:

- Use the "CREATE TABLE AS" syntax to make a new copy of the table, adding the code to derive additional values into the `AS SELECT` section of the SQL.
- Add columns to your existing table using `ALTER TABLE`, then populate those columns by:

    - building the logic for deriving the values for new columns entirely in SQL and using "`UPDATE ... FROM`".
    - using a combination of Python and SQL to build logic for deriving values in Python for a given row, then updating the table row-by-row.

## Option 1 - Add columns using CREATE TABLE AS

- back to [Table of Contents](#Table-of-Contents)

The first option is to add columns to your data set by creating a new table from the existing set of columns using "`CREATE TABLE ... AS `", including the new columns and the logic to derive them in the SELECT along with all of the existing columns from the old table.

When you use this method, each time you create a new table, you should also:

- provide access to the new table, as documented above in []().
- "`DROP`" the previous version of the table, so that you free up space you no longer need.

For example, the SQL below creates a new table from our existing head of household table, adding a given person's "`docnbr`" from Illinois department of corrections data for any heads of household in our data table whose "`ssn_hash`" matches one in the "`ildoc.person table`".

First, create the new table:

In [None]:
# Create table - declare variables
existing_table_name = ""
new_table_name = existing_table_name + "_001"

# generate SQL
sql_string = "CREATE TABLE " + schema_name + "." + new_table_name
sql_string += " AS SELECT"
sql_string += " existing.recptno"
sql_string += ", existing.sex"
sql_string += ", existing.rac"
sql_string += ", existing.rootrace"
sql_string += ", existing.foreignbn"
sql_string += ", existing.ssn_hash"
sql_string += ", existing.fname_hash"
sql_string += ", existing.lname_hash"
sql_string += ", existing.birth_date"
sql_string += ", ildoc.ildoc_docnbr"
sql_string += " FROM " + schema_name + "." + existing_table_name + " existing"
sql_string += " LEFT OUTER JOIN ildoc.person ildoc ON ( existing.ssn_hash = ildoc.ssn_hash )"
#sql_string += " WHERE EXTRACT( year FROM birth_date ) = 1976"
#sql_string += " LIMIT 1000"
sql_string += ";"

print( "====> " + str( sql_string ) )

# run SQL
pgsql_cursor.execute( sql_string )
pgsql_connection.commit()

#temp_df = pandas.read_sql( sql_string, con = pgsql_engine )
#temp_length = len( temp_df )
#temp_df.head( n = temp_length )

print( "====> " + str( sql_string ) + " completed at " + str( datetime.datetime.now() ) )

Provide access to the new table: [Provide access to table](#Provide-access-to-table)

DROP the old table:

In [None]:
# Create table - declare variables
existing_table_name = ""
new_table_name = existing_table_name + "_001"

# generate SQL
sql_string = "DROP TABLE " + schema_name + "." + existing_table_name
sql_string += ";"

print( "====> " + str( sql_string ) )

# run SQL
pgsql_cursor.execute( sql_string )
pgsql_connection.commit()

#temp_df = pandas.read_sql( sql_string, con = pgsql_engine )
#temp_length = len( temp_df )
#temp_df.head( n = temp_length )

print( "====> " + str( sql_string ) + " completed at " + str( datetime.datetime.now() ) )

## Add columns to existing table

- back to [Table of Contents](#Table-of-Contents)

The next two options are ways to update a table in place by adding columns to hold new data, then populating the new columns.

First, add a column to the table:

In [None]:
# Add column
table_name = ""
column_name = "<column_name>"
column_type = "<column_type>"

# generate SQL
sql_string = "ALTER TABLE " + schema_name + "." + table_name

# start date values
sql_string += " ADD COLUMN " + column_name + " " + column_type

sql_string += ";"

# run SQL
pgsql_cursor.execute( sql_string )
pgsql_connection.commit()

#temp_df = pandas.read_sql( sql_string, con = pgsql_engine )
#temp_length = len( temp_df )
#temp_df.head( n = temp_length )

print( "====> " + str( sql_string ) + " completed at " + str( datetime.datetime.now() ) )

### Option 2 - UPDATE all at once using SQL

- back to [Table of Contents](#Table-of-Contents)

Very basic example (nothing near the complexity of the stuff above):

In [None]:
# ==> Set `start_date` from `start_date_orig`.

# declare variables
current_column = "start_date"

# UPDATE
sql_string = "UPDATE " + schema_name + "." + work_db_table
sql_string += " SET " + current_column + " = TO_DATE( " + current_column + "_orig, 'YYYY-MM-DD' )"

# WHERE clause
# WHERE clause
where_clause = "WHERE " + current_column + " IS NULL"
where_clause += " AND ( ( " + current_column + "_orig IS NOT NULL ) AND ( " + current_column + "_orig != '' ) )"
sql_string += " " + where_clause

sql_string += ";"

print( "SQL: " + sql_string )

# run SQL
pgsql_cursor.execute( sql_string )
pgsql_connection.commit()

print( "UPDATEd " + where_clause + " at " + str( datetime.datetime.now() ) )

## Option 3 - For each row, calculate and store value using Python

- back to [Table of Contents](#Table-of-Contents)

To update each row individually:

- Loop over your table of interest.
- For each row, get info needed from that row, then calculate whatever value you care about for that row.
- Store the value either back to the row in your table of interest, or store derived information elsewhere as appropriate.

In [None]:
# declare variables
sql_string = ""
id_column_name = ""
recptno_value = None
ssn_hash_value = None
bdate_year_value = None
work_cursor = None
work_year_18 = None
work_year_19 = None
work_year_20 = None
work_year_21 = None
work_year_in_list = []
in_q3_year_list = []
non_q3_year_list = []
years_worked_in_q3 = -1
years_worked_non_q3 = -1
years_worked_only_q3 = -1
row_counter = -1

# declare variables working with work cursor.
work_sql_string = ""
work_row = None
wage_year = None
wage_quarter = None

# make a work cursor, so you can query and update independent of your loop over your people.
work_cursor = pgsql_connection.cursor( cursor_factory = psycopg2.extras.DictCursor )

# get IDs from work table.
sql_string = "SELECT * FROM " + schema_name + "." + work_db_table

# only get rows that have not yet been updated - this lets you pick up if the program is interrupted.
#sql_string += " WHERE years_worked_in_q3 IS NULL"

sql_string += ";"

print( sql_string )

# get list of records in person file, so we can process one-by-one
pgsql_cursor.execute( sql_string )
row_counter = 0
for current_row in pgsql_cursor:
    
    # increment row Counter
    row_counter += 1
    
    # initialize variables to make sure we empty out from last row.
    recptno_value = None
    ssn_hash_value = None
    bdate_year_value = None
    work_year_18 = -1
    work_year_19 = -1
    work_year_20 = -1
    work_year_21 = -1
    work_year_in_list = []
    in_q3_year_list = []
    non_q3_year_list = []
    work_sql_string = ""
    years_worked_in_q3 = -1
    years_worked_non_q3 = -1
    years_worked_only_q3 = -1
    
    # get values from record
    recptno_value = current_row.get( "recptno", None )
    ssn_hash_value = current_row.get( "ssn_hash", None )
    bdate_year_value = current_row.get( "bdate_year", None )
    
    # for that recipient, perform logic to derive value for recipient.
    
    # Example: number of years worked Q3 for work_years 18-21,
    #     number of years worked in quarters outside of Q3,
    #     and number of years worked only in Q3.
    work_year_18 = bdate_year_value + 18
    work_year_19 = work_year_18 + 1
    work_year_20 = work_year_19 + 1
    work_year_21 = work_year_20 + 1
    
    # make list of work years, converted to strings for use in query.
    work_year_in_list = [ str( work_year_18 ), str( work_year_19 ), str( work_year_20 ), str( work_year_21 ) ]
    
    # get all wage records for this person in these years.
    in_q3_year_list = []
    non_q3_year_list = []
    
    # create SQL to retrieve wage records...
    work_sql_string = "SELECT * FROM ides.il_wage"
    
    # ...for the current person...
    work_sql_string += " WHERE ssn = '" + ssn_hash_value + "'"
    
    # ...in the specified years...
    work_sql_string += " AND year IN ( " + ", ".join( work_year_in_list ) + " )"
    
    # ...ordered by year and quarter, ascending.
    work_sql_string += " ORDER BY year ASC, quarter ASC"

    work_sql_string += ";"
    
    # call the query and loop over results
    work_cursor.execute()
    for work_row in work_cursor:
        
        # ==> Here you do whatever work you need to for a given wage record.
        
        # get data
        wage_year = work_row.get( "year", None )
        wage_quarter = work_row.get( "quarter", None )
        
        # quarter 3?
        if wage_quarter == 3:
            
            # 3 - if year not already in list, add it (can have multiple rows per quarter, so don't want to count)
            if wage_year not in in_q3_year_list:
                
                in_q3_year_list.append( wage_year )
                
            #-- END check to see if we need to add year. --#
            
        else:
            
            # 1, 2, or 4 - if year not already in list, add it (can have multiple rows per quarter, so don't want to count)
            if wage_year not in non_q3_year_list:
                
                non_q3_year_list.append( wage_year )
                
            #-- END check to see if we need to add year. --#
            
        #-- END check what quarter. --#
        
    #-- END loop over wage records. --#
    
    # ==> calculate values you care about:
    
    years_worked_in_q3 = len( in_q3_year_list )
    years_worked_non_q3 = len( non_q3_year_list )
    years_worked_only_q3 = 0
    
    # loop over q3 year list
    for current_year in in_q3_year_list:
        
        # is that year also in non_q3 list?
        if current_year not in non_q3_year_list:
            
            # no - just q3
            years_worked_only_q3 += 1
            
        #-- END check to see if q3 year in non-q3 year list. --#
        
    #-- END loop over years in Q3 list. --#
    
    # ==> UPDATE
    work_sql_string = "UPDATE " + schema_name + "." + work_db_table
    work_sql_string += " SET years_worked_in_q3 = " + str( years_worked_in_q3 )
    work_sql_string += ", years_worked_non_q3 = " + str( years_worked_non_q3 )
    work_sql_string += ", years_worked_only_q3 = " + str( years_worked_only_q3 )
    work_sql_string += " WHERE recptno = " + str( recptno_value )
    work_sql_string += ";"
    
    # execute and commit.
    work_cursor.execute()
    pgsql_connection.commit()

    # every <cluster_size> people, output a message
    cluster_size = 1000
    if ( ( row_counter % cluster_size ) == 0 ):
        
        print( "Processsed " + str( row_counter ) + " people." )
        
        # if you want, you could also only commit every <cluster_size> UPDATES.
        # - in some cases, this will improve performance.
        # To do this, commend out the commit above, then uncomment this one.
        #pgsql_connection.commit()
        
    #-- END check to see if this is a multiple of <cluster_size> --#
    
#-- END loop over people. --#

# Doing work step-by-step using temporary tables

- Back to [Table of Contents](#Table-of-Contents)

For new variables/features whose derivation involves particularly complicated logic, you will want to break up the work of creating these values to make it easier to validate the steps in your process.  One way to do this is to use the Python update pattern above to build out the feature derivation in Python code, step-by-step.  Another is to use the CREATE TABLE AS pattern to build up your variable/feature over multiple SQL statements, building up the results in temporary tables as you go and eventually joining the results from your temporary table into your main data table.

Example:

# Choosing between the three options

- Back to [Table of Contents](#Table-of-Contents)

Some considerations when choosing between these three options:

- _**"`CREATE TABLE AS`" is the quickest way to add columns to a table.**_

    - It is more efficient in a database to create new rows than it is to update existing rows.  This is especially true when you are working with a large data set.  If you are working with a large data set, however, particularly in a shared environment, cleaning up after yourself becomes very important - you'll need to be vigilant about dropping old versions of your table as you go so you don't take all the space in the database.

- _**if your tables are embedded in a system where many processes interact with them, you'll probably want to add on rather than create new.**_

    - If you move from implementing a data project to implementing a data system, you will likely want to move away from making new tables every time you want to add columns and start updating existing tables with new information, such that you minimize the chance for disrupting existing processes that refer to your tables.  Even in a project, adding on to existing tables could make sense if you have a lot of analysis code you want to re-run within a project over time.  This can be mitigated with disciplined renaming as you go to a point.  Eventually, though, as you build up additional tasks to run each time you recreate (indexes, primary key creation, etc.), you should consider adding on to existing tables rather than risking breaking code because you subtly alter the table fro mone version to the next.

- _**As complexity of logic increases, you'll want to break up your work**_

    - As complexity of logic for deriving a variable/feature increases, you'll want to find some way to break up the work of creating that feature, rather than trying to cram it all into one monolotihic SQL statement.  You can do this entirely in SQL by incrementally building up needed values in a temporary table then joining the results to your data table.  You can do this by writing the logic to derive the value in Python then updating each row with the result.  Or, you can do a combination.  But you'll want to break up the work of deriving complex features somehow so that you can verify/validate the steps in the process as you go.

- _**Sometimes reliability or comprehensibility trumps performance**_

    - Sometimes a slower Python loop can be preferable to a monolithic SQL statement if you expect problems that could cause your program to die in the middle.  A Python program can make incremental updates such that it can pick up where it left off if something goes wrong (like a surprise server shutdown).  Sometimes data or an algorithm will be susceptible to dying prematurely, and building robust code will be more important to you than pure performance.
    - The same sort of consideration can be true based on your comfort with different languages.  Sometimes, a slow Python, R, or Stata program you understand that completes in a week and that can pick up if it dies will still be faster than working for weeks on a program in a language with which you are less comfortable.

- _**EXCEPT when data is big enough that you have to tune for performance**_

    - On the other hand, sometimes, with really large data sets, something written a certain way is just too slow.  Then you have to bite the bullet and figure out a way to make your code complete, even if it means using a language you aren't as comfortable with.