# etlFFIEC Database Walkabout
There are four tables in the database. They are `dictionary`, `institution`, `period`, and `report`.

Each is detailed below. At the bottom, there's an example of analysis with pandas and SQL.

In [1]:
import json
from datetime import datetime
from pprint import pprint

import happybase
import pandas
import pandasql
import matplotlib

HBASE = 'node1.hwr.io'  # CHANGE ME
SAMPLE_BANKS = (131034, 720858, 229342, 819172, 65513, 753641)

def stringify_dict(dictionary):
    # Because everything in Hbase is a byte-array,
    # we need to convert the keys/values of retrieved dictionaries into unicode strings
    
    # also, mildly annoying, but the dict keys retain the Hbase column name, pop that off
    return {key.decode('utf-8').split(':', 1).pop(-1): 
            value.decode('utf-8') for key,value in dictionary.items()}

def period_to_datetime(period):
    return datetime.strptime(period, '%m/%d/%Y')

def datetime_to_period(value):
    return datetime.strftime(value, '%m/%d/%Y')

def filter_periods(periods, count, reverse):
    # convert string datestamps to python datetime, sort, return a slice of `count` length
    periods_we_care_about = [period_to_datetime(k) for k in periods]
    periods_we_care_about.sort(reverse=reverse)
    return[datetime_to_period(k) for k in periods_we_care_about[:count]]

# The `dictionary` table

`dictionary` has column `M`, for Metadata. The row-key format for `dictionary` is the MDRM mnemonic and item code appended together, making the MDRM Identifier. It's eight all-uppercase alphanumeric characters, and looks something like this `RCON2170`.

This is the only table not populated from the FFIEC CDR. Instead, it is loaded from a CSV file downloaded from the Federal Reserve. Since this CSV essentially becomes a dependancy of the project, it is dowloaded into the docker image at build time. This table is populated in two scenarios, either when the `--init` flag is passed or when the `--update-metadata` flag is passed. 

In [2]:
# Scan the entire dictionary table into memory
hbase = happybase.Connection(HBASE)
dictionary = hbase.table('dictionary')
items = [item for item in dictionary.scan()]

mdrm = {}
for row, document in items:
    mdrm[row.decode('utf-8')] = stringify_dict(document)

In [None]:
for k,v in mdrm.items():
    pprint(k)
    pprint(v)
    break
    
# pay no mind to the encoding!

# The `period` table

`period` has column `I`, for institution. The row-key for `period` is the datestamp format used by the FFIEC to represent a reporting period. This table is used for looking up all institutions which have filed call reports for a given reporting period. Each column value is a json-encoded document of metadata about the institution at the given period.

In [3]:
hbase = happybase.Connection(HBASE)
period_table = hbase.table('period')

periods = {}
for row, value in period_table.scan():
    document = stringify_dict(value)
    for key in document:
        document[key] = json.loads(document[key])
    
    periods[row.decode('utf-8')] = document

In [None]:
for k,v in periods.items():
    pprint(k)
    pprint(v)
    break

# The `institution` table
`institution` has column `P`, for period. The row-key format is an [RSSDID](https://www.alacra.com/alacra/outside/lei/info/rssdid.html) which stands for 'Replication Server System Database ID', got all that? This table is used for looking up all reporting periods in which you'll find call reports for the given institution. 


In [4]:
hbase = happybase.Connection(HBASE)
institution_table = hbase.table('institution')

institutions = {}
for row, value in institution_table.scan():
    document = stringify_dict(value)
    for key in document:
        document[key] = json.loads(document[key])
    
    institutions[row.decode('utf-8')] = document

In [None]:
for  k,v in institutions.items():
    pprint(k)
    pprint(v)
    break

# The `report` table

This is the main table, it holds all the call reports. There's a single column, `R`, for report.
The rowkey consists on the institution's rssd and the reporting period appended together and separated by a dash. Something like this: `131034-3/31/2001`. This results in a pretty good balance over regions, from what I've seen testing. One usability improvement would be to reverse the order of the report period datastamp, which would facilitate scanning institution by year, like `row_prefix=b'534242-2017`.

The column key is formatted to include both an MDRM identifier and a field name appended together and separated by a colon. For example, `RCON0010:bank_rssd_identifier`. Generally for analysis you'd really only care about the value field, `R:RCON0010:value': '1672`. It would probably be better to create a document for each reported metric, in the same fashion as the `institution` and `period` tables,

In [5]:
hbase = happybase.Connection(HBASE, timeout=300000)  # connection with 5-minute timeout
reports_table = hbase.table('report')


# It may also be better to create a document for each reported metric, 
# in the same fashion as the `institution` and `period` tables,
# rather than N column entries for N fields

# Regardless, this is the time to go get coffee
call_reports = {}
for bank in SAMPLE_BANKS:
    prefix = bytes('{}-'.format(bank), 'utf-8')
    for row, report in reports_table.scan(row_prefix=prefix):
        call_reports[row.decode('utf-8')] = stringify_dict(report)

In [8]:
for k,v in call_reports.items():
    pprint(k)
    pprint(v)
    break

'720858-3/31/2001'
{'RCON0010:bank_rssd_identifier': '720858',
 'RCON0010:call_date': '20010331',
 'RCON0010:call_schedule': 'RCR',
 'RCON0010:last_update': '20050712',
 'RCON0010:line_number': '34',
 'RCON0010:mdrm_#': 'RCON0010',
 'RCON0010:short_definition': 'Cash and balances dues from depository '
                              'institutions',
 'RCON0010:value': '1672',
 'RCON0020:bank_rssd_identifier': '720858',
 'RCON0020:call_date': '20010331',
 'RCON0020:call_schedule': 'RCA',
 'RCON0020:last_update': '20050712',
 'RCON0020:line_number': '1a',
 'RCON0020:mdrm_#': 'RCON0020',
 'RCON0020:short_definition': 'Cash items in process of collection and '
                              'unposted debits',
 'RCON0020:value': '',
 'RCON0030:bank_rssd_identifier': '720858',
 'RCON0030:call_date': '20010331',
 'RCON0030:call_schedule': 'RCO',
 'RCON0030:last_update': '20050712',
 'RCON0030:line_number': '1a',
 'RCON0030:mdrm_#': 'RCON0030',
 'RCON0030:short_definition': 'Actual amount of all 

 'RCON1736:value': '0',
 'RCON1737:bank_rssd_identifier': '720858',
 'RCON1737:call_date': '20010331',
 'RCON1737:call_schedule': 'RCB',
 'RCON1737:last_update': '20050712',
 'RCON1737:line_number': '6a',
 'RCON1737:mdrm_#': 'RCON1737',
 'RCON1737:short_definition': 'Other domestic debt securities',
 'RCON1737:value': '0',
 'RCON1738:bank_rssd_identifier': '720858',
 'RCON1738:call_date': '20010331',
 'RCON1738:call_schedule': 'RCB',
 'RCON1738:last_update': '20050712',
 'RCON1738:line_number': '6a',
 'RCON1738:mdrm_#': 'RCON1738',
 'RCON1738:short_definition': 'Other domestic debt securities',
 'RCON1738:value': '0',
 'RCON1739:bank_rssd_identifier': '720858',
 'RCON1739:call_date': '20010331',
 'RCON1739:call_schedule': 'RCB',
 'RCON1739:last_update': '20050712',
 'RCON1739:line_number': '6a',
 'RCON1739:mdrm_#': 'RCON1739',
 'RCON1739:short_definition': 'Other domestic debt securities',
 'RCON1739:value': '0',
 'RCON1741:bank_rssd_identifier': '720858',
 'RCON1741:call_date': '20010

 'RCON3514:line_number': '2b2',
 'RCON3514:mdrm_#': 'RCON3514',
 'RCON3514:short_definition': 'Actual amount of unposted credits to time and '
                              'savings deposits',
 'RCON3514:value': '0',
 'RCON3520:bank_rssd_identifier': '720858',
 'RCON3520:call_date': '20010331',
 'RCON3520:call_schedule': 'RCO',
 'RCON3520:last_update': '20050712',
 'RCON3520:line_number': '3',
 'RCON3520:mdrm_#': 'RCON3520',
 'RCON3520:short_definition': "Uninvested trust funds (cash) held in bank's "
                              'own trust department (not included in total '
                              'deposits)',
 'RCON3520:value': '0',
 'RCON3529:bank_rssd_identifier': '720858',
 'RCON3529:call_date': '20010331',
 'RCON3529:call_schedule': 'RCN',
 'RCON3529:last_update': '20050712',
 'RCON3529:line_number': 'M5',
 'RCON3529:mdrm_#': 'RCON3529',
 'RCON3529:short_definition': 'Interest rate, foreign exchange rate, and other '
                              'commodity and equity con

 'RCON8695:value': '0',
 'RCON8696:bank_rssd_identifier': '720858',
 'RCON8696:call_date': '20010331',
 'RCON8696:call_schedule': 'RCL',
 'RCON8696:last_update': '20050712',
 'RCON8696:line_number': '11a',
 'RCON8696:mdrm_#': 'RCON8696',
 'RCON8696:short_definition': 'Futures contracts',
 'RCON8696:value': '0',
 'RCON8697:bank_rssd_identifier': '720858',
 'RCON8697:call_date': '20010331',
 'RCON8697:call_schedule': 'RCL',
 'RCON8697:last_update': '20050712',
 'RCON8697:line_number': '11b',
 'RCON8697:mdrm_#': 'RCON8697',
 'RCON8697:short_definition': 'Forward contracts',
 'RCON8697:value': '0',
 'RCON8698:bank_rssd_identifier': '720858',
 'RCON8698:call_date': '20010331',
 'RCON8698:call_schedule': 'RCL',
 'RCON8698:last_update': '20050712',
 'RCON8698:line_number': '11b',
 'RCON8698:mdrm_#': 'RCON8698',
 'RCON8698:short_definition': 'Forward contracts',
 'RCON8698:value': '0',
 'RCON8699:bank_rssd_identifier': '720858',
 'RCON8699:call_date': '20010331',
 'RCON8699:call_schedule': 'RC

 'RCONA550:short_definition': 'Over three months through 12 months',
 'RCONA550:value': '746',
 'RCONA551:bank_rssd_identifier': '720858',
 'RCONA551:call_date': '20010331',
 'RCONA551:call_schedule': 'RCB',
 'RCONA551:last_update': '20050712',
 'RCONA551:line_number': 'M2a3',
 'RCONA551:mdrm_#': 'RCONA551',
 'RCONA551:short_definition': 'Over one year through three years',
 'RCONA551:value': '2908',
 'RCONA552:bank_rssd_identifier': '720858',
 'RCONA552:call_date': '20010331',
 'RCONA552:call_schedule': 'RCB',
 'RCONA552:last_update': '20050712',
 'RCONA552:line_number': 'M2a4',
 'RCONA552:mdrm_#': 'RCONA552',
 'RCONA552:short_definition': 'Over three years through five years',
 'RCONA552:value': '2637',
 'RCONA553:bank_rssd_identifier': '720858',
 'RCONA553:call_date': '20010331',
 'RCONA553:call_schedule': 'RCB',
 'RCONA553:last_update': '20050712',
 'RCONA553:line_number': 'M2a5',
 'RCONA553:mdrm_#': 'RCONA553',
 'RCONA553:short_definition': 'Over five years through 15 years.',
 'R

 'RCONB835:value': '0',
 'RCONB836:bank_rssd_identifier': '720858',
 'RCONB836:call_date': '20010331',
 'RCONB836:call_schedule': 'RCN',
 'RCONB836:last_update': '20050712',
 'RCONB836:line_number': '2',
 'RCONB836:mdrm_#': 'RCONB836',
 'RCONB836:short_definition': 'Loans to depository institutions and '
                              'acceptances of other banks',
 'RCONB836:value': '0',
 'RCONB837:bank_rssd_identifier': '720858',
 'RCONB837:call_date': '20010331',
 'RCONB837:call_schedule': 'RCCI',
 'RCONB837:last_update': '20050712',
 'RCONB837:line_number': 'M5',
 'RCONB837:mdrm_#': 'RCONB837',
 'RCONB837:short_definition': 'To be completed by banks with $300 million or '
                              'more in total assets: Loans secured by real '
                              'estate to non-U.S. addressees (domicile) '
                              '(included in Schedule RC-C, part I, items 1.a '
                              'through 1.e, column B)',
 'RCONB837:value': '',
 'RCONB8

 'RIADC017:call_schedule': 'RIE',
 'RIADC017:last_update': '20050712',
 'RIADC017:line_number': '2a',
 'RIADC017:mdrm_#': 'RIADC017',
 'RIADC017:short_definition': 'Data processing expenses',
 'RIADC017:value': '23',
 'RIADC018:bank_rssd_identifier': '720858',
 'RIADC018:call_date': '20010331',
 'RIADC018:call_schedule': 'RIE',
 'RIADC018:last_update': '20050712',
 'RIADC018:line_number': '2d',
 'RIADC018:mdrm_#': 'RIADC018',
 'RIADC018:short_definition': 'Printing, stationery, and supplies',
 'RIADC018:value': '0',
 'RSSD9017:bank_rssd_identifier': '720858',
 'RSSD9017:call_date': '20010331',
 'RSSD9017:call_schedule': 'ENT',
 'RSSD9017:last_update': '20050712',
 'RSSD9017:line_number': '3',
 'RSSD9017:mdrm_#': 'RSSD9017',
 'RSSD9017:short_definition': 'Legal title of bank',
 'RSSD9017:value': '',
 'RSSD9050:bank_rssd_identifier': '720858',
 'RSSD9050:call_date': '20010331',
 'RSSD9050:call_schedule': 'ENT',
 'RSSD9050:last_update': '20050712',
 'RSSD9050:line_number': '2',
 'RSSD9050

In [6]:
# dig for interesting asset-related reports, enrich the item_name

possible_relevant_fields = {}

for key, value in mdrm.items():
    item_type = value['item_type']
    item_name = value['item_name'].lower()
    if 'total asset' in item_name:   
        possible_relevant_fields[key] = '{} {}'.format(item_type, item_name)

In [None]:
pprint(possible_relevent_fields)

In [7]:
# process call_reports into a time-series
data = []
for row_key, call_report in call_reports.items():  
    rssd, period = row_key.split('-')
    
    for key, value in call_report.items():
        mdrm_id, name = key.split(':')
        
        if mdrm_id not in possible_relevant_fields:
            continue
        
        if name != 'value':
            continue
         
        try:
            value = float(value)
            if value.is_integer():
                value = int(value)
        except ValueError:
            pass

        if isinstance(value, str) and not value:
            value = None
            
        data.append({'institution': institutions[rssd][period]['Name'],
                     'mdrm': mdrm_id,
                     'type': mdrm[mdrm_id]['item_type'],
                     'name': mdrm[mdrm_id]['item_name'],
                     'rssd': int(rssd), 
                     'value': value,
                     'period': period_to_datetime(period)})

In [None]:
for k,v in data.items():
    pprint(k)
    pprint(v)
    break


# Ok, so can you answer the question?
> how has asset growth over time compared with other / similar banks?

We'll look at MDRM ID RCON2170, derrived total assets.

In [8]:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
reports_frame = pandas.DataFrame(data=data)

In [9]:
# let's compare 131034, 819172, 720858, 229342, 819172, 753641
# it seems there are no RCON2170 values for 65513
pysqldf('''select rssd, institution, type, name, strftime('%Y', period) as year, sum(value) as "sum"
                  from reports_frame where mdrm="RCON2170"
                  GROUP BY year, rssd
                  ORDER BY rssd, year''')

Unnamed: 0,rssd,institution,type,name,year,sum
0,131034,"CITIZENS BANK OF OVIEDO, THE",derived,TOTAL ASSETS,2001,437415.0
1,131034,"CITIZENS BANK OF OVIEDO, THE",derived,TOTAL ASSETS,2002,493609.0
2,131034,"CITIZENS BANK OF OVIEDO, THE",derived,TOTAL ASSETS,2003,536692.0
3,131034,"CITIZENS BANK OF OVIEDO, THE",derived,TOTAL ASSETS,2004,584885.0
4,131034,"CITIZENS BANK OF OVIEDO, THE",derived,TOTAL ASSETS,2005,663896.0
5,131034,CITIZENS BANK OF FLORIDA,derived,TOTAL ASSETS,2006,731743.0
6,131034,CITIZENS BANK OF FLORIDA,derived,TOTAL ASSETS,2007,751258.0
7,131034,CITIZENS BANK OF FLORIDA,derived,TOTAL ASSETS,2008,961270.0
8,131034,CITIZENS BANK OF FLORIDA,derived,TOTAL ASSETS,2009,915913.0
9,131034,CITIZENS BANK OF FLORIDA,derived,TOTAL ASSETS,2010,914478.0


In [10]:
pysqldf('''select institution, type, name, mdrm, rssd, strftime('%Y', period) as year, sum(value)
                  from reports_frame where mdrm="RCFD5320" 
                  GROUP BY year, rssd
                  ORDER BY year, rssd''')

Unnamed: 0,institution,type,name,mdrm,rssd,year,sum(value)
0,DELTA NATIONAL BANK AND TRUST COMPANY,reported,TOTAL ASSETS (0% RISK-WEIGHT),RCFD5320,65513,2001,488178.0
1,DELTA NATIONAL BANK AND TRUST COMPANY,reported,TOTAL ASSETS (0% RISK-WEIGHT),RCFD5320,65513,2002,458185.0
2,DELTA NATIONAL BANK AND TRUST COMPANY,reported,TOTAL ASSETS (0% RISK-WEIGHT),RCFD5320,65513,2003,383452.0
3,DELTA NATIONAL BANK AND TRUST COMPANY,reported,TOTAL ASSETS (0% RISK-WEIGHT),RCFD5320,65513,2004,255093.0
4,DELTA NATIONAL BANK AND TRUST COMPANY,reported,TOTAL ASSETS (0% RISK-WEIGHT),RCFD5320,65513,2005,182524.0
5,DELTA NATIONAL BANK AND TRUST COMPANY,reported,TOTAL ASSETS (0% RISK-WEIGHT),RCFD5320,65513,2006,113664.0
6,DELTA NATIONAL BANK AND TRUST COMPANY,reported,TOTAL ASSETS (0% RISK-WEIGHT),RCFD5320,65513,2007,85282.0
7,DELTA NATIONAL BANK AND TRUST COMPANY,reported,TOTAL ASSETS (0% RISK-WEIGHT),RCFD5320,65513,2008,318283.0
8,DELTA NATIONAL BANK AND TRUST COMPANY,reported,TOTAL ASSETS (0% RISK-WEIGHT),RCFD5320,65513,2009,927706.0
9,DELTA NATIONAL BANK AND TRUST COMPANY,reported,TOTAL ASSETS (0% RISK-WEIGHT),RCFD5320,65513,2010,935286.0


In [None]:
pysqldf('''select institution, type, name, mdrm, rssd, strftime('%Y', period) as year, sum(value)
                  from reports_frame
                  GROUP BY year, rssd
                  ORDER BY year, rssd, mdrm''')