
# Notes

Gonna use pandas directly. Shouldn't be any consistency issues as these excels are machine generated, 
and will take forever with databaker as excessive depth (databaker has diminishing returns on speed of lookups). 


## Usage

1.) CHANGE THE variables `importsInFile` and `exportsInFile` in the below cell.

2.) Use `Cell->Run All` from the above ribbon.


In [6]:

import pandas as pd

# ###########################
# CHANGE INPUT FILENAMES HERE
# ###########################

importsInFile = "tradeingoodscountrybycommodityimportsmay2018final.xlsx"
exportsInFile = "tradeingoodscountrybycommodityexportsmay2018final.xlsx"

dfI = pd.read_excel(importsInFile)   # create imports dataframe
dfE = pd.read_excel(exportsInFile)   # create exports dataframe


In [7]:
# Sanity check imports dataframe - literally just so we can eyeball the first three lines.
dfI[:3]

Unnamed: 0,COMMODITY,COUNTRY,DIRECTION,'1998JAN','1998FEB','1998MAR','1998APR','1998MAY','1998JUN','1998JUL',...,'2017AUG','2017SEP','2017OCT','2017NOV','2017DEC','2018JAN','2018FEB','2018MAR','2018APR','2018MAY'
0,0 Food & live animals,AD Andorra,IM Imports,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0 Food & live animals,AE United Arab Emirates,IM Imports,1,0,12,29,11,2,0,...,0,0,1,1,0,0,0,8,1,0
2,0 Food & live animals,AF Afghanistan,IM Imports,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
# Sanity check exports dataframe - literally just wo we can eyeball the first three lines.
dfE[:3]

Unnamed: 0,COMMODITY,COUNTRY,DIRECTION,'1998JAN','1998FEB','1998MAR','1998APR','1998MAY','1998JUN','1998JUL',...,'2017AUG','2017SEP','2017OCT','2017NOV','2017DEC','2018JAN','2018FEB','2018MAR','2018APR','2018MAY'
0,0 Food & live animals,AD Andorra,EX Exports,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0 Food & live animals,AE United Arab Emirates,EX Exports,5,5,5,4,5,4,6,...,25,27,21,21,22,19,17,21,20,18
2,0 Food & live animals,AF Afghanistan,EX Exports,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# Some functions we'll reuse a few times.

def codelistify(cell):
    """
    Takes the contents of one 'cell', and styles it as per standard cmd codes. i.e `This Value` to `this-value`
    """
    return cell.lower().replace(" ", "-")


def fixTime(cell):
    """
    Takes the horrible date i.e '1998JAN' and returns something cmd friendly. i.e 1998JAN becomes 'Jan 1998'.
    """
    
    # get rid of pointess quotes
    cell = cell.replace("'", "")
    
    # Some validation, as this is most likely place to encounter 'fun'
    assert len(cell) == 7, "Aborting. Expecting 'date' to be 7 characters long (eg 1998JAN). We got: " + cell
    
    try:
        pointless = int(cell[:4]) # hacky
    except:
        raise ValueError("First 4 characters of 'date' should be a year, we got: " + cell[:4])
        
    return cell[-3:].title() + " " + cell[:4]


# Transformation Script

At this point the dataframes and function are loaded (by executing previous cells), so executing the following cell creates the 
final v4 file.

## Explanation
We're just gonna pivot the data a column at a time. Creating a list containing dataframes - then concatenating them into 
one "master" dataframe which is written to csv.

If you think of the source data (see sanity checks above), each of the sub-datasets we're creating is made from
the first-3-columns + cmd bumf and 1 * time column.



In [10]:

allFrames = [] # list for holding each sub-dataframe

# TODO - no real need to have both sources in memory at the same time
num = 0
for source in [dfI, dfE]:
        
    num += 1 # simeple counter for feedback.
    
    # For each date column:
    for dateCol in source.columns.values[3:]:
        
        df = pd.DataFrame()

        df["v4_0"] = source[dateCol]
        
        df["months"] = "month"
        df["time"] = fixTime(dateCol)
        
        df["uk-only"] = "K0200001"
        df["geography"] = "United Kingdom"
        
        # For the three topic dimensions, they appear to have put the code and label together.
        # just need to split them out
        
        # TODO - messy. Less lambda more func
        # NOTE - replacing "/" with "-" as "/" has syntactical meaning in Cypher and breaks dimension importer
        df["trade-commodity"] = source["COMMODITY"].map(lambda x: x.split(" ")[0]).str.replace("/", "-")
        df["commodity"] = source["COMMODITY"].map(lambda x: " ".join(x.split(" ")[1:]))
        
        df["trade-country"] = source["COUNTRY"].map(lambda x: x.split(" ")[0])
        df["country"] = source["COUNTRY"].map(lambda x: " ".join(x.split(" ")[1:]))
        
        df["trade-direction"] = source["DIRECTION"].map(lambda x: x.split(" ")[0])
        df["direction"] = source["DIRECTION"].map(lambda x: x.split(" ")[1])
        
        allFrames.append(df)
        print("Generated sub-dataframe for {dc} from source {n} of 2.".format(dc=dateCol, n=num))
    
allDf = pd.concat(allFrames)

allDf.to_csv("v4_Trade.csv", index=False) # output to csv
print("v4 File successfully generated.")

# Sanity check output, 5 lines only
allDf[:5]

Generated sub-dataframe for '1998JAN' from source 1 of 2.
Generated sub-dataframe for '1998FEB' from source 1 of 2.
Generated sub-dataframe for '1998MAR' from source 1 of 2.
Generated sub-dataframe for '1998APR' from source 1 of 2.
Generated sub-dataframe for '1998MAY' from source 1 of 2.
Generated sub-dataframe for '1998JUN' from source 1 of 2.
Generated sub-dataframe for '1998JUL' from source 1 of 2.
Generated sub-dataframe for '1998AUG' from source 1 of 2.
Generated sub-dataframe for '1998SEP' from source 1 of 2.
Generated sub-dataframe for '1998OCT' from source 1 of 2.
Generated sub-dataframe for '1998NOV' from source 1 of 2.
Generated sub-dataframe for '1998DEC' from source 1 of 2.
Generated sub-dataframe for '1999JAN' from source 1 of 2.
Generated sub-dataframe for '1999FEB' from source 1 of 2.
Generated sub-dataframe for '1999MAR' from source 1 of 2.
Generated sub-dataframe for '1999APR' from source 1 of 2.
Generated sub-dataframe for '1999MAY' from source 1 of 2.
Generated sub-

Generated sub-dataframe for '2009NOV' from source 1 of 2.
Generated sub-dataframe for '2009DEC' from source 1 of 2.
Generated sub-dataframe for '2010JAN' from source 1 of 2.
Generated sub-dataframe for '2010FEB' from source 1 of 2.
Generated sub-dataframe for '2010MAR' from source 1 of 2.
Generated sub-dataframe for '2010APR' from source 1 of 2.
Generated sub-dataframe for '2010MAY' from source 1 of 2.
Generated sub-dataframe for '2010JUN' from source 1 of 2.
Generated sub-dataframe for '2010JUL' from source 1 of 2.
Generated sub-dataframe for '2010AUG' from source 1 of 2.
Generated sub-dataframe for '2010SEP' from source 1 of 2.
Generated sub-dataframe for '2010OCT' from source 1 of 2.
Generated sub-dataframe for '2010NOV' from source 1 of 2.
Generated sub-dataframe for '2010DEC' from source 1 of 2.
Generated sub-dataframe for '2011JAN' from source 1 of 2.
Generated sub-dataframe for '2011FEB' from source 1 of 2.
Generated sub-dataframe for '2011MAR' from source 1 of 2.
Generated sub-

Generated sub-dataframe for '2001APR' from source 2 of 2.
Generated sub-dataframe for '2001MAY' from source 2 of 2.
Generated sub-dataframe for '2001JUN' from source 2 of 2.
Generated sub-dataframe for '2001JUL' from source 2 of 2.
Generated sub-dataframe for '2001AUG' from source 2 of 2.
Generated sub-dataframe for '2001SEP' from source 2 of 2.
Generated sub-dataframe for '2001OCT' from source 2 of 2.
Generated sub-dataframe for '2001NOV' from source 2 of 2.
Generated sub-dataframe for '2001DEC' from source 2 of 2.
Generated sub-dataframe for '2002JAN' from source 2 of 2.
Generated sub-dataframe for '2002FEB' from source 2 of 2.
Generated sub-dataframe for '2002MAR' from source 2 of 2.
Generated sub-dataframe for '2002APR' from source 2 of 2.
Generated sub-dataframe for '2002MAY' from source 2 of 2.
Generated sub-dataframe for '2002JUN' from source 2 of 2.
Generated sub-dataframe for '2002JUL' from source 2 of 2.
Generated sub-dataframe for '2002AUG' from source 2 of 2.
Generated sub-

Generated sub-dataframe for '2013FEB' from source 2 of 2.
Generated sub-dataframe for '2013MAR' from source 2 of 2.
Generated sub-dataframe for '2013APR' from source 2 of 2.
Generated sub-dataframe for '2013MAY' from source 2 of 2.
Generated sub-dataframe for '2013JUN' from source 2 of 2.
Generated sub-dataframe for '2013JUL' from source 2 of 2.
Generated sub-dataframe for '2013AUG' from source 2 of 2.
Generated sub-dataframe for '2013SEP' from source 2 of 2.
Generated sub-dataframe for '2013OCT' from source 2 of 2.
Generated sub-dataframe for '2013NOV' from source 2 of 2.
Generated sub-dataframe for '2013DEC' from source 2 of 2.
Generated sub-dataframe for '2014JAN' from source 2 of 2.
Generated sub-dataframe for '2014FEB' from source 2 of 2.
Generated sub-dataframe for '2014MAR' from source 2 of 2.
Generated sub-dataframe for '2014APR' from source 2 of 2.
Generated sub-dataframe for '2014MAY' from source 2 of 2.
Generated sub-dataframe for '2014JUN' from source 2 of 2.
Generated sub-

Unnamed: 0,v4_0,months,time,uk-only,geography,trade-commodity,commodity,trade-country,country,trade-direction,direction
0,0,month,Jan 1998,K0200001,United Kingdom,0,Food & live animals,AD,Andorra,IM,Imports
1,1,month,Jan 1998,K0200001,United Kingdom,0,Food & live animals,AE,United Arab Emirates,IM,Imports
2,0,month,Jan 1998,K0200001,United Kingdom,0,Food & live animals,AF,Afghanistan,IM,Imports
3,0,month,Jan 1998,K0200001,United Kingdom,0,Food & live animals,AG,Antigua & Barbuda,IM,Imports
4,0,month,Jan 1998,K0200001,United Kingdom,0,Food & live animals,AI,Anguilla,IM,Imports
