
# Dev Notes

Gonna use pandas directly. Shouldn't be any consistency issues as these excels are machine generated, 
and will take forever with databaker as excessive depth (databaker has diminishing returns on speed of lookups). 


## Usage

1.) CHANGE THE variables `importsInFile` and `exportsInFile` in the below cell.

2.) Use `Cell->Run All` from the above ribbon.
<br>
<br>
NOTE - this is not quick, expect it to take 5-10 mins and just leave it to run, it's fine.




In [25]:

import pandas as pd
from databakerUtils.v4Functions import v4Integers
import requests

# ###########################
# CHANGE INPUT FILENAMES HERE
# ###########################

importsInFile = "countrybycommodityimportsfinal.xlsx"
exportsInFile = "countrybycommodityexportsfinal.xlsx"

dfI = pd.read_excel(importsInFile)   # create imports dataframe
dfE = pd.read_excel(exportsInFile)   # create exports dataframe

#BA have added a key at the bottom...
#next two lines needed if script returns an error
dfI = dfI.dropna()
dfE = dfE.dropna()


# Sanity check imports dataframe - literally just so we can eyeball the first three lines.
dfI[:3]

Unnamed: 0,COMMODITY,COUNTRY,DIRECTION,1998JAN,1998FEB,1998MAR,1998APR,1998MAY,1998JUN,1998JUL,...,2018APR,2018MAY,2018JUN,2018JUL,2018AUG,2018SEP,2018OCT,2018NOV,2018DEC,2019JAN
0,0 Food & live animals,AD Andorra,IM Imports,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0 Food & live animals,AE United Arab Emirates,IM Imports,1.0,0.0,12.0,29.0,11.0,2.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0 Food & live animals,AF Afghanistan,IM Imports,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
# Sanity check exports dataframe - literally just so we can eyeball the first three lines.
dfE[:3]

# Functions we'll reuse a few times.

def fixTime(cell):
    """
    Takes the horrible date i.e '1998JAN' and returns something cmd friendly. i.e 1998JAN becomes 'Jan-98'.
    """
    
    # get rid of pointess quotes
    cell = cell.replace("'", "")
    
    # Some validation, as this is most likely place to encounter 'fun'
    assert len(cell) == 7, "Aborting. Expecting 'date' to be 7 characters long (eg 1998JAN). We got: " + cell
    
    try:
        pointless = int(cell[:4]) # hacky
    except:
        raise ValueError("First 4 characters of 'date' should be a year, we got: " + cell[:4])
        
    return cell[-3:].title() + "-" + cell[2:4]

url = 'https://api.cmd-dev.onsdigital.co.uk/v1/code-lists/trade-commodity/editions/one-off/codes'
r = requests.get(url)
wholeDict = r.json()
commodityDict = {}
for item in wholeDict['items']:
    commodityDict.update({item['id']:item['label']})
def CommodityLabels(value):
    #returns trade-commodity labels from api
    return commodityDict[value]

url = 'https://api.cmd-dev.onsdigital.co.uk/v1/code-lists/trade-country/editions/one-off/codes'
r = requests.get(url)
wholeDict = r.json()
countryDict = {}
for item in wholeDict['items']:
    countryDict.update({item['id']:item['label']})
def CountryLabels(value):
    #returns trade-country labels from api
    return countryDict[value]


# Transformation Script

At this point the dataframes and function are loaded (by executing previous cells), so executing the following cell creates the 
final v4 file.

## Explanation
We're just gonna pivot the data a column at a time. Creating a list containing dataframes - then concatenating them into 
one "master" dataframe which is written to csv.

If you think of the source data (see sanity checks above), each of the sub-datasets we're creating is made from
the first-3-columns + cmd bumf and 1 * time column.



In [27]:

allFrames = [] # list for holding each sub-dataframe

# TODO - no real need to have both sources in memory at the same time
num = 0
for source in [dfI, dfE]:
        
    num += 1 # simeple counter for feedback.
    
    # For each date column:
    for dateCol in source.columns.values[3:]:
        
        df = pd.DataFrame()

        df["v4_0"] = source[dateCol]
        
        df["mmm-yy"] = fixTime(dateCol)
        df["time"] = fixTime(dateCol)
        
        df["uk-only"] = "K02000001"
        df["geography"] = "United Kingdom"
        
        # For the three topic dimensions, they appear to have put the code and label together.
        # just need to split them out
        
        # TODO - messy. Less lambda more func
        # NOTE - replacing "/" with "-" as "/" has syntactical meaning in Cypher and breaks dimension importer
        df["trade-commodity"] = source["COMMODITY"].map(lambda x: x.split(" ")[0]).str.replace("/", "-")
        df["commodity"] = source["COMMODITY"].map(lambda x: " ".join(x.split(" ")[1:]))
        
        df["trade-country"] = source["COUNTRY"].map(lambda x: x.split(" ")[0])
        df["country"] = source["COUNTRY"].map(lambda x: " ".join(x.split(" ")[1:]))
        
        df["trade-direction"] = source["DIRECTION"].map(lambda x: x.split(" ")[0])
        df["direction"] = source["DIRECTION"].map(lambda x: x.split(" ")[1])
        
        allFrames.append(df)
        print("Generated sub-dataframe for {dc} from source {n} of 2.".format(dc=dateCol, n=num))
    
allDf = pd.concat(allFrames)

allDf['v4_0'] = allDf['v4_0'].apply(v4Integers) #changes floats to string-integers

allDf['country'] = allDf['trade-country'].apply(CountryLabels) #change country labels
allDf['commodity'] = allDf['trade-commodity'].apply(CommodityLabels) #change commodity labels

allDf.to_csv("v4_Trade.csv", index=False) # output to csv
print("v4 File successfully generated.")

# Sanity check output, 5 lines only
allDf[:5]

Generated sub-dataframe for 1998JAN from source 1 of 2.
Generated sub-dataframe for 1998FEB from source 1 of 2.
Generated sub-dataframe for 1998MAR from source 1 of 2.
Generated sub-dataframe for 1998APR from source 1 of 2.
Generated sub-dataframe for 1998MAY from source 1 of 2.
Generated sub-dataframe for 1998JUN from source 1 of 2.
Generated sub-dataframe for 1998JUL from source 1 of 2.
Generated sub-dataframe for 1998AUG from source 1 of 2.
Generated sub-dataframe for 1998SEP from source 1 of 2.
Generated sub-dataframe for 1998OCT from source 1 of 2.
Generated sub-dataframe for 1998NOV from source 1 of 2.
Generated sub-dataframe for 1998DEC from source 1 of 2.
Generated sub-dataframe for 1999JAN from source 1 of 2.
Generated sub-dataframe for 1999FEB from source 1 of 2.
Generated sub-dataframe for 1999MAR from source 1 of 2.
Generated sub-dataframe for 1999APR from source 1 of 2.
Generated sub-dataframe for 1999MAY from source 1 of 2.
Generated sub-dataframe for 1999JUN from source 

Generated sub-dataframe for 2010APR from source 1 of 2.
Generated sub-dataframe for 2010MAY from source 1 of 2.
Generated sub-dataframe for 2010JUN from source 1 of 2.
Generated sub-dataframe for 2010JUL from source 1 of 2.
Generated sub-dataframe for 2010AUG from source 1 of 2.
Generated sub-dataframe for 2010SEP from source 1 of 2.
Generated sub-dataframe for 2010OCT from source 1 of 2.
Generated sub-dataframe for 2010NOV from source 1 of 2.
Generated sub-dataframe for 2010DEC from source 1 of 2.
Generated sub-dataframe for 2011JAN from source 1 of 2.
Generated sub-dataframe for 2011FEB from source 1 of 2.
Generated sub-dataframe for 2011MAR from source 1 of 2.
Generated sub-dataframe for 2011APR from source 1 of 2.
Generated sub-dataframe for 2011MAY from source 1 of 2.
Generated sub-dataframe for 2011JUN from source 1 of 2.
Generated sub-dataframe for 2011JUL from source 1 of 2.
Generated sub-dataframe for 2011AUG from source 1 of 2.
Generated sub-dataframe for 2011SEP from source 

Generated sub-dataframe for 2001JUL from source 2 of 2.
Generated sub-dataframe for 2001AUG from source 2 of 2.
Generated sub-dataframe for 2001SEP from source 2 of 2.
Generated sub-dataframe for 2001OCT from source 2 of 2.
Generated sub-dataframe for 2001NOV from source 2 of 2.
Generated sub-dataframe for 2001DEC from source 2 of 2.
Generated sub-dataframe for 2002JAN from source 2 of 2.
Generated sub-dataframe for 2002FEB from source 2 of 2.
Generated sub-dataframe for 2002MAR from source 2 of 2.
Generated sub-dataframe for 2002APR from source 2 of 2.
Generated sub-dataframe for 2002MAY from source 2 of 2.
Generated sub-dataframe for 2002JUN from source 2 of 2.
Generated sub-dataframe for 2002JUL from source 2 of 2.
Generated sub-dataframe for 2002AUG from source 2 of 2.
Generated sub-dataframe for 2002SEP from source 2 of 2.
Generated sub-dataframe for 2002OCT from source 2 of 2.
Generated sub-dataframe for 2002NOV from source 2 of 2.
Generated sub-dataframe for 2002DEC from source 

Generated sub-dataframe for 2013NOV from source 2 of 2.
Generated sub-dataframe for 2013DEC from source 2 of 2.
Generated sub-dataframe for 2014JAN from source 2 of 2.
Generated sub-dataframe for 2014FEB from source 2 of 2.
Generated sub-dataframe for 2014MAR from source 2 of 2.
Generated sub-dataframe for 2014APR from source 2 of 2.
Generated sub-dataframe for 2014MAY from source 2 of 2.
Generated sub-dataframe for 2014JUN from source 2 of 2.
Generated sub-dataframe for 2014JUL from source 2 of 2.
Generated sub-dataframe for 2014AUG from source 2 of 2.
Generated sub-dataframe for 2014SEP from source 2 of 2.
Generated sub-dataframe for 2014OCT from source 2 of 2.
Generated sub-dataframe for 2014NOV from source 2 of 2.
Generated sub-dataframe for 2014DEC from source 2 of 2.
Generated sub-dataframe for 2015JAN from source 2 of 2.
Generated sub-dataframe for 2015FEB from source 2 of 2.
Generated sub-dataframe for 2015MAR from source 2 of 2.
Generated sub-dataframe for 2015APR from source 

Unnamed: 0,v4_0,mmm-yy,time,uk-only,geography,trade-commodity,commodity,trade-country,country,trade-direction,direction
0,0,Jan-98,Jan-98,K02000001,United Kingdom,0,0 - Food & live animals,AD,AD - Andorra,IM,Imports
1,1,Jan-98,Jan-98,K02000001,United Kingdom,0,0 - Food & live animals,AE,AE - United Arab Emirates,IM,Imports
2,0,Jan-98,Jan-98,K02000001,United Kingdom,0,0 - Food & live animals,AF,AF - Afghanistan,IM,Imports
3,0,Jan-98,Jan-98,K02000001,United Kingdom,0,0 - Food & live animals,AG,AG - Antigua & Barbuda,IM,Imports
4,0,Jan-98,Jan-98,K02000001,United Kingdom,0,0 - Food & live animals,AI,AI - Anguilla,IM,Imports
