# Simple SDMX > V3 loadfile transformation

Its "simple" as we aren't using the uri's etc, just using pre-downloaded standard xml definitions (Structures.xml) for transforming the data. Standard doesnt cover everything in our example SDMX file though, so there will be a few dimensions that remain unconverted at the end.

Longer term we'll need to use the global SDMX resources and APIs to get specific structural data per SDMX file. Its certainly something we can do but wasn't a priority for this first look.

The rest of this workbook just walks through the process, displaying the file at each stage. If you dont use python it should still kinda make sense.

PLEASE NOTE - we're basically transforming well defined concepts into text labels here so that we can load them and turn them back into well defined concepts:) This is enough for alpha but conversations will be needed around mapping SDMX concepts and codelists to the systems we are developing.  


## Get the Codes

This code block just constructs some lookups (a python dictionary) we can use to translate the SDMX codes to readable 
text later. There's a preview of sorts at the end of this code block.


In [76]:


# Pulfrom bs4 import BeautifulSoup
import pandas as pd
import requests
from bs4 import BeautifulSoup


def updateMapping(SDMXmapping, allCodeLists, justConcepts=False):
    
    for cl in allCodeLists:
        clId = cl['id']
        
        if not justConcepts:
            SDMXmapping.update({clId:{'name':'', 'codes': {}}})

        count = 0
        for c in cl.children:
            
            if count == 0:
                if justConcepts:
                    SDMXmapping.update({clId:c.text})
                else:
                    SDMXmapping[clId]['name'] = c.text
                                
            if not justConcepts:
                try:
                    SDMXmapping[clId]['codes'].update({c['id']:c.text.strip()})
                except:
                    pass # some strings in there
            count += 1  
    return SDMXmapping



# for now just downloaded it (security certificates).
with open('Structures.xml', 'r', encoding='utf8') as SDMXstruc:
    
    soup = BeautifulSoup(SDMXstruc, 'lxml')
    
    SDMXmapping = {}
    
    # get codelist mappings
    allCodeLists = soup.find_all('str:codelist')
    SDMXmapping = updateMapping(SDMXmapping, allCodeLists)
    
    # get codelist mappings
    conceptList = soup.find_all('str:concept')
    SDMXmapping = updateMapping(SDMXmapping, conceptList, justConcepts=True)   



## Example 'lookup'

## Flatten the SMDX

Literally that. No other changes, 10 line preview at the end.

In [77]:
# flatten the sdmx

inputfile = 'T0111Quarterly_4(SDMXExport) (11).xml'

from bs4 import BeautifulSoup
import pandas as pd

with open(inputfile, 'r') as f:
    soup = BeautifulSoup(f, 'lxml')

    # for each data series
    dataSeries = soup.html.body.compactdata.find_all('na_:series') # TODO - na_ is specifc to national account, needs generic
    
    # build dict
    finalDict = {}
    initial_keys = ['obs_value', 'time_period', 'obs_status', 'conf_status']
    for ik in initial_keys:
        finalDict.update({ik:[]})
    
    # add the keys is use per series
    series_keys = dataSeries[0].attrs.keys()
    for sk in series_keys:
        finalDict.update({sk:[]})
        
        
    # EXTRACT
    for dSeries in dataSeries:
        for ob in dSeries.findChildren():    
            for ik in initial_keys:
                finalDict[ik].append(ob[ik])
            for sk in series_keys:
                finalDict[sk].append(dSeries[sk])
                
    obs_file = pd.DataFrame.from_dict(finalDict)


obs_file[:10] # preview


Unnamed: 0,accounting_entry,activity,adjustment,comment_ts,compiling_org,conf_status,counterpart_area,counterpart_sector,decimals,expenditure,...,prices,ref_area,ref_sector,sto,table_identifier,time_period,title,transformation,unit_measure,unit_mult
0,_Z,A,N,\n,GB1,F,W2,S1,3,_Z,...,_Z,GB,S1,EMP,T0111,1995-Q1,Employment by Industry,N,HW,6
1,_Z,A,N,\n,GB1,F,W2,S1,3,_Z,...,_Z,GB,S1,EMP,T0111,1995-Q2,Employment by Industry,N,HW,6
2,_Z,A,N,\n,GB1,F,W2,S1,3,_Z,...,_Z,GB,S1,EMP,T0111,1995-Q3,Employment by Industry,N,HW,6
3,_Z,A,N,\n,GB1,F,W2,S1,3,_Z,...,_Z,GB,S1,EMP,T0111,1995-Q4,Employment by Industry,N,HW,6
4,_Z,A,N,\n,GB1,F,W2,S1,3,_Z,...,_Z,GB,S1,EMP,T0111,1996-Q1,Employment by Industry,N,HW,6
5,_Z,A,N,\n,GB1,F,W2,S1,3,_Z,...,_Z,GB,S1,EMP,T0111,1996-Q2,Employment by Industry,N,HW,6
6,_Z,A,N,\n,GB1,F,W2,S1,3,_Z,...,_Z,GB,S1,EMP,T0111,1996-Q3,Employment by Industry,N,HW,6
7,_Z,A,N,\n,GB1,F,W2,S1,3,_Z,...,_Z,GB,S1,EMP,T0111,1996-Q4,Employment by Industry,N,HW,6
8,_Z,A,N,\n,GB1,F,W2,S1,3,_Z,...,_Z,GB,S1,EMP,T0111,1997-Q1,Employment by Industry,N,HW,6
9,_Z,A,N,\n,GB1,F,W2,S1,3,_Z,...,_Z,GB,S1,EMP,T0111,1997-Q2,Employment by Industry,N,HW,6


# Tidy things up a bit

change order to something more like we're used to.

10 line preview at the end.


In [78]:
# order it to be more like our typcal load files

reorder = ['obs_value',
'obs_status',
'time_period',
'ref_area',        
'unit_measure',
'unit_mult',
'title',
'accounting_entry',
'activity',
'adjustment',
'comment_ts',
'compiling_org',
'conf_status',
'counterpart_area',
'counterpart_sector',
'decimals',
'expenditure',
'freq',
'instr_asset',
'prices',
'ref_sector',
'sto',
'table_identifier',
'transformation']

obs_file = obs_file[reorder]


obs_file[:10] # preview


Unnamed: 0,obs_value,obs_status,time_period,ref_area,unit_measure,unit_mult,title,accounting_entry,activity,adjustment,...,counterpart_sector,decimals,expenditure,freq,instr_asset,prices,ref_sector,sto,table_identifier,transformation
0,279.902,A,1995-Q1,GB,HW,6,Employment by Industry,_Z,A,N,...,S1,3,_Z,Q,_Z,_Z,S1,EMP,T0111,N
1,303.099,A,1995-Q2,GB,HW,6,Employment by Industry,_Z,A,N,...,S1,3,_Z,Q,_Z,_Z,S1,EMP,T0111,N
2,307.714,A,1995-Q3,GB,HW,6,Employment by Industry,_Z,A,N,...,S1,3,_Z,Q,_Z,_Z,S1,EMP,T0111,N
3,285.639,A,1995-Q4,GB,HW,6,Employment by Industry,_Z,A,N,...,S1,3,_Z,Q,_Z,_Z,S1,EMP,T0111,N
4,258.886,A,1996-Q1,GB,HW,6,Employment by Industry,_Z,A,N,...,S1,3,_Z,Q,_Z,_Z,S1,EMP,T0111,N
5,291.308,A,1996-Q2,GB,HW,6,Employment by Industry,_Z,A,N,...,S1,3,_Z,Q,_Z,_Z,S1,EMP,T0111,N
6,302.456,A,1996-Q3,GB,HW,6,Employment by Industry,_Z,A,N,...,S1,3,_Z,Q,_Z,_Z,S1,EMP,T0111,N
7,277.518,A,1996-Q4,GB,HW,6,Employment by Industry,_Z,A,N,...,S1,3,_Z,Q,_Z,_Z,S1,EMP,T0111,N
8,255.679,A,1997-Q1,GB,HW,6,Employment by Industry,_Z,A,N,...,S1,3,_Z,Q,_Z,_Z,S1,EMP,T0111,N
9,291.108,A,1997-Q2,GB,HW,6,Employment by Industry,_Z,A,N,...,S1,3,_Z,Q,_Z,_Z,S1,EMP,T0111,N


## Perform Lookups

get rid of some of those SDMX codes (for now .... ), replace with their plain text value.

4 line preview at the end.

In [79]:

# lookup the values ... do we even want to do this? debatable. SDMX is more fully ...
# ... described than text and that should be a good thing.

for i, row in obs_file.iterrows():
    
    for header in obs_file.columns.values:
            
        # some things we dont want as text :)
        if header not in ['decimals']:
            
            try:
                obs_file.ix[i, header] = SDMXmapping['CL_' + header.upper()]['codes'][obs_file.ix[i, header]]
            except:
                pass
            
        if obs_file.ix[i, header] == '_Z':  # dump "non applicable entries"
            obs_file.ix[i, header] = ''

# hacky "drop empty columns" code.

# if a columns has one unique entry equal to '' add to list to drop
dropList = []
for col in obs_file.columns.values:
        if len(obs_file[col].unique()) == 1 and obs_file[col].unique()[0] == '':
            dropList.append(col)
        if len(obs_file[col].unique()) == 1 and obs_file[col].unique()[0] == 'Not applicable':
            dropList.append(col)

# then drop
for drop in dropList:
    
    # get column index number, then drop the column
    delMe = obs_file.columns.get_loc(drop)
    obs_file = obs_file.drop(obs_file.columns[delMe], axis=1)  

    
# redo the headers in keeping with SDMX definition
for header in obs_file.columns.values:
    
    try:
        newName = SDMXmapping['CL_' + header.upper()]['name']
        obs_file = obs_file.rename(columns={header:newName})
    except:
        pass

obs_file[:4]  # preview


Unnamed: 0,obs_value,Observation Status,time_period,ref_area,unit_measure,Unit Multiplier,title,Industrial activity code list,Adjustment indicator,comment_ts,compiling_org,Confidentiality Status,counterpart_area,counterpart_sector,Decimals,Frequency,ref_sector,sto,table_identifier,Transformation codes
0,279.902,Normal valueTo be used as default value if no ...,1995-Q1,GB,HW,Millions,Employment by Industry,"Agriculture, forestry and fishing",Neither seasonally adjusted nor calendar adjus...,\n,GB1,Free (free for publication)Used for observatio...,W2,S1,3,QuarterlyTo be used for data collected or diss...,S1,EMP,T0111,Non transformed data
1,303.099,Normal valueTo be used as default value if no ...,1995-Q2,GB,HW,Millions,Employment by Industry,"Agriculture, forestry and fishing",Neither seasonally adjusted nor calendar adjus...,\n,GB1,Free (free for publication)Used for observatio...,W2,S1,3,QuarterlyTo be used for data collected or diss...,S1,EMP,T0111,Non transformed data
2,307.714,Normal valueTo be used as default value if no ...,1995-Q3,GB,HW,Millions,Employment by Industry,"Agriculture, forestry and fishing",Neither seasonally adjusted nor calendar adjus...,\n,GB1,Free (free for publication)Used for observatio...,W2,S1,3,QuarterlyTo be used for data collected or diss...,S1,EMP,T0111,Non transformed data
3,285.639,Normal valueTo be used as default value if no ...,1995-Q4,GB,HW,Millions,Employment by Industry,"Agriculture, forestry and fishing",Neither seasonally adjusted nor calendar adjus...,\n,GB1,Free (free for publication)Used for observatio...,W2,S1,3,QuarterlyTo be used for data collected or diss...,S1,EMP,T0111,Non transformed data


# Remove what we dont need

At this stage we need to remove any columns that are not part of the V3 structure (or are just pointless - so anything with an identical or blank entry on every row for starters). Longer term this would have to be programatic rather than explicit as it'd be subject to change.

Interestingly, a lot of these (unit of measure, title etc) are metadata and could feasibly be piped in that direction instead.

Disregarded columns containing (arguably) metadata:

* title
* unit
* unit multiplier (SELF, needs incorporating into obs?)
* compiling org
* frequency

In [80]:

print(obs_file.columns.values)

# we'll start by removing the obvious miss-matches:
# a lot of this is arguably metadata
remove = ['Decimals','Unit Multiplier','Frequency', 'title', 'table_identifier',
          'Observation Status', 'compiling_org', 'Transformation codes', 'comment_ts']

for r in remove:
    # get column index number, then drop the column
    try:
        delMe = obs_file.columns.get_loc(r)
        obs_file = obs_file.drop(obs_file.columns[delMe], axis=1)
    except:
        print ('Delete failed on', r)

obs_file[:4]

['obs_value' 'Observation Status' 'time_period' 'ref_area' 'unit_measure'
 'Unit Multiplier' 'title' 'Industrial activity code list'
 'Adjustment indicator' 'comment_ts' 'compiling_org'
 'Confidentiality Status' 'counterpart_area' 'counterpart_sector'
 'Decimals' 'Frequency' 'ref_sector' 'sto' 'table_identifier'
 'Transformation codes']


Unnamed: 0,obs_value,time_period,ref_area,unit_measure,Industrial activity code list,Adjustment indicator,Confidentiality Status,counterpart_area,counterpart_sector,ref_sector,sto
0,279.902,1995-Q1,GB,HW,"Agriculture, forestry and fishing",Neither seasonally adjusted nor calendar adjus...,Free (free for publication)Used for observatio...,W2,S1,S1,EMP
1,303.099,1995-Q2,GB,HW,"Agriculture, forestry and fishing",Neither seasonally adjusted nor calendar adjus...,Free (free for publication)Used for observatio...,W2,S1,S1,EMP
2,307.714,1995-Q3,GB,HW,"Agriculture, forestry and fishing",Neither seasonally adjusted nor calendar adjus...,Free (free for publication)Used for observatio...,W2,S1,S1,EMP
3,285.639,1995-Q4,GB,HW,"Agriculture, forestry and fishing",Neither seasonally adjusted nor calendar adjus...,Free (free for publication)Used for observatio...,W2,S1,S1,EMP


## Creating a V3 Version

This mainly consists of disregarding columns that aren't relevant to our format.

10 line preview at the end.


In [81]:

# anything that we dont want generating its own dimension
# time and geo and anything relating to obs are handled specially so make the list

# dumping as pointless - if they are still there. 
# TODO - pointless, the last part works or it doesnt
disregard = ['obs_value', 'obs_status', 'time_period', 'area', 'unit', 'unit_mult', 'table_identifier', 
             'transformation', 'freq', 'comment_ts', 'decimals', 'conf_status', 'compiling_org', 'title']
    
v3_file = pd.DataFrame(columns=['Observation', 'Data_Marking', 'Observation_Type_Value'])
v3_file['Observation'] = obs_file['obs_value']

# dim 1: time
v3_file['Dimension_Hierarchy_1'] = 'time'
v3_file['Dimension_Name_1'] = ''
v3_file['Dimension_Value_1'] = obs_file['time_period']

# dim 2: geography
v3_file['Dimension_Hierarchy_2'] = ''
v3_file['Dimension_Name_2'] = 'Geography'
v3_file['Dimension_Value_2'] = obs_file['ref_area']

# whatevers left
repeatHeaders = ['Dimension_Hierarchy_', 'Dimension_Name_', 'Dimension_Value_']
counter = 3 # as 1 and 2 are taken by geography
for header in obs_file.columns.values:
    if header not in disregard:
        v3_file['Dimension_Hierarchy_' + str(counter)] = ''
        v3_file['Dimension_Name_' + str(counter)] = header
        v3_file['Dimension_Value_' + str(counter)] = obs_file[header]
        counter += 1

v3_file.fillna('', inplace=True)

v3_file[:10]

Unnamed: 0,Observation,Data_Marking,Observation_Type_Value,Dimension_Hierarchy_1,Dimension_Name_1,Dimension_Value_1,Dimension_Hierarchy_2,Dimension_Name_2,Dimension_Value_2,Dimension_Hierarchy_3,...,Dimension_Value_8,Dimension_Hierarchy_9,Dimension_Name_9,Dimension_Value_9,Dimension_Hierarchy_10,Dimension_Name_10,Dimension_Value_10,Dimension_Hierarchy_11,Dimension_Name_11,Dimension_Value_11
0,279.902,,,time,,1995-Q1,,Geography,GB,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
1,303.099,,,time,,1995-Q2,,Geography,GB,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
2,307.714,,,time,,1995-Q3,,Geography,GB,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
3,285.639,,,time,,1995-Q4,,Geography,GB,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
4,258.886,,,time,,1996-Q1,,Geography,GB,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
5,291.308,,,time,,1996-Q2,,Geography,GB,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
6,302.456,,,time,,1996-Q3,,Geography,GB,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
7,277.518,,,time,,1996-Q4,,Geography,GB,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
8,255.679,,,time,,1997-Q1,,Geography,GB,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
9,291.108,,,time,,1997-Q2,,Geography,GB,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP


## Reformat time and geography for our systems

Doing these in their own section as theyre a pretty big deal.

Preview at the end.


In [82]:

# time reformatting function
def changeTime(df):
    
    for i, row in df.iterrows():
        if 'Q' in row['Dimension_Value_1']:   # it's a quarter
            row['Dimension_Name_1'] = 'Quarter'
            row['Dimension_Value_1'] = row['Dimension_Value_1'].replace('-', '.')
        
        # TODO - NEED MORE SDMX TIME EXAMPLES
    
    # just in case
    assert len(v3_file['Dimension_Name_1'].unique()) == 1
    
    return df


# geography reformatting function
def changeGeog(df):
    
    # TODO - this will be horrible. We're gonna need a whole SMDX pllain text > ONS geographic hierarchy mapping or
    # text matching of some kind. if we're very lucky someone else has already done it.
    
    # If there's only one level its a national code of some kind
    if len(df['Dimension_Value_2'].unique()) == 1:
        oneCode = df['Dimension_Value_2'].unique()[0]
        
        natCodes = {
            "United Kingdom":"K02000001",
            "Great Britain":"K03000001",
            "England and Wales":"K04000001",
            "England":"E92000001",
            "Wales / Cymru":"W92000004",
            "Wales":"W92000004",
            "Cymru":"W92000004",
            "Northern Ireland":"N92000002",
            "Scotland":"S92000003",
            "GB":"K03000001",
                }
        
        # try catch in case we dont know it
        try:
            oneCode = natCodes[oneCode]
        except:
            raise ValueError
        
        df['Dimension_Value_2'] = oneCode
        df['Dimension_Hierarchy_2'] = '2011STATH'
    
    return df
    
v3_file = changeTime(v3_file)
v3_file = changeGeog(v3_file)

v3_file[:10]


Unnamed: 0,Observation,Data_Marking,Observation_Type_Value,Dimension_Hierarchy_1,Dimension_Name_1,Dimension_Value_1,Dimension_Hierarchy_2,Dimension_Name_2,Dimension_Value_2,Dimension_Hierarchy_3,...,Dimension_Value_8,Dimension_Hierarchy_9,Dimension_Name_9,Dimension_Value_9,Dimension_Hierarchy_10,Dimension_Name_10,Dimension_Value_10,Dimension_Hierarchy_11,Dimension_Name_11,Dimension_Value_11
0,279.902,,,time,Quarter,1995.Q1,2011STATH,Geography,K03000001,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
1,303.099,,,time,Quarter,1995.Q2,2011STATH,Geography,K03000001,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
2,307.714,,,time,Quarter,1995.Q3,2011STATH,Geography,K03000001,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
3,285.639,,,time,Quarter,1995.Q4,2011STATH,Geography,K03000001,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
4,258.886,,,time,Quarter,1996.Q1,2011STATH,Geography,K03000001,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
5,291.308,,,time,Quarter,1996.Q2,2011STATH,Geography,K03000001,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
6,302.456,,,time,Quarter,1996.Q3,2011STATH,Geography,K03000001,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
7,277.518,,,time,Quarter,1996.Q4,2011STATH,Geography,K03000001,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
8,255.679,,,time,Quarter,1997.Q1,2011STATH,Geography,K03000001,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
9,291.108,,,time,Quarter,1997.Q2,2011STATH,Geography,K03000001,,...,W2,,counterpart_sector,S1,,ref_sector,S1,,sto,EMP
