Skip to content

GSS-Cogs/HTMLSpecTest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InfoTransform Wiki

NB The functionality of these functions assumes that each tab is being processed out to it's own dataframe to then be joined together in stage 2 transform.
This also assumes when using the trace you are using the tab.name as the table name when storing each tab.

Functions

Main

Info Transform

infoTransform takes 3 arguments to function: tabName, tabTitle, and tabColumns.

The tabName (string), as one might expect, is the unchanged name of the tab in the excel document being transformed. (must remain unchanged as this is used to determine if the tab is alrady present in the spec upon rerunning)

The tabTitle (string) is usually the title of the tab given at the top of each sheet, if there isn't one then this can be chosen by the DE as it is not used to determine if the tab spec is already present in the info.json
This must be given as a string so if taking the title from the tab use cellCont to get the string value (see 'Cell Contents')

Lastly, the tabColumns (list) is given as the list of columns in the output dataset, this must be given as the columns variable that is being used for the trace as any column value passed to this function which isnt in the trace columns list will cause an error.

The function will run through the list of column headers call to the trace function for the most recent stored variable for each column.

The function will check if the transformStage section of the info.json file exists, if not it will create one and add the spec. If it does then it will pull out the list of specs and check if the given tab already exists in the list, if it does then it will replace the existing spec, if not then it will append it.

This function also reads and writes to the info.json file so there is no need to read/write it manually.



This function should be used in the loop for each tab (I have it just before the list of databaker dimensions)

def infoTransform(tabName, tabTitle, tabColumns):

    dictList = []

    with open('info.json') as info:
        data = info.read()

    infoData = json.loads(data)

    columnInfo = {}

    for i in tabColumns:
        underI = i.replace(' ', '_')
        columnInfo[i] = getattr(getattr(trace, underI), 'var')

    dicti = {'name' : tabName,
             'title' : tabTitle,
             'columns' : columnInfo}

    if infoData.get('transform').get('transformStage') == None:
        infoData['transform']['transformStage'] = []
        dictList.append(dicti)
    else:
        dictList = infoData['transform']['transformStage']
        index = next((index for (index, d) in enumerate(dictList) if d["name"] == tabName), None)
        if index is None :
            dictList.append(dicti)
        else:
            dictList[index] = dicti

    infoData['transform']['transformStage'] = dictList

    with open('info.json', 'w') as info:
        info.write(json.dumps(infoData, indent=4).replace('null', '"Not Applicable"'))

Info Comments

infoComments is very similar to infoTransform except its primary use is for appending post transform changes to the spec.

For example, when changing a DATAMARKER df = df.replace({'Marker' : {'*' : 'Statistical Disclosure Applied'}})
You would then add the trace trace.Marker("Change * DataMarker to 'Statistical Disclosure Applied'")
By running this function at the end of each tab's post transform script you can add this to the spec in the following form:

"Marker": [
"Change * DataMarker to 'Statistical Disclosure Applied'"]

This function takes arguments: tabName, and tabColumns.

For this function to work correctly each tab should be being transformed into its own dataframe and therefore the post transform changes should be tab by tab.

This is because this functions calls the trace function variables by tab name so there should be a table for each tab using the tabName as the trace table name.

As the columns variable that was used for the infoTransform will likely no longer match up with the tab currently being transformed, you can use list(df.columns) to get a list of the columns for the current dataframe.

def infoComments(tabName, tabColumns):

    with open('info.json') as info:
        data = info.read()

    infoData = json.loads(data)

    columnInfo = {}

    for i in tabColumns:
        comments = []
        underI = i.replace(' ', '_')
        for j in getattr(getattr(trace, underI), 'comments'):
            if j == []:
                continue
            else:
                comments.append(':'.join(str(j).split(':', 3)[3:])[:-2].strip().lstrip('\"').rstrip('\"'))
        columnInfo[i] = comments

    columnInfo = {key:val for key, val in columnInfo.items() if val != ""}
    columnInfo = {key:val for key, val in columnInfo.items() if val != []}

    dicti = {'name' : tabName,
             'columns' : columnInfo}

    dictList = infoData['transform']['transformStage']
    index = next((index for (index, d) in enumerate(dictList) if d["name"] == tabName), None)
    if index is None :
        print('Tab not found in Info.json')
    else:
        dictList[index]['postTransformNotes'] = dicti

    with open('info.json', 'w') as info:
        info.write(json.dumps(infoData, indent=4).replace('null', '"Not Applicable"'))

Info Notes

infoNotes is used to append any notes from the tabs to info.json.

It assumes that all the notes are saved under one string such as:

notes = """
Size of care homes is determined by number of beds.
Cumulative numbers are counts since data was first reported and will include cases which are no longer active
'w/c 15th June 2020' Based on return of 966 of Scotland's 1,080 adult care homes
'w/c 22nd June 2020' Based on return of 987 of Scotland's 1,080 adult care homes
'w/c 29th June 2020' Based on return of 1,006 of Scotland's 1,080 adult care homes
"""
def infoNotes(notes):

    with open('info.json') as info:
        data = info.read()

    infoData = json.loads(data)

    infoData['transform']['Stage One Notes'] = notes

    with open('info.json', 'w') as info:
        info.write(json.dumps(infoData, indent=4).replace('null', '"Not Applicable"'))

Excel Range

excelRange takes a bag of cells created with databaker and returns the lowest cell value and the highest cell value in the following form:

'{' + lowx + lowy + '-' + highx + highy + '}'

This is a core function for the transform spec as it can be used in conjuntion with the trace functions as follows:

trace.Period('Values found in range: {}', var = excelRange(period))

Which can then represent the location of the data being extracted as "Period": "{B6-B19}", for example

def excelRange(bag):
    xvalues = []
    yvalues = []
    for cell in bag:
        coordinate = cellLoc(cell)
        xvalues.append(''.join([i for i in coordinate if not i.isdigit()]))
        yvalues.append(int(''.join([i for i in coordinate if i.isdigit()])))
    high = 0
    low = 0
    for i in xvalues:
        if col2num(i) >= high:
            high = col2num(i)
        if low == 0:
            low = col2num(i)
        elif col2num(i) < low:
            low = col2num(i)
        highx = colnum_string(high)
        lowx = colnum_string(low)
    highy = str(max(yvalues))
    lowy = str(min(yvalues))

Cell Contents

cellCont is similar to excelRange except that it takes only a single cell as an argument, this is useful for tab titles and tab ranging periods.

def cellCont(cell):
    return re.findall(r"'([^']*)'", str(cell))[0]

Supplementary

Column to Number

col2num is used by excelRange. This function takes in a excel column coordinate (such as AX) and returns a number value of the location (eg 50).
This is to simplify the ranking of the cell values in the range.

def col2num(col):
    num = 0
    for c in col:
        if c in string.ascii_letters:
            num = num * 26 + (ord(c.upper()) - ord('A')) + 1
    return num

Cell Locate

cellLoc is used by excelRange. This function takes a single cell from a cell bag and returns the cell coordinate.
This is used in conjunction with col2num to rank cell values in a range.

def cellLoc(cell):
    return right(str(cell), len(str(cell)) - 2).split(" ", 1)[0]

Column Number to String

colnum_string is used by excelRange. This function is the reverse of col2num, it takes in a number (eg 50) and returns the corresponding column value (eg AX)
This is used in conjunction with col2num to rank cell values in a range.

def colnum_string(n):
    string = ""
    while n > 0:
        n, remainder = divmod(n - 1, 26)
        string = chr(65 + remainder) + string
    return string

About

Handlebars representation of Spec Test Repo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published