# TOS format parser

This is an attempt to get data validation rules from the NHS metadata (the **Technical Output Specification** or TOS.) This is the [Hospital Episode Statistics Data Dictionary
](https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics/hospital-episode-statistics-data-dictionary).

In [3]:
# Set Jupyter options
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
import os

import pandas as pd
import yaml

from field import Field
from rule import Rule

In [10]:
DEFAULT_TOS_PATH = "https://digital.nhs.uk/binaries/content/assets/website-assets/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics/hes-data-dictionary/hes-tos-v1.15.xlsx"

In [12]:
tos_path = os.getenv('TOS_PATH', DEFAULT_TOS_PATH)
tos_file = pd.ExcelFile(tos_path)

Select the tabs we want to work with.

In [13]:
for sheet_name in {s for s in tos_file.sheet_names if 'CUREd' in s}:
    # TODO
    pass

Load the TOS into memory

In [14]:
tos = pd.read_excel(tos_file, sheet_name='HES APC TOS', index_col='Field', skiprows=1)
tos.info()

<class 'pandas.core.frame.DataFrame'>
Index: 379 entries, A_NUMACP to YEAR
Data columns (total 12 columns):
 #   Column                                                                                                                                 Non-Null Count  Dtype 
---  ------                                                                                                                                 --------------  ----- 
 0   Field name                                                                                                                             379 non-null    object
 1   Format                                                                                                                                 379 non-null    object
 2   HES Legacy Field Status (Y/N)                                                                                                          84 non-null     object
 3   Availability                                                            

Build collection of rules from this metadata

In [15]:
for row in tos.sample(5).reset_index().replace(pd.NA, None).to_dict(orient='records'):
    field = Field(
        name=row['Field'],
        title=row['Field name'],
        format_=row['Format'],
        values=row['Values'],
        description=row['Description'],        
    )

    print(field, field.format, sep='\t')
    print(field.values)
    for rule in field.generate_rules():
        print(rule)
    print('_'*128)

PRESENT_ON_ADMISSION_INDICATOR	String(1)
Y = Patient Diagnosis Already Present
N = Patient Diagnosis Not Already Present
8 = Not Applicable (Indication Of This Patient Diagnosis On Admission Not Required Nationally)
9 = Not Known Where The Patient Diagnosis Was Present On Admission
description: Present On Admission Indicator is String(1)
expr: is.character(PRESENT_ON_ADMISSION_INDICATOR) & nchar(PRESENT_ON_ADMISSION_INDICATOR)
  == 1
name: PRESENT_ON_ADMISSION_INDICATOR String(1)

________________________________________________________________________________________________________________________________
MARSTAT	String(1)
From 1/10/2006 onwards: 
S = Single  
M = Married/Civil Partner  
D = Divorced/Person whose Civil Partnership has been dissolved  
W = Widowed/Surviving Civil Partner  
P = Separated  
N = Not disclosed.  
8 = Not applicable
9 = Not known

Prior to 1/10/2006: 
1 = Single 
2 = Married, including separated  
3 = Divorced  
4 = Widowed 
8 = Not applicable
9 = Not know

NotImplementedError: 

In [None]:
rules = list()
for row in tos.sample(50).reset_index().replace(pd.NA, None).to_dict(orient='records'):
    print(row['Field'], row['Format'], row['Values'], sep='\t')
    field = Field(
        name=row['Field'],
        title=row['Field name'],
        format_=row['Format'],
        values=row['Values'],
        description=row['Description'],        
    )

    for rule in field.generate_rules():
        rules.append(dict(rule))

    print('____________________')

In [None]:
rules_data = yaml.dump(dict(rules=rules))
print(rules_data)

with open('rules.yaml', 'w') as file:
    file.write(rules_data)