# TOS format parser

This is an attempt to get data validation rules from the NHS metadata (the **Technical Output Specification** or TOS.) This is the [Hospital Episode Statistics Data Dictionary
](https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics/hospital-episode-statistics-data-dictionary).

In [1]:
# Set Jupyter options
%load_ext autoreload
%autoreload 2

In [2]:
import os

import pandas as pd
import yaml

from field import Field
from rule import Rule

In [3]:
# Constants
DEFAULT_TOS_PATH = "https://digital.nhs.uk/binaries/content/assets/website-assets/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics/hes-data-dictionary/hes-tos-v1.15.xlsx"

In [6]:
tos_path = os.getenv('TOS_PATH', DEFAULT_TOS_PATH)
tos_file = pd.ExcelFile(tos_path)

Select the tabs we want to work with.

In [7]:
for sheet_name in {s for s in tos_file.sheet_names if 'CUREd' in s}:
    # TODO
    pass

Load the TOS into memory

In [8]:
tos = pd.read_excel(tos_file, sheet_name='HES APC TOS', index_col='Field', skiprows=1)
tos.info()

<class 'pandas.core.frame.DataFrame'>
Index: 379 entries, A_NUMACP to YEAR
Data columns (total 12 columns):
 #   Column                                                                                                                                 Non-Null Count  Dtype 
---  ------                                                                                                                                 --------------  ----- 
 0   Field name                                                                                                                             379 non-null    object
 1   Format                                                                                                                                 379 non-null    object
 2   HES Legacy Field Status (Y/N)                                                                                                          84 non-null     object
 3   Availability                                                            

Build collection of rules from this metadata

In [13]:
rules = list()
for row in tos.sample(20).reset_index().replace(pd.NA, None).to_dict(orient='records'):
    print(row['Field'], row['Format'], sep='\t')
    field = Field(
        name=row['Field'],
        title=row['Field name'],
        format_=row['Format'],
        values=row['Values'],
        description=row['Description'],        
    )

    for rule in field.generate_rules():
        rules.append(dict(rule))


PROCODET	String(5)
PERSON_MARITAL_STATUS	String(1)
DELCHANG	String(1)
OPERSTAT	String(1)
PEREND	Date(YYYY-MM-DD)
PROVSPNO	Number
CURRWARD	String(4)
IMD04_DECILE	String(30)
CLASSPAT	String(1)
UCUM_UNIT_OF_MEASUREMENT	String(10)
LEGALGPA	String(1)
DISMETH_UNCLN	String(1)
OPERTN_COUNT	Number
ACPDQIND_n	String(1)
ELECDUR	Number
SUSSPELLID	String(38)
SERVICE_CODE	String(12)
ELECDATE	Date(YYYY-MM-DD)
RESRO	String(3)
ACTIVITY_LOCATION_TYPE_CODE	String(3)


In [14]:
print(yaml.dump(dict(rules=rules)))

rules:
- description: Provider Code of Treatment is String(5)
  expr: is.character(PROCODET) & nchar(PROCODET) == 5
  name: PROCODET String(5)
- description: Person Marital Status is String(1)
  expr: is.character(PERSON_MARITAL_STATUS) & nchar(PERSON_MARITAL_STATUS) == 1
  name: PERSON_MARITAL_STATUS String(1)
- description: Delivery Place Change Reason is String(1)
  expr: is.character(DELCHANG) & nchar(DELCHANG) == 1
  name: DELCHANG String(1)
- description: Operation Status Code is String(1)
  expr: is.character(OPERSTAT) & nchar(OPERSTAT) == 1
  name: OPERSTAT String(1)
- description: Cds Report Period End Date  is Date(YYYY-MM-DD)
  expr: grepl("^\d4-([0]\d|1[0-2])-([0-2]\d|3[01])$", PEREND)
  name: PEREND Date(YYYY-MM-DD)
- description: Hospital Provider Spell Identifier is Number
  expr: is.integer(PROVSPNO)
  name: PROVSPNO Number
- description: Current Electoral Ward is String(4)
  expr: is.character(CURRWARD) & nchar(CURRWARD) == 4
  name: CURRWARD String(4)
- description: I