<a href="https://colab.research.google.com/github/HomesHeatAndHealthyKids/DataLochAnalysis/blob/main/building_phenotype_list.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phenotype List
This code is going to build up a list of condition codes and their mappings to phenotypes. The list will serves a couple of purposes:
1. **Lookup** - map ICD10 or Read V2 code to description of code
2. **Categorisation** - convert ICD10 or Read V2 code to a condition/phenotype

The list will combine the 22 chronic conditions investigated by Swann et al in a set of Harmonised Code Lists with a list of conditions taken from the DataLoch contract. Some of these lists will be constructed manually from the DataLoch contract. Others will be constructed directly from the [HDR UK Phenotype library](https://phenotypes.healthdatagateway.org/). The Harmonised Code Lists are stored in the HDR UK Phenotype library. The [pyconceptlibrary](https://swanseauniversitymedical.github.io/pyconceptlibraryclient/collections/) can be used to download phenotypes from the HDR UK phenotype library.

The resultant data will be reduced to several columns:
*   **phenotype_category** - ideally, any given code will only appear once in a given phenotype_category. There are two categories - "Chronic Paediatric Conditions" and "Acute Respiratory Infections"
*   **code** - ICD10 or Read V2 code
*   **description** - long description of the code
*   **code_system** - indicates whether the code is ICD10 or Read V2
*   **clinical_subcategory** - subgrouping of phenotype codes (e.g. gastrointestinal device or gastrointestinal condition)
*  **phenotype_name** - name of the condition

Finally, I will compare these lists to the lists used for Eleanor's Dissertation.


The list of ICD10 codes that correspond to chronic conditions will be based on the list of 22 harmonised phenotypes built by Swann et al.. These lists have been "harmonised" to include Read V2 codes, ICD10 codes, and SNOMED codes. The SNOMED codes will be filtered out as these are currently unused in Scotland.

The list of ARI infections will be based on the DataLoch contract and a couple of phenotype codelists from HDRUK. The DataLoch contract gives a list of GP read codes and ICD10 codes corresponding to clinical subcategories of ARI's.



In [None]:
# Install the pyconcept library for accessing the HDR UK library and initialise the pre-requisites
!pip install git+https://github.com/SwanseaUniversityMedical/pyconceptlibraryclient.git@v1.0.0

import pandas as pd
from pyconceptlibraryclient import Client

# pyconceptlibraryclient is an interface to the following site
# Here is an example of a downloaded phenotype
# https://phenotypes.healthdatagateway.org/api/v1/phenotypes/PH1698/version/3628/detail/?format=json
client = Client(public=True)

Collecting git+https://github.com/SwanseaUniversityMedical/pyconceptlibraryclient.git@v1.0.0
  Cloning https://github.com/SwanseaUniversityMedical/pyconceptlibraryclient.git (to revision v1.0.0) to /tmp/pip-req-build-lo4b4ugn
  Running command git clone --filter=blob:none --quiet https://github.com/SwanseaUniversityMedical/pyconceptlibraryclient.git /tmp/pip-req-build-lo4b4ugn
  Running command git checkout -q 7a22a95d29f83363f6b98b0a6fc919a96ad849f6
  Resolved https://github.com/SwanseaUniversityMedical/pyconceptlibraryclient.git to commit 7a22a95d29f83363f6b98b0a6fc919a96ad849f6
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting autohooks<24.0.0,>=23.4.0 (from pyconceptlibraryclient==1.0.0)
  Downloading autohooks-23.10.0-py3-none-any.whl.metadata (6.5 kB)
Collecting autohooks-plugin-black<24.0.0,>=23.4.0 (from pyconceptlibraryclient==1.0.0)
  Downloading

## HDR UK Phenotype Download
Many conditions can be downloaded directly from HDR UK. This first set of routines gets a phenotype directly from HDR UK, assigns a phenotype category and removes any additional fields.

The list of HDR UK downloads includes the 22 Chronic conditions that have harmonised code lists by Swann et al. These conditions contain "harmonised" codelists - these are codelists that contain mappings for both primary ("Read V2") and secondary care ("ICD10").

Some Clinical Subcategories of Acute Respiratory Infection can also be downloaded directly from the HDR UK phenotype library.

In [None]:
def fix_name(condition_str: str) -> str:
    """
    fix_name - removes 'Chronic Paediatric Conditions:' from the start of the
    condition name if it is present
    """
    if condition_str[0:7]=='Chronic':
        return condition_str[31:]
    else:
        return condition_str

def categorise_phenotype(condition_str: str) -> str:
    """
    categorise_phenotype - apply category to the condition name
    All of Swann et als harmonised conditions are Chronic
    """
    if condition_str[0:7]=='Chronic':
        return 'Chronic Paediatric Condition'
    else:
        return 'Acute Respiratory Infection'

def remap_icd_code(icd_code_str: str) -> str:
    """
    The HDR UK phenotype list stores ICD10 codes
    as J12.0 - we need to change that to J120 so that
    they are recognised within DataLoch
    so let's remove all of the occurrences of "."
    """
    return icd_code_str.replace('.', '')


def download_phenotype_df(phenotype_id: str, phenotype_version_id: int) -> pd.DataFrame:
    """
    download_phenotype_df - returns details of phenotype for the phenotype id
    and version in a DataFrame with codelists
     - code (ICD10, Read Code, or SNOMED code)
     - description (description of the condition represented by the code
     - code_system (SNOMED, Read Code, or SNOMED)
     - phenotype_name (with 'Chronic Paediatric Conditions:' removed)

    """
    print(f'Downloading {phenotype_id}, version {phenotype_version_id}')
    # Download the phenotype from concept library
    detail = client.phenotypes.get_codelist(
            phenotype_id, phenotype_version_id
        )

    df = pd.DataFrame(detail)  # Convert the codelist to a dataframe

    df['phenotype_category'] = df['phenotype_name'].apply(categorise_phenotype)

    # Extract the code type from the coding system
    df['code_system'] = df['coding_system'].apply(lambda x: x['name'])


    # Remove 'Chronic Paediatric Conditions:' from the beginning of the name
    df['phenotype_name'] = df['phenotype_name'].apply(fix_name)

    # Remove the text in brackets at the end of the concept name (ICD10, Read Code v2, etc.)
    df['concept_name'] = df['concept_name'].str.replace(r'\s*\(.*\)', '', regex=True).str.strip()

    df = df.rename(columns={'concept_name': 'clinical_subcategory'})

    # Remove additional columns
    df = df.reset_index()[
        [
            'phenotype_category',
            'code',
            'description',
            'code_system',
            'clinical_subcategory',
            'phenotype_name',
        ]
    ]

    return df

Let's test the download function by downloading a single Phenotype.

In [None]:
# Test the HDR UK phenotype download!
# Download the Gastrointestinal Phenotype
df = download_phenotype_df('PH783', 2207)
df[df['code_system']=='ICD10 codes']


Downloading PH783, version 2207


Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name
0,Acute Respiratory Infection,J45,Asthma,ICD10 codes,Asthma Secondary care - 2,Asthma Secondary care
1,Acute Respiratory Infection,J45.0,Extrinsic asthma,ICD10 codes,Asthma Secondary care - 2,Asthma Secondary care
2,Acute Respiratory Infection,J45.1,Intrinsic asthma,ICD10 codes,Asthma Secondary care - 2,Asthma Secondary care
3,Acute Respiratory Infection,J45.20,"Mild intermittent asthma, uncomplicated",ICD10 codes,Asthma Secondary care - 2,Asthma Secondary care
4,Acute Respiratory Infection,J45.21,Mild intermittent asthma with (acute) exacerba...,ICD10 codes,Asthma Secondary care - 2,Asthma Secondary care
5,Acute Respiratory Infection,J45.30,"Mild persistent asthma, uncomplicated",ICD10 codes,Asthma Secondary care - 2,Asthma Secondary care
6,Acute Respiratory Infection,J45.31,Mild persistent asthma with (acute) exacerbation,ICD10 codes,Asthma Secondary care - 2,Asthma Secondary care
7,Acute Respiratory Infection,J45.40,"Moderate persistent asthma, uncomplicated",ICD10 codes,Asthma Secondary care - 2,Asthma Secondary care
8,Acute Respiratory Infection,J45.41,Moderate persistent asthma with (acute) exacer...,ICD10 codes,Asthma Secondary care - 2,Asthma Secondary care
9,Acute Respiratory Infection,J45.50,"Severe persistent asthma, uncomplicated",ICD10 codes,Asthma Secondary care - 2,Asthma Secondary care


## DataLoch Contract CodeLists

### Initialise Downloads With DataLoch Phenotypes
Some of the downloads come from the DataLoch contract. These phenotypes are stored on HDR UK, but don't necessarily have the "harmonised" codelists for primary and secondary care.

The list of phenotypes to download from HDR UK are stored in a dictionary. The key of the dictionary is the phenotype identifier, and the value is the integer version number.

In [None]:
# Create a dictionary containing the phenotype id and the version id for
# downloading
dict_downloads = {}
#Code lists for outcomes:
#a) Admission with ARI
#HDR-UK phenotype code lists will be used.
#
#i)	Lower respiratory tract infections:
#https://phenotypes.healthdatagateway.org/phenotypes/PH488/version/1521/detail/
dict_downloads['PH488'] = 1521
#
#ii)	Upper respiratory tract infections: https://phenotypes.healthdatagateway.org/phenotypes/PH158/version/316/detail/
dict_downloads['PH158'] = 316
#
#iii)	In addition, to ensure we include children presenting with viral induced wheeze (for which there is no ICD-10 code), we will also include the following ICD-10 codes (as previously published in (1)).
#
#-	R06.2	Wheezing
#-	R06.0	Dyspnoea
#-	R06.8	Other abnormalities of breathing
df_wheeze = pd.DataFrame(
    [
        {'code': 'R06.2', 'description': 'Wheezing'},
        {'code': 'R06.0', 'description': 'Dyspnoea'},
        {'code': 'R06.8', 'description': 'Other abnormalities of breathing'}
    ]
)
# Are all of these categorisations good?
df_wheeze['code_system'] =  'ICD10 codes'
df_wheeze['phenotype_name'] = 'Wheezing'
df_wheeze['phenotype_category'] = 'Acute Respiratory Infection'
df_wheeze['clinical_subcategory'] = 'Wheezing'

# These conditions do not currently appear
#
#The ICD-10 codes outlined above will be included if present in diagnosis positions 1-3.
#
#iv)	In young children, there is no simple distinction between viral induced wheeze and asthma. As such, we will also include children admitted with asthma as viral infection is predominant trigger in this age group. The following ICD-10 codes will be included:
#-	J45 	Asthma
#-	J46 	Status asthmaticus#
# These conditions appear as part of the Asthma condition from the Swann list

#These asthma ICD-10 codes will only be included if present in diagnosis position 1. This will ensure only admissions directly due to asthma are included and avoid including historical diagnoses of asthma (i.e., excluding asthma if only present as a comorbidity).
#
#b) GP-coded ARI
#HDR-UK phenotype code lists will be used.
#i)	Lower respiratory tract infections:
#https://phenotypes.healthdatagateway.org/phenotypes/PH899/version/1877/detail/ - ADDED
dict_downloads['PH899'] = 1877
#ii)	Upper respiratory tract infections
#https://phenotypes.healthdatagateway.org/phenotypes/PH906/version/1891/detail/  - ADDED
dict_downloads['PH906'] = 1891

# Livvy suggested the following downloads in e-mail (2026-1-23):
dict_downloads['PH475'] = 1516 # Otitis Media Read codes v2
dict_downloads['PH814'] = 2248 # Tonsillitis Primary care Read codes v2 SNOMED codes
dict_downloads['PH476'] = 1517 # Sore Throat Read codes v2




### Manual Code Lists

There are also a set of lists of ICD10 and Read V2 codes that refer to different subcategories of problem such as Wheezing. These code lists are written out in full in the DataLoch contract. This section includes those codes that are specified in full. These lists are manually converted to DataFrames that will be appended to the final aggregated DataFrame.

There is a problem with these read codes from the contract - they are a mixture of lengths. Some are 5-characters long and some are 7 characters long. For all but 1 of the other records, the read codes are all 7 characters long. The GP read codes data on DataLoch has two fields - FullReadCode (always 7 characters long), and ReadCode (always 5 characters long).

**Ideally, I would convert all read codes to 7 character read codes**


In [None]:
#
#iii)	Wheeze read codes (mapped from ICD-10 using (https://rdrr.io/rforge/CALIBERcodelists/man/CALIBERcodelists-package.html)
#  and from Clinical Terminology Read Code Browser - Scottish Read codes v2)
#
# -	R060900	[D]Wheezing
# -	1737.11  	Wheezing symptom
# -	232H.00	On examination - inspiratory wheeze
# -	173e.		Viral wheeze - 173e.00 and 173e.11 appear on DataLoch
# -	R060E		[D]Mild wheeze - R060E00 appears on DataLoch
# -	R060G		[D]Severe wheeze - R060G00 appears on DataLoch
# -	R060F		[D]Moderate wheeze - R060F00 appears on DataLoch
# -	173e.		Viral induced wheeze - 173e.00 and 173e.11 appear on DataLoch
# -	R060H		[D]Very severe wheeze - R060H00 appears on DataLoch
# -	2326.		O/E - expiratory wheeze - 2326.00 appears on DataLoch
# -	173B.		Nocturnal cough / wheeze  - 173B.00 appears on DataLoch
# -	232H.		On examination - inspiratory wheeze - 232H.00 appears on DataLoch
#
# Codes are a mixture of 5-digit read codes and 7 digit full read codes
df_wheeze_read = pd.DataFrame(
    [
        {'code': 'R060900', 'description': '[D]Wheezing]'},
        {'code': '1737.11', 'description': 'Wheezing Symptom'},
        {'code': '232H.00', 'description': 'On examination - inspiratory wheeze'},
        {'code': '173e.00', 'description': 'Viral wheeze'},
        {'code': 'R060E00', 'description': '[D]Mild wheeze'},
        {'code': 'R060G00', 'description': '[D]Severe wheeze'},
        {'code': 'R060F00', 'description': '[D]Moderate wheeze'},
        {'code': '173e.11', 'description': 'Viral induced wheeze'},
        {'code': 'R060H00', 'description': '[D]Very severe wheeze'},
        {'code': '2326.00', 'description': 'O/E - expiratory wheeze'},
        {'code': '173B.00', 'description': 'Nocturnal cough / wheeze'},
        {'code': '232H.00', 'description': 'On examination - inspiratory wheeze'},
    ]
    )
df_wheeze_read['code_system'] = 'Read codes v2'
df_wheeze_read['phenotype_name'] = 'Wheezing'
df_wheeze_read['phenotype_category'] = 'Acute Respiratory Infection'
df_wheeze_read['clinical_subcategory'] = 'Wheezing'

# iv)	Acute asthma presentation (acute presentations filtered from HDR-UK phenotype code list
# and using Clinical Terminology Read Code Browser - Scottish Read codes v2)
# https://phenotypes.healthdatagateway.org/concepts/C2421/version/6242/detail/)
#
# -	H33z0	Status asthmaticus NOS
# -	H33z011 Severe asthma attack
# -	H33z1	Asthma attack - H33z100 appears on DataLoch
# -	H3311	Intrinsic asthma with status asthmaticus - H3311 does not appear on DataLoch
# -	H3301	Extrinsic asthma with status asthmaticus - H3301 does not appear on DataLoch

# Some problems here - These are the read codes that appear in the GP read codes on DataLoch
#   H333.00
#   H330.00
#   H33z.00
#   H33z100
#   H33..11
#   H33..00
#   H330.12
#   H33z011
#   H33xx11
#   H33zz00
# Read codes are a mixture of 5 and 7-digit read codes
df_acute_read = pd.DataFrame(
    [
        {'code': 'H33z000', 'description': 'Status asthmaticus NOS' },
        {'code': 'H33z011', 'description': 'Severe asthma attack'},
        {'code': 'H33z100', 'description': 'Asthma attack'},
        {'code': 'H331100', 'description': 'Intrinsic asthma with status asthmaticus'},
        {'code': 'H330100', 'description': 'Extrinsic asthma with status asthmaticus'}
    ]
    )
df_acute_read['code_system'] = 'Read codes v2'
df_acute_read['phenotype_name'] = 'Acute Asthma'
df_acute_read['phenotype_category'] = 'Acute Respiratory Infection'
df_acute_read['clinical_subcategory'] = 'Acute Asthma'

# Livvy asked for the following Read v2 codes to be added to the Acute Asthma via Slack
# see https://homesheatandh-fcg4352.slack.com/archives/D09FA0NH8SE/p1769175677711169


#Great, thanks those Read codes look sensible to ensure we are picking up acute asthma and not a label of chronic asthma. Can we please add:
#H33z111 Asthma attack NOS
#X102D Status asthmaticus
#Xa0lZ Asthmatic bronchitis
#Xa9zf Acute asthma
#XE0YU Intrinsic asthma with status asthmaticus
#XE0YV Status asthmaticus NOS
#XE0YW Asthma attack
#XM0s2 Asthma attack NOS

df_additional_acute_read = pd.DataFrame(
    [
        {'code': 'H33z111', 'description': 'Asthma attack NOS'},
        {'code': 'X102D00', 'description': 'Status asthmaticus'},
        {'code': 'Xa0lZ00', 'description': 'Asthmatic bronchitis'},
        {'code': 'Xa9zf00', 'description': 'Acute asthma'},
        {'code': 'XE0YU00', 'description': 'Intrinsic asthma with status asthmaticus'},
        {'code': 'XE0YV00', 'description': 'Status asthmaticus NOS'},
        {'code': 'XE0YW00', 'description': 'Asthma attack'},
        {'code': 'XM0s200', 'description': 'Asthma attack NOS'}
    ]
)

df_additional_acute_read['code_system'] = 'Read codes v2'
df_additional_acute_read['phenotype_name'] = 'Acute Asthma'
df_additional_acute_read['phenotype_category'] = 'Acute Respiratory Infection'
df_additional_acute_read['clinical_subcategory'] = 'Acute Asthma'



## Build Acute Asthma ICD10 codes
Use Luke's list to build up the ICD 10 codes for Acute Asthma - taken from Livvy's slack - see https://homesheatandh-fcg4352.slack.com/archives/D09FA0NH8SE/p1769176175753709

In [None]:
# Identify rows in 'ICD10' where the code has more than one character after a dot
# Example: 'J45.998' should be removed, 'J45.9' should be kept (if the dot is part of a 5-char code)
# The regex r'\.[A-Za-z0-9]{2,}$' looks for a dot followed by two or more alphanumeric characters at the end of the string.

df_acute_icd = download_phenotype_df('PH783', 2207)
df_acute_icd = df_acute_icd[df_acute_icd['code_system']=='ICD10 codes']

# First, ensure 'code' column is string type to apply string methods
df_acute_icd['code'] = df_acute_icd['code'].astype(str)

# Filter for 'Read codes v2' and identify codes with more than one character after a dot
# The regex r'\.[A-Za-z0-9]{2}' will match codes like '173e.00' or '232H.00' where there are two or more chars after the dot
mask_problematic_icd10 = (df_acute_icd['code'].str.contains(r'\.[A-Za-z0-9]{2}', regex=True))

df_removed_problematic_icd10 = df_acute_icd[mask_problematic_icd10]

# Remove these rows from the main DataFrame
df_acute_icd = df_acute_icd[~mask_problematic_icd10]





df_acute_icd['phenotype_name'] = 'Acute Asthma'
df_acute_icd['phenotype_category'] = 'Acute Respiratory Infection'
df_acute_icd['clinical_subcategory'] = 'Acute Asthma'

print(f"Number of rows removed from df: {len(df_removed_problematic_icd10)}")
print("\nHead of DataFrame with problematic Read v2 codes removed:")
print(f"\nnumber of rows = :{len(df_acute_icd)}")
df_acute_icd
print(f"\n")
df_acute_icd = df_acute_icd.drop_duplicates(subset=['code'], keep='first')
df_acute_icd

Downloading PH783, version 2207
Number of rows removed from df: 22

Head of DataFrame with problematic Read v2 codes removed:

number of rows = :12




Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name
0,Acute Respiratory Infection,J45,Asthma,ICD10 codes,Acute Asthma,Acute Asthma
1,Acute Respiratory Infection,J45.0,Extrinsic asthma,ICD10 codes,Acute Asthma,Acute Asthma
2,Acute Respiratory Infection,J45.1,Intrinsic asthma,ICD10 codes,Acute Asthma,Acute Asthma
11,Acute Respiratory Infection,J45.8,Chronic obstructive asthma,ICD10 codes,Acute Asthma,Acute Asthma
12,Acute Respiratory Infection,J45.9,Other forms of asthma,ICD10 codes,Acute Asthma,Acute Asthma
16,Acute Respiratory Infection,J46,Status asthmaticus,ICD10 codes,Acute Asthma,Acute Asthma


## Chronic conditions - "Harmonised" codelists
Swann et al created a list of 22 "harmonised" codelists for chronic conditions. "Harmonised" means that there are comparable Read codes, ICD 10 codes, and SNOMED-CT codes for the same phenotype. So, for example, the ICD10 codes for a patient who presents with Wheezing at a hospital would be similar to the Read codes for the Wheezing conditions. This piece of code searches the HDR UK library for the list of chronic conditions and adds them to the list of HDR UK downloads.

In [None]:
# Add in all of the phenotypes by Swann
SEARCH_TERM = 'swann' # Look for all phenotypes by Swann
# Search for all phenotypes by author 'Swann'
# Returns a list of 22 conditions

phenotypes = client.phenotypes.get(search=SEARCH_TERM)
page = phenotypes.get('page')
total_pages = phenotypes.get('total_pages')

# client.phenotypes.get only returns one page at a time
# so download remaining pages
while page <= total_pages:
    dict_cur = {phenotype['phenotype_id'] : phenotype['phenotype_version_id'] for phenotype in phenotypes['data']}
    dict_downloads = dict_downloads | dict_cur
    page = page + 1
    phenotypes = client.phenotypes.get(search='swann', page=page)




In [None]:
# Add in all of the phenotypes by Swann
SEARCH_TERM = 'swann' # Look for all phenotypes by Swann
# Search for all phenotypes by author 'Swann'
# Returns a list of 22 conditions

phenotypes = client.phenotypes.get(search=SEARCH_TERM)
page = phenotypes.get('page')
total_pages = phenotypes.get('total_pages')
dict_temp = {}
# client.phenotypes.get only returns one page at a time
# so download remaining pages
while page <= total_pages:
    dict_cur = {phenotype['name']: f"https://phenotypes.healthdatagateway.org/concepts/{phenotype['phenotype_id']}/version/{phenotype['phenotype_version_id']}/detail/" for phenotype in phenotypes['data']}
    dict_temp = dict_temp | dict_cur
    page = page + 1
    phenotypes = client.phenotypes.get(search='swann', page=page)

print(dict_temp)

{'Chronic paediatric conditions: Asthma': 'https://phenotypes.healthdatagateway.org/concepts/PH1690/version/3620/detail/', 'Chronic paediatric conditions: Cystic Fibrosis': 'https://phenotypes.healthdatagateway.org/concepts/PH1691/version/3622/detail/', 'Chronic paediatric conditions: Other respiratory (excluding asthma and cystic fibrosis)': 'https://phenotypes.healthdatagateway.org/concepts/PH1692/version/3623/detail/', 'Chronic paediatric conditions: Cardiovascular': 'https://phenotypes.healthdatagateway.org/concepts/PH1698/version/3628/detail/', 'Chronic paediatric conditions: Epilepsy': 'https://phenotypes.healthdatagateway.org/concepts/PH1699/version/3626/detail/', 'Chronic paediatric conditions: Headaches': 'https://phenotypes.healthdatagateway.org/concepts/PH1700/version/3625/detail/', 'Chronic paediatric conditions: Other neurological (excluding epilepsy and headaches)': 'https://phenotypes.healthdatagateway.org/concepts/PH1701/version/3627/detail/', 'Chronic paediatric condit

In [None]:
print(phenotypes['data'])

[{'phenotype_id': 'PH1715', 'phenotype_version_id': 3640, 'name': 'Chronic paediatric conditions: Transplant', 'publications': [], 'validation': '', 'citation_requirements': 'To our knowledge, these are the first set of harmonised code lists for chronic paediatric\nconditions that span commonly used primary and secondary care coding systems in the\nUK. We hope they will prove a valuable resource for the paediatric research community\nand welcome suggestions for further development ([Olivia.Swann@ed.ac.uk](mailto:Olivia.Swann@ed.ac.uk)).\n', 'created': '2025-01-19T17:09:18.457856Z', 'author': 'Swann OV, Williams TC, Fraser LK, Farrell J, Parker M, Kennedy J, Seabourne M, Brophy S, Harrison EM, Docherty AB, Pollock L', 'collections': [{'name': 'Phenotype Library', 'value': 18}], 'tags': None, 'organisation': None, 'world_access': 2, 'updated': '2025-01-19T17:09:18.431863Z', 'sex': [{'name': 'Both', 'value': '3'}], 'type': [{'name': 'Disease or syndrome', 'value': '2'}], 'ontology': [{'na

## Download and Combine the Codes
There is now a combined dictionary containing all of the information required to download the codes from the HDR UK library and the other codes. This code downloads and creates a combined dataframe by appending them all together and adding the manual lists in.


The list removes the "." characters from the ICD10 codes because they do not appear to be present in SMR01 on DataLoch.

In [None]:
# Download the list of phenotypes from the HDR UK library
lst_dfs = [download_phenotype_df(k, v) for k, v in dict_downloads.items()]
# Add the manually constructed code lists
lst_dfs = lst_dfs + [df_wheeze, df_wheeze_read, df_acute_read, df_additional_acute_read, df_acute_icd ]

df = pd.concat(lst_dfs)  # Append all of the dataframes together

df = df[    df.code_system != 'SNOMED  CT codes'
]  # Remove the SNOMED CT codes because they are English GP Codes

# Remove the "." from the ICD10 codes
df.loc[df['code_system'] == 'ICD10 codes', 'code'] = df.loc[df['code_system'] == 'ICD10 codes', 'code'].apply(remap_icd_code)
df.to_csv('raw_phenotype_list.csv', index=False)
print('Finished download')
df

Downloading PH488, version 1521
Downloading PH158, version 316
Downloading PH899, version 1877
Downloading PH906, version 1891
Downloading PH475, version 1516
Downloading PH814, version 2248
Downloading PH476, version 1517
Downloading PH1690, version 3620
Downloading PH1691, version 3622
Downloading PH1692, version 3623
Downloading PH1698, version 3628
Downloading PH1699, version 3626
Downloading PH1700, version 3625
Downloading PH1701, version 3627
Downloading PH1702, version 3624
Downloading PH1703, version 3619
Downloading PH1704, version 3629
Downloading PH1705, version 3630
Downloading PH1706, version 3631
Downloading PH1707, version 3632
Downloading PH1708, version 3633
Downloading PH1709, version 3634
Downloading PH1710, version 3635
Downloading PH1711, version 3636
Downloading PH1712, version 3637
Downloading PH1713, version 3638
Downloading PH1714, version 3639
Downloading PH1715, version 3640
Downloading PH1716, version 3641
Finished download


Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name
0,Acute Respiratory Infection,A15,"Respiratory tuberculosis, bacteriologically an...",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection
1,Acute Respiratory Infection,A150,"Tuberculosis of lung, confirmed by sputum micr...",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection
2,Acute Respiratory Infection,A151,"Tuberculosis of lung, confirmed by culture only",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection
3,Acute Respiratory Infection,A152,"Tuberculosis of lung, confirmed histologically",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection
4,Acute Respiratory Infection,A153,"Tuberculosis of lung, confirmed by unspecified...",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection
...,...,...,...,...,...,...
1,Acute Respiratory Infection,J450,Extrinsic asthma,ICD10 codes,Acute Asthma,Acute Asthma
2,Acute Respiratory Infection,J451,Intrinsic asthma,ICD10 codes,Acute Asthma,Acute Asthma
11,Acute Respiratory Infection,J458,Chronic obstructive asthma,ICD10 codes,Acute Asthma,Acute Asthma
12,Acute Respiratory Infection,J459,Other forms of asthma,ICD10 codes,Acute Asthma,Acute Asthma


## Error Testing
### ICD10, Read V2 Codes With Different Descriptions
We may use the different codes as a lookup for a description of what they mean. It would be helpful if this was a unique one-to-one relationship. Unfortunately, several codes appear with slightly different descriptions. However, the names for the codes only differ by a small amount so this is not a big deal.


In [None]:
duplicate_codes_with_different_descriptions = \
    df.groupby('code').filter(lambda x: x['description'].nunique() > 1)

display(duplicate_codes_with_different_descriptions.sort_values('code', ascending=True))

Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name
155,Chronic Paediatric Condition,C108J00,Insulin dependent diabetes mellitus with neuro...,Read codes v2,Rheumatological or musculoskeletal,Rheumatological or musculoskeletal
101,Chronic Paediatric Condition,C108J00,Insulin dependent diab mell with neuropathic a...,Read codes v2,Diabetes,Diabetes
848,Chronic Paediatric Condition,C109000,Non-insulin-dependent diabetes mellitus with r...,Read codes v2,Renal and Genitourinary - non-congenital renal...,Renal and Genitourinary
105,Chronic Paediatric Condition,C109000,Non-insulin-dependent diabetes mellitus with r...,Read codes v2,Diabetes,Diabetes
1519,Chronic Paediatric Condition,C109200,Non-insulin-dependent diabetes mellitus with n...,Read codes v2,Other neurological - other neurological condition,Other neurological (excluding epilepsy and hea...
114,Chronic Paediatric Condition,C109200,Non-insulin-dependent diabetes mellitus with n...,Read codes v2,Diabetes,Diabetes
1,Chronic Paediatric Condition,J450,Predominantly allergic asthma,ICD10 codes,Asthma,Asthma
1,Acute Respiratory Infection,J450,Extrinsic asthma,ICD10 codes,Acute Asthma,Acute Asthma
2,Chronic Paediatric Condition,J451,Nonallergic asthma,ICD10 codes,Asthma,Asthma
2,Acute Respiratory Infection,J451,Intrinsic asthma,ICD10 codes,Acute Asthma,Acute Asthma


### Remove duplicate descriptions
Remove any duplicate descriptions by applying the first available mapping from code to description.

In [None]:
# Create a mapping of each code to its first description
code_to_first_description = df.drop_duplicates(subset=['code'], keep='first').set_index('code')['description']

# Update the 'description' column in the main DataFrame using this mapping
df['description'] = df['code'].map(code_to_first_description)

display(df.head())

Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name
0,Acute Respiratory Infection,A15,"Respiratory tuberculosis, bacteriologically an...",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection
1,Acute Respiratory Infection,A150,"Tuberculosis of lung, confirmed by sputum micr...",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection
2,Acute Respiratory Infection,A151,"Tuberculosis of lung, confirmed by culture only",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection
3,Acute Respiratory Infection,A152,"Tuberculosis of lung, confirmed histologically",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection
4,Acute Respiratory Infection,A153,"Tuberculosis of lung, confirmed by unspecified...",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection


### Duplicate Codes - Codes that appear more than once in a single category
Ideally, a single Read V2 or ICD10 code would only appear once in a single category. For example, a Read V2 code would not be replicated across different phenotypes with the "Chronic conditions" category. Unfortunately, there are a significant number (289) of rows that appear more than once in the chronic conditions. This seems to particularly impact the diabetes diagnoses, but other conditions are also impacted.

In [None]:
def display_code_dupes(df):
    """
    Look for codes that are duplicated within a phenotype category.
    There are two phenotype categories - Acute Respiratory Infections and Chronic Conditions
    """
    duplicate_codes_in_category = df.groupby(['phenotype_category', 'code']).filter(lambda x: len(x) > 1)
    display(duplicate_codes_in_category.sort_values('code'))
    return duplicate_codes_in_category
duplicate_codes_in_category = display_code_dupes(df)
# Some codes are even duplicated within a phenotype name!! The codes appear in two separate clinical_subcategories

Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name
91,Acute Respiratory Infection,14B7.00,History of recurrent tonsillitis,Read codes v2,Tonsillitis Primary care - BREATHE recommended...,Tonsillitis Primary care
34,Acute Respiratory Infection,14B7.00,History of recurrent tonsillitis,Read codes v2,Tonsillitis Primary care - 1,Tonsillitis Primary care
92,Acute Respiratory Infection,14B8.00,History of tonsillitis,Read codes v2,Tonsillitis Primary care - BREATHE recommended...,Tonsillitis Primary care
35,Acute Respiratory Infection,14B8.00,History of tonsillitis,Read codes v2,Tonsillitis Primary care - 1,Tonsillitis Primary care
2,Acute Respiratory Infection,232H.00,On examination - inspiratory wheeze,Read codes v2,Wheezing,Wheezing
...,...,...,...,...,...,...
36,Chronic Paediatric Condition,PC6z.00,Hypospadias or epispadias NOS,Read codes v2,Renal and Genitourinary - congenital renal dis...,Renal and Genitourinary
1527,Chronic Paediatric Condition,SP08b00,De novo glomerulonephritis,Read codes v2,Renal and Genitourinary - non-congenital renal...,Renal and Genitourinary
47,Chronic Paediatric Condition,SP08b00,De novo glomerulonephritis,Read codes v2,Transplant,Transplant
1245,Chronic Paediatric Condition,ZV44000,[V]Has tracheostomy,Read codes v2,Other respiratory - respiratory devices,Other respiratory (excluding asthma and cystic...


In [None]:
# Export the duplicate codes so that Livvy can review and remove them
duplicate_codes_in_category.to_csv('code_dupes.csv')

### Eleanor's Dissertation

Eleanor's dissertation has two sets of ICD10 code mappings:

*   ARI - this is a list of ICD10 codes that should be considered to be an indicator of an Acute Respiratory Infection
*   Other - this is a set of categories for the ICD10 codes based on the left 3 characters of the ICD10 code.

Here is a list of the mappings from Eleanor's dissertation. Many of these are "short" ICD10 codes using the left 3. Others contain the full ICD10 code.

In [None]:
df_el_ari_maps = pd.DataFrame([{'code':'J00', 'description':'Common Cold'},
{'code':'J12', 'description':'Viral Pneumonia'},
{'code':'J14', 'description':'Other Pneumonia'},
{'code':'J15', 'description':'Other Pneumonia'},
{'code':'J16', 'description':'Other Pneumonia'},
{'code':'J18', 'description':'Other Pneumonia'},
{'code':'J20', 'description':'Acute Bronchiolitis'},
{'code':'J21', 'description':'Acute Bronchiolitis'},
{'code':'J22', 'description':'Unspecified ARI'},
{'code':'J34', 'description':'Upper Respiratory Tract Abcess'}, #  Should these even be here - these don't seem to be respiratory infections
{'code':'J36', 'description':'Upper Respiratory Tract Abcess'}, #  ???
{'code':'J39', 'description':'Upper Respiratory Tract Abcess'}, #  ??? These 3 codes are very rare in El's dissertation - <5 cases in total and mostly older children (none sub 18 months)
{'code':'J45', 'description':'Unspecified Asthma'},
{'code':'J85', 'description':'Pyogenic Lower Respiratory Tract Complication'},
{'code':'J86', 'description':'Pyogenic Lower Respiratory Tract Complication'},
{'code':'R06', 'description':'Breathing Abnormalities'},
{'code':'R060', 'description':'ARI'},
{'code':'R062', 'description':'ARI'},
{'code':'J210', 'description':'ARI'},
{'code':'J22X', 'description':'ARI'},
{'code':'R068', 'description':'ARI'},
{'code':'J189', 'description':'ARI'},
{'code':'J168', 'description':'ARI'},
{'code':'J860', 'description':'ARI'},
{'code':'J181', 'description':'ARI'},
{'code':'J218', 'description':'ARI'},
{'code':'J00X', 'description':'ARI'},
{'code':'J219', 'description':'ARI'},
{'code':'J150', 'description':'ARI'},
{'code':'J123', 'description':'ARI'},
{'code':'J13X', 'description':'ARI'},
{'code':'J211', 'description':'ARI'},
{'code':'J128', 'description':'ARI'},
{'code':'J158', 'description':'ARI'},
{'code':'H624 A', 'description':'ARI'},
{'code':'J129', 'description':'ARI'},
{'code':'J850', 'description':'ARI'},
{'code':'J459', 'description':'ARI'},
{'code':'J120', 'description':'ARI'},
{'code':'J121', 'description':'ARI'},
{'code':'J154', 'description':'ARI'},
{'code':'J208', 'description':'ARI'},
{'code':'J122', 'description':'ARI'},
{'code':'J869', 'description':'ARI'},
{'code':'J209', 'description':'ARI'},
{'code':'J390', 'description':'ARI'},
{'code':'J201', 'description':'ARI'},
{'code':'J14X', 'description':'ARI'},
{'code':'J206', 'description':'ARI'},
{'code':'J340', 'description':'ARI'},
{'code':'J36X', 'description':'ARI'},
{'code':'J391', 'description':'ARI'}
])
df_el_ari_maps

Unnamed: 0,code,description
0,J00,Common Cold
1,J12,Viral Pneumonia
2,J14,Other Pneumonia
3,J15,Other Pneumonia
4,J16,Other Pneumonia
5,J18,Other Pneumonia
6,J20,Acute Bronchiolitis
7,J21,Acute Bronchiolitis
8,J22,Unspecified ARI
9,J34,Upper Respiratory Tract Abcess


In [None]:
# Generate the list of codes from El's dissertation that do not appear in the global download in the category "Acute Respiratory Infection"
codes_in_el_ari_not_in_df = df_el_ari_maps[~df_el_ari_maps['code'].isin(df[df['phenotype_category']=='Acute Respiratory Infection']['code'])]
display(codes_in_el_ari_not_in_df)

Unnamed: 0,code,description
9,J34,Upper Respiratory Tract Abcess
11,J39,Upper Respiratory Tract Abcess
15,R06,Breathing Abnormalities
19,J22X,ARI
26,J00X,ARI
30,J13X,ARI
34,H624 A,ARI
47,J14X,ARI
50,J36X,ARI


J34, J39 and R06 do not appear in either list of ICD10 condition codes - the list of ICD10 codes that appeared in SMR01 for Eleanor, and the new list of ICD10 codes that DataLoch have recently sent over. Is this because they are particularly rare? All other ICD10 codes appear in El's list. So, where do these belong?

## Ranking Field
Add ranking field to list and use that to determine which row is more important if we want a unique list mapping condition codes to actual conditions. Livvy has supplied a ranking field for the codes so that we can rank each individual membership. All codes with a rank = 1 should be unique for a particular phenotype_category - Acute Respiratory Infection or Chronic Condition.

In [None]:
# Apply Livvy's ranking to the codes so that we can use them to build a unique mapping of
# ICD10/read v2 codes to conditions
df_ranked = df[df.code != 'NA'].copy() # Remove the three codes that are listed as "NA"

# Create rank field, default to 1
df_ranked['rank'] = 1

# Read in Livvy's list of rank updates and changes from a public dropbox file
df_rank_updates = pd.read_csv('https://www.dropbox.com/scl/fi/1mhou1s8q1k1syx0r9dxp/code_dupes_chronic_conditions.csv?rlkey=0f6hx44magqoavocz8psosnhu&st=ev5f46rl&dl=1')

# Rename 'Rank' column in df_rank_updates to 'new_rank' to avoid confusion during merge
df_rank_updates_renamed = df_rank_updates.rename(columns={'Rank': 'new_rank'})

# Perform a left merge to bring the 'new_rank' into df_ranked
df_ranked = df_ranked.merge(
    df_rank_updates_renamed[
        ['phenotype_category',
         'phenotype_name',
         'clinical_subcategory',
         'code',
         'new_rank']
    ],
    on=['phenotype_category', 'phenotype_name', 'clinical_subcategory', 'code'],
    how='left'
)

# Update the 'rank' column in df_ranked where 'new_rank' is not NaN
df_ranked['rank'] = df_ranked['new_rank'].fillna(df_ranked['rank']).astype(int)

# Drop the temporary 'new_rank' column
df_ranked = df_ranked.drop(columns=['new_rank'])

display(df_ranked.head())
display(df_ranked['rank'].value_counts())
display(df_ranked['code_system'].value_counts())

Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name,rank
0,Acute Respiratory Infection,A15,"Respiratory tuberculosis, bacteriologically an...",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection,1
1,Acute Respiratory Infection,A150,"Tuberculosis of lung, confirmed by sputum micr...",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection,1
2,Acute Respiratory Infection,A151,"Tuberculosis of lung, confirmed by culture only",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection,1
3,Acute Respiratory Infection,A152,"Tuberculosis of lung, confirmed histologically",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection,1
4,Acute Respiratory Infection,A153,"Tuberculosis of lung, confirmed by unspecified...",ICD10 codes,Respiratory Tract Infection - Secondary care,Respiratory Tract Infection,1


Unnamed: 0_level_0,count
rank,Unnamed: 1_level_1
1,11372
2,143


Unnamed: 0_level_0,count
code_system,Unnamed: 1_level_1
Read codes v2,8525
ICD10 codes,2990


In [None]:
# Export the chronic conditions with ranks
# Used to supply Mat @ DataLoch with the
#df_ranked[df_ranked['phenotype_category']=='Chronic Paediatric Condition'].to_csv('chronic_conditions_with_ranks.csv', index=False)


## Remap Phenotype Names
Livvy supplied a set of 3 additional phenotypes for the Ear, Nose and Throat categories. These phenotypes consist of "Read v2" GP codes for Tonsilitis, Sore Throats, and Otitis Media. I'm going to map the phenotype_name for those categories to "Ear and Upper Respiratory Tract Infections". I'm also going to remap "Respiratory Tract Infection" to "Lower Respiratory Tract Infection".

In [None]:
dict_remaps = {
    'Respiratory Tract Infection' : 'Lower Respiratory Tract infection',
    'Lower respiratory tract infection (LRTI)' : 'Lower Respiratory Tract infection',
    'Tonsillitis Primary care' : 'Ear and Upper Respiratory Tract Infections',
    'Sore Throat' : 'Ear and Upper Respiratory Tract Infections',
    'Otitismedia' : 'Ear and Upper Respiratory Tract Infections',
    'Upper Respiratory Tract Infection (URTI)' : 'Ear and Upper Respiratory Tract Infections',
}
df_ranked['phenotype_name'] = df_ranked['phenotype_name'].replace(dict_remaps)

## Remove Duplicates in the Acute Respiratory Infections

Livvy also asked for the removal of duplicates when the 3 additional phenotypes were added to the Acute Respiratory Infections and asked for them to be deduped.
This code should dedupe them and show the results so that we can see whether the dupes were only removed in the Acute Respiratory Infections.

In [None]:
duplicate_subset_cols = ['phenotype_category', 'code', 'rank']

# Identify duplicate rows that will be removed
df_removed_duplicates = df_ranked[df_ranked.duplicated(subset=duplicate_subset_cols, keep='first')]

# Deduplicate df_ranked, keeping the first occurrence
df_ranked = df_ranked.drop_duplicates(subset=duplicate_subset_cols, keep='first')

print("Head of the deduplicated DataFrame (df_ranked):")
display(df_ranked.head())

print("\nHead of the DataFrame containing removed duplicate rows (df_removed_duplicates):")
display(df_removed_duplicates.head())

print(f"Total rows removed: {len(df_removed_duplicates)}")

Head of the deduplicated DataFrame (df_ranked):


Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name,rank
0,Acute Respiratory Infection,A15,"Respiratory tuberculosis, bacteriologically an...",ICD10 codes,Respiratory Tract Infection - Secondary care,Lower Respiratory Tract infection,1
1,Acute Respiratory Infection,A150,"Tuberculosis of lung, confirmed by sputum micr...",ICD10 codes,Respiratory Tract Infection - Secondary care,Lower Respiratory Tract infection,1
2,Acute Respiratory Infection,A151,"Tuberculosis of lung, confirmed by culture only",ICD10 codes,Respiratory Tract Infection - Secondary care,Lower Respiratory Tract infection,1
3,Acute Respiratory Infection,A152,"Tuberculosis of lung, confirmed histologically",ICD10 codes,Respiratory Tract Infection - Secondary care,Lower Respiratory Tract infection,1
4,Acute Respiratory Infection,A153,"Tuberculosis of lung, confirmed by unspecified...",ICD10 codes,Respiratory Tract Infection - Secondary care,Lower Respiratory Tract infection,1



Head of the DataFrame containing removed duplicate rows (df_removed_duplicates):


Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name,rank
260,Acute Respiratory Infection,Hyu0200,[X]Acute tonsillitis due to other specified or...,Read codes v2,Tonsillitis Primary care - 1,Ear and Upper Respiratory Tract Infections,1
261,Acute Respiratory Infection,14B7.00,History of recurrent tonsillitis,Read codes v2,Tonsillitis Primary care - BREATHE recommended...,Ear and Upper Respiratory Tract Infections,1
262,Acute Respiratory Infection,14B8.00,History of tonsillitis,Read codes v2,Tonsillitis Primary care - BREATHE recommended...,Ear and Upper Respiratory Tract Infections,1
263,Acute Respiratory Infection,2DB6.00,O/E - follicular tonsillitis,Read codes v2,Tonsillitis Primary care - BREATHE recommended...,Ear and Upper Respiratory Tract Infections,1
264,Acute Respiratory Infection,A340300,Streptococcal tonsillitis,Read codes v2,Tonsillitis Primary care - BREATHE recommended...,Ear and Upper Respiratory Tract Infections,1


Total rows removed: 38


## Final List
The final list of categories and conditions looks like this. This list is likely to require some final manual intervention before uploading to DataLoch.

In [None]:
import csv
df_ranked.to_csv('code_list.csv', index=False ,  quotechar='"',    quoting=csv.QUOTE_NONNUMERIC)  # quotes all non-numeric (object/string) columns)
print('Code duplicates - Rank = 1')
df_deduped = df_ranked[df_ranked['rank']==1] # Unique condition code to condition mapping list
df_temp = display_code_dupes(df_deduped)

print('Code duplicates - Rank = 2')
df_deduped = df_ranked[df_ranked['rank']==2] # Unique condition code to condition mapping list
df_temp = display_code_dupes(df_deduped)
#display(df_ranked.head(10))

Code duplicates - Rank = 1


Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name,rank


Code duplicates - Rank = 2


Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name,rank


In [None]:
df_ranked[df_ranked['code']=='H036.00']

Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name,rank
253,Acute Respiratory Infection,H036.00,Acute viral tonsillitis,Read codes v2,Tonsillitis Primary care - 1,Ear and Upper Respiratory Tract Infections,1


Treemap of the dataframe showing counts in each category. Looking at this list for the Acute Respiratory Infections category appears to indicate that some care should be taken about the phenotype names.

In [None]:
import plotly.express as px
fig = px.treemap(df_ranked,
                 path=[ 'phenotype_category', 'phenotype_name', 'clinical_subcategory', 'code_system'])

fig.update_layout(title="Code Lists",
                  width=1000, height=700,)

fig.show()

In [None]:
import re

def test_readv2(df):

# 2) Focus only on rows that claim to be Read v2
    mask_readv2 = df["code_system"].astype(str).str.strip().str.casefold() == "read codes v2"
    readv2 = df.loc[mask_readv2].copy()

# 3) Clean the code column (uppercase, trim, drop internal spaces for validation)
    readv2["code_raw"] = readv2["code"]
    readv2["code_clean"] = (
        readv2["code_raw"]
        .astype(str)
        .str.strip()
        .str.upper()
    )
# A separate version without any whitespace for validation
    readv2["code_nospace"] = readv2["code_clean"].str.replace(r"\s+", "", regex=True)

# 4) Validation helpers
# Allowed characters only?
    readv2["has_only_allowed_chars"] = readv2["code_nospace"].str.fullmatch(r"[A-Z0-9.]+", na=False)

# Exactly 5 chars?
    readv2["len5"] = readv2["code_nospace"].str.len() == 5
    readv2["len7"] = readv2["code_nospace"].str.len() == 7
# Dots only at the end (i.e., no dot followed by an alphanumeric)
    readv2["has_internal_dot"] = readv2["code_nospace"].str.contains(r"\.[A-Z0-9]")

# "Full" 5-byte Read v2 format: length 5, allowed chars, and any dots must be trailing
    readv2["is_full_read_v2"] = (
        readv2["has_only_allowed_chars"] &
        readv2["len5"] &
        (~readv2["has_internal_dot"]) &
    # also ensure the code doesn't start with a dot
        readv2["code_nospace"].str.match(r"^[A-Z0-9]")
    )

# Truncated-but-fixable (1–4 alphanumerics, no dots)
    readv2["is_truncated_fixable"] = readv2["code_nospace"].str.fullmatch(r"[A-Z0-9]{1,4}", na=False)

# 5) Normalise/pad truncated Read v2 codes (optional)
    def pad_read_v2(code):
        if pd.isna(code):
            return None
        c = str(code).strip().upper()
        c = re.sub(r"\s+", "", c)          # remove spaces
        c = re.sub(r"\.*$", "", c)         # strip any trailing dots
        if not re.fullmatch(r"[A-Z0-9]{1,5}", c):  # must be only alphanumerics to be padded
            return None
        if len(c) > 5:
            return None
        return c + "." * (5 - len(c))

    readv2["read_v2_full_padded"] = readv2.apply(
         lambda row: row["code_nospace"] if row["is_full_read_v2"]
        else (pad_read_v2(row["code_nospace"]) if row["is_truncated_fixable"] else None),
        axis=1
     )

# 6) Summary
    total = len(readv2)
    n_full = int(readv2["is_full_read_v2"].sum())
    n_trunc = int(readv2["is_truncated_fixable"].sum())
    n_invalid = total - n_full - n_trunc

    print(f"Rows with code_system == 'Read v2': {total}")
    print(f"Full 5-byte Read v2 codes: {n_full}")
    print(f"Truncated but fixable (can be dot-padded): {n_trunc}")
    print(f"Invalid/other issues: {n_invalid}")

# 7) Inspect problems
    print("\nExamples of truncated (will be padded):")
    display(readv2.loc[readv2["is_truncated_fixable"], ["code_raw", "code_nospace", "read_v2_full_padded"]].head(20))

    print("\nExamples of invalid (bad chars, internal dots, wrong length):")
    display(readv2.loc[~(readv2["is_full_read_v2"] | readv2["is_truncated_fixable"]),
                       ["code_raw", "code_nospace", "has_only_allowed_chars", "len5", "has_internal_dot"]].head(20))

# 8) (Optional) Merge results back to the original dataframe and save
    df_out = df.copy()
    df_out.loc[mask_readv2, "code_full_read_v2"] = readv2["read_v2_full_padded"]
    df_out.loc[mask_readv2, "is_full_read_v2"] = readv2["is_full_read_v2"].values
    df_out.loc[mask_readv2, "validation_status"] = (
        readv2.apply(lambda r: "full" if r["is_full_read_v2"]
                     else ("padded" if pd.notna(r["read_v2_full_padded"]) else "invalid"), axis=1)
        .values
    )
    return df_out
df_temp = test_readv2(df_ranked)


Rows with code_system == 'Read v2': 8487
Full 5-byte Read v2 codes: 0
Truncated but fixable (can be dot-padded): 1
Invalid/other issues: 8486

Examples of truncated (will be padded):


Unnamed: 0,code_raw,code_nospace,read_v2_full_padded
2267,2828,2828,2828.0



Examples of invalid (bad chars, internal dots, wrong length):


Unnamed: 0,code_raw,code_nospace,has_only_allowed_chars,len5,has_internal_dot
121,H06..00,H06..00,True,False,True
122,H060.00,H060.00,True,False,True
123,H060000,H060000,True,False,False
124,H060.11,H060.11,True,False,True
125,H060300,H060300,True,False,False
126,H060400,H060400,True,False,False
127,H060500,H060500,True,False,False
128,H060600,H060600,True,False,False
129,H060700,H060700,True,False,False
130,H060800,H060800,True,False,False


In [None]:
display(df_temp[df_temp['is_full_read_v2']==True])
display(df_ranked[df_ranked['code']=='H33z000'])

Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name,rank,code_full_read_v2,is_full_read_v2,validation_status


Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name,rank
347,Chronic Paediatric Condition,H33z000,Status asthmaticus NOS,Read codes v2,Asthma,Asthma,1
11496,Acute Respiratory Infection,H33z000,Status asthmaticus NOS,Read codes v2,Acute Asthma,Acute Asthma,1


In [None]:
display(df_ranked[df_ranked['code']=='H060300'])

Unnamed: 0,phenotype_category,code,description,code_system,clinical_subcategory,phenotype_name,rank
125,Acute Respiratory Infection,H060300,Acute purulent bronchitis,Read codes v2,Lower respiratory tract infection - Primary Care,Lower Respiratory Tract infection,1


In [None]:
df_pivot = df.pivot_table(index=['phenotype_category', 'phenotype_name', 'clinical_subcategory'], columns='code_system', values='code', aggfunc='count')

df_pivot

Unnamed: 0_level_0,Unnamed: 1_level_0,code_system,ICD10 codes,Read codes v2
phenotype_category,phenotype_name,clinical_subcategory,Unnamed: 3_level_1,Unnamed: 4_level_1
Acute Respiratory Infection,Acute Asthma,Acute Asthma,6.0,13.0
Acute Respiratory Infection,Ear and Upper Respiratory Tract Infections,Ear and Upper Respiratory Tract Infections - Secondary care - Diagnoses,36.0,
Acute Respiratory Infection,Lower respiratory tract infection (LRTI),Lower respiratory tract infection - Primary Care,,46.0
Acute Respiratory Infection,Otitismedia,Otitismedia - Primary Care,,28.0
Acute Respiratory Infection,Respiratory Tract Infection,Respiratory Tract Infection - Secondary care,85.0,
Acute Respiratory Infection,Sore Throat,Sore Throat - Primary Care,,33.0
Acute Respiratory Infection,Tonsillitis Primary care,Tonsillitis Primary care - 1,,26.0
Acute Respiratory Infection,Tonsillitis Primary care,Tonsillitis Primary care - BREATHE recommended - 1,,23.0
Acute Respiratory Infection,Upper Respiratory Tract Infection (URTI),Upper Respiratory Tract Infection - Primary Care,,40.0
Acute Respiratory Infection,Wheezing,Wheezing,3.0,12.0


In [None]:
import plotly.express as px

fig = px.bar(df.pivot_table(index=['phenotype_category', 'phenotype_name', 'clinical_subcategory','code_system'], values='code', aggfunc='count').reset_index(),
             x='phenotype_name',
             y='code',
             color='code_system', # This will create two sets of bars (ICD10 and Read V2)
             facet_col='phenotype_category', # Separate charts for Chronic and Acute
             facet_col_wrap=1, # Arrange facets in a single column
             barmode='group', # Group bars side-by-side
             hover_data=['clinical_subcategory'], # Show subcategory on hover
             title='Count of Codes by Phenotype, Clinical Subcategory, and Code System',
             labels={'phenotype_name': 'Phenotype Name', 'count': 'Number of Codes', 'code_system': 'Code System'},
             height=800)

fig.update_layout(xaxis={'categoryorder':'category ascending'})
fig.show()

## TODO

1. Verify that all read codes are consistent - either full read codes or short read codes. These should be consistent within the Chronic conditions in the phenotypes identified by Swann, but might be more problematic with the codes generated from the other lists used for Asthma
