# 01 Prep QOF data 

> A first look at the QOF data, prepping the data and associating GP practices to geographic identifiers.  

---

In [1]:
#|default_exp core.01_prep_data

In [2]:
#|hide
import nbdev; nbdev.nbdev_export()

In [3]:
#|hide
from nbdev.showdoc import show_doc

In [4]:
#|export
import dementia_inequalities as proj
from dementia_inequalities import const, log, utils, tools
import adu_proj.utils as adutils

In [5]:
#|export
import numpy as np 
import pandas as pd 

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Quality Outcomes Framework (QOF) 

Going to start by exploring the QOF dataset.  

In [6]:
#| export 

# Useful to use this: const.data_path
df_QOF_prev = pd.read_csv(const.data_path+'/QOF_2022-23/PREVALENCE_2223.csv')

In [7]:
df_QOF_prev.head()

Unnamed: 0,PRACTICE_CODE,GROUP_CODE,REGISTER,PATIENT_LIST_TYPE,PRACTICE_LIST_SIZE
0,A81001,AF,112.0,TOTAL,3996
1,A81001,AST,317.0,06OV,3766
2,A81001,CAN,131.0,TOTAL,3996
3,A81001,CHD,150.0,TOTAL,3996
4,A81001,CKD,93.0,18OV,3179


In [8]:
df_QOF_prev[df_QOF_prev['GROUP_CODE']=='DEM']

Unnamed: 0,PRACTICE_CODE,GROUP_CODE,REGISTER,PATIENT_LIST_TYPE,PRACTICE_LIST_SIZE
6,A81001,DEM,22.0,TOTAL,3996
27,A81002,DEM,203.0,TOTAL,18315
48,A81004,DEM,100.0,TOTAL,11276
69,A81005,DEM,100.0,TOTAL,8005
90,A81006,DEM,156.0,TOTAL,14383
...,...,...,...,...,...
133839,Y07057,DEM,60.0,TOTAL,12005
133860,Y07059,DEM,110.0,TOTAL,19271
133881,Y07060,DEM,133.0,TOTAL,25901
133902,Y07274,DEM,2.0,TOTAL,1229


Let us look at how many people are recorded as having dementia 

In [9]:
total_dem = df_QOF_prev[df_QOF_prev['GROUP_CODE']=='DEM']['REGISTER'].sum()
print(f'The total number of people with dementia, across all practises in England, is {int(total_dem)}.')

The total number of people with dementia, across all practises in England, is 460330.


Let's look at practice information to try and understand these statistics a bit better 

In [10]:
df_practice_fem_age = pd.read_csv(const.data_path+'/GP_practices_dec_23/gp-reg-pat-prac-sing-age-female.csv')
df_practice_men_age = pd.read_csv(const.data_path+'/GP_practices_dec_23/gp-reg-pat-prac-sing-age-male.csv')
df_practice_map = pd.read_csv(const.data_path+'/GP_practices_dec_23/gp-reg-pat-prac-map.csv')

Right, I think what would be good to have in a dataframe: 
* ORG_CODE - check this lines up with the PRACTICE_CODE 
* POSTCODE (in df_practice_map)
* PRACTICE_NAME (in df_practice_map)
* PCN_NAME (in df_practice_map)
* PRACTICE_LIST_SIZE (in df_QOF_prev)
* num women >65 (any diag) (in df_practice_fem_age)
* num men >65 (any diag) (in df_practice_men_age)
* num dementia (in df_QOF_prev)
* num dementia >65 (not sure this info is available)

In [11]:
df_practice_map_geo = df_practice_map[['PRACTICE_CODE', 'PRACTICE_NAME', 'PRACTICE_POSTCODE', 'PCN_NAME']].copy()
df_practice_map_geo.head()

Unnamed: 0,PRACTICE_CODE,PRACTICE_NAME,PRACTICE_POSTCODE,PCN_NAME
0,A81001,THE DENSHAM SURGERY,TS18 1HU,STOCKTON PCN
1,A81002,QUEENS PARK MEDICAL CENTRE,TS18 2AW,NORTH STOCKTON PCN
2,A81004,ACKLAM MEDICAL CENTRE,TS5 8SB,GREATER MIDDLESBROUGH PCN
3,A81005,SPRINGWOOD SURGERY,TS14 7DJ,EAST CLEVELAND PCN
4,A81006,TENNANT STREET MEDICAL PRACTICE,TS18 2AT,NORTH STOCKTON PCN


In [12]:
# Only interested in Dementia diagnosis 
df_QOF_dem = df_QOF_prev[df_QOF_prev['GROUP_CODE']=='DEM'].copy()
# Drop columns which are not useful 
df_QOF_dem.drop(labels=['GROUP_CODE', 'PATIENT_LIST_TYPE'], axis=1, inplace=True)
df_QOF_dem.rename(columns={'REGISTER':'DEM_REGISTER'}, inplace=True)
df_QOF_dem.head()

Unnamed: 0,PRACTICE_CODE,DEM_REGISTER,PRACTICE_LIST_SIZE
6,A81001,22.0,3996
27,A81002,203.0,18315
48,A81004,100.0,11276
69,A81005,100.0,8005
90,A81006,156.0,14383


Check the practices are unique

In [13]:
df_QOF_dem['PRACTICE_CODE'].is_unique

True

In [14]:
print(f'Number of practices in QOF dataset: {len(df_QOF_dem)}. Number of practices in geographies dataset: {len(df_practice_map_geo)}. Difference: {len(df_QOF_dem)-len(df_practice_map_geo)}.')

Number of practices in QOF dataset: 6378. Number of practices in geographies dataset: 6328. Difference: 50.


I wonder where this difference of 50 GP practices comes from. I think this might be because the geography data comes from a December snapshot. Whereas the QOF dataset spans the year. So maybe these 50 practises closed over the course of the year? 

In [15]:
df_QOF_dem_geo = pd.merge(df_QOF_dem, df_practice_map_geo, on="PRACTICE_CODE", how='left')

In [16]:
df_QOF_dem_geo.head()

Unnamed: 0,PRACTICE_CODE,DEM_REGISTER,PRACTICE_LIST_SIZE,PRACTICE_NAME,PRACTICE_POSTCODE,PCN_NAME
0,A81001,22.0,3996,THE DENSHAM SURGERY,TS18 1HU,STOCKTON PCN
1,A81002,203.0,18315,QUEENS PARK MEDICAL CENTRE,TS18 2AW,NORTH STOCKTON PCN
2,A81004,100.0,11276,ACKLAM MEDICAL CENTRE,TS5 8SB,GREATER MIDDLESBROUGH PCN
3,A81005,100.0,8005,SPRINGWOOD SURGERY,TS14 7DJ,EAST CLEVELAND PCN
4,A81006,156.0,14383,TENNANT STREET MEDICAL PRACTICE,TS18 2AT,NORTH STOCKTON PCN


Need to tidy up the age values in the practice by age datasets

In [17]:
# Women by age dataset
df_practice_fem_age.loc[df_practice_fem_age['AGE'] == '95+', 'AGE'] = 95
df_practice_fem_age['AGE'] = pd.to_numeric(df_practice_fem_age['AGE'], errors='coerce') # The coerce parameter replaces any remaining strings with 'Nan'

# Men by age dataset
df_practice_men_age.loc[df_practice_men_age['AGE'] == '95+', 'AGE'] = 95
df_practice_men_age['AGE'] = pd.to_numeric(df_practice_men_age['AGE'], errors='coerce') # The coerce parameter replaces any remaining strings with 'Nan'

In [18]:
def over_65_gender(df, # dataframe
                   col_name): # column name to be assigned 
    org_code = list(df['ORG_CODE'].unique())
    over_65_by_code = []
    for i in org_code:
        num_over_65 = df[(df['ORG_CODE'] == str(i))&(df['AGE'] >= 65)]['NUMBER_OF_PATIENTS'].sum()
        over_65_by_code.append(num_over_65)
    dict = {'ORG_CODE': org_code, col_name: over_65_by_code} 
    return pd.DataFrame(dict)

In [19]:
df_women_over_65 = over_65_gender(df_practice_fem_age, 'WOMEN_OVER_65')
df_men_over_65 = over_65_gender(df_practice_men_age, 'MEN_OVER_65')

In [20]:
print(f'Number of practices with women: {len(df_women_over_65)}. Number of practices with men: {len(df_men_over_65)}.')

Number of practices with women: 6328. Number of practices with men: 6328.


Merge the men and womens data for those aged over 65

In [21]:
df_over_65_merge = pd.merge(df_women_over_65, df_men_over_65, on="ORG_CODE", how='left')
df_over_65_merge.head()

Unnamed: 0,ORG_CODE,WOMEN_OVER_65,MEN_OVER_65
0,A84002,1182,1060
1,A84005,1476,1271
2,A84006,3085,2729
3,A84007,1644,1423
4,A84008,838,751


Now merge the over 65 data with the dementia data and geographic info

In [22]:
df = pd.merge(df_QOF_dem_geo, df_over_65_merge, left_on="PRACTICE_CODE", right_on='ORG_CODE', how='left')

Check that the practice codes are recorded as expected 

In [23]:
df[pd.isna(df['PRACTICE_CODE'])]

Unnamed: 0,PRACTICE_CODE,DEM_REGISTER,PRACTICE_LIST_SIZE,PRACTICE_NAME,PRACTICE_POSTCODE,PCN_NAME,ORG_CODE,WOMEN_OVER_65,MEN_OVER_65


In [24]:
df[df['PRACTICE_CODE'] == ''].index

Index([], dtype='int64')

Drop the duplicated column (since PRACTICE_CODE and ORG_CODE are the same).

In [25]:
df.drop(labels=['ORG_CODE'], axis=1, inplace=True)

In [26]:
df.head()

Unnamed: 0,PRACTICE_CODE,DEM_REGISTER,PRACTICE_LIST_SIZE,PRACTICE_NAME,PRACTICE_POSTCODE,PCN_NAME,WOMEN_OVER_65,MEN_OVER_65
0,A81001,22.0,3996,THE DENSHAM SURGERY,TS18 1HU,STOCKTON PCN,456.0,433.0
1,A81002,203.0,18315,QUEENS PARK MEDICAL CENTRE,TS18 2AW,NORTH STOCKTON PCN,2278.0,1918.0
2,A81004,100.0,11276,ACKLAM MEDICAL CENTRE,TS5 8SB,GREATER MIDDLESBROUGH PCN,1194.0,1044.0
3,A81005,100.0,8005,SPRINGWOOD SURGERY,TS14 7DJ,EAST CLEVELAND PCN,1309.0,1119.0
4,A81006,156.0,14383,TENNANT STREET MEDICAL PRACTICE,TS18 2AT,NORTH STOCKTON PCN,1540.0,1319.0


In [30]:
df.to_csv(const.output_path+'/QOF_GP_dem.csv', index=False)

## Prescribing data 

This is the data on all the drugs which were prescribed in November(?) 2023 across all GP gractices in England. 

In [31]:
df_prescribe = pd.read_csv(const.data_path+'/EPD_202311.csv')

In [33]:
len(df_prescribe)

17918320

In [32]:
df_prescribe.head()

Unnamed: 0,YEAR_MONTH,REGIONAL_OFFICE_NAME,REGIONAL_OFFICE_CODE,ICB_NAME,ICB_CODE,PCO_NAME,PCO_CODE,PRACTICE_NAME,PRACTICE_CODE,ADDRESS_1,...,BNF_CODE,BNF_DESCRIPTION,BNF_CHAPTER_PLUS_CODE,QUANTITY,ITEMS,TOTAL_QUANTITY,ADQUSAGE,NIC,ACTUAL_COST,UNIDENTIFIED
0,202311,EAST OF ENGLAND,Y61,NHS HERTFORDSHIRE AND WEST ESSEX INTEGRA,QM7,NHS HERTFORDSHIRE AND WEST ESSEX ICB - 0,07H00,THE RIVER SURGERY,F81216,THE RIVER SURGERY,...,20020200822,K-Lite bandage 10cm x 4.5m,20: Dressings,20.0,2,40.0,0.0,43.2,40.61396,N
1,202311,EAST OF ENGLAND,Y61,NHS HERTFORDSHIRE AND WEST ESSEX INTEGRA,QM7,NHS HERTFORDSHIRE AND WEST ESSEX ICB - 0,07H00,THE RIVER SURGERY,F81216,THE RIVER SURGERY,...,20020200824,K-Lite Long bandage 10cm x 5.25m,20: Dressings,60.0,1,60.0,0.0,74.4,69.91596,N
2,202311,EAST OF ENGLAND,Y61,NHS HERTFORDSHIRE AND WEST ESSEX INTEGRA,QM7,NHS HERTFORDSHIRE AND WEST ESSEX ICB - 0,07H00,THE RIVER SURGERY,F81216,THE RIVER SURGERY,...,20020200824,K-Lite Long bandage 10cm x 5.25m,20: Dressings,30.0,1,30.0,0.0,37.2,34.96418,N
3,202311,EAST OF ENGLAND,Y61,NHS HERTFORDSHIRE AND WEST ESSEX INTEGRA,QM7,NHS HERTFORDSHIRE AND WEST ESSEX ICB - 0,07H00,THE RIVER SURGERY,F81216,THE RIVER SURGERY,...,20020200824,K-Lite Long bandage 10cm x 5.25m,20: Dressings,10.0,1,10.0,0.0,12.4,11.66299,N
4,202311,EAST OF ENGLAND,Y61,NHS HERTFORDSHIRE AND WEST ESSEX INTEGRA,QM7,NHS HERTFORDSHIRE AND WEST ESSEX ICB - 0,07H00,THE RIVER SURGERY,F81216,THE RIVER SURGERY,...,20020200914,K-Soft Long bandage 10cm x 4.5m,20: Dressings,10.0,1,10.0,0.0,6.1,5.74374,N


Interested in prescription of an of the four anti-dementia drugs: 

* Donepezil AKA Aricept
* Rivastigmine AKA Exelon
* Galantamine AKA Reminyl
* Memantine AKA Ebixa or Marixino or Valios

In [36]:
df_prescribe[['BNF_CHEMICAL_SUBSTANCE', 'CHEMICAL_SUBSTANCE_BNF_DESCR', 'BNF_CODE', 'BNF_DESCRIPTION', 'BNF_CHAPTER_PLUS_CODE']].head()

Unnamed: 0,BNF_CHEMICAL_SUBSTANCE,CHEMICAL_SUBSTANCE_BNF_DESCR,BNF_CODE,BNF_DESCRIPTION,BNF_CHAPTER_PLUS_CODE
0,2002,Arm Sling/Bandages,20020200822,K-Lite bandage 10cm x 4.5m,20: Dressings
1,2002,Arm Sling/Bandages,20020200824,K-Lite Long bandage 10cm x 5.25m,20: Dressings
2,2002,Arm Sling/Bandages,20020200824,K-Lite Long bandage 10cm x 5.25m,20: Dressings
3,2002,Arm Sling/Bandages,20020200824,K-Lite Long bandage 10cm x 5.25m,20: Dressings
4,2002,Arm Sling/Bandages,20020200914,K-Soft Long bandage 10cm x 4.5m,20: Dressings


In [39]:
df_prescribe[df_prescribe['BNF_CODE']=='0411000D0']

Unnamed: 0,YEAR_MONTH,REGIONAL_OFFICE_NAME,REGIONAL_OFFICE_CODE,ICB_NAME,ICB_CODE,PCO_NAME,PCO_CODE,PRACTICE_NAME,PRACTICE_CODE,ADDRESS_1,...,BNF_CODE,BNF_DESCRIPTION,BNF_CHAPTER_PLUS_CODE,QUANTITY,ITEMS,TOTAL_QUANTITY,ADQUSAGE,NIC,ACTUAL_COST,UNIDENTIFIED


In [40]:
df_prescribe.apply(lambda row: row.astype(str).str.contains('donepezil').any(), axis=1)

KeyboardInterrupt: 

In [None]:
# Want to reduce the dataframe down as much as possible 
df_prescribe[['PRACTICE_CODE', 'BNF_CODE', 'BNF_DESCRIPTION', 'ITEMS']]