# Analysing data on hospital buildings (ERIC)

The Estates Returns Information Collection (ERIC) - data on NHS buildings including hospitals - is [published every October](https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection).

This notebook details the code needed to:

* Fetch the latest data
* Fetch the data from 5 other years
* Drill down to the columns that interest us (there are a *lot* of columns)

The [page publishing the data](https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection/estates-return-information-collection-2016-17) notes:

> Note: in 2019 we were advised of an error in Devonshire Partnership NHS Trust's submitted Oil Consumption figures. The correct figure for the aggregate site consumption is 40,798.8 kWh, rather than the reported 3,855 kWh.

> Note: 7th September 2021: When the revalidated data was released, only the revised headline figures, report (containing trust, site and PFI level data) and data quality statement were made available (figures in the underlying data .csv files were not updated, although revised trust, site and PFI revised figures were available in the data tables). We apologise for any confusion caused and have now published revised .csv files to accompany the release products. These are clearly labelled below.

For this analysis we are using the data marked "revalidation".


In [None]:
#import pandas for dealing with data
import pandas as pd
#we will need the math library too for detecting nan values
import math

## Import the data

Let's start by importing the earliest year's data.

In [None]:
sitedataurl = "https://files.digital.nhs.uk/F7/97436C/ERIC-201617-revalidation-Site%20Data-version2.csv"

In [None]:
#we get a UnicodeDecodeError error so add encoding thanks to https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas
sitedata1617 = pd.read_csv(sitedataurl, encoding = "ISO-8859-1")
sitedata1617.head(3)

Unnamed: 0,Trust Code,Trust Name,Commissioning Region,Trust Type,Site Code,Site Name,Site Type,Tenure,Post Code,Estates and facilities finance costs (£),...,Cost of cleaning the occupied floor area not requiring regular cleaning (£),Occupied floor area not requiring regular cleaning (%),Inpatient food service cost (£),Inpatient main meals requested (No.),Cost of feeding one inpatient per day (inpatient meal day) (£),Laundry and linen service cost (£),Pieces per annum (No.),Laundry and linen service used (Select),Portering service cost (£),Portering staff (WTE)
0,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,AGGRE,AGGREGATE SITE,Aggregate Site,,WR4 9RW,2728686,...,Not Applicable,Not Applicable,14458,9855,4.4,198,421,3. Hybrid,14500,0.6
1,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1A1P,EVESHAM COMMUNITY HOSPITAL,7. Community hospital (with inpatient beds),1. Freehold,WR11 1JT,510125,...,0,0,202047,72429,8.37,228381,200797,3. Hybrid,242003,8.0
2,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1A22,KEITH WINTER CLOSE MH UNIT,4. Mental Health (including Specialist services),1. Freehold,B61 0EX,74098,...,399,1,17660,11219,4.72,4813,10000,2. Full Service  In house,0,0.0


## Create a list of the keys we always want to use

We want to keep the keys that we'll use in *all* analyses: those that identify the site, trust, and region. There are a lot of keys...

In [None]:
#how many keys are there?
len(sitedata1617.keys())

128

In [None]:
#let's identify the ones we always want
print(sitedata1617.keys()[0:9])
#and store
keykeys = list(sitedata1617.keys()[0:9])

Index(['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type',
       'Site Code', 'Site Name', 'Site Type', 'Tenure', 'Post Code'],
      dtype='object')


## Create a subset focused on backlogs of repairs

With so many columns, we need to drill down to those relevant to our analysis. We start with columns about the backlog of repairs. These all have the word 'backlog' in them, so we find the column headings with this code:

In [None]:
#loop through the keys and extract the ones with backlog in them
backlog_keys = [key for key in sitedata1617.keys() if 'backlog' in key.lower()]

In [None]:
#check the results
backlog_keys

['Cost to eradicate high risk backlog (£)',
 'Cost to eradicate significant risk backlog (£)',
 'Cost to eradicate moderate risk backlog (£)',
 'Cost to eradicate low risk backlog (£)']

In [None]:
#add those keys to the ones we've already stored
thesekeys = keykeys+backlog_keys
print(thesekeys)

['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Site Code', 'Site Name', 'Site Type', 'Tenure', 'Post Code', 'Cost to eradicate high risk backlog (£)', 'Cost to eradicate significant risk backlog (£)', 'Cost to eradicate moderate risk backlog (£)', 'Cost to eradicate low risk backlog (£)']


In [None]:
#we have to specify the indices to prevent an error
bothkeys = thesekeys[:9]+backlog_keys

In [None]:
#use those to extract a subset
backlogdf = sitedata1617[bothkeys]
backlogdf.head(3)

Unnamed: 0,Trust Code,Trust Name,Commissioning Region,Trust Type,Site Code,Site Name,Site Type,Tenure,Post Code,Cost to eradicate high risk backlog (£),Cost to eradicate significant risk backlog (£),Cost to eradicate moderate risk backlog (£),Cost to eradicate low risk backlog (£)
0,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,AGGRE,AGGREGATE SITE,Aggregate Site,,WR4 9RW,0,316628,2049070,1218122
1,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1A1P,EVESHAM COMMUNITY HOSPITAL,7. Community hospital (with inpatient beds),1. Freehold,WR11 1JT,0,1401259,1635668,1077418
2,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1A22,KEITH WINTER CLOSE MH UNIT,4. Mental Health (including Specialist services),1. Freehold,B61 0EX,0,0,35623,30587


### Create a pivot table on high risk costs

If we try to pivot on total cost of backlogs by region, we hit two problems: the costs are stored as text, and there are non-numeric values in there.

In [None]:
backlogdf.pivot_table(index="Commissioning Region",
                      values="Cost to eradicate high risk backlog (£)",
                      aggfunc="sum")


Unnamed: 0_level_0,Cost to eradicate high risk backlog (£)
Commissioning Region,Unnamed: 1_level_1
Commissioning Region,Cost to eradicate high risk backlog (£)
LONDON COMMISSIONING REGION,"01,602,560916,00016,784,6457,279,2992,462,4295..."
MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,"00000000000000Not Applicable120,0000170,00032,..."
NORTH OF ENGLAND COMMISSIONING REGION,"000Not ApplicableNot Applicable00000050,000173..."
SOUTH OF ENGLAND COMMISSIONING REGION,"4,50022,220Not Applicable60,68351,3500149,8007..."


### Filter out non-numbers and convert to integers

We can see a problem with the results caused by the data being stored as strings, and including values like 'Not Applicable'. Let's fix that.

In [None]:
#filter to the rows where the condition is True
backlogdf[backlogdf['Cost to eradicate high risk backlog (£)'] == 'Not Applicable']

Unnamed: 0,Trust Code,Trust Name,Commissioning Region,Trust Type,Site Code,Site Name,Site Type,Tenure,Post Code,Cost to eradicate high risk backlog (£),Cost to eradicate significant risk backlog (£),Cost to eradicate moderate risk backlog (£),Cost to eradicate low risk backlog (£)
15,R1C,SOLENT NHS TRUST,SOUTH OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1C34,ROYAL SOUTH HANTS HOSPITAL,7. Community hospital (with inpatient beds),7. Leased from NHS Property Services,SO14 0YG,Not Applicable,Not Applicable,Not Applicable,Not Applicable
20,R1D,SHROPSHIRE COMMUNITY HEALTH NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1D21,LUDLOW HOSPITAL,7. Community hospital (with inpatient beds),7. Leased from NHS Property Services,SY8 1QX,Not Applicable,Not Applicable,Not Applicable,Not Applicable
48,R1K,LONDON NORTH WEST HEALTHCARE NHS TRUST,LONDON COMMISSIONING REGION,ACUTE - LARGE,R1K23,WILLESDEN CENTRE FOR HEALTH AND CARE,7. Community hospital (with inpatient beds),7. Leased from NHS Property Services,NW10 3RY,Not Applicable,Not Applicable,Not Applicable,Not Applicable
68,RAE,BRADFORD TEACHING HOSPITALS NHS FOUNDATION TRUST,NORTH OF ENGLAND COMMISSIONING REGION,ACUTE - TEACHING,RAE3A,WESTWOOD PARK,7. Community hospital (with inpatient beds),7. Leased from NHS Property Services,BD6 3NL,Not Applicable,Not Applicable,Not Applicable,Not Applicable
69,RAE,BRADFORD TEACHING HOSPITALS NHS FOUNDATION TRUST,NORTH OF ENGLAND COMMISSIONING REGION,ACUTE - TEACHING,RAE5H,WESTBOURNE GREEN COMMUNITY HOSPITAL,7. Community hospital (with inpatient beds),7. Leased from NHS Property Services,BD8 8RA,Not Applicable,Not Applicable,Not Applicable,Not Applicable
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1179,RYY,KENT COMMUNITY HEALTH NHS FOUNDATION TRUST,SOUTH OF ENGLAND COMMISSIONING REGION,COMMUNITY,RYYCH,VICTORIA HOSPITAL,7. Community hospital (with inpatient beds),7. Leased from NHS Property Services,CT14 9UA,Not Applicable,Not Applicable,Not Applicable,Not Applicable
1180,RYY,KENT COMMUNITY HEALTH NHS FOUNDATION TRUST,SOUTH OF ENGLAND COMMISSIONING REGION,COMMUNITY,RYYCM,WHITSTABLE & TANKERTON HOSPITAL,7. Community hospital (with inpatient beds),7. Leased from NHS Property Services,CT5 2HN,Not Applicable,Not Applicable,Not Applicable,Not Applicable
1181,RYY,KENT COMMUNITY HEALTH NHS FOUNDATION TRUST,SOUTH OF ENGLAND COMMISSIONING REGION,COMMUNITY,RYYD4,EDENBRIDGE HOSPITAL,7. Community hospital (with inpatient beds),7. Leased from NHS Property Services,TN8 5DA,Not Applicable,Not Applicable,Not Applicable,Not Applicable
1183,RYY,KENT COMMUNITY HEALTH NHS FOUNDATION TRUST,SOUTH OF ENGLAND COMMISSIONING REGION,COMMUNITY,RYYD9,SEVENOAKS HOSPITAL,7. Community hospital (with inpatient beds),7. Leased from NHS Property Services,TN13 3PG,Not Applicable,Not Applicable,Not Applicable,Not Applicable


There's also a problem with a line which repeats the column headers.

In [None]:
#filter to the rows where the condition is True
backlogdf[backlogdf['Cost to eradicate high risk backlog (£)'] == 'Cost to eradicate high risk backlog (£)']

Unnamed: 0,Trust Code,Trust Name,Commissioning Region,Trust Type,Site Code,Site Name,Site Type,Tenure,Post Code,Cost to eradicate high risk backlog (£),Cost to eradicate significant risk backlog (£),Cost to eradicate moderate risk backlog (£),Cost to eradicate low risk backlog (£)
1215,Trust Code,Trust Name,Commissioning Region,Trust Type,Site Code,Site Name,Site Type,Tenure,Post Code,Cost to eradicate high risk backlog (£),Cost to eradicate significant risk backlog (£),Cost to eradicate moderate risk backlog (£),Cost to eradicate low risk backlog (£)


We can calculate how many rows we *should* end up with once those as removed.

In [None]:
#Calculate how many rows we should end up with when the 161 matching ones are removed
#start with the number of rows in the dataframe
print(len(backlogdf))
#how many when filtered
print(len(backlogdf[backlogdf['Cost to eradicate high risk backlog (£)'] == 'Not Applicable']))
#subtract one from the other
print(1216-161)

1216
161
1055


Then apply a filter to remove the rows we don't want.

In [None]:
#filter to the rows where the condition is True
#.index converts that list of T/F to a list of indices
backlog_filtered = backlogdf.drop(backlogdf[backlogdf['Cost to eradicate high risk backlog (£)'] == 'Not Applicable'].index)
#remove the extra row of headers too - this time inplace
backlog_filtered.drop(backlog_filtered[backlog_filtered['Cost to eradicate high risk backlog (£)'] == 'Cost to eradicate high risk backlog (£)'].index,
                      inplace = True)
backlog_filtered

Unnamed: 0,Trust Code,Trust Name,Commissioning Region,Trust Type,Site Code,Site Name,Site Type,Tenure,Post Code,Cost to eradicate high risk backlog (£),Cost to eradicate significant risk backlog (£),Cost to eradicate moderate risk backlog (£),Cost to eradicate low risk backlog (£)
0,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,AGGRE,AGGREGATE SITE,Aggregate Site,,WR4 9RW,0,316628,2049070,1218122
1,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1A1P,EVESHAM COMMUNITY HOSPITAL,7. Community hospital (with inpatient beds),1. Freehold,WR11 1JT,0,1401259,1635668,1077418
2,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1A22,KEITH WINTER CLOSE MH UNIT,4. Mental Health (including Specialist services),1. Freehold,B61 0EX,0,0,35623,30587
3,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1A49,HILL CREST MH UNIT,4. Mental Health (including Specialist services),4. SLA/lease from NHS,B98 7WG,0,0,0,16695
4,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1AAN,TENBURY COMMUNITY HOSPITAL,7. Community hospital (with inpatient beds),1. Freehold,WR15 8AF,0,4878,175215,116636
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1210,TAJ,BLACK COUNTRY PARTNERSHIP NHS FOUNDATION TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,CARE TRUST,TAJ52,PENN HOSPITAL,4. Mental Health (including Specialist services),1. Freehold,WV4 5HN,0,284034,135650,165600
1211,TAJ,BLACK COUNTRY PARTNERSHIP NHS FOUNDATION TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,CARE TRUST,TAJ54,RIDGE HILL HOSPITAL,5. Learning Disabilities,6. Local Investment Finance Trust (LIFT),DY8 5ST,0,0,0,0
1212,TAJ,BLACK COUNTRY PARTNERSHIP NHS FOUNDATION TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,CARE TRUST,TAJ55,DAISY BANK,5. Learning Disabilities,1. Freehold,WS5 3DY,0,32382,22000,0
1213,,,,,,,,,,,,,


### Try the pivot table again

Now we've removed the not applicable fields, will the pivot table work?

In [None]:
backlog_filtered.pivot_table(index="Commissioning Region",
                      values="Cost to eradicate high risk backlog (£)",
                      aggfunc="sum")


Unnamed: 0_level_0,Cost to eradicate high risk backlog (£)
Commissioning Region,Unnamed: 1_level_1
LONDON COMMISSIONING REGION,"01,602,560916,00016,784,6457,279,2992,462,4295..."
MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,"00000000000000120,0000170,00032,00054,411296,4..."
NORTH OF ENGLAND COMMISSIONING REGION,"00000000050,000173,84200354,108000002,313,5850..."
SOUTH OF ENGLAND COMMISSIONING REGION,"4,50022,22060,68351,3500149,800764,58925,00000..."


### Filter out NaN values (floats)

We have some `nan` values in there. These read as floats so we can filter on that to see if the rows can be safely removed.

In [None]:
#create a filter that looks for floats (NaN values)
backlog_filtered[[type(i) == float for i in backlog_filtered["Cost to eradicate high risk backlog (£)"]]]

Unnamed: 0,Trust Code,Trust Name,Commissioning Region,Trust Type,Site Code,Site Name,Site Type,Tenure,Post Code,Cost to eradicate high risk backlog (£),Cost to eradicate significant risk backlog (£),Cost to eradicate moderate risk backlog (£),Cost to eradicate low risk backlog (£)
1213,,,,,,,,,,,,,
1214,,,,,,,,,,,,,


They can.

In [None]:

#filter to the rows where the condition is True
backlog_filtered = backlogdf.drop(backlogdf[[type(i) == float for i in backlogdf["Cost to eradicate high risk backlog (£)"]]].index)
#.index converts that list of T/F to a list of indices
backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['Cost to eradicate high risk backlog (£)'] == 'Not Applicable'].index)
#remove the extra row of headers too - this time inplace
backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['Cost to eradicate high risk backlog (£)'] == 'Cost to eradicate high risk backlog (£)'].index)
backlog_filtered

Unnamed: 0,Trust Code,Trust Name,Commissioning Region,Trust Type,Site Code,Site Name,Site Type,Tenure,Post Code,Cost to eradicate high risk backlog (£),Cost to eradicate significant risk backlog (£),Cost to eradicate moderate risk backlog (£),Cost to eradicate low risk backlog (£)
0,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,AGGRE,AGGREGATE SITE,Aggregate Site,,WR4 9RW,0,316628,2049070,1218122
1,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1A1P,EVESHAM COMMUNITY HOSPITAL,7. Community hospital (with inpatient beds),1. Freehold,WR11 1JT,0,1401259,1635668,1077418
2,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1A22,KEITH WINTER CLOSE MH UNIT,4. Mental Health (including Specialist services),1. Freehold,B61 0EX,0,0,35623,30587
3,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1A49,HILL CREST MH UNIT,4. Mental Health (including Specialist services),4. SLA/lease from NHS,B98 7WG,0,0,0,16695
4,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,COMMUNITY,R1AAN,TENBURY COMMUNITY HOSPITAL,7. Community hospital (with inpatient beds),1. Freehold,WR15 8AF,0,4878,175215,116636
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1208,TAJ,BLACK COUNTRY PARTNERSHIP NHS FOUNDATION TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,CARE TRUST,TAJ11,HEATH LANE HOSPITAL,6. Mental Health and Learning Disabilities,1. Freehold,B71 2BG,46428,87195,16400,10000
1209,TAJ,BLACK COUNTRY PARTNERSHIP NHS FOUNDATION TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,CARE TRUST,TAJ20,HALLAM STREET HOSPITAL,3. Mixed service hospital,2. Whole site - Private Finance Initiative (PFI),B71 4NH,0,0,0,0
1210,TAJ,BLACK COUNTRY PARTNERSHIP NHS FOUNDATION TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,CARE TRUST,TAJ52,PENN HOSPITAL,4. Mental Health (including Specialist services),1. Freehold,WV4 5HN,0,284034,135650,165600
1211,TAJ,BLACK COUNTRY PARTNERSHIP NHS FOUNDATION TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,CARE TRUST,TAJ54,RIDGE HILL HOSPITAL,5. Learning Disabilities,6. Local Investment Finance Trust (LIFT),DY8 5ST,0,0,0,0


### Clean the commas from the figures so they can be converted to numbers

Now we should be able to convert the remaining strings to numbers - but we need to remove the commas first.

In [None]:
#create a new list
highriskcost_int = []
#loop through the strings
for i in backlog_filtered["Cost to eradicate high risk backlog (£)"]:
  #print(i)
  #if it's a string, which they all should be now
  if type(i) == str:
    #replace the comma, otherwise it won't convert to an integer
    newfigure = int(i.replace(',',''))
    #add to the list
    highriskcost_int.append(int(newfigure))
  else:
    print('HUH', type(i))
    print(math.isnan(i))

In [None]:
#is it the same length as the dataframe?
len(highriskcost_int) == len(backlog_filtered)

True

In [None]:
#check we can add it back into the dataframe
backlog_filtered['highriskcost_int'] = highriskcost_int

We can now check the total of that - and see if it matches the same analysis performed in an Excel spreadsheet (947110283).

In [None]:
#check the total
sum(highriskcost_int)

947110283

## Pivot on total cost of backlog of high risk repairs

Now we can try that pivot again.

In [None]:
backlog_filtered.pivot_table(index="Commissioning Region",
                      values="highriskcost_int",
                      aggfunc="sum")


Unnamed: 0_level_0,highriskcost_int
Commissioning Region,Unnamed: 1_level_1
LONDON COMMISSIONING REGION,525903775
MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,194626219
NORTH OF ENGLAND COMMISSIONING REGION,103329921
SOUTH OF ENGLAND COMMISSIONING REGION,123250368


These values also match an analysis in Excel.

We can do the same pivot table by trust, too.

In [None]:
backlog_filtered.pivot_table(index="Trust Name",
                      values="highriskcost_int",
                      aggfunc="sum")


Unnamed: 0_level_0,highriskcost_int
Trust Name,Unnamed: 1_level_1
2GETHER NHS FOUNDATION TRUST,1664000
5 BOROUGHS PARTNERSHIP NHS FOUNDATION TRUST,0
AINTREE UNIVERSITY HOSPITAL NHS FOUNDATION TRUST,0
AIREDALE NHS FOUNDATION TRUST,385000
ALDER HEY CHILDRENS NHS FOUNDATION TRUST,50000
...,...
"WRIGHTINGTON, WIGAN AND LEIGH NHS FOUNDATION TRUST",0
WYE VALLEY NHS TRUST,14900000
YEOVIL DISTRICT HOSPITAL NHS FOUNDATION TRUST,0
YORK TEACHING HOSPITAL NHS FOUNDATION TRUST,2495085


## Fetch the other years (create some functions)

Now to compare that with other years.





In [None]:
#define a function, it takes one argument - the url of the CSV
def backlogdataonly(csvurl):
  #read in the CSV
  sitedata = pd.read_csv(csvurl, encoding = "ISO-8859-1")
  #store the first 9 column names
  keykeys = list(sitedata.keys()[0:9])
  print(keykeys)
  #loop through the keys and extract the ones with backlog in them
  backlog_keys = [key for key in sitedata.keys() if 'backlog' in key.lower()]
  #add those keys to the ones we've already stored
  bothkeys = keykeys[:9]+backlog_keys
  print(bothkeys)
  #use those to extract a subset
  backlogdf = sitedata[bothkeys]
  #filter to the rows where the condition is True
  backlog_filtered = backlogdf.drop(backlogdf[[type(i) == float for i in backlogdf["Cost to eradicate high risk backlog (£)"]]].index)
  #.index converts that list of T/F to a list of indices
  backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['Cost to eradicate high risk backlog (£)'] == 'Not Applicable'].index)
  #remove the extra row of headers too - this time inplace
  backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['Cost to eradicate high risk backlog (£)'] == 'Cost to eradicate high risk backlog (£)'].index)
  #return the resulting dataframe to whatever called the function
  return(backlog_filtered)

In [None]:
def cleannumbers(column):
  #create a new list
  column_as_ints = []
  #loop through the strings
  for i in column:
    #print(i)
    #if it's a string, which they all should be now
    if type(i) == str:
      #replace the comma, otherwise it won't convert to an integer
      newfigure = int(i.replace(',',''))
      #add to the list
      column_as_ints.append(int(newfigure))
    else:
      print('HUH', type(i))
      print(math.isnan(i))
  return(column_as_ints)

In [None]:
#test the function
backlogdf = backlogdataonly("https://files.digital.nhs.uk/A8/188D99/ERIC-201718-SiteData.csv")
#test the second function
backlogdf['high_risk_backlog_cost'] = cleannumbers(backlogdf['Cost to eradicate high risk backlog (£)'])
#show the results
backlogdf

Unnamed: 0,Trust Code,Trust Name,Commissioning Region,Trust Type,Site Code,Site Name,Site Type,Tenure,Leasehold Type,Cost to eradicate high risk backlog (£),Cost to eradicate significant risk backlog (£),Cost to eradicate moderate risk backlog (£),Cost to eradicate low risk backlog (£),high_risk_backlog_cost
0,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH OF ENGLAND COMMISSIONING REGION,ACUTE - TEACHING,AGGRE,AGGREGATE SITE,Aggregate Site,,,1858,0,421219,43534,1858
1,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH OF ENGLAND COMMISSIONING REGION,ACUTE - TEACHING,R0A01,ISLAND SITE,1. General acute hospital,2. Whole site - Private Finance Initiative (PFI),,200,417946,4292709,1769794,200
2,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH OF ENGLAND COMMISSIONING REGION,ACUTE - TEACHING,R0A07,WYTHENSHAWE HOSPITAL,1. General acute hospital,2. Whole site - Private Finance Initiative (PFI),,232177,14598996,6988717,5370371,232177
3,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH OF ENGLAND COMMISSIONING REGION,ACUTE - TEACHING,R0A09,TRAFFORD GENERAL HOSPITAL,1. General acute hospital,1. Freehold,,4440391,3561412,695601,6175566,4440391
4,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH OF ENGLAND COMMISSIONING REGION,ACUTE - TEACHING,R0A47,DERMOTT MURPHY CLOSE - LONG STAY UNIT,8. Other inpatient,1. Freehold,,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1154,TAJ,BLACK COUNTRY PARTNERSHIP NHS FOUNDATION TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,CARE TRUST,TAJ07,EDWARD STREET HOSPITAL,4. Mental Health (including Specialist services),1. Freehold,,130900,38200,64040,219528,130900
1155,TAJ,BLACK COUNTRY PARTNERSHIP NHS FOUNDATION TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,CARE TRUST,TAJ11,HEATH LANE HOSPITAL,6. Mental Health and Learning Disabilities,1. Freehold,,0,4800,0,23875,0
1156,TAJ,BLACK COUNTRY PARTNERSHIP NHS FOUNDATION TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,CARE TRUST,TAJ20,HALLAM STREET HOSPITAL,6. Mental Health and Learning Disabilities,2. Whole site - Private Finance Initiative (PFI),,0,0,0,26262,0
1157,TAJ,BLACK COUNTRY PARTNERSHIP NHS FOUNDATION TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,CARE TRUST,TAJ52,PENN HOSPITAL,4. Mental Health (including Specialist services),1. Freehold,,0,38350,194609,278156,0


### Fixing 21/22

This works on each of the last 5 years' data apart from 21/22. This is because the column heading is slightly different: `'Cost to eradicate high risk backlog (Â£)'` instead of `'Cost to eradicate high risk backlog (£)'`

A similar problem causes the first column to be treated separately, so we clean that too below.

In [None]:
backlogdataonly('https://files.digital.nhs.uk/72/97A33E/ERIC%20-%20202122%20-%20Site%20data.csv')

['ï»¿Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Status', 'Site Code', 'Site Name', 'Site Type', 'Tenure']
['ï»¿Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Status', 'Site Code', 'Site Name', 'Site Type', 'Tenure', 'Cost to eradicate high risk backlog (Â£)', 'Cost to eradicate significant risk backlog (Â£)', 'Cost to eradicate moderate risk backlog (Â£)', 'Cost to eradicate low risk backlog (Â£)', 'Percentage of GIA surveyed using risk adjusted backlog guidance (Select)', 'Methodology used to review costs to eradicate backlog (Select)', 'Methodology used to review costs to eradicate backlog - Reason (Notes)', 'Investment to reduce backlog maintenance - Critical Infrastructure Risk (Â£)', 'Investment to reduce backlog maintenance - non Critical Infrastructure Risk (Â£)']


KeyError: ignored

### Reshape wide to long

Let's try to fix it by reshaping the dataframe first from wide to long before we filter on that column (we can then filter on the 'value' column with 'measure' storing the column names)

In [None]:

sitedata = pd.read_csv('https://files.digital.nhs.uk/72/97A33E/ERIC%20-%20202122%20-%20Site%20data.csv', encoding = "ISO-8859-1")
#store the first 9 column names
keykeys = list(sitedata.keys()[0:9])
print(keykeys)
#loop through the keys and extract the ones with backlog in them
backlog_keys = [key for key in sitedata.keys() if 'backlog' in key.lower()]
#add those keys to the ones we've already stored
bothkeys = keykeys[:9]+backlog_keys
print(bothkeys)
#use those to extract a subset
backlogdf = sitedata[bothkeys]

#reshape from wide to long
longversion = pd.melt(backlogdf, id_vars=list(sitedata.keys()[0:9]),var_name='measure', value_name='values')
print(longversion)
#filter to the rows where the condition is True
backlog_filtered = longversion.drop(longversion[[type(i) == float for i in longversion["values"]]].index)
#.index converts that list of T/F to a list of indices
backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['values'] == 'Not Applicable'].index)
#remove the extra row of headers too - this time inplace
backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['values'] == 'Cost to eradicate high risk backlog (£)'].index)
#return the resulting dataframe to whatever called the function
#return(backlog_filtered)

#rename columns
if 'Trust Code' in backlogdf.keys()[0]:
  replacename = backlogdf.keys()[0]
  backlogdf = backlogdf.rename(columns={replacename: 'Trust Code'})
print(backlogdf.keys())
#print(backlog_filtered)

['ï»¿Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Status', 'Site Code', 'Site Name', 'Site Type', 'Tenure']
['ï»¿Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Status', 'Site Code', 'Site Name', 'Site Type', 'Tenure', 'Cost to eradicate high risk backlog (Â£)', 'Cost to eradicate significant risk backlog (Â£)', 'Cost to eradicate moderate risk backlog (Â£)', 'Cost to eradicate low risk backlog (Â£)', 'Percentage of GIA surveyed using risk adjusted backlog guidance (Select)', 'Methodology used to review costs to eradicate backlog (Select)', 'Methodology used to review costs to eradicate backlog - Reason (Notes)', 'Investment to reduce backlog maintenance - Critical Infrastructure Risk (Â£)', 'Investment to reduce backlog maintenance - non Critical Infrastructure Risk (Â£)']
      ï»¿Trust Code                                     Trust Name  \
0               R0A     MANCHESTER UNIVERSITY NHS FOUNDATION TRUST   
1               R0A     MANCHESTER U

### Update the function

In [None]:
#define a function, it takes one argument - the url of the CSV
def backlogdataonly(csvurl):
  #read in the CSV
  sitedata = pd.read_csv(csvurl, encoding = "ISO-8859-1")
  #store the first 9 column names
  keykeys = list(sitedata.keys()[0:9])
  print(keykeys)
  #loop through the keys and extract the ones with backlog in them
  backlog_keys = [key for key in sitedata.keys() if 'backlog' in key.lower()]
  #add those keys to the ones we've already stored
  bothkeys = keykeys[:9]+backlog_keys
  print(bothkeys)
  #use those to extract a subset
  backlogdf = sitedata[bothkeys]
  #reshape from wide to long
  longversion = pd.melt(backlogdf, id_vars=list(sitedata.keys()[0:9]),var_name='measure', value_name='values')
  print(longversion)
  #filter to the rows where the condition is True
  backlog_filtered = longversion.drop(longversion[[type(i) == float for i in longversion["values"]]].index)
  #.index converts that list of T/F to a list of indices
  backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['values'] == 'Not Applicable'].index)
  #remove the extra row of headers too - this time inplace
  backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['values'] == 'Cost to eradicate high risk backlog (£)'].index)
  #return the resulting dataframe to whatever called the function
  return(backlog_filtered)

## To be continued in another notebook...

This notebook documents the exploration of the data. With that encoded in the functions above, we start a [new notebook to apply those to multiple spreadsheets](https://colab.research.google.com/drive/1qsGhqDeIMsdeI0ydyWw2pimSOZ9PDBM1?usp=sharing).