# Analysing data on hospital buildings (ERIC) - incidents

The Estates Returns Information Collection (ERIC) - data on NHS buildings including hospitals - is [published every October](https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection).

This notebook details the code needed to:

* Fetch the data from 5 years of spreadsheets
* Drill down to the columns on repair backlogs
* Combine the data

The [page publishing the data](https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection/estates-return-information-collection-2016-17) notes:

> Note: in 2019 we were advised of an error in Devonshire Partnership NHS Trust's submitted Oil Consumption figures. The correct figure for the aggregate site consumption is 40,798.8 kWh, rather than the reported 3,855 kWh.

> Note: 7th September 2021: When the revalidated data was released, only the revised headline figures, report (containing trust, site and PFI level data) and data quality statement were made available (figures in the underlying data .csv files were not updated, although revised trust, site and PFI revised figures were available in the data tables). We apologise for any confusion caused and have now published revised .csv files to accompany the release products. These are clearly labelled below.

For this analysis we are using the data marked "revalidation".


## Import the libraries

First we need to import the libraries needed.

In [None]:
#import pandas for dealing with data
import pandas as pd
#we will need the math library too for detecting nan values
import math
#requests for fetching URLs
import requests
#beautiful soup for drilling into them
from bs4 import BeautifulSoup

In [None]:
#import re for regex
import re
#import a library for downloading files
from google.colab import files

## Create functions to filter...

In a [previous notebook](https://colab.research.google.com/drive/1B7hT6PDdO-XZigGiI3n_iKNRRRN293cE?usp=sharing) we explored the data and codified that in some functions. Let's recreate those here.

The first one takes a large ERIC spreadsheet and filters it to just the key columns and those on incidents. It also filters out non-numbers.

In [None]:
#define a function, it takes one argument - the url of the CSV
def backlogdataonly(csvurl):
  #read in the CSV
  sitedata = pd.read_csv(csvurl, encoding = "ISO-8859-1")
  #store the first 9 column names
  keykeys = list(sitedata.keys()[0:10])
  print(keykeys)
  #loop through the keys and extract the ones with backlog in them
  backlog_keys = [key for key in sitedata.keys() if 'incident' in key.lower()]
  #add those keys to the ones we've already stored
  bothkeys = keykeys[:10]+backlog_keys
  print(bothkeys)
  #use those to extract a subset
  backlogdf = sitedata[bothkeys]
  #reshape from wide to long
  longversion = pd.melt(backlogdf, id_vars=list(sitedata.keys()[0:10]),var_name='measure', value_name='values')
  #print(longversion)
  #filter to the rows where the condition is True
  backlog_filtered = longversion.drop(longversion[[type(i) == float for i in longversion["values"]]].index)
  #.index converts that list of T/F to a list of indices
  backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['values'] == 'Not Applicable'].index)
  #remove the extra row of headers too - this time inplace
  backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['values'] == 'Cost to eradicate high risk backlog (£)'].index)
  #rename columns where name has extra chars
  if 'Trust Code' in backlog_filtered.keys()[0]:
    print('HEY', backlog_filtered.keys()[0])
    replacename = backlog_filtered.keys()[0]
    print('HEYHEY', replacename)
    backlog_filtered = backlog_filtered.rename(columns={replacename: 'Trust Code'})
  if 'New Commissioning Region' in backlog_filtered.keys()[3]:
    print('HEY', backlog_filtered.keys()[3])
    replacename = backlog_filtered.keys()[3]
    print('HEYHEY', replacename)
    backlog_filtered = backlog_filtered.rename(columns={replacename: 'Commissioning Region'})
  #print(backlogdf.keys())
  #return the resulting dataframe to whatever called the function
  return(backlog_filtered)

### ...And clean strings to numbers (FUNCTION)

The second function converts the values of a column from strings to numbers.

In [None]:
def cleannumbers(column):
  #create a new list
  column_as_ints = []
  #loop through the strings
  for i in column:
    #print(i)
    #if it's a string, which they all should be now
    if type(i) == str:
      #replace the comma, otherwise it won't convert to an integer
      newfigure = int(i.replace(',',''))
      #add to the list
      column_as_ints.append(int(newfigure))
    #or if it's a number
    elif type(i) == int:
      #add to the list
      column_as_ints.append(i)
    # or if it's a NaN
    else:
      if(math.isnan(i)):
        print('HUH', type(i))
        #add an empty string to the list instead
      column_as_ints.append(i)
  #return the now clean list
  return(column_as_ints)

## Apply the functions to 2 years of data (SITE LEVEL)

This data first appears at site level in 21/22 (before then it is published at trust level) so this future proofs it.

Now let's store the URLs of each dataset.

In [None]:
#store the ERIC homepage
ericurl = "https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection"
#store the base URL which we will need for relative URLs
baseurl = "https://digital.nhs.uk"

### Create a function to scrape the data CSV

This function will find the 'site data' CSV link on each page.

In [None]:
#define a function that takes a URL and returns the site data CSV link on that page
def fetchcsv_for_sites(url):
  # Send a GET request to the link URL
  link_response = requests.get(url)
  #parse into soup
  soup = BeautifulSoup(link_response.content, 'html.parser')
  # Find all links
  divboxlink = soup.find_all('a')
  #create an empty list
  matches = []
  #loop through each one
  for i in divboxlink:
    #look for the one about Site data
    if "Site" in i.get('href'):
      #show that URL
      #print(i.get('href'))
      matches.append(i.get('href'))
  #if the list has something in it
  if len(matches) >0:
    #return that URL
    return(matches[0])
  #otherwise
  else:
    #return a string we can pick up the other side
    return('NO LINK')

In [None]:
#create an empty list to store the URLs
csvurls = []

#some of this code generated by ChatGPT in response to the prompt:
#"write some python code which identifies the first link inside a <h3> tag at
#https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection and fetches that"
#fetch that page
response = requests.get(ericurl)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all the <h3> tags
h3_tag = soup.find_all('h3')
#loop through the last 5 years
for i in h3_tag[:2]:
  #find the first <a> and get the href= attribute
  yearpageurl = baseurl+i.find('a').get('href')
  print(yearpageurl)
  #run the function defined above to fetch the CSV link from that page
  sitedatacsvurl = fetchcsv_for_sites(yearpageurl)
  print(sitedatacsvurl)
  #add it to the list unless it's a 'NO LINK'
  if sitedatacsvurl != 'NO LINK':
    csvurls.append(sitedatacsvurl)


print(csvurls)

https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection/england-2022-23
https://files.digital.nhs.uk/41/5787C9/ERIC%20-%202022_23%20-%20Site%20data.csv
https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection/england-2021-22
https://files.digital.nhs.uk/EE/7E330D/ERIC%20-%20202122%20-%20Site%20Data%20v3.csv
['https://files.digital.nhs.uk/41/5787C9/ERIC%20-%202022_23%20-%20Site%20data.csv', 'https://files.digital.nhs.uk/EE/7E330D/ERIC%20-%20202122%20-%20Site%20Data%20v3.csv']


### Loop through the CSV urls

For now we only have one URL in our list, but once the new data is published there will be more. [UPDATE: now there is indeed two]

In [None]:
csvurls

['https://files.digital.nhs.uk/41/5787C9/ERIC%20-%202022_23%20-%20Site%20data.csv',
 'https://files.digital.nhs.uk/EE/7E330D/ERIC%20-%20202122%20-%20Site%20Data%20v3.csv']

In [None]:
#create an empty dataframe
last5yrs = pd.DataFrame()

#loop through the URLs
for i in csvurls:
  print(i)
  thisyrdf = backlogdataonly(i)
  thisyrdf['year_range'] = i.split('ERIC')[1].split('-')[1].split('-')[0].replace('%20','').replace('_','')
  last5yrs = last5yrs.append(thisyrdf, ignore_index = True)



https://files.digital.nhs.uk/41/5787C9/ERIC%20-%202022_23%20-%20Site%20data.csv
['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Site Code', 'Site Name', 'Post Code', 'Integrated Care Board', 'Local Authority', 'Site Type']
['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Site Code', 'Site Name', 'Post Code', 'Integrated Care Board', 'Local Authority', 'Site Type', 'Clinical service incidents caused by estates and infrastructure failure related to Critical Infrastructure Risk (No)', 'Clinical service incidents caused by estates and infrastructure failure related to non-Critical Infrastructure Risk (No)', 'Clinical service incidents caused by estates and infrastructure failure - other (No)', 'Estates and facilities related incidents related to Critical Infrastructure Risk (No.)', 'Estates and facilities related incidents related to Non-Critical Infrastructure Risk (No)', 'Estates and facilities incidents related - other (No)', 'Estates and facilitie

  last5yrs = last5yrs.append(thisyrdf, ignore_index = True)


['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Status', 'Site Code', 'Site Name', 'Post Code', 'Site Type', 'Tenure']
['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Status', 'Site Code', 'Site Name', 'Post Code', 'Site Type', 'Tenure', 'Number of estates and facilities related incidents related to Critical Infrastructure Risk (No.)', 'Number of estates and facilities related incidents related to Non-Critical Infrastructure Risk (No)', 'Clinical service incidents caused by estates and infrastructure failure (No)', 'Estates and facilities RIDDOR incidents (No)']
HEY Trust Code
HEYHEY Trust Code


  last5yrs = last5yrs.append(thisyrdf, ignore_index = True)


### Check the measure text

We need to check if the measures are used consistently, without variation.

We can use the `.unique()` method to check all the unique values in that column.

In [None]:
last5yrs['year_range'].unique()

array(['202223', '202122'], dtype=object)

In [None]:
last5yrs['measure'].unique()

array(['Clinical service incidents caused by estates and infrastructure failure related to Critical Infrastructure Risk (No)',
       'Clinical service incidents caused by estates and infrastructure failure related to non-Critical Infrastructure Risk (No)',
       'Clinical service incidents caused by estates and infrastructure failure - other (No)',
       'Estates and facilities related incidents related to Critical Infrastructure Risk (No.)',
       'Estates and facilities related incidents related to Non-Critical Infrastructure Risk (No)',
       'Estates and facilities incidents related - other (No)',
       'Estates and facilities RIDDOR incidents (No)',
       'Most clinically impactful - Incident type (Select)',
       'Most clinically impactful - Number of recurring incidents (No.)',
       'Second most clinically impactful - Incident type (Select)',
       'Second most clinically impactful - Number of recurring incidents (No.)',
       'Third most clinically impactful - Incid

## Make a copy of the 'most impactful' data

The 22/23 data includes some new measures which are very interesting about the top 3 ranked most impactful categories of incidents. We need to exclude them in a moment because they contain categorical data which trips up the code below.

But before we do that we need to make a copy of the data relating to that.

In [None]:
#how many rows do we start with?
len(last5yrs)

31749

In [None]:
#how many do we get if we try to filter out the rows we don't want
last5yrs_impactful = last5yrs[last5yrs['measure'].str.contains('impactful')]
len(last5yrs_impactful)

1770

## Clean and filter (remove the new text measures)

With that done, we can now remove the data we've copied across by adding a `~` to the code used above (which makes it a negative filter). We are going to just filter out the category ('Incident type') columns and keep the numeric ones.

In [None]:
#how many do we get if we try to filter out the rows we don't want
last5yrs = last5yrs[~last5yrs['measure'].str.contains('Incident type')]
len(last5yrs)

30864

In [None]:
#check the measures left
last5yrs['measure'].unique()

array(['Clinical service incidents caused by estates and infrastructure failure related to Critical Infrastructure Risk (No)',
       'Clinical service incidents caused by estates and infrastructure failure related to non-Critical Infrastructure Risk (No)',
       'Clinical service incidents caused by estates and infrastructure failure - other (No)',
       'Estates and facilities related incidents related to Critical Infrastructure Risk (No.)',
       'Estates and facilities related incidents related to Non-Critical Infrastructure Risk (No)',
       'Estates and facilities incidents related - other (No)',
       'Estates and facilities RIDDOR incidents (No)',
       'Most clinically impactful - Number of recurring incidents (No.)',
       'Second most clinically impactful - Number of recurring incidents (No.)',
       'Third most clinically impactful - Number of recurring incidents (No.)',
       'Number of estates and facilities related incidents related to Critical Infrastructure Ri

### Clean the measures to be consistent

We have a change in measures too. In 21/22 there was this measure:

* Clinical service incidents caused by estates and infrastructure failure (No)

Which in 22/23 is:

* Clinical service incidents caused by estates and infrastructure failure related to Critical Infrastructure Risk (No)',
* 'Clinical service incidents caused by estates and infrastructure failure related to non-Critical Infrastructure Risk (No)',
* 'Clinical service incidents caused by estates and infrastructure failure - other (No)

Then there is a change in label:

* Number of estates and facilities related incidents related to Critical Infrastructure Risk (No.)
* Number of estates and facilities related incidents related to Non-Critical Infrastructure Risk (No)

Become:

* Estates and facilities related incidents related to Critical Infrastructure Risk (No.)
* Estates and facilities related incidents related to Non-Critical Infrastructure Risk (No)

And a new third column:

* Estates and facilities incidents related - other (No)

Let's do the simple cleaning first.



In [None]:
#Remove 'Number of '
last5yrs['measure'] = last5yrs['measure'].str.replace('Number of estates and facilities','Estates and facilities')

In [None]:
last5yrs['measure'].unique()

array(['Clinical service incidents caused by estates and infrastructure failure related to Critical Infrastructure Risk (No)',
       'Clinical service incidents caused by estates and infrastructure failure related to non-Critical Infrastructure Risk (No)',
       'Clinical service incidents caused by estates and infrastructure failure - other (No)',
       'Estates and facilities related incidents related to Critical Infrastructure Risk (No.)',
       'Estates and facilities related incidents related to Non-Critical Infrastructure Risk (No)',
       'Estates and facilities incidents related - other (No)',
       'Estates and facilities RIDDOR incidents (No)',
       'Most clinically impactful - Number of recurring incidents (No.)',
       'Second most clinically impactful - Number of recurring incidents (No.)',
       'Third most clinically impactful - Number of recurring incidents (No.)',
       'Clinical service incidents caused by estates and infrastructure failure (No)'],
      dt

### Clean the values from strings to numbers

We created a function to clean the values column, but haven't yet used it. In fact we don't need to because the numbers are too small to have commas, etc.

We can instead just convert them to integers this way.

In [None]:
#create new column based on applying the custom function to another column
last5yrs['valuesclean'] = [int(i) for i in last5yrs['values']]

### Test a pivot table

We can test it by generating a pivot table on the cost of high risk backlogs by commissioning region, and check if the values match what we get by doing the same in Excel (they do).

In [None]:
#check the keys we can use
last5yrs.keys()

Index(['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type',
       'Site Code', 'Site Name', 'Post Code', 'Integrated Care Board',
       'Local Authority', 'Site Type', 'measure', 'values', 'year_range',
       'Status', 'Tenure', 'valuesclean'],
      dtype='object')

In [None]:
last5yrs[(last5yrs.measure == 'Estates and facilities related incidents related to Critical Infrastructure Risk (No.)')].pivot_table(
    index="Commissioning Region",
    values="valuesclean",
    aggfunc="sum",
    columns = "year_range")


year_range,202122,202223
Commissioning Region,Unnamed: 1_level_1,Unnamed: 2_level_1
EAST OF ENGLAND COMMISSIONING REGION,406,549
LONDON COMMISSIONING REGION,826,334
MIDLANDS COMMISSIONING REGION,165,41
NORTH EAST AND YORKSHIRE COMMISSIONING REGION,160,95
NORTH WEST COMMISSIONING REGION,124,232
SOUTH EAST COMMISSIONING REGION,56,264
SOUTH WEST COMMISSIONING REGION,79,100


### Add up the clinical services incidents

Resolving the division of the clinical services incidents into three separate categories is trickier. We need to add them up to create a total comparable to the previous measure - this might be better done in the pivot table stage (and similarly with the extra estates incidents category)

In [None]:
#list the measures so we can copy the ones we want into the code below
[i for i in last5yrs['measure'].unique()]

['Clinical service incidents caused by estates and infrastructure failure related to Critical Infrastructure Risk (No)',
 'Clinical service incidents caused by estates and infrastructure failure related to non-Critical Infrastructure Risk (No)',
 'Clinical service incidents caused by estates and infrastructure failure - other (No)',
 'Estates and facilities related incidents related to Critical Infrastructure Risk (No.)',
 'Estates and facilities related incidents related to Non-Critical Infrastructure Risk (No)',
 'Estates and facilities incidents related - other (No)',
 'Estates and facilities RIDDOR incidents (No)',
 'Most clinically impactful - Number of recurring incidents (No.)',
 'Second most clinically impactful - Number of recurring incidents (No.)',
 'Third most clinically impactful - Number of recurring incidents (No.)',
 'Clinical service incidents caused by estates and infrastructure failure (No)']

In [None]:
#store the measures
measures_to_include = ['Clinical service incidents caused by estates and infrastructure failure related to Critical Infrastructure Risk (No)',
                       'Clinical service incidents caused by estates and infrastructure failure related to non-Critical Infrastructure Risk (No)',
                       'Clinical service incidents caused by estates and infrastructure failure - other (No)',
                       'Clinical service incidents caused by estates and infrastructure failure (No)']

#pivot on those
last5yrs[last5yrs['measure'].isin(measures_to_include)].pivot_table(
    index="Commissioning Region",
    values="valuesclean",
    aggfunc="sum",
    columns="year_range"
)


year_range,202122,202223
Commissioning Region,Unnamed: 1_level_1,Unnamed: 2_level_1
EAST OF ENGLAND COMMISSIONING REGION,1264,779
LONDON COMMISSIONING REGION,540,1248
MIDLANDS COMMISSIONING REGION,1022,466
NORTH EAST AND YORKSHIRE COMMISSIONING REGION,622,772
NORTH WEST COMMISSIONING REGION,1126,502
SOUTH EAST COMMISSIONING REGION,461,695
SOUTH WEST COMMISSIONING REGION,313,205


Because all the measures start with 'Clinical' another way of achieving the same results would be this:

In [None]:
last5yrs[last5yrs['measure'].str.contains('Clinical')].pivot_table(
    index="Commissioning Region",
    values="valuesclean",
    aggfunc="sum",
    columns = "year_range")

year_range,202122,202223
Commissioning Region,Unnamed: 1_level_1,Unnamed: 2_level_1
EAST OF ENGLAND COMMISSIONING REGION,1264,779
LONDON COMMISSIONING REGION,540,1248
MIDLANDS COMMISSIONING REGION,1022,466
NORTH EAST AND YORKSHIRE COMMISSIONING REGION,622,772
NORTH WEST COMMISSIONING REGION,1126,502
SOUTH EAST COMMISSIONING REGION,461,695
SOUTH WEST COMMISSIONING REGION,313,205


### Clean the site types

Until 2021/22 the 'Site type' column included ordinal prefixes, e.g. `1. General acute hospital`. But the most recent data is not numbered.

We can clean the data so that it's consistent and doesn't need cleaning later.

In [None]:
#pivot on the Site Type field
last5yrs[(last5yrs.measure == 'Estates and facilities related incidents related to Critical Infrastructure Risk (No.)')].pivot_table(
    index="Site Type",
    values="valuesclean",
    aggfunc="sum",
    columns = "year_range")


year_range,202122,202223
Site Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Community hospital (with inpatient beds),28.0,60.0
General acute hospital,1580.0,1083.0
Learning Disabilities,0.0,6.0
Mental Health (including Specialist services),53.0,365.0
Mental Health and Learning Disabilities,0.0,0.0
Mixed service hospital,11.0,17.0
Non inpatient,43.0,37.0
Other Reportable Site,12.0,10.0
Other inpatient,21.0,19.0
Specialist hospital (acute only),65.0,18.0


In [None]:
#testing line
#re.sub('[0-9]\. ','','1. Learning Disabilities')
sitetypeclean = [re.sub('[0-9]\. ','',i) for i in last5yrs['Site Type']]
sitetypeclean[:10]

['Non inpatient',
 'Non inpatient',
 'Other inpatient',
 'Non inpatient',
 'General acute hospital',
 'Non inpatient',
 'General acute hospital',
 'Non inpatient',
 'General acute hospital',
 'Mixed service hospital']

### Add back into dataframe and pivot



In [None]:
#create a new column from that list
last5yrs['sitetypeclean'] = sitetypeclean

In [None]:
#pivot on that field
last5yrs[(last5yrs.measure == 'Estates and facilities related incidents related to Critical Infrastructure Risk (No.)')].pivot_table(
    index="sitetypeclean",
    values="valuesclean",
    aggfunc="sum",
    columns = "year_range")


year_range,202122,202223
sitetypeclean,Unnamed: 1_level_1,Unnamed: 2_level_1
Community hospital (with inpatient beds),28.0,60.0
General acute hospital,1580.0,1083.0
Learning Disabilities,0.0,6.0
Mental Health (including Specialist services),53.0,365.0
Mental Health and Learning Disabilities,0.0,0.0
Mixed service hospital,11.0,17.0
Non inpatient,43.0,37.0
Other Reportable Site,12.0,10.0
Other inpatient,21.0,19.0
Specialist hospital (acute only),65.0,18.0


### Add a 'year ending' column

For the visualisation we need a year ending column rather than '202122' so let's create that too.

In [None]:
#grab the last two digits of every string in year_range
#add '20' to the front of those, and store in a list
yearending = ['20'+i[-2:] for i in last5yrs['year_range']]

#add to dataframe
last5yrs['yearending'] = yearending

## Export the site level data

In [None]:
#create a CSV from the dataframe
last5yrs.to_csv('last2yrs_incidents.csv')
#import a library for downloading files
from google.colab import files
#download the file
files.download('last2yrs_incidents.csv')

## Add in codes for trusts that have changed

Many trusts only have figures for some years, either because they are new, or because they are older trusts that ceased to exist (changing name, or becoming part of a new or existing trust).

Some trusts have figures for all the years, but during that time acquired other trusts whose historical costs need to be factored in.

For example: in 2023 SOUTHPORT AND ORMSKIRK HOSPITAL NHS TRUST is going to be succeeded by the code RBN. RBN is the code for ST HELENS AND KNOWSLEY TEACHING HOSPITALS NHS TRUST, which has had quite low figures for the last 4 years (dropping from £957,207 to £129,277). Now, next year it's going to include the figures for Southport - the most recent of which was £54m.
So St Helens's figures will jump from £129k to £54m - *unless* we add in the historical figures of Southport to more accurately reflect the fact that costs are reported under just one code which previously were reported under two.

NHS Digital [publishes data on 'Successor Organisations'](https://digital.nhs.uk/services/organisation-data-service/export-data-files/csv-downloads/miscellaneous).

We've downloaded and unzipped the succarc.csv (Archived Successor Organisations) and succ.csv (Successor Organisations) files from that page and published as a Google Sheet, adding headings so that we can import and merge with our data here.

We need to import that and merge it with our existing data so we can pivot on the most recent codes.

In [None]:
#store the URL
succarcurl = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRgtsnTAuRKfLgjwMzz_DdFDcxBF8t3KqX30HX9vKJZDeaMq2p-AZD-1dBAboOQwUHYwNaWkUwY56SG/pub?gid=409838561&single=true&output=csv"
#import
succarcdf = pd.read_csv(succarcurl)


In [None]:
succarcdf.keys()

Index(['Trust Code', 'Most Recent Trust Code', 'Most Recent Trust Name',
       'Notes on trust data', 'Source', 'Count'],
      dtype='object')

In [None]:
last5yrs.keys()

Index(['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type',
       'Site Code', 'Site Name', 'Post Code', 'Integrated Care Board',
       'Local Authority', 'Site Type', 'measure', 'values', 'year_range',
       'Status', 'Tenure', 'valuesclean', 'sitetypeclean', 'yearending'],
      dtype='object')

In [None]:
#merge the two dataframes on the year_range column and store in a new df
last5yrs = pd.merge(left = last5yrs,
                              right = succarcdf,
                              on = 'Trust Code')

last5yrs.head()

Unnamed: 0,Trust Code,Trust Name,Commissioning Region,Trust Type,Site Code,Site Name,Post Code,Integrated Care Board,Local Authority,Site Type,...,Status,Tenure,valuesclean,sitetypeclean,yearending,Most Recent Trust Code,Most Recent Trust Name,Notes on trust data,Source,Count
0,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH WEST COMMISSIONING REGION,ACUTE - TEACHING,C5Y2X,LANCE BURN HEALTH CENTRE,M6 5QX,NHS GREATER MANCHESTER ICB,SALFORD CITY COUNCIL,Non inpatient,...,,,0,Non inpatient,2023,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,,ERIC data,1
1,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH WEST COMMISSIONING REGION,ACUTE - TEACHING,C9Y7X,CRESCENT BANK,M8 9JS,NHS GREATER MANCHESTER ICB,MANCHESTER CITY COUNCIL,Non inpatient,...,,,0,Non inpatient,2023,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,,ERIC data,1
2,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH WEST COMMISSIONING REGION,ACUTE - TEACHING,H4S9Q-X,MLCO VIRTUAL WARD,M18 8HE,NHS GREATER MANCHESTER ICB,MANCHESTER CITY COUNCIL,Other inpatient,...,,,0,Other inpatient,2023,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,,ERIC data,1
3,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH WEST COMMISSIONING REGION,ACUTE - TEACHING,O3L2I,MEDWAY HEALTH CENTRE,M33 4PS,NHS GREATER MANCHESTER ICB,TRAFFORD METROPOLITAN BOROUGH COUNCIL,Non inpatient,...,,,0,Non inpatient,2023,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,,ERIC data,1
4,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH WEST COMMISSIONING REGION,ACUTE - TEACHING,R0A01,ISLAND SITE,M13 9WL,NHS GREATER MANCHESTER ICB,MANCHESTER CITY COUNCIL,General acute hospital,...,,,2,General acute hospital,2023,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,,ERIC data,1


### Export udpated data

In [None]:
#create a CSV from the dataframe
last5yrs.to_csv('last2yrs_incidents.csv')
#import a library for downloading files
from google.colab import files
#download the file
files.download('last2yrs_incidents.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## TBD: Analyse the data on most impactful incidents

Earlier we filtered out data on the categories of incidents which were most impactful. Now we need to return to that.

In [None]:
#what measures were retained
last5yrs_impactful['measure'].unique()

array(['Most clinically impactful - Incident type (Select)',
       'Most clinically impactful - Number of recurring incidents (No.)',
       'Second most clinically impactful - Incident type (Select)',
       'Second most clinically impactful - Number of recurring incidents (No.)',
       'Third most clinically impactful - Incident type (Select)',
       'Third most clinically impactful - Number of recurring incidents (No.)'],
      dtype=object)

In [None]:
#pivot on that field
last5yrs_impactful[(last5yrs_impactful.measure == 'Most clinically impactful - Incident type (Select)') &
 (last5yrs_impactful['Site Type'] == 'General acute hospital')].pivot_table(
    index="Trust Name",
    values="measure",
    aggfunc="count",
    columns = "values")
#Site Type


values,Alarms & detection systems,Electrical systems,Energy centre systems,External building works,External fabric,Fire safety,Fixed plant / Equipment,Heating systems,Hot and cold water systems,Internal fabric and fixtures,Lifts & Hoists,Medical gas pipeline services,Miscellaneous,Roofs,Structure,Ventilation systems
Trust Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
ASHFORD AND ST. PETER'S HOSPITALS NHS FOUNDATION TRUST,,,,1.0,,,,,,,,,,1.0,,
BEDFORDSHIRE HOSPITALS NHS FOUNDATION TRUST,,1.0,,,,,,,,,,,,,,
BOLTON NHS FOUNDATION TRUST,,1.0,,,,,,,,,,,,,,
BRADFORD TEACHING HOSPITALS NHS FOUNDATION TRUST,,1.0,,,,,,,,,,,,,,
COUNTESS OF CHESTER HOSPITAL NHS FOUNDATION TRUST,,1.0,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WIRRAL UNIVERSITY TEACHING HOSPITAL NHS FOUNDATION TRUST,,,,,,,,,,,,,1.0,,,
WORCESTERSHIRE ACUTE HOSPITALS NHS TRUST,,,,1.0,,,,,,,,,,,,
"WRIGHTINGTON, WIGAN AND LEIGH NHS FOUNDATION TRUST",,,,,,,,,,,,,,,,1.0
WYE VALLEY NHS TRUST,,,,,,,,,1.0,,,,,,,


In [None]:
last5yrs_impactful

Unnamed: 0,Trust Code,Trust Name,Commissioning Region,Trust Type,Site Code,Site Name,Post Code,Integrated Care Board,Local Authority,Site Type,measure,values,year_range,Status,Tenure
18905,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH WEST COMMISSIONING REGION,ACUTE - TEACHING,R0A01,ISLAND SITE,M13 9WL,NHS GREATER MANCHESTER ICB,MANCHESTER CITY COUNCIL,General acute hospital,Most clinically impactful - Incident type (Sel...,Heating systems,202223,,
18906,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH WEST COMMISSIONING REGION,ACUTE - TEACHING,R0A07,WYTHENSHAWE HOSPITAL,M23 9LT,NHS GREATER MANCHESTER ICB,MANCHESTER CITY COUNCIL,General acute hospital,Most clinically impactful - Incident type (Sel...,Fire safety,202223,,
18907,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH WEST COMMISSIONING REGION,ACUTE - TEACHING,R0A21,LONGSIGHT HEALTH CENTRE,M13 0RR,NHS GREATER MANCHESTER ICB,MANCHESTER CITY COUNCIL,Non inpatient,Most clinically impactful - Incident type (Sel...,Hot and cold water systems,202223,,
18908,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH WEST COMMISSIONING REGION,ACUTE - TEACHING,R0A66,NORTH MANCHESTER GENERAL HOSPITAL,M8 5RB,NHS GREATER MANCHESTER ICB,MANCHESTER CITY COUNCIL,General acute hospital,Most clinically impactful - Incident type (Sel...,Hot and cold water systems,202223,,
18909,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH WEST COMMISSIONING REGION,ACUTE - TEACHING,R0AORS,OTHER REPORTABLE SITES,M13 9WL,,,Other Reportable Site,Most clinically impactful - Incident type (Sel...,Roofs,202223,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20670,RYY,KENT COMMUNITY HEALTH NHS FOUNDATION TRUST,SOUTH EAST COMMISSIONING REGION,COMMUNITY,RYYD9,SEVENOAKS HOSPITAL,TN13 3PG,NHS KENT AND MEDWAY ICB,KENT COUNTY COUNCIL,Community hospital (with inpatient beds),Third most clinically impactful - Number of re...,0,202223,,
20671,RYY,KENT COMMUNITY HEALTH NHS FOUNDATION TRUST,SOUTH EAST COMMISSIONING REGION,COMMUNITY,RYYDC,TONBRIDGE COTTAGE HOSPITAL,TN11 0NE,NHS KENT AND MEDWAY ICB,KENT COUNTY COUNCIL,Community hospital (with inpatient beds),Third most clinically impactful - Number of re...,0,202223,,
20672,RYY,KENT COMMUNITY HEALTH NHS FOUNDATION TRUST,SOUTH EAST COMMISSIONING REGION,COMMUNITY,RYYORS,OTHER REPORTABLE SITES,TN25 4AZ,,,Other Reportable Site,Third most clinically impactful - Number of re...,3,202223,,
20673,TAF,CAMDEN AND ISLINGTON NHS FOUNDATION TRUST,LONDON COMMISSIONING REGION,CARE TRUST,TAF01,ST PANCRAS HOSPITAL,NW1 0PE,,,Mixed service hospital,Third most clinically impactful - Number of re...,0,202223,,


### Create a function to scrape IMPACTFUL columns

The original CSV has further columns, such as `Most clinically impactful - Cost to rectify estates and infrastructure failure (£)` and `Most clinically impactful - Down time as a result of estates and infrastructure failure (Hrs)`

We need to import it again. We adapt our `backlogdataonly()` function from earlier to look for those columns instead.

In [None]:
#define a function, it takes one argument - the url of the CSV
def impactfuldataonly(csvurl):
  #read in the CSV
  sitedata = pd.read_csv(csvurl, encoding = "ISO-8859-1")
  #store the first 9 column names
  keykeys = list(sitedata.keys()[0:10])
  print(keykeys)
  #loop through the keys and extract the ones with backlog in them
  backlog_keys = [key for key in sitedata.keys() if 'impactful' in key.lower()]
  #add those keys to the ones we've already stored
  bothkeys = keykeys[:10]+backlog_keys
  print(bothkeys)
  #use those to extract a subset
  backlogdf = sitedata[bothkeys]
  #reshape from wide to long
  longversion = pd.melt(backlogdf, id_vars=list(sitedata.keys()[0:10]),var_name='measure', value_name='values')
  #print(longversion)
  #filter to the rows where the condition is True
  backlog_filtered = longversion.drop(longversion[[type(i) == float for i in longversion["values"]]].index)
  #.index converts that list of T/F to a list of indices
  backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['values'] == 'Not Applicable'].index)
  #rename columns where name has extra chars
  if 'Trust Code' in backlog_filtered.keys()[0]:
    print('HEY', backlog_filtered.keys()[0])
    replacename = backlog_filtered.keys()[0]
    print('HEYHEY', replacename)
    backlog_filtered = backlog_filtered.rename(columns={replacename: 'Trust Code'})
  if 'New Commissioning Region' in backlog_filtered.keys()[3]:
    print('HEY', backlog_filtered.keys()[3])
    replacename = backlog_filtered.keys()[3]
    print('HEYHEY', replacename)
    backlog_filtered = backlog_filtered.rename(columns={replacename: 'Commissioning Region'})
  #print(backlogdf.keys())
  #return the resulting dataframe to whatever called the function
  return(backlog_filtered)

In [None]:
#only the most recent data has these columns, so we run it on that CSV
impactfuldata = impactfuldataonly(csvurls[0])
impactfuldata.head(3)

['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)', 'Number of sites - Learning Disabilities (No.)', 'Number of sites - Mental Health and Learning Disabilities (No.)']
['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)', 'Number of sites - Learning Disabilities (No.)', 'Number of sites - Mental Health and Learning Disabilities (No.)']
HEY Trust Code
HEYHEY Trust Code


Unnamed: 0,Trust Code,Trust Name,Commissioning Region,Trust Type,Number of sites - General acute hospital (No.),Number of sites - Specialist hospital (acute only) (No.),Number of sites - Mixed service hospital (No.),Number of sites - Mental Health (including Specialist services) (No.),Number of sites - Learning Disabilities (No.),Number of sites - Mental Health and Learning Disabilities (No.),measure,values


In [None]:
impactfuldata['measure'].unique()

array([], dtype=object)

In [None]:
impactfuldata.keys()

Index(['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type',
       'Site Code', 'Site Name', 'Post Code', 'Integrated Care Board',
       'Local Authority', 'Site Type', 'measure', 'values'],
      dtype='object')

### Export

In [None]:
impactfuldata.to_csv("impactfuldata.csv")
files.download('impactfuldata.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Pivot to see the most common categories

In its current shape some pivoting can be done to count occurrences of particular categories, but summing the cost or incidents related would require a two-step reshape which creates:

* a column for incident type
* a column for incident ranking (most, second, third)
* a column for measure (cost, incidents)
* a column for values

We could do that in R instead as it only involves a single CSV.

Meanwhile here's that pivot.

In [None]:
#pivot on that field
impactfuldata[(impactfuldata.measure == 'Most clinically impactful - Incident type (Select)')].pivot_table(
    index="values",
    values="measure",
    aggfunc="count",
    columns = "Site Type")



## Add the TRUST level data - define a new FUNCTION

Before we start to clean it up we need to add in historical data published at trust level.

We need to create an equivalent function for that trust level data.

In [None]:
#define a function, it takes one argument - the url of the CSV
def trustbacklogdataonly(csvurl):
  #read in the CSV
  sitedata = pd.read_csv(csvurl, encoding = "ISO-8859-1")
  #store the first 9 column names
  keykeys = list(sitedata.keys()[0:9])
  print(keykeys)
  #loop through the keys and extract the ones with backlog in them
  backlog_keys = [key for key in sitedata.keys() if 'incident' in key.lower()]
  #add those keys to the ones we've already stored
  bothkeys = keykeys[:9]+backlog_keys
  print(bothkeys)
  #use those to extract a subset
  backlogdf = sitedata[bothkeys]
  #reshape from wide to long
  longversion = pd.melt(backlogdf, id_vars=list(sitedata.keys()[0:9]),var_name='measure', value_name='values')
  print(len(longversion))
  #print(longversion)
  #filter to the rows where the condition is True
  #.index converts that list of T/F to a list of indices
  backlog_filtered = longversion.drop(longversion[longversion['values'] == 'Not Applicable'].index)
  print(len(backlog_filtered))
  #backlog_filtered = backlog_filtered.drop(backlog_filtered[[type(i) == float for i in backlog_filtered["values"]]].index)
  #print(len(backlog_filtered))
  #remove the extra row of headers too - this time inplace
  backlog_filtered = backlog_filtered.drop(backlog_filtered[backlog_filtered['values'] == 'Cost to eradicate high risk backlog (£)'].index)
  print(len(backlog_filtered))
  #rename columns where name has extra chars
  if 'Trust Code' in backlog_filtered.keys()[0]:
    print('HEY', backlog_filtered.keys()[0])
    replacename = backlog_filtered.keys()[0]
    print('HEYHEY', replacename)
    backlog_filtered = backlog_filtered.rename(columns={replacename: 'Trust Code'})
  if 'New Commissioning Region' in backlog_filtered.keys()[3]:
    print('HEY', backlog_filtered.keys()[3])
    replacename = backlog_filtered.keys()[3]
    print('HEYHEY', replacename)
    backlog_filtered = backlog_filtered.rename(columns={replacename: 'Commissioning Region'})
  #print(backlogdf.keys())
  #return the resulting dataframe to whatever called the function
  return(backlog_filtered)

In [None]:
testdf = trustbacklogdataonly('https://files.digital.nhs.uk/83/4AF81B/ERIC%20-%20201819%20-%20TrustData%20v4.csv')
testdf

['Trust Code', 'Trust Name', 'Old Commissioning Region', 'New Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)']
['Trust Code', 'Trust Name', 'Old Commissioning Region', 'New Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)', 'Estates and Facilities RIDDOR incidents (No.)', 'Estates and facilities related incidents (No.)', 'Clinical service incidents caused by estates and infrastructure failure (No.)']
681
681
681
HEY Trust Code
HEYHEY Trust Code
HEY New Commissioning Region
HEYHEY New Commissioning Region


Unnamed: 0,Trust Code,Trust Name,Old Commissioning Region,Commissioning Region,Trust Type,Number of sites - General acute hospital (No.),Number of sites - Specialist hospital (acute only) (No.),Number of sites - Mixed service hospital (No.),Number of sites - Mental Health (including Specialist services) (No.),measure,values
0,R0A,MANCHESTER UNIVERSITY NHS FOUNDATION TRUST,NORTH OF ENGLAND COMMISSIONING REGION,NORTH WEST COMMISSIONING REGION,ACUTE - TEACHING,3,4,1,3,Estates and Facilities RIDDOR incidents (No.),7
1,R1A,WORCESTERSHIRE HEALTH AND CARE NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,MIDLANDS COMMISSIONING REGION,COMMUNITY,0,0,1,5,Estates and Facilities RIDDOR incidents (No.),0
2,R1C,SOLENT NHS TRUST,SOUTH OF ENGLAND COMMISSIONING REGION,SOUTH EAST COMMISSIONING REGION,COMMUNITY,0,0,1,0,Estates and Facilities RIDDOR incidents (No.),0
3,R1D,SHROPSHIRE COMMUNITY HEALTH NHS TRUST,MIDLANDS AND EAST OF ENGLAND COMMISSIONING REGION,MIDLANDS COMMISSIONING REGION,COMMUNITY,0,0,0,0,Estates and Facilities RIDDOR incidents (No.),2
4,R1F,ISLE OF WIGHT NHS TRUST,SOUTH OF ENGLAND COMMISSIONING REGION,SOUTH EAST COMMISSIONING REGION,ACUTE - MULTI-SERVICE,0,0,1,1,Estates and Facilities RIDDOR incidents (No.),8
...,...,...,...,...,...,...,...,...,...,...,...
676,RYY,KENT COMMUNITY HEALTH NHS FOUNDATION TRUST,SOUTH OF ENGLAND COMMISSIONING REGION,SOUTH EAST COMMISSIONING REGION,COMMUNITY,0,0,0,0,Clinical service incidents caused by estates a...,17
677,TAD,BRADFORD DISTRICT CARE NHS FOUNDATION TRUST,NORTH OF ENGLAND COMMISSIONING REGION,NORTH EAST AND YORKSHIRE COMMISSIONING REGION,CARE TRUST,0,0,1,1,Clinical service incidents caused by estates a...,0
678,TAF,CAMDEN AND ISLINGTON NHS FOUNDATION TRUST,LONDON COMMISSIONING REGION,LONDON COMMISSIONING REGION,CARE TRUST,0,0,1,8,Clinical service incidents caused by estates a...,16
679,TAH,SHEFFIELD HEALTH AND SOCIAL CARE NHS FOUNDATIO...,NORTH OF ENGLAND COMMISSIONING REGION,NORTH EAST AND YORKSHIRE COMMISSIONING REGION,CARE TRUST,0,0,6,1,Clinical service incidents caused by estates a...,0


In [None]:
testdf['measure'].unique()

array(['Estates and Facilities RIDDOR incidents (No.)',
       'Estates and facilities related incidents (No.)',
       'Clinical service incidents caused by estates and infrastructure failure (No.)'],
      dtype=object)

### Create a function to scrape the TRUST data CSV

We adapt the code from the function we created to grab the site CSV so that it grabs the trust CSV instead.

In [None]:
#define a function that takes a URL and returns the site data CSV link on that page
def fetchcsv_for_trusts(url):
  # Send a GET request to the link URL
  link_response = requests.get(url)
  #parse into soup
  soup = BeautifulSoup(link_response.content, 'html.parser')
  # Find all links
  divboxlink = soup.find_all('a')
  #create an empty list
  matches = []
  #loop through each one
  for i in divboxlink:
    #look for the one about Site data
    if "Trust" in i.get('href'):
      #show that URL
      #print(i.get('href'))
      matches.append(i.get('href'))
  #if the list has something in it
  if len(matches) >0:
    #return that URL
    return(matches[0])
  #otherwise
  else:
    #return a string we can pick up the other side
    return('NO LINK')

In [None]:
#create an empty list to store the URLs
csvurls = []

#some of this code generated by ChatGPT in response to the prompt:
#"write some python code which identifies the first link inside a <h3> tag at
#https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection and fetches that"
#fetch that page
response = requests.get(ericurl)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find all the <h3> tags
h3_tag = soup.find_all('h3')
#loop through the years before the data shifted to site level
#those are index 2 (the third link) to 5
for i in h3_tag[2:5]:
  #find the first <a> and get the href= attribute
  yearpageurl = baseurl+i.find('a').get('href')
  print(yearpageurl)
  #run the function defined above to fetch the CSV link from that page
  trustdatacsvurl = fetchcsv_for_trusts(yearpageurl)
  print(trustdatacsvurl)
  #add it to the list unless it's a 'NO LINK'
  if trustdatacsvurl != 'NO LINK':
    csvurls.append(trustdatacsvurl)


print(csvurls)

https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection/england-2020-21
https://files.digital.nhs.uk/81/4A77B0/ERIC%20-%20202021%20-%20Trust%20data.csv
https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection/england-2019-20
https://files.digital.nhs.uk/84/07227E/ERIC%20-%20201920%20-%20TrustData.csv
https://digital.nhs.uk/data-and-information/publications/statistical/estates-returns-information-collection/england-2018-19
https://files.digital.nhs.uk/83/4AF81B/ERIC%20-%20201819%20-%20TrustData%20v4.csv
['https://files.digital.nhs.uk/81/4A77B0/ERIC%20-%20202021%20-%20Trust%20data.csv', 'https://files.digital.nhs.uk/84/07227E/ERIC%20-%20201920%20-%20TrustData.csv', 'https://files.digital.nhs.uk/83/4AF81B/ERIC%20-%20201819%20-%20TrustData%20v4.csv']


### Loop through the CSV urls for TRUSTS

In [None]:
csvurls

['https://files.digital.nhs.uk/81/4A77B0/ERIC%20-%20202021%20-%20Trust%20data.csv',
 'https://files.digital.nhs.uk/84/07227E/ERIC%20-%20201920%20-%20TrustData.csv',
 'https://files.digital.nhs.uk/83/4AF81B/ERIC%20-%20201819%20-%20TrustData%20v4.csv']

In [None]:
check2021 = trustbacklogdataonly(csvurls[0])

['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)', 'Number of sites - Learning Disabilities (No.)']
['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)', 'Number of sites - Learning Disabilities (No.)', 'RIDDOR incidents (No.)', 'Estates and facilities related incidents (No.)', 'Clinical service incidents caused by estates and infrastructure failure (No.)']
651
651
651
HEY Trust Code
HEYHEY Trust Code


In [None]:
len(check2021)

651

In [None]:
#create an empty dataframe
yrs18to20 = pd.DataFrame()

#loop through the URLs
for i in csvurls:
  print(i)
  #apply the function to the URL, and store the results in a new variable
  thisyrdf = trustbacklogdataonly(i)
  #clean a column in the results
  thisyrdf['year_range'] = i.split('ERIC')[1].split('-')[1].split('-')[0].replace('%20','')
  #add to the previously empty dataframe
  print(len(thisyrdf))
  yrs18to20 = yrs18to20.append(thisyrdf, ignore_index = True)



https://files.digital.nhs.uk/81/4A77B0/ERIC%20-%20202021%20-%20Trust%20data.csv
['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)', 'Number of sites - Learning Disabilities (No.)']
['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)', 'Number of sites - Learning Disabilities (No.)', 'RIDDOR incidents (No.)', 'Estates and facilities related incidents (No.)', 'Clinical service incidents caused by estates and infrastructure failure (No.)']
651
651
651
HEY Trust Code
HEYHEY Trust Code
651
https://files.di

  yrs18to20 = yrs18to20.append(thisyrdf, ignore_index = True)


['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)', 'Number of sites - Learning Disabilities (No.)']
['Trust Code', 'Trust Name', 'Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)', 'Number of sites - Learning Disabilities (No.)', 'RIDDOR incidents (No.)', 'Estates and facilities related incidents (No.)', 'Clinical service incidents caused by estates and infrastructure failure (No.)']
672
672
672
HEY Trust Code
HEYHEY Trust Code
672
https://files.digital.nhs.uk/83/4AF81B/ERIC%20-%20201819%20-%20TrustData%20v4.csv


  yrs18to20 = yrs18to20.append(thisyrdf, ignore_index = True)


['Trust Code', 'Trust Name', 'Old Commissioning Region', 'New Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)']
['Trust Code', 'Trust Name', 'Old Commissioning Region', 'New Commissioning Region', 'Trust Type', 'Number of sites - General acute hospital (No.)', 'Number of sites - Specialist hospital (acute only) (No.)', 'Number of sites - Mixed service hospital (No.)', 'Number of sites - Mental Health (including Specialist services) (No.)', 'Estates and Facilities RIDDOR incidents (No.)', 'Estates and facilities related incidents (No.)', 'Clinical service incidents caused by estates and infrastructure failure (No.)']
681
681
681
HEY Trust Code
HEYHEY Trust Code
HEY New Commissioning Region
HEYHEY New Commissioning Region
681


  yrs18to20 = yrs18to20.append(thisyrdf, ignore_index = True)


In [None]:
#check the unique values in the measure column
yrs18to20['measure'].unique()

array(['RIDDOR incidents (No.)',
       'Estates and facilities related incidents (No.)',
       'Clinical service incidents caused by estates and infrastructure failure (No.)',
       'Estates and Facilities RIDDOR incidents (No.)'], dtype=object)

In [None]:
#check the unique values in the 'year_range' column
yrs18to20['year_range'].unique()

array(['202021', '201920', '201819'], dtype=object)

## Clean trust data

This time the values in the spreadsheets *are* large enough to have commas (see the error below), which means we can use the cleaning function created earlier in this notebook.

In [None]:
#create new column based on applying the custom function to another column
yrs18to20['valuesclean'] = [int(i) for i in yrs18to20['values']]

ValueError: cannot convert float NaN to integer

In [None]:
#apply the cleannumbers function to the column of values
#store in a new column
yrs18to20['valuesclean'] = cleannumbers(yrs18to20['values'])
#then convert to integer
yrs18to20['valuesclean']

### Export for data validation done in Excel

At this point we export the dataset and check in Excel if the `valuesclean` column has the same values as the `values` column.

Because Excel treats them both as numeric columns, the formula `=O2=L2` can be typed in a new column and copied down to see if the two values are always the same. The process confirms that the result is `TRUE` for all rows.

In [None]:
#create a CSV from the dataframe
yrs18to20.to_csv('yrs18to20_incidents.csv')
#download the file
files.download('yrs18to20_incidents.csv')

In [None]:
yrs18to20[(yrs18to20.measure == 'Clinical service incidents caused by estates and infrastructure failure (No.)')].pivot_table(
    index="Commissioning Region",
    values="valuesclean",
    aggfunc="sum",
    columns = "year_range")


### Identify different terms used for measures

We can see RIDDOR incidents are reported under two different terms. Let's see what measures are used in each year.

In [None]:
#filter the dataframe to those where the year_range is 201819
#then show the unique values in the 'measure' column
yrs18to20['measure'][yrs18to20['year_range'] == '201819'].unique()

In [None]:
#repeat for 201920
yrs18to20['measure'][yrs18to20['year_range'] == '201920'].unique()

In [None]:
#repeat for 202021
yrs18to20['measure'][yrs18to20['year_range'] == '202021'].unique()

### Make the measures consistent

We need to make these measures consistent across the three years covered - and with the other site-level dataset.

Let's remind us what they were there.

In [None]:
last5yrs['measure'].unique()

In [None]:
yrs18to20['measure'].unique()

So for RIDDOR we need to use `'Estates and facilities RIDDOR incidents (No)'`

In [None]:
#check what we start with
print(yrs18to20['measure'].unique())
#replace any text that begins 'RIDDOR' with the specified string
measureclean = [re.sub('^RIDDOR.*','Estates and Facilities RIDDOR incidents (No.)',i) for i in yrs18to20['measure']]
#check
print(pd.Series(measureclean).unique())
#replace the string used in this data with a lower case f and no period on No.
measureclean = [re.sub('Estates and Facilities RIDDOR.*','Estates and facilities RIDDOR incidents (No)',i) for i in measureclean]
#check
print(pd.Series(measureclean).unique())
#assign to the dataframe
yrs18to20['measureclean'] = measureclean

There's also a subtle difference between this measure in each data frame:

In [None]:
print(last5yrs['measure'].unique()[-2])
print(yrs18to20['measureclean'].unique()[-1])

In [None]:
measureclean = [re.sub('Clinical service incidents caused by.*',
                       'Clinical service incidents caused by estates and infrastructure failure (No)',
                       i) for i in measureclean]
#check
print(pd.Series(measureclean).unique())
#assign to the dataframe
yrs18to20['measureclean'] = measureclean

In [None]:
#check it worked
print(last5yrs['measure'].unique()[-2])
print(yrs18to20['measureclean'].unique()[-1])

## Make the two dataframes consistent

If we want to combine the dataframes we need them to have the same columns. Let's start by adding a 'yearending' column to the older data too.

In [None]:
#grab the last two digits of every string in year_range
#add '20' to the front of those, and store in a list
yearending = ['20'+i[-2:] for i in yrs18to20['year_range']]

#add to dataframe
yrs18to20['yearending'] = yearending

In [None]:
last5yrs.keys()

In [None]:
yrs18to20.keys()

Let's work on a copy, too.

In [None]:
yrs21on = last5yrs

In [None]:
#loop through the keys of one data frame
for i in yrs18to20.keys():
  print(i)
  #check if it's in the other data frame
  if i not in yrs21on.keys():
    print('not there, making new column')
    #create a new column for that
    yrs21on[i] = 'NA'
  else:
    print('already there')

In [None]:
yrs21on.keys()

Now let's reverse the process.

In [None]:
#loop through the keys of one data frame
for i in yrs21on.keys():
  print(i)
  #check if it's in the other data frame
  if i not in yrs18to20.keys():
    print('not there, making new column')
    #create a new column for that
    yrs18to20[i] = 'NA'
  else:
    print('already there')

Do both data frames have the same columns?

In [None]:
[i in yrs21on.keys() for i in yrs18to20.keys()]

## Combine the two data frames

Now that both data frames have the same columns, we can combine them using `concat()`.

First, it's worth adding a new column to specify the level the data was published at.

In [None]:
yrs21on['level_of_reporting'] = 'site'
yrs18to20['level_of_reporting'] = 'trust'

In [None]:
combined_df = pd.concat([yrs21on, yrs18to20], axis=0)
len(combined_df) == len(yrs21on)+len(yrs18to20)

In [None]:
combined_df

## Export as a CSV

Because it's going to cover five years of data, we call it last5yrs.csv when we export it.

In [None]:
#create a CSV from the dataframe
combined_df.to_csv('last5yrs_incidents.csv')
#import a library for downloading files
from google.colab import files
#download the file
files.download('last5yrs_incidents.csv')