In [83]:
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
cannot find .env file


# VMFI Data processing pipeline

This workbook aims to emulate the current data processing pipeline that occurs in VMFI pipeline. The logic and processing is largely based on the following document [Insights data portal - Data sources and sql analysis](https://educationgovuk.sharepoint.com.mcas.ms/:w:/r/sites/VMFI/_layouts/15/Doc.aspx?sourcedoc=%7B38C1DC37-7CDB-48B8-9E22-284F4F311C0B%7D&file=1.%20Insights%20portal%20-%20data%20sources%20and%20sql%20analysis%20v010%20-%20Copy.docx&action=default&mobileredirect=true) and will stay true to this document even if the existing stored procedures are doing something different. This will form the basis of a gap analysis going forward. 

All data loaded in the following workbook comes from the set of CSV files in the `data` folder alongside this workbook. These datasets are for the most part from the list at the start of the linked document. However, because there is additional standing data required to fully implement the pipeline then this data has been exported from the development VMFI pipeline database. These files are currently: 

| File name | DB Table |
|:----------|----------|
|standing_data_cdc.csv | standing_data.cdc |

In [1]:
import pandas as pd
import mappings as mappings
import schemas as schemas
import datetime
import time
import glob
import os

# Create and clean directory
from pathlib import Path
Path("output/pre-processing").mkdir(parents=True, exist_ok=True)

files = glob.glob("output/pre-processing/*")
for f in files:
    os.remove(f)

start_time = time.time()
current_year = 2022
accounts_return_period_start_date = datetime.date(current_year - 1, 9, 10)
academy_year_start_date = datetime.date(current_year - 1, 9, 1)
academy_year_end_date = datetime.date(current_year, 8, 30)
maintained_schools_year_start_date = datetime.date(current_year, 4, 1)
maintained_schools_year_end_date = datetime.date(current_year, 3, 31)

## CDC data load and preparation

School buildings condition dataset. Based on the surveys performed throughout 2018-2019.

The data in the file `data/standing_data_cdc.csv` is just an export of the data in `standing_data.cdc` table. Without the Year and Import ID fields. In future this will likely have to be read directly from the source database as per [this document.](https://educationgovuk.sharepoint.com.mcas.ms/:w:/r/sites/VMFI/_layouts/15/Doc.aspx?sourcedoc=%7B38C1DC37-7CDB-48B8-9E22-284F4F311C0B%7D&file=1.%20Insights%20portal%20-%20data%20sources%20and%20sql%20analysis%20v010%20-%20Copy.docx&action=default&mobileredirect=true) 

In [85]:
cdc = pd.read_csv('data/standing_data_cdc.csv', encoding='utf8', index_col=schemas.cdc_index_col, usecols=schemas.cdc.keys(), dtype=schemas.cdc)

cdc['Total Internal Floor Area'] = cdc.groupby(by=['URN'])['GIFA'].sum()
cdc['Proportion Area'] = (cdc['GIFA'] / cdc['Total Internal Floor Area'])
cdc['Indicative Age'] = cdc['Block Age'].fillna('').map(mappings.map_block_age).astype('Int64')
cdc['Age Score'] = cdc['Proportion Area'] * (current_year - cdc['Indicative Age'])
cdc['Age Average Score'] = cdc.groupby(by=['URN'])['Age Score'].sum()
cdc = cdc[['Total Internal Floor Area', 'Age Average Score']].drop_duplicates()

In [86]:
cdc.to_csv('output/pre-processing/cdc.csv')
cdc

Unnamed: 0_level_0,Total Internal Floor Area,Age Average Score
URN,Unnamed: 1_level_1,Unnamed: 2_level_1
100150,2803.0,48.358188
100162,2105.0,133.162945
100164,2934.0,97.0
100166,2040.0,91.705882
105304,1602.0,35.752809
...,...,...
144913,3111.0,16.704275
144917,2620.0,78.412214
105623,3382.0,7.0
144918,4733.0,19.009296


## School Census data load

*Pupil Census* - DfE data collection providing information about school and pupil characteristics, for example percentage of pupils claiming free school`z meals, or having English as their second language. 

*Workforce census* - Single reference for all school workforce statistics based on staff working in publicly funded schools in England.

The following code loads both the workforce and pupil census data and preforms an `inner` join by URN on the data sets.

In [87]:
school_workforce_census = pd.read_excel('data/School_Tables_School_Workforce_Census_2022.xlsx', header=5, index_col=schemas.workforce_census_index_col, usecols=schemas.workforce_census.keys(), dtype=schemas.workforce_census, na_values=["x","u"], keep_default_na=True).drop_duplicates()

school_pupil_census = pd.read_csv('data/standing_data_census_pupils.csv', encoding='utf8', index_col=schemas.pupil_census_index_col, usecols=schemas.pupil_census.keys(), dtype=schemas.pupil_census).drop_duplicates()

census = school_pupil_census.join(school_workforce_census, on='URN', how='inner', rsuffix='_pupil', lsuffix='_workforce')

census.drop(labels=['full time pupils', 'headcount of pupils'], axis=1, inplace=True)
         

In [3]:
school_workforce_census = pd.read_excel('data/School_Tables_School_Workforce_Census_2022.xlsx', header=5, index_col=schemas.workforce_census_index_col, usecols=schemas.workforce_census.keys(), dtype=schemas.workforce_census, na_values=["x","u"], keep_default_na=True).drop_duplicates()


In [90]:
census.to_csv('output/pre-processing/census.csv')
census

Unnamed: 0_level_0,% of pupils known to be eligible for and claiming free school me,% of pupils known to be eligible for free school meals (Performa,number of pupils whose first language is known or believed to be other than English,Total School Workforce (Full-Time Equivalent),Total Number of Teachers (Full-Time Equivalent),Pupil: Teacher Ratio (Full-Time Equivalent of qualified and unqualified teachers),Number of Vacant Teacher Posts
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
141334,33.8,52.3,93.0,34.17,13.11,24.8,0.0
141396,23.4,60.3,236.0,82.47,34.00,18.3,0.0
141397,33.2,47.7,127.0,72.81,24.55,19.7,0.0
142223,5.1,8.7,343.0,99.66,47.12,23.0,0.0
144396,56.7,64.8,29.0,25.57,11.39,18.1,0.0
...,...,...,...,...,...,...,...
104642,2.4,2.6,14.0,34.47,15.80,26.6,0.0
104643,3.5,8.5,13.0,39.89,17.40,24.7,0.0
104645,32.9,33.8,43.0,26.47,12.40,19.1,0.0
104646,29.9,31.9,20.0,22.36,12.00,15.8,0.0


## Special Education Needs (SEN) data load and preparation

Special educational needs dataset. Contains information about the number of pupils, who require various SEN provisions. This loads the `SEN` data, which originates from [here](https://explore-education-statistics.service.gov.uk/find-statistics/special-educational-needs-in-england#dataDownloads-1)

In [91]:
sen = pd.read_csv('data/SEN.csv', encoding='cp1252', index_col=schemas.sen_index_col, dtype=schemas.sen, usecols=schemas.sen.keys())
sen['Percentage SEN'] = (sen['EHC plan'] / sen['Total pupils']) * 100.0

In [92]:
sen.to_csv("output/pre-processing/sen.csv")
sen

Unnamed: 0_level_0,Total pupils,SEN support,EHC plan,EHC_Primary_need_spld,EHC_Primary_need_mld,EHC_Primary_need_sld,EHC_Primary_need_pmld,EHC_Primary_need_semh,EHC_Primary_need_slcn,EHC_Primary_need_hi,...,SUP_Primary_need_semh,SUP_Primary_need_slcn,SUP_Primary_need_hi,SUP_Primary_need_vi,SUP_Primary_need_msi,SUP_Primary_need_pd,SUP_Primary_need_asd,SUP_Primary_need_oth,SUP_Primary_need_nsa,Percentage SEN
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100000,271,59,8,0,0,0,0,1,1,0,...,8,30,2,0,2,0,4,0,7,2.95203
100001,739,22,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
100002,269,22,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
100003,1045,145,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
100005,136,23,2,0,0,0,1,0,0,0,...,0,0,0,0,0,1,21,1,0,1.470588
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149557,41,2,3,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,7.317073
149632,1291,136,58,2,1,0,0,3,15,4,...,17,10,4,2,1,3,6,16,34,4.492641
149633,86,7,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,1,1,0,0.0
149635,654,30,10,2,0,0,0,5,1,0,...,7,1,2,1,0,0,1,4,0,1.529052


## AR Data load and preparation

This loads the Annual accounts return dataset and the corresponding mapping file. This extract only contains benchmarking section, which consists of submissions of costs, income, and balances of individual academies.

The mapping file, contains the mapping from AR4 cell references to cost categories and descriptions.

In [4]:
ar_cell_mapping = pd.read_csv('data/AR_cell_mapping.csv', encoding='utf8', index_col=schemas.ar_cell_mapping_index_col, usecols=schemas.ar_cell_mapping.keys(), dtype=schemas.ar_cell_mapping)

ar_raw = pd.read_csv('data/AR_raw.csv', encoding='utf8', index_col=schemas.ar_index_col, usecols=schemas.ar.keys(), dtype=schemas.ar)

ar = ar_raw.reset_index().merge(ar_cell_mapping, right_on='cell', left_on='aruniquereference').set_index(schemas.ar_index_col)

pfi_schools = ar[ar['aruniquereference'] == 'BAE310-T'][['value']].map(mappings.map_is_pfi_school).rename(columns={'value': 'PFI School'})
academy_financial_position = ar[ar['aruniquereference'] == 'BAB030-T'][['value']].rename(columns={'value': 'Academy Balance'})
central_services_financial_position = ar[(ar['aruniquereference'] == 'BAB030-T') | (ar['aruniquereference'] == 'BTB030')].groupby('trustupin')['value'].sum().rename('Central Services Balance')
trust_financial_position = ar[(ar['aruniquereference'] == 'BTB030')][['trustupin', 'value']].rename(columns={'value': 'Trust Balance'}).set_index('trustupin')

ar = (ar.join(pfi_schools, how='left')
      .join(academy_financial_position, how='left')
      .join(trust_financial_position, on='trustupin', how='left')
      .join(central_services_financial_position, on='trustupin', how='left'))


In [None]:
ar.to_csv('output/pre-processing/ar.csv')
ar

Unnamed: 0_level_0,trustupin,companynumber,aruniquereference,value,Description L1,Metric,Metric ID,Cost Pool,Presentation name,Cost Pool ID,PFI School,Academy Balance,Trust Balance,Central Services Balance
academyupin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
111443,137157,8146633,BAE010-T,5687.0,Staff costs,Teaching staff,1,Teaching and Teaching support staff,Teaching and Teaching support staff,1,Non-PFI school,59.0,359.0,3506.0
111443,137157,8146633,BAE020-T,66.0,Staff costs,Supply teaching staff,5,Teaching and Teaching support staff,Teaching and Teaching support staff,1,Non-PFI school,59.0,359.0,3506.0
111443,137157,8146633,BAE030-T,916.0,Staff costs,Education support staff,2,Teaching and Teaching support staff,Teaching and Teaching support staff,1,Non-PFI school,59.0,359.0,3506.0
111443,137157,8146633,BAE040-T,623.0,Staff costs,Administrative and clerical staff,6,Non-educational support staff,Non-educational support staff,2,Non-PFI school,59.0,359.0,3506.0
111443,137157,8146633,BAE050-T,390.0,Staff costs,Premises staff,14,Premises,Premises,5,Non-PFI school,59.0,359.0,3506.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,136961,9040380,BTB030,136.0,Closing balance,Closing balance (Restricted and Unrestricted F...,0,,,0,,,136.0,2634.0
,141510,10269490,BTB030,0.0,Closing balance,Closing balance (Restricted and Unrestricted F...,0,,,0,,,0.0,6.0
,137157,8146633,BTB030,359.0,Closing balance,Closing balance (Restricted and Unrestricted F...,0,,,0,,,359.0,3506.0
,136943,7973953,BTB030,0.0,Closing balance,Closing balance (Restricted and Unrestricted F...,0,,,0,,,0.0,338.0


Create a summary table for the AR stance of each distinct academy in the table.

In [None]:
academy_ar = ar.reset_index().drop_duplicates(subset=['academyupin'], ignore_index=True)[['academyupin', 'Academy Balance', 'Trust Balance', 'Central Services Balance', 'PFI School']].set_index('academyupin')

academy_ar['Central Services Financial Position'] = academy_ar['Central Services Balance'].map(mappings.map_is_surplus_deficit)
academy_ar['Academy Financial Position'] = academy_ar['Academy Balance'].map(mappings.map_is_surplus_deficit) 
academy_ar['Trust Financial Position'] = academy_ar['Trust Balance'].map(mappings.map_is_surplus_deficit) 

In [None]:
academy_ar

Unnamed: 0_level_0,Academy Balance,Trust Balance,Central Services Balance,PFI School,Central Services Financial Position,Academy Financial Position,Trust Financial Position
academyupin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
111443,59.0,359.0,3506.0,Non-PFI school,Surplus,Surplus,Surplus
111451,0.0,1909.0,1909.0,Non-PFI school,Surplus,Deficit,Surplus
111453,0.0,3873.0,3873.0,Non-PFI school,Surplus,Deficit,Surplus
111710,1442.0,,1442.0,Non-PFI school,Surplus,Surplus,Unknown
113087,1655.0,0.0,1655.0,Non-PFI school,Surplus,Surplus,Deficit
...,...,...,...,...,...,...,...
163968,488.0,42.0,2672.0,Non-PFI school,Surplus,Surplus,Surplus
163969,0.0,0.0,0.0,Non-PFI school,Deficit,Deficit,Deficit
163970,-115.0,-1127.0,1544.0,Non-PFI school,Surplus,Deficit,Deficit
164084,31.0,-270.0,1394.0,Non-PFI school,Surplus,Surplus,Deficit


## Academy and maintained schools data load and preparation

This reads the main GIAS data (edubasealldataYYYYMMDD file) and the associated links file (links_edubasealldataYYYYMMDD file). This is taken from the [GIAS Service](https://get-information-schools.service.gov.uk/help)

Other columns are tidied up by asserting the correct type for that column. This is tidying phase is largly because on load integer columns will be inferred to be a float as opposed to an integer.

In [None]:
gias = pd.read_csv('data/edubasealldata20240312.csv', encoding='cp1252', 
                   index_col=schemas.gias_index_col, usecols=schemas.gias.keys(), dtype=schemas.gias)

gias_links = pd.read_csv('data/links_edubasealldata20240312.csv', encoding='cp1252', 
                         index_col=schemas.gias_links_index_col, usecols=schemas.gias_links.keys(), dtype=schemas.gias_links)

# GIAS transformations
gias['LA Establishment Number'] = gias['LA (code)'] + '-' + gias['EstablishmentNumber'].astype('string')
gias['LA Establishment Number'] = gias['LA Establishment Number'].astype('string')

gias['OpenDate'] = pd.to_datetime(gias['OpenDate'], dayfirst=True, format='mixed')
gias['CloseDate'] = pd.to_datetime(gias['CloseDate'], dayfirst=True, format='mixed')
gias['SchoolWebsite'] = gias['SchoolWebsite'].fillna('').map(mappings.map_school_website)
gias['Boarders (name)'] = gias['Boarders (name)'].fillna('').map(mappings.map_boarders)
gias['OfstedRating (name)'] = gias['OfstedRating (name)'].fillna('').map(mappings.map_ofsted_rating)
gias['NurseryProvision (name)'] = gias['NurseryProvision (name)'].fillna('')
gias['OfficialSixthForm (name)'] = gias['OfficialSixthForm (name)'].fillna('').map(mappings.map_sixth_form)
gias['AdmissionsPolicy (name)'] = gias['AdmissionsPolicy (name)'].fillna('').map(mappings.map_admission_policy)
gias['HeadName'] = gias['HeadTitle (name)'] + ' ' + gias['HeadFirstName'] + ' ' + gias['HeadLastName']

In the following cell, we find all the predecessor and merged links. The links are then Ranked by URN and order by 'Link Established Date'. The linked GAIS data in then joined to the base GIAS data. This creates the overall school data set. This dataset is then filtered for schools that are open (CloseDate is null) and the schools with nested links that are Ranked 1.

In [None]:
gias_links = gias_links[
    gias_links['LinkType'].isin(['Predecessor', 
                                 'Predecessor - amalgamated', 
                                 'Predecessor - Split School', 
                                 'Predecessor - merged', 
                                 'Merged - expansion of school capacity', 
                                 'Merged - change in age range'])
].sort_values(by='LinkEstablishedDate', ascending=False)

gias_links['Rank'] = gias_links.groupby('URN').cumcount() + 1
gias_links['Rank'] = gias_links['Rank'].astype('Int64')

schools = gias.join(gias_links, on='URN', how='left', rsuffix='_links', lsuffix='_school').sort_values(by='URN')

schools = schools[
    schools['CloseDate'].isna() & ((schools['Rank'] == 1) | (schools['Rank'].isna()))
].drop(columns=['LinkURN', 'LinkName', 'LinkType', 'LinkEstablishedDate', 'Rank'])

In [None]:
schools.to_csv('output/pre-processing/schools.csv')
schools.sort_index()

Unnamed: 0_level_0,LA (code),LA (name),EstablishmentNumber,EstablishmentName,TypeOfEstablishment (code),TypeOfEstablishment (name),EstablishmentStatus (code),EstablishmentStatus (name),OpenDate,CloseDate,...,UrbanRural (name),BoardingEstablishment (name),PreviousLA (code),PreviousLA (name),PreviousEstablishmentNumber,OfstedRating (name),MSOA (code),LSOA (code),LA Establishment Number,HeadName
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100000,201,City of London,3614,The Aldgate School,2,Voluntary aided school,1,Open,NaT,NaT,...,(England/Wales) Urban major conurbation,,999,,,Outstanding,E02000001,E01032739,201-3614,Miss Alexandra Allan
100001,201,City of London,6005,City of London School for Girls,11,Other independent school,1,Open,1920-01-01,NaT,...,(England/Wales) Urban major conurbation,Does not have boarders,999,,,,E02000001,E01000002,201-6005,Mrs Jenny Brown
100002,201,City of London,6006,St Paul's Cathedral School,11,Other independent school,1,Open,1939-01-01,NaT,...,(England/Wales) Urban major conurbation,Has boarders,999,,,,E02000001,E01032739,201-6006,
100003,201,City of London,6007,City of London School,11,Other independent school,1,Open,1919-01-01,NaT,...,(England/Wales) Urban major conurbation,Does not have boarders,999,,,,E02000001,E01032739,201-6007,Mr Alan Bird
100005,202,Camden,1048,Thomas Coram Centre,15,Local authority nursery school,1,Open,NaT,NaT,...,(England/Wales) Urban major conurbation,,999,,,Outstanding,E02007115,E01000937,202-1048,Ms Perina Holness
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
402468,679,Monmouthshire,5500,King Henry viii 3-19 School,30,Welsh establishment,1,Open,2023-09-01,NaT,...,,,999,,,,999999999,999999999,679-5500,
402469,681,Cardiff,2333,Ysgol Gynradd Groes-Wen Primary,30,Welsh establishment,1,Open,2023-09-01,NaT,...,(England/Wales) Rural village,,999,,,,W02000380,W01001729,681-2333,
402470,668,Pembrokeshire,2398,Ysgol Bro Penfro,30,Welsh establishment,4,Proposed to open,2024-09-01,NaT,...,(England/Wales) Rural town and fringe,,999,,,,W02000140,W01000607,668-2398,
402471,679,Monmouthshire,2325,Ysgol Gymraeg Trefynwy,30,Welsh establishment,4,Proposed to open,2024-09-01,NaT,...,(England/Wales) Urban city and town,,999,,,,W02000339,W01001978,679-2325,


Merge required GIAS, census, sen, cdc, PFI, and arr data with the base academy data

In [None]:
academies_list = pd.read_csv('data/master_list_raw.csv', encoding='utf8', index_col=schemas.academy_master_list_index_col, dtype=schemas.academy_master_list, usecols=schemas.academy_master_list.keys()).rename(columns={'UKPRN': 'Academy UKPRN'})

academies_base = academies_list.merge(schools.reset_index(), left_index=True, right_on='LA Establishment Number').set_index('URN')

academies = (academies_base.merge(census, on='URN', how='left')
             .merge(sen, on='URN', how='left')
             .merge(cdc, on='URN', how='left')
             .merge(academy_ar, left_on='Academy UPIN', right_index=True, how='left'))

academies['Type of Provision - Phase'] = academies.apply(lambda df: mappings.map_academy_phase_type(df['TypeOfEstablishment (code)'], df['Type of Provision - Phase']), axis=1)

# Bizarre I shouldn't need this as this is coming from the original GIAS dataset, but I seem to have to do this twice. 
academies['NurseryProvision (name)'] = academies['NurseryProvision (name)'].fillna('')
academies['NurseryProvision (name)'] = academies.apply(lambda df: mappings.map_nursery(df['NurseryProvision (name)'], df['Type of Provision - Phase']), axis=1)

academies['Status'] = academies.apply(lambda df: mappings.map_academy_status(pd.to_datetime(df['Date left or closed if in period']), 
                                                                             pd.to_datetime(df['Valid to']), 
                                                                             pd.to_datetime(df['OpenDate']), 
                                                                             pd.to_datetime(df['CloseDate']), 
                                                                             pd.to_datetime(accounts_return_period_start_date), pd.to_datetime(academy_year_start_date), pd.to_datetime(academy_year_end_date)), axis=1)

academies['SchoolPhaseType'] = academies.apply(lambda df: mappings.map_school_phase_type(df['TypeOfEstablishment (code)'], df['Type of Provision - Phase']), axis=1)

academies.rename(columns={
    'UKPRN_x':'UKPRN',
    '% of pupils known to be eligible for free school meals (Performa': 'Percentage Free school meals'
}, inplace=True)

In [None]:
academies.to_csv('output/pre-processing/academies.csv')
academies.sort_index()

Unnamed: 0_level_0,Company Registration Number,Incorporation Date,Academy Trust UPIN,Academy UKPRN,Academy Trust Name,Academy Name,Academy UPIN,Trust Type,Date Opened,LA Name,...,Age Average Score,Academy Balance,Trust Balance,Central Services Balance,PFI School,Central Services Financial Position,Academy Financial Position,Trust Financial Position,Status,SchoolPhaseType
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
101849,02369239,1989-04-06 00:00:00.0000000,134879,10058152,The BRIT School Limited,BRIT School for Performing Arts and Technology,118384,Single Academy Trust (SAT),1991-10-22 00:00:00.0000000,Croydon London Borough Council,...,43.834208,910.0,,910.0,Non-PFI school,Surplus,Surplus,Unknown,Open,Secondary
105135,05210075,2004-08-19 00:00:00.0000000,135025,10058185,St Paul's Academy,St Paul's Academy - Greenwich,119110,Single Academy Trust (SAT),2005-09-01 00:00:00.0000000,Greenwich London Borough Council,...,16.7463,2916.0,,2916.0,Non-PFI school,Surplus,Surplus,Unknown,Open,Secondary
108420,04464331,2002-06-19 00:00:00.0000000,139672,10064192,Emmanuel Schools Foundation,Emmanuel College - Gateshead,118390,Multi Academy Trust (MAT),1991-04-05 00:00:00.0000000,Gateshead Council,...,37.0,805.0,1024.0,1829.0,Non-PFI school,Surplus,Surplus,Surplus,Open,Secondary
123627,02414699,1989-08-18 00:00:00.0000000,134882,10058155,Telford City Technology College Trust Limited,Thomas Telford School,118401,Single Academy Trust (SAT),1992-02-07 00:00:00.0000000,Telford and Wrekin Council,...,23.204891,9101.0,,9101.0,Non-PFI school,Surplus,Surplus,Unknown,Open,Secondary
129342,07525820,2011-02-10 00:00:00.0000000,135906,10058513,Tove Learning Trust,Grace Academy Solihull,118615,Multi Academy Trust (MAT),2006-09-01 00:00:00.0000000,Solihull Metropolitan Borough Council,...,16.930612,843.0,2428.0,9022.0,Non-PFI school,Surplus,Surplus,Surplus,Open,Secondary
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149222,10192252,2016-05-20 00:00:00.0000000,139657,10064173,Connect Academy Trust,Cockington Primary School,163969,Multi Academy Trust (MAT),2013-09-01 00:00:00.0000000,Torbay Council,...,,0.0,0.0,0.0,Non-PFI school,Deficit,Deficit,Deficit,(Re)opened in period,Primary
149222,07668923,2011-06-14 00:00:00.0000000,135225,10058887,Coast Academies,Cockington Primary School,163969,Multi Academy Trust (MAT),2013-09-01 00:00:00.0000000,Torbay Council,...,,0.0,0.0,0.0,Non-PFI school,Deficit,Deficit,Deficit,Closed in period,Primary
149299,09066969,2014-06-02 00:00:00.0000000,135834,10060816,Blessed Christopher Wharton Catholic Academy T...,"The Holy Family Catholic School, a Voluntary A...",162962,Multi Academy Trust (MAT),2022-08-01 00:00:00.0000000,Bradford Metropolitan District Council,...,41.71524,10.0,94.0,1843.0,Non-PFI school,Surplus,Surplus,Surplus,(Re)opened in period,Secondary
149396,08075785,2012-05-18 00:00:00.0000000,138199,10048061,The White Horse Federation,Peak Academy,164084,Multi Academy Trust (MAT),2012-09-01 00:00:00.0000000,Gloucestershire County Council,...,,31.0,-270.0,1394.0,Non-PFI school,Surplus,Surplus,Deficit,Closed in period,Special


Merge required census and cdc data to the maintained schools data set

In [9]:
# Load raw list from CSV
maintained_schools_list = pd.read_csv('data/maintained_schools_raw.csv', encoding='utf8', index_col=schemas.maintained_schools_master_list_index_col, usecols=schemas.maintained_schools_master_list.keys(), dtype=schemas.maintained_schools_master_list)

In [None]:
# Merge maintained_schools_base with schools (metadata from GIAS) & rename PFI column
maintained_schools = maintained_schools_list.merge(schools.reset_index(), left_index=True, right_on='URN')


In [None]:
# Merge in cencus and cdc data
maintained_schools = (maintained_schools
                      .merge(census, on='URN', how='left')
                      .merge(cdc, on='URN', how='left'))



In [None]:
# Compute columns
maintained_schools['PFI'] = maintained_schools['PFI'].map(lambda x: 'PFI school' if x == 'Y' else 'Non-PFI school')
maintained_schools['Status'] = maintained_schools.apply(lambda df: mappings.map_maintained_school_status(df['OpenDate'], df['CloseDate'], df['Period covered by return (months)'], pd.to_datetime(maintained_schools_year_start_date), pd.to_datetime(maintained_schools_year_end_date)), axis=1)
maintained_schools['School Balance'] = maintained_schools['Total Income   I01 to I18'] - maintained_schools['Total Expenditure  E01 to E32']
maintained_schools['School Financial Position'] = maintained_schools['School Balance'].map(mappings.map_is_surplus_deficit)
maintained_schools['SchoolPhaseType'] = maintained_schools.apply(lambda df: mappings.map_school_phase_type(df['TypeOfEstablishment (code)'], df['Overall Phase']), axis=1)
maintained_schools['Partial Years Present'] = maintained_schools['Period covered by return (months)'].map(lambda x: x != 12)
maintained_schools['Did Not Submit'] = maintained_schools['Did Not Supply flag'].map(lambda x: x == 1)

In [None]:
# Rename columns
maintained_schools.rename(columns={
    'PFI': 'PFI School',
    '% of pupils eligible for FSM': 'Percentage Free school meals',
    'No Pupils': 'Number of pupils',
}, inplace=True)

In [None]:
maintained_schools.set_index('URN', inplace=True)

In [None]:
maintained_schools

Unnamed: 0_level_0,Phase,Overall Phase,Lowest age of pupils,Highest age of pupils,Type,Number of pupils,Percentage Free school meals,Period covered by return (months),Did Not Supply flag,Federation,...,Total Number of Teachers (Full-Time Equivalent),Pupil: Teacher Ratio (Full-Time Equivalent of qualified and unqualified teachers),Number of Vacant Teacher Posts,Total Internal Floor Area,Age Average Score,Status,School Balance,School Financial Position,Partial Years Present,SchoolPhaseType
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100000,Infant and junior,Primary,3.0,11.0,Voluntary aided school,271.0,18.1,12,0,No,...,15.64,17.3,0.0,3628.0,142.0,Open,-103843.66,Deficit,False,Primary
100005,Nursery,Nursery,2.0,5.0,Local authority nursery school,107.5,38.2,12,0,No,...,4.31,24.9,0.0,,,Open,-159542.44,Deficit,False,Nursery
100006,Pupil referral unit,Pupil referral unit,11.0,16.0,Pupil referral unit,49.0,68.4,12,0,No,...,9.80,5.0,1.0,1523.0,59.797111,Open,488879.73,Surplus,False,Pupil referral unit
100007,Pupil referral unit,Pupil referral unit,5.0,11.0,Pupil referral unit,19.0,100.0,12,0,No,...,15.80,1.2,0.0,,,Open,-70514.23,Deficit,False,Pupil referral unit
100008,Infant and junior,Primary,3.0,11.0,Community school,350.0,55.1,12,0,No,...,18.60,18.8,0.0,2951.0,117.0,Open,-116875.63,Deficit,False,Primary
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131818,Infant and junior,Primary,3.0,11.0,Community school,417.0,52.0,12,0,No,...,25.00,16.7,0.0,2819.0,56.148634,Open,-121353.13,Deficit,False,Primary
132176,Infant and junior,Primary,3.0,11.0,Voluntary aided school,278.0,45.9,12,0,No,...,14.40,19.3,2.0,2872.0,27.0,Open,89141.77,Surplus,False,Primary
132793,Infant and junior,Primary,4.0,11.0,Voluntary aided school,411.0,28.2,12,0,No,...,19.40,21.2,0.0,1639.0,117.0,Open,-92369.69,Deficit,False,Primary
132796,Infant and junior,Primary,2.0,11.0,Voluntary aided school,473.0,29.6,12,0,No,...,23.91,19.8,0.0,2736.0,17.0,Open,-161742.84,Deficit,False,Primary


In [None]:
maintained_schools.to_csv('output/pre-processing/maintained_schools.csv')


In [None]:
print(f'Processing Time: {time.time() - start_time} seconds')

Processing Time: 37.596153259277344 seconds


## Federation and MAT capture



In [10]:
group_links = pd.read_csv('data/alllinksdata20240417.csv', encoding='unicode-escape')