In [22]:
import numpy as np
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


# VMFI Data processing pipeline

This workbook aims to emulate the current data processing pipeline that occurs in VMFI pipeline. The logic and processing is largely based on the following document [Insights data portal - Data sources and sql analysis](https://educationgovuk.sharepoint.com.mcas.ms/:w:/r/sites/VMFI/_layouts/15/Doc.aspx?sourcedoc=%7B38C1DC37-7CDB-48B8-9E22-284F4F311C0B%7D&file=1.%20Insights%20portal%20-%20data%20sources%20and%20sql%20analysis%20v010%20-%20Copy.docx&action=default&mobileredirect=true) and will stay true to this document even if the existing stored procedures are doing something different. This will form the basis of a gap analysis going forward. 

All data loaded in the following workbook comes from the set of CSV files in the `data` folder alongside this workbook. These datasets are for the most part from the list at the start of the linked document. However, because there is additional standing data required to fully implement the pipeline then this data has been exported from the development VMFI pipeline database. These files are currently: 

| File name | DB Table |
|:----------|----------|
|standing_data_cdc.csv | standing_data.cdc |

In [23]:
import pandas as pd
import mappings as mappings
import schemas as schemas
import datetime

from pathlib import Path
Path("output/pre-processing").mkdir(parents=True, exist_ok=True)

current_year = 2022
accounts_return_period_start_date = datetime.date(current_year - 1, 9, 10)
academy_year_start_date = datetime.date(current_year - 1, 9, 1)
academy_year_end_date = datetime.date(current_year, 8, 30)
maintained_schools_year_start_date = datetime.date(current_year, 4, 1)
maintained_schools_year_end_date = datetime.date(current_year, 3, 31)

## GIAS data load and preparation

This reads the main GIAS data (edubasealldataYYYYMMDD file) and the associated links file (links_edubasealldataYYYYMMDD file). This is taken from the [GIAS Service](https://get-information-schools.service.gov.uk/help)

Other columns are tidied up by asserting the correct type for that column. This is tidying phase is largly because on load integer columns will be inferred to be a float as opposed to an integer.

In [24]:
gias = pd.read_csv('data/edubasealldata20240312.csv', encoding='cp1252', 
                   index_col=schemas.gias_index_col, usecols=schemas.gias.keys(), dtype=schemas.gias)

gias_links = pd.read_csv('data/links_edubasealldata20240312.csv', encoding='cp1252', 
                         index_col=schemas.gias_links_index_col, usecols=schemas.gias_links.keys(), dtype=schemas.gias_links)

# GIAS transformations
gias['LA Establishment Number'] = gias['LA (code)'] + '-' + gias['EstablishmentNumber'].astype('string')
gias['LA Establishment Number'] = gias['LA Establishment Number'].astype('string')

gias['OpenDate'] = pd.to_datetime(gias['OpenDate'], dayfirst=True, format='mixed')
gias['CloseDate'] = pd.to_datetime(gias['CloseDate'], dayfirst=True, format='mixed')
gias['SchoolWebsite'] = gias['SchoolWebsite'].fillna('').map(mappings.map_school_website)
gias['Boarders (name)'] = gias['Boarders (name)'].fillna('').map(mappings.map_boarders)
gias['OfstedRating (name)'] = gias['OfstedRating (name)'].fillna('').map(mappings.map_ofsted_rating)
gias['NurseryProvision (name)'] = gias['NurseryProvision (name)'].fillna('')
gias['OfficialSixthForm (name)'] = gias['OfficialSixthForm (name)'].fillna('').map(mappings.map_sixth_form)
gias['AdmissionsPolicy (name)'] = gias['AdmissionsPolicy (name)'].fillna('').map(mappings.map_admission_policy)
gias['Head Name'] = gias['HeadTitle (name)'] + ' ' + gias['HeadFirstName'] + ' ' + gias['HeadLastName']

In the following cell, we find all the predecessor and merged links. The links are then Ranked by URN and order by 'Link Established Date'. The linked GAIS data in then joined to the base GIAS data. This creates the overall school data set. This dataset is then filtered for schools that are open (CloseDate is null) and the schools with nested links that are Ranked 1.

In [25]:
gias_links = gias_links[
    gias_links['LinkType'].isin(['Predecessor', 
                                 'Predecessor - amalgamated', 
                                 'Predecessor - Split School', 
                                 'Predecessor - merged', 
                                 'Merged - expansion of school capacity', 
                                 'Merged - change in age range'])
].sort_values(by='LinkEstablishedDate', ascending=False)

gias_links['Rank'] = gias_links.groupby('URN').cumcount() + 1
gias_links['Rank'] = gias_links['Rank'].astype('Int64')

schools = gias.join(gias_links, on='URN', how='left', rsuffix='_links', lsuffix='_school').sort_values(by='URN')

schools = schools[
    schools['CloseDate'].isna() & ((schools['Rank'] == 1) | (schools['Rank'].isna()))
]

In [26]:
schools.to_csv('output/pre-processing/schools.csv')
schools.sort_index()

Unnamed: 0_level_0,LA (code),LA (name),EstablishmentNumber,EstablishmentName,TypeOfEstablishment (code),TypeOfEstablishment (name),EstablishmentStatus (code),EstablishmentStatus (name),OpenDate,CloseDate,...,OfstedRating (name),MSOA (code),LSOA (code),LA Establishment Number,Head Name,LinkURN,LinkName,LinkType,LinkEstablishedDate,Rank
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100000,201,City of London,3614,The Aldgate School,2,Voluntary aided school,1,Open,NaT,NaT,...,Outstanding,E02000001,E01032739,201-3614,Miss Alexandra Allan,,,,,
100001,201,City of London,6005,City of London School for Girls,11,Other independent school,1,Open,1920-01-01,NaT,...,,E02000001,E01000002,201-6005,Mrs Jenny Brown,,,,,
100002,201,City of London,6006,St Paul's Cathedral School,11,Other independent school,1,Open,1939-01-01,NaT,...,,E02000001,E01032739,201-6006,,,,,,
100003,201,City of London,6007,City of London School,11,Other independent school,1,Open,1919-01-01,NaT,...,,E02000001,E01032739,201-6007,Mr Alan Bird,,,,,
100005,202,Camden,1048,Thomas Coram Centre,15,Local authority nursery school,1,Open,NaT,NaT,...,Outstanding,E02007115,E01000937,202-1048,Ms Perina Holness,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
402468,679,Monmouthshire,5500,King Henry viii 3-19 School,30,Welsh establishment,1,Open,2023-09-01,NaT,...,,999999999,999999999,679-5500,,402093,Deri View Primary,Predecessor - amalgamated,31-08-2023,1
402469,681,Cardiff,2333,Ysgol Gynradd Groes-Wen Primary,30,Welsh establishment,1,Open,2023-09-01,NaT,...,,W02000380,W01001729,681-2333,,,,,,
402470,668,Pembrokeshire,2398,Ysgol Bro Penfro,30,Welsh establishment,4,Proposed to open,2024-09-01,NaT,...,,W02000140,W01000607,668-2398,,,,,,
402471,679,Monmouthshire,2325,Ysgol Gymraeg Trefynwy,30,Welsh establishment,4,Proposed to open,2024-09-01,NaT,...,,W02000339,W01001978,679-2325,,,,,,


## Academy and Maintained school lookup tables

Due to the way schools can be merged, or re-assigned URN's we have to create the following lookup tables so that we can efficentl

In [27]:
academies_list = pd.read_csv('data/master_list_raw.csv', encoding='utf8', index_col=schemas.academy_master_list_index_col, dtype=schemas.academy_master_list, usecols=schemas.academy_master_list.keys())

maintained_schools_list = pd.read_csv('data/maintained_schools_raw.csv', encoding='utf8', index_col=schemas.maintained_schools_master_list_index_col, usecols=schemas.maintained_schools_master_list.keys(), dtype=schemas.maintained_schools_master_list)

academies_lookup = academies_list.merge(schools.reset_index(), left_index=True, right_on='LA Establishment Number')[
    ['URN', 'LinkURN', 'LA Establishment Number', 'Academy UPIN', 'Academy Trust UPIN']
].set_index('URN')

maintained_schools_lookup = maintained_schools_list.merge(schools.reset_index(), left_index=True, right_on='URN')[
    ['URN', 'LinkURN', 'LA Establishment Number']
].set_index('URN')

In [28]:
academies_lookup

Unnamed: 0_level_0,LinkURN,LA Establishment Number,Academy UPIN,Academy Trust UPIN
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
148853,114221,840-3130,163480,139821
144542,111642,808-2343,138448,139821
144551,133301,808-2006,138465,139821
148854,114258,840-3441,163504,139821
136730,131185,826-4097,119734,134890
...,...,...,...,...
147031,141498,881-4033,151873,136062
145812,114802,881-2191,140554,136062
138865,131654,881-4007,121878,136062
136577,115319,881-5403,119652,136062


In [29]:
maintained_schools_lookup

Unnamed: 0_level_0,LinkURN,LA Establishment Number
URN,Unnamed: 1_level_1,Unnamed: 2_level_1
100000,,201-3614
100005,,202-1048
100006,134643,202-1100
100007,,202-1101
100008,,202-2019
...,...,...
131818,104536,341-2230
132176,104606,341-2232
132793,104654,341-2233
132796,131028,341-2234


## CDC data load and preparation

School buildings condition dataset. Based on the surveys performed throughout 2018-2019.

The data in the file `data/standing_data_cdc.csv` is just an export of the data in `standing_data.cdc` table. Without the Year and Import ID fields. In future this will likely have to be read directly from the source database as per [this document.](https://educationgovuk.sharepoint.com.mcas.ms/:w:/r/sites/VMFI/_layouts/15/Doc.aspx?sourcedoc=%7B38C1DC37-7CDB-48B8-9E22-284F4F311C0B%7D&file=1.%20Insights%20portal%20-%20data%20sources%20and%20sql%20analysis%20v010%20-%20Copy.docx&action=default&mobileredirect=true) 

In [30]:
cdc = pd.read_csv('data/standing_data_cdc.csv', encoding='utf8', index_col=schemas.cdc_index_col, usecols=schemas.cdc.keys(), dtype=schemas.cdc)

cdc['Total Internal Floor Area'] = cdc.groupby(by=['URN'])['GIFA'].sum()
cdc['Proportion Area'] = (cdc['GIFA'] / cdc['Total Internal Floor Area'])
cdc['Indicative Age'] = cdc['Block Age'].fillna('').map(mappings.map_block_age).astype('Int64')
cdc['Age Score'] = cdc['Proportion Area'] * (current_year - cdc['Indicative Age'])
cdc['Score'] = cdc.groupby(by=['URN'])['Age Score'].sum()
cdc = cdc[['Total Internal Floor Area', 'Score']].drop_duplicates()
cdc = pd.concat([
    academies_lookup.join(cdc, on='URN', how='left'),
    maintained_schools_lookup.join(cdc, on='URN', how='left')
]).sort_index()


In [31]:
cdc.to_csv('output/pre-processing/cdc.csv')
cdc[cdc['Academy Trust UPIN'] == 122824]

Unnamed: 0_level_0,LinkURN,LA Establishment Number,Academy UPIN,Academy Trust UPIN,Total Internal Floor Area,Score
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
130912,101150,213-6905,118638,122824,14121.0,17.0
131749,106647,370-6905,118649,122824,12176.0,17.0
131895,107127,373-6905,118640,122824,12288.0,17.0
131896,107147,373-6906,118642,122824,25042.0,17.0
132727,103102,320-6905,118656,122824,13821.0,16.776427
...,...,...,...,...,...,...
147467,138171,839-4008,162328,122824,,
148003,100743,209-4001,162199,122824,,
148357,138193,839-2010,163257,122824,,
148393,133351,355-4005,133267,122824,,


## School Census data load

*Pupil Census* - DfE data collection providing information about school and pupil characteristics, for example percentage of pupils claiming free school`z meals, or having English as their second language. 

*Workforce census* - Single reference for all school workforce statistics based on staff working in publicly funded schools in England.

The following code loads both the workforce and pupil census data and preforms an `inner` join by URN on the data sets.

In [32]:
school_workforce_census = pd.read_excel('data/School_Tables_School_Workforce_Census_2022.xlsx', header=5, index_col=schemas.workforce_census_index_col, usecols=schemas.workforce_census.keys(), dtype=schemas.workforce_census, na_values=["x","u"], keep_default_na=True).drop_duplicates()

school_pupil_census = pd.read_csv('data/standing_data_census_pupils.csv', encoding='utf8', index_col=schemas.pupil_census_index_col, usecols=schemas.pupil_census.keys(), dtype=schemas.pupil_census).drop_duplicates()

census = school_pupil_census.join(school_workforce_census, on='URN', how='inner', rsuffix='_pupil', lsuffix='_workforce')

census = pd.concat([
    academies_lookup.join(census, on='URN', how='left'),
    maintained_schools_lookup.join(census, on='URN', how='left')
]).sort_index()
                 

In [33]:
census.to_csv('output/pre-processing/census.csv')
census

Unnamed: 0_level_0,LinkURN,LA Establishment Number,Academy UPIN,Academy Trust UPIN,full time pupils,headcount of pupils,% of pupils known to be eligible for and claiming free school me,number of pupils whose first language is known or believed to be other than English,Total School Workforce (Full-Time Equivalent),Total Number of Teachers (Full-Time Equivalent),Pupil: Teacher Ratio (Full-Time Equivalent of qualified and unqualified teachers),Number of Vacant Teacher Posts
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
100000,,201-3614,,,271.0,271.0,9.2,145.0,41.62,15.64,17.3,0.0
100005,,202-1048,,,79.0,136.0,38.2,38.0,37.48,4.31,24.9,0.0
100006,134643,202-1100,,,19.0,19.0,0.0,5.0,26.30,9.80,5.0,1.0
100007,,202-1101,,,19.0,19.0,0.0,4.0,31.58,15.80,1.2,0.0
100008,,202-2019,,,350.0,350.0,53.1,306.0,45.22,18.60,18.8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
149222,140063,880-2006,163969,139657,492.0,492.0,23.6,77.0,59.25,25.39,19.4,0.0
149222,140063,880-2006,163969,135225,492.0,492.0,23.6,77.0,59.25,25.39,19.4,0.0
149299,107428,380-4094,162962,135834,862.0,862.0,23.3,388.0,126.89,64.89,13.3,1.0
149396,138429,916-7026,164084,139729,60.0,60.0,63.3,0.0,28.36,13.00,4.6,0.0


## Special Education Needs (SEN) data load and preparation

Special educational needs dataset. Contains information about the number of pupils, who require various SEN provisions. This loads the `SEN` data, which originates from [here](https://explore-education-statistics.service.gov.uk/find-statistics/special-educational-needs-in-england#dataDownloads-1)

In [34]:
sen = pd.read_csv('data/SEN.csv', encoding='cp1252', index_col=schemas.sen_index_col, dtype=schemas.sen, usecols=schemas.sen.keys())

sen['% of pupils with special educational needs support'] = (sen['EHC plan'] / sen['Total pupils']) * 100.0

sen = pd.concat([
    academies_lookup.join(sen, on='URN', how='left'),
    maintained_schools_lookup.join(sen, on='URN', how='left')
]).sort_index()

In [35]:
sen.to_csv("output/pre-processing/sen.csv")
sen

Unnamed: 0_level_0,LinkURN,LA Establishment Number,Academy UPIN,Academy Trust UPIN,Total pupils,SEN support,EHC plan,EHC_Primary_need_spld,EHC_Primary_need_mld,EHC_Primary_need_sld,...,SUP_Primary_need_semh,SUP_Primary_need_slcn,SUP_Primary_need_hi,SUP_Primary_need_vi,SUP_Primary_need_msi,SUP_Primary_need_pd,SUP_Primary_need_asd,SUP_Primary_need_oth,SUP_Primary_need_nsa,% of pupils with special educational needs support
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100000,,201-3614,,,271,59,8,0,0,0,...,8,30,2,0,2,0,4,0,7,2.95203
100005,,202-1048,,,136,23,2,0,0,0,...,0,0,0,0,0,1,21,1,0,1.470588
100006,134643,202-1100,,,19,13,0,0,0,0,...,10,1,0,0,0,0,0,0,1,0.0
100007,,202-1101,,,19,10,9,0,0,0,...,9,1,0,0,0,0,0,0,0,47.368421
100008,,202-2019,,,350,61,12,0,0,2,...,5,19,0,0,0,0,3,4,0,3.428571
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149222,140063,880-2006,163969,139657,492,68,17,0,2,0,...,25,5,0,0,0,1,0,0,14,3.455285
149222,140063,880-2006,163969,135225,492,68,17,0,2,0,...,25,5,0,0,0,1,0,0,14,3.455285
149299,107428,380-4094,162962,135834,862,144,34,2,3,0,...,36,10,8,2,0,6,10,2,0,3.944316
149396,138429,916-7026,164084,139729,60,0,60,3,5,0,...,0,0,0,0,0,0,0,0,0,100.0


## AR Data load and preparation

This loads the Annual accounts return dataset and the corresponding mapping file. This extract only contains benchmarking section, which consists of submissions of costs, income, and balances of individual academies.

The mapping file, contains the mapping from AR4 cell references to cost categories and descriptions.

In [36]:
ar_cell_mapping = pd.read_csv('data/AR_cell_mapping.csv', encoding='utf8', index_col=schemas.ar_cell_mapping_index_col, usecols=schemas.ar_cell_mapping.keys(), dtype=schemas.ar_cell_mapping)

ar_raw = pd.read_csv('data/AR_raw.csv', encoding='utf8', index_col=schemas.ar_index_col, usecols=schemas.ar.keys(), dtype=schemas.ar)

ar = ar_raw.reset_index().merge(ar_cell_mapping, right_on='cell', left_on='aruniquereference').set_index(schemas.ar_index_col)
ar = (academies_lookup.reset_index()
      .merge(ar, left_on='Academy UPIN', right_on='academyupin', how='left')
      .join(cdc[['Total Internal Floor Area', 'Score']], on='URN', how='left', rsuffix='_cdc', lsuffix='_ar')).drop(['trustupin', 'companynumber'], axis=1)

ar.set_index('URN', inplace=True)

In [37]:
ar.to_csv('output/pre-processing/ar.csv')
ar

Unnamed: 0_level_0,LinkURN,LA Establishment Number,Academy UPIN,Academy Trust UPIN,aruniquereference,value,Description L1,Metric,Metric ID,Cost Pool,Presentation name,Cost Pool ID,Total Internal Floor Area,Score
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
148853,114221,840-3130,163480,139821,BAE010-T,240.0,Staff costs,Teaching staff,1,Teaching and Teaching support staff,Teaching and Teaching support staff,1,,
148853,114221,840-3130,163480,139821,BAE020-T,0.0,Staff costs,Supply teaching staff,5,Teaching and Teaching support staff,Teaching and Teaching support staff,1,,
148853,114221,840-3130,163480,139821,BAE030-T,89.0,Staff costs,Education support staff,2,Teaching and Teaching support staff,Teaching and Teaching support staff,1,,
148853,114221,840-3130,163480,139821,BAE040-T,17.0,Staff costs,Administrative and clerical staff,6,Non-educational support staff,Non-educational support staff,2,,
148853,114221,840-3130,163480,139821,BAE050-T,25.0,Staff costs,Premises staff,14,Premises,Premises,5,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138300,119255,890-2221,122121,137035,BAE310-T,0.0,Other supplies and services,PFI charges,25,Other costs,Other costs,9,6531.0,91.396264
138300,119255,890-2221,122121,137035,BAE320-T,0.0,Funding costs,Interest charges for loan and bank,32,Other costs,Other costs,9,6531.0,91.396264
138300,119255,890-2221,122121,137035,BAI110-T,0.0,Self-generated income,Income from catering,33,Catering,Catering income,10,6531.0,91.396264
138300,119255,890-2221,122121,137035,BAITOT-T,3778.0,Total revenue income,Total revenue income,0,,,0,6531.0,91.396264


## Academies data load and preparation 

A list of academies expected to submit the AAR data, as well as the trust – academy relationship. Academies master list was originally provided to us by Cristelle Ngoune, however it can be acquired by querying AnM (see attached script). 

This read academy data from the `master_list_raw` data file and joins on the GIAS Data (schools) school_census and CDC data by the 'LA Establishment Number', 'URN' and 'URN' keys respectively. Finally, some further mapping of the Nursery provision and the Academy status is carried out. 

In [44]:
academies = academies_list.merge(schools.reset_index(), left_index=True, right_on='LA Establishment Number')
academies.set_index('URN', inplace=True)

academies = (academies.join(census, how='left', rsuffix='_census', lsuffix='_academy').filter(regex='^(?!.*_census)')
             .join(sen, on='URN', how='left', rsuffix='_sen', lsuffix='_academy').filter(regex='^(?!.*_sen)')
             .join(cdc, on='URN', how='left', rsuffix='_cdc', lsuffix='_academycdc').filter(regex='^(?!.*_cdc)'))

academies['Type of Provision - Phase'] = academies.apply(lambda df: mappings.map_academy_phase_type(df['TypeOfEstablishment (code)'], df['Type of Provision - Phase']), axis=1)

# Bizarre I shouldn't need this as this is coming from the original GIAS dataset, but I seem to have to do this twice. 
academies['NurseryProvision (name)'] = academies['NurseryProvision (name)'].fillna('')
academies['NurseryProvision (name)'] = academies.apply(lambda df: mappings.map_nursery(df['NurseryProvision (name)'], df['Type of Provision - Phase']), axis=1)

academies['Status'] = academies.apply(lambda df: mappings.map_academy_status(pd.to_datetime(df['Date left or closed if in period']), 
                                                                             pd.to_datetime(df['Valid to']), 
                                                                             pd.to_datetime(df['OpenDate']), 
                                                                             pd.to_datetime(df['CloseDate']), 
                                                                             pd.to_datetime(accounts_return_period_start_date), pd.to_datetime(academy_year_start_date), pd.to_datetime(academy_year_end_date)), axis=1)

academies.rename(columns={'UKPRN_x':'UKPRN'}, inplace=True)

In [45]:
academies.to_csv('output/pre-processing/academies.csv')
academies.sort_index()

Unnamed: 0_level_0,Company Registration Number,Incorporation Date,Academy Trust UPIN_academy,UKPRN,Academy Trust Name,Academy Name,Academy UPIN_academy,Trust Type,Date Opened,LA Name,...,SUP_Primary_need_vi,SUP_Primary_need_msi,SUP_Primary_need_pd,SUP_Primary_need_asd,SUP_Primary_need_oth,SUP_Primary_need_nsa,% of pupils with special educational needs support,Total Internal Floor Area,Score,Status
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
101849,02369239,1989-04-06 00:00:00.0000000,134879,10058152,The BRIT School Limited,BRIT School for Performing Arts and Technology,118384,Single Academy Trust (SAT),1991-10-22 00:00:00.0000000,Croydon London Borough Council,...,1,0,16,32,0,0,3.435934,27058.0,43.834208,Open
105135,05210075,2004-08-19 00:00:00.0000000,135025,10058185,St Paul's Academy,St Paul's Academy - Greenwich,119110,Single Academy Trust (SAT),2005-09-01 00:00:00.0000000,Greenwich London Borough Council,...,1,0,1,14,3,7,2.5,11352.0,16.7463,Open
108420,04464331,2002-06-19 00:00:00.0000000,139672,10064192,Emmanuel Schools Foundation,Emmanuel College - Gateshead,118390,Multi Academy Trust (MAT),1991-04-05 00:00:00.0000000,Gateshead Council,...,4,0,3,28,11,23,0.692521,15903.0,37.0,Open
123627,02414699,1989-08-18 00:00:00.0000000,134882,10058155,Telford City Technology College Trust Limited,Thomas Telford School,118401,Single Academy Trust (SAT),1992-02-07 00:00:00.0000000,Telford and Wrekin Council,...,5,1,7,6,0,0,0.12945,28298.0,23.204891,Open
129342,07525820,2011-02-10 00:00:00.0000000,135906,10058513,Tove Learning Trust,Grace Academy Solihull,118615,Multi Academy Trust (MAT),2006-09-01 00:00:00.0000000,Solihull Metropolitan Borough Council,...,5,1,2,68,21,0,2.447164,17294.0,16.930612,Open
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149396,08075785,2012-05-18 00:00:00.0000000,138199,10048061,The White Horse Federation,Peak Academy,164084,Multi Academy Trust (MAT),2012-09-01 00:00:00.0000000,Gloucestershire County Council,...,0,0,0,0,0,0,100.0,,,Closed in period
149396,08075785,2012-05-18 00:00:00.0000000,138199,10048061,The White Horse Federation,Peak Academy,164084,Multi Academy Trust (MAT),2012-09-01 00:00:00.0000000,Gloucestershire County Council,...,0,0,0,0,0,0,100.0,,,Closed in period
149396,08075785,2012-05-18 00:00:00.0000000,138199,10048061,The White Horse Federation,Peak Academy,164084,Multi Academy Trust (MAT),2012-09-01 00:00:00.0000000,Gloucestershire County Council,...,0,0,0,0,0,0,100.0,,,Closed in period
149396,08075785,2012-05-18 00:00:00.0000000,138199,10048061,The White Horse Federation,Peak Academy,164084,Multi Academy Trust (MAT),2012-09-01 00:00:00.0000000,Gloucestershire County Council,...,0,0,0,0,0,0,100.0,,,Closed in period


## Maintained schools data load and preparation

In [46]:
maintained_schools = maintained_schools_list.merge(schools.reset_index(), left_index=True, right_on='URN')

maintained_schools = (maintained_schools
                      .join(census, how='left', rsuffix='_census', lsuffix='_school').filter(regex='^(?!.*_census)')
                      .join(cdc, on='URN', how='left', rsuffix='_cdc', lsuffix='_school').filter(regex='^(?!.*_cdc)'))

maintained_schools['Status'] = maintained_schools.apply(lambda df: mappings.map_maintained_school_status(df['OpenDate'], df['CloseDate'], df['Period covered by return (months)'], pd.to_datetime(maintained_schools_year_start_date), pd.to_datetime(maintained_schools_year_end_date)), axis=1)

maintained_schools.set_index('URN', inplace=True)

In [47]:
maintained_schools.to_csv('output/pre-processing/maintained_schools.csv')
maintained_schools

Unnamed: 0_level_0,LA,LA Name,Region,Estab,LAEstab,School Name,Phase,Overall Phase,Lowest age of pupils,Highest age of pupils,...,number of pupils whose first language is known or believed to be other than English,Total School Workforce (Full-Time Equivalent),Total Number of Teachers (Full-Time Equivalent),Pupil: Teacher Ratio (Full-Time Equivalent of qualified and unqualified teachers),Number of Vacant Teacher Posts,LinkURN,LA Establishment Number,Total Internal Floor Area,Score,Status
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100000,201,City of London,London,3614,2013614,The Aldgate School,Infant and junior,Primary,3.0,11.0,...,,,,,,,201-3614,3628.0,142.0,Open
100005,202,Camden,London,1048,2021048,Thomas Coram Centre,Nursery,Nursery,2.0,5.0,...,,,,,,,202-1048,,,Open
100006,202,Camden,London,1100,2021100,Heath School,Pupil referral unit,Pupil referral unit,11.0,16.0,...,,,,,,134643,202-1100,1523.0,59.797111,Open
100007,202,Camden,London,1101,2021101,Camden Primary Pupil Referral Unit,Pupil referral unit,Pupil referral unit,5.0,11.0,...,,,,,,,202-1101,,,Open
100008,202,Camden,London,2019,2022019,Argyle Primary School,Infant and junior,Primary,3.0,11.0,...,,,,,,,202-2019,2951.0,117.0,Open
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131818,341,Liverpool,North West,2230,3412230,Fazakerley Primary School,Infant and junior,Primary,3.0,11.0,...,,,,,,104536,341-2230,2819.0,56.148634,Open
132176,341,Liverpool,North West,2232,3412232,Kirkdale St Lawrence CofE VA Primary School,Infant and junior,Primary,3.0,11.0,...,,,,,,104606,341-2232,2872.0,27.0,Open
132793,341,Liverpool,North West,2233,3412233,St Matthew's Catholic Primary School,Infant and junior,Primary,4.0,11.0,...,,,,,,104654,341-2233,1639.0,117.0,Open
132796,341,Liverpool,North West,2234,3412234,St John's Catholic Primary School,Infant and junior,Primary,2.0,11.0,...,,,,,,131028,341-2234,2736.0,17.0,Open
