In [27]:
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
cannot find .env file


# VMFI Data processing pipeline

This workbook aims to emulate the current data processing pipeline that occurs in VMFI pipeline. The logic and processing is largely based on the following document [Insights data portal - Data sources and sql analysis](https://educationgovuk.sharepoint.com.mcas.ms/:w:/r/sites/VMFI/_layouts/15/Doc.aspx?sourcedoc=%7B38C1DC37-7CDB-48B8-9E22-284F4F311C0B%7D&file=1.%20Insights%20portal%20-%20data%20sources%20and%20sql%20analysis%20v010%20-%20Copy.docx&action=default&mobileredirect=true) and will stay true to this document even if the existing stored procedures are doing something different. This will form the basis of a gap analysis going forward. 

All data loaded in the following workbook comes from the set of CSV files in the `data` folder alongside this workbook. These datasets are for the most part from the list at the start of the linked document. However, because there is additional standing data required to fully implement the pipeline then this data has been exported from the development VMFI pipeline database. These files are currently: 

| File name | DB Table |
|:----------|----------|
|standing_data_cdc.csv | standing_data.cdc |

In [28]:
import pandas as pd
import mappings as mappings
import schemas as schemas
import datetime
import time
import glob
import os

# Create and clean directory
from pathlib import Path
Path("output/pre-processing").mkdir(parents=True, exist_ok=True)

files = glob.glob("output/pre-processing/*")
for f in files:
    os.remove(f)

start_time = time.time()
current_year = 2022
accounts_return_period_start_date = datetime.date(current_year - 1, 9, 10)
academy_year_start_date = datetime.date(current_year - 1, 9, 1)
academy_year_end_date = datetime.date(current_year, 8, 30)
maintained_schools_year_start_date = datetime.date(current_year, 4, 1)
maintained_schools_year_end_date = datetime.date(current_year, 3, 31)

## CDC data load and preparation

School buildings condition dataset. Based on the surveys performed throughout 2018-2019.

The data in the file `data/standing_data_cdc.csv` is just an export of the data in `standing_data.cdc` table. Without the Year and Import ID fields. In future this will likely have to be read directly from the source database as per [this document.](https://educationgovuk.sharepoint.com.mcas.ms/:w:/r/sites/VMFI/_layouts/15/Doc.aspx?sourcedoc=%7B38C1DC37-7CDB-48B8-9E22-284F4F311C0B%7D&file=1.%20Insights%20portal%20-%20data%20sources%20and%20sql%20analysis%20v010%20-%20Copy.docx&action=default&mobileredirect=true) 

In [29]:
cdc = pd.read_csv('data/standing_data_cdc.csv', encoding='utf8', index_col=schemas.cdc_index_col, usecols=schemas.cdc.keys(), dtype=schemas.cdc)

cdc['Total Internal Floor Area'] = cdc.groupby(by=['URN'])['GIFA'].sum()
cdc['Proportion Area'] = (cdc['GIFA'] / cdc['Total Internal Floor Area'])
cdc['Indicative Age'] = cdc['Block Age'].fillna('').map(mappings.map_block_age).astype('Int64')
cdc['Age Score'] = cdc['Proportion Area'] * (current_year - cdc['Indicative Age'])
cdc['Age Average Score'] = cdc.groupby(by=['URN'])['Age Score'].sum()
cdc = cdc[['Total Internal Floor Area', 'Age Average Score']].drop_duplicates()

In [30]:
cdc.to_csv('output/pre-processing/cdc.csv')
cdc

Unnamed: 0_level_0,Total Internal Floor Area,Age Average Score
URN,Unnamed: 1_level_1,Unnamed: 2_level_1
100150,2803.0,48.358188
100162,2105.0,133.162945
100164,2934.0,97.0
100166,2040.0,91.705882
105304,1602.0,35.752809
...,...,...
144913,3111.0,16.704275
144917,2620.0,78.412214
105623,3382.0,7.0
144918,4733.0,19.009296


## School Census data load

*Pupil Census* - DfE data collection providing information about school and pupil characteristics, for example percentage of pupils claiming free school`z meals, or having English as their second language. 

*Workforce census* - Single reference for all school workforce statistics based on staff working in publicly funded schools in England.

The following code loads both the workforce and pupil census data and preforms an `inner` join by URN on the data sets.

In [31]:
school_workforce_census = pd.read_excel('data/School_Tables_School_Workforce_Census_2022.xlsx', header=5, index_col=schemas.workforce_census_index_col, usecols=schemas.workforce_census.keys(), dtype=schemas.workforce_census, na_values=["x","u","c"], keep_default_na=True).drop_duplicates()

school_pupil_census = pd.read_csv('data/standing_data_census_pupils.csv', encoding='utf8', index_col=schemas.pupil_census_index_col, usecols=schemas.pupil_census.keys(), dtype=schemas.pupil_census).drop_duplicates()

census = school_pupil_census.join(school_workforce_census, on='URN', how='inner', rsuffix='_pupil', lsuffix='_workforce')

census.drop(labels=['full time pupils', 'headcount of pupils'], axis=1, inplace=True)
         

In [32]:

# Rename Columns
census.rename(columns={
    "Total Number of Non-Classroom-based School Support Staff, (Other school support staff plus Administrative staff plus Technicians and excluding Auxiliary staff (Full-Time Equivalent)": "FullTimeOther",
    "Total Number of Non Classroom-based School Support Staff, Excluding Auxiliary Staff (Headcount)": "FullTimeOtherHeadCount",
}, inplace=True)

In [33]:
census.to_csv('output/pre-processing/census.csv')
census

Unnamed: 0_level_0,% of pupils known to be eligible for and claiming free school me,% of pupils known to be eligible for free school meals (Performa,number of pupils whose first language is known or believed to be other than English,Statutory Low Age,Total School Workforce (Headcount),Total Number of Teachers in the Leadership Group (Headcount),Total Number of Teachers (Headcount),Total Number of Teaching Assistants (Headcount),FullTimeOtherHeadCount,Total Number of Auxiliary Staff (Headcount),Total School Workforce (Full-Time Equivalent),Total Number of Teachers in the Leadership Group (Full-time Equivalent),Total Number of Teachers (Full-Time Equivalent),Total Number of Teaching Assistants (Full-Time Equivalent),FullTimeOther,Total Number of Auxiliary Staff (Full-Time Equivalent),Pupil: Teacher Ratio (Full-Time Equivalent of qualified and unqualified teachers),Teachers with Qualified Teacher Status (%) (Headcount),Number of Vacant Teacher Posts
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
141334,33.8,52.3,93.0,4,48,3,15,14,6,13,34.17,2.64,13.11,10.29,4.82,5.95,24.8,100.000000,0
141396,23.4,60.3,236.0,3,118,4,39,34,11,34,82.47,4.00,34.00,29.55,10.13,8.79,18.3,100.000000,0
141397,33.2,47.7,127.0,3,105,5,27,42,9,27,72.81,4.24,24.55,31.84,6.55,9.87,19.7,100.000000,0
142223,5.1,8.7,343.0,3,156,5,56,44,9,47,99.66,4.16,47.12,33.07,6.57,12.90,23.0,100.000000,0
144396,56.7,64.8,29.0,3,37,2,13,9,4,11,25.57,2.00,11.39,7.36,4.00,2.82,18.1,100.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104642,2.4,2.6,14.0,4,52,4,18,10,6,18,34.47,3.60,15.80,6.27,5.22,7.18,26.6,100.000000,0
104643,3.5,8.5,13.0,3,68,3,19,24,6,19,39.89,3.00,17.40,11.74,3.34,7.41,24.7,100.000000,0
104645,32.9,33.8,43.0,7,37,3,13,10,4,10,26.47,3.00,12.40,6.78,3.19,4.10,19.1,92.307692,0
104646,29.9,31.9,20.0,3,29,2,12,10,2,5,22.36,2.00,12.00,6.24,1.44,2.68,15.8,100.000000,0


## Special Education Needs (SEN) data load and preparation

Special educational needs dataset. Contains information about the number of pupils, who require various SEN provisions. This loads the `SEN` data, which originates from [here](https://explore-education-statistics.service.gov.uk/find-statistics/special-educational-needs-in-england#dataDownloads-1)

In [34]:
sen = pd.read_csv('data/SEN.csv', encoding='cp1252', index_col=schemas.sen_index_col, dtype=schemas.sen, usecols=schemas.sen.keys())
sen['Percentage SEN'] = (sen['EHC plan'] / sen['Total pupils']) * 100.0
sen['Primary Need SPLD'] = sen['EHC_Primary_need_spld'] + sen['SUP_Primary_need_spld']
sen['Primary Need MLD'] = sen['EHC_Primary_need_mld'] + sen['SUP_Primary_need_mld']
sen['Primary Need SLD'] = sen['EHC_Primary_need_sld'] + sen['SUP_Primary_need_sld']
sen['Primary Need PMLD'] = sen['EHC_Primary_need_pmld'] + sen['SUP_Primary_need_pmld']
sen['Primary Need SEMH'] = sen['EHC_Primary_need_semh'] + sen['SUP_Primary_need_semh']
sen['Primary Need SLCN'] = sen['EHC_Primary_need_slcn'] + sen['SUP_Primary_need_slcn']
sen['Primary Need HI'] = sen['EHC_Primary_need_hi'] + sen['SUP_Primary_need_hi']
sen['Primary Need VI'] = sen['EHC_Primary_need_vi'] + sen['SUP_Primary_need_vi']
sen['Primary Need MSI'] = sen['EHC_Primary_need_msi'] + sen['SUP_Primary_need_msi']
sen['Primary Need PD'] = sen['EHC_Primary_need_pd'] + sen['SUP_Primary_need_pd']
sen['Primary Need ASD'] = sen['EHC_Primary_need_asd'] + sen['SUP_Primary_need_asd']
sen['Primary Need OTH'] = sen['EHC_Primary_need_oth'] + sen['SUP_Primary_need_oth']
sen.rename(columns={'prov_slcn': 'Prov_SLCN', 'prov_hi':'Prov_HI', 'prov_vi':'Prov_VI', 'prov_msi': 'Prov_MSI', 'prov_pd':'Prov_PD', 'prov_asd':'Prov_ASD', 'prov_oth':'Prov_OTH'}, inplace=True)

sen = sen[['Total pupils', 'EHC plan', 'Percentage SEN', 'Primary Need SPLD', 'Primary Need MLD', 'Primary Need SLD', 'Primary Need PMLD', 'Primary Need SEMH', 'Primary Need SLCN', 'Primary Need HI', 'Primary Need VI', 'Primary Need MSI', 'Primary Need PD', 'Primary Need ASD', 'Primary Need OTH', 'Prov_SPLD', 'Prov_MLD', 'Prov_SLD', 'Prov_PMLD', 'Prov_SEMH', 'Prov_SLCN','Prov_HI','Prov_VI','Prov_MSI','Prov_PD','Prov_ASD','Prov_OTH']] 

In [35]:
sen.to_csv("output/pre-processing/sen.csv")
sen

Unnamed: 0_level_0,Total pupils,EHC plan,Percentage SEN,Primary Need SPLD,Primary Need MLD,Primary Need SLD,Primary Need PMLD,Primary Need SEMH,Primary Need SLCN,Primary Need HI,...,Prov_SLD,Prov_PMLD,Prov_SEMH,Prov_SLCN,Prov_HI,Prov_VI,Prov_MSI,Prov_PD,Prov_ASD,Prov_OTH
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100000,271,8,2.95203,2,4,0,0,9,31,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100001,739,0,0.0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100002,269,0,0.0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100003,1045,0,0.0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100005,136,2,1.470588,0,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149557,41,3,7.317073,2,0,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
149632,1291,58,4.492641,31,15,0,0,20,25,8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
149633,86,0,0.0,2,1,0,0,1,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
149635,654,10,1.529052,15,1,0,0,12,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## KS2 and KS4 processing

In [36]:
ks2 = pd.read_excel('data/2022-2023_england_ks2revised.xlsx',usecols=schemas.ks2.keys())

In [37]:
# ks2 mappings
ks2['READPROG'] = ks2['READPROG'].replace({
                                        'SUPP':0,
                                        'LOWCOV':0})
ks2['MATPROG']=ks2['MATPROG'].replace({
                                        'SUPP':0,
                                        'LOWCOV':0})
ks2['WRITPROG']=ks2['WRITPROG'].replace({
                                        'SUPP':0,
                                        'LOWCOV':0})

ks2['Ks2Progress'] = ks2['READPROG'].astype(float) + ks2['MATPROG'].astype(float)+ks2['WRITPROG'].astype(float)


ks2 = ks2[['URN','Ks2Progress']].copy()

In [38]:
ks2.to_csv('output/ks2.csv')
ks2

Unnamed: 0,URN,Ks2Progress
0,100000.0,0.5
1,136807.0,13.0
2,139837.0,20.7
3,140686.0,-1.8
4,100008.0,5.5
...,...,...
16534,,-0.1
16535,,-2.0
16536,,-0.7
16537,,


In [39]:
ks4 = pd.read_excel('data/2022-2023_england_ks4revised.xlsx')

In [40]:
#ks4 mappings

ks4.rename(columns={
    'ATT8SCR':'AverageAttainment',
    'P8MEA':'Progress8Measure',
    'P8_BANDING':'Progress8Banding'
}, inplace=True)

ks4 = ks4[['URN','AverageAttainment','Progress8Measure','Progress8Banding']].copy()

In [41]:
ks4.to_csv('output/ks4.csv')
ks4

Unnamed: 0,URN,AverageAttainment,Progress8Measure,Progress8Banding
0,100003.0,36.8,NP,
1,100001.0,29.4,NP,
2,100544.0,6.8,NP,
3,,,,
4,100053.0,50.3,-0.16,Average
...,...,...,...,...
5808,112446.0,12,NP,
5809,134191.0,7,NP,
5810,,46.3,-0.1,Below average
5811,,44.6,,


## AR Data load and preparation

This loads the Annual accounts return dataset and the corresponding mapping file. This extract only contains benchmarking section, which consists of submissions of costs, income, and balances of individual academies.

The mapping file, contains the mapping from AR4 cell references to cost categories and descriptions.

In [42]:
aar = pd.read_excel('data/SFB_Academies_2022-23_20240418.xlsx', sheet_name='Academies')
aar.rename(columns={'Academy UPIN':'academyupin',
                                    'In year balance':'Academy Balance',
                                    'PFI':'PFI School',
                                    'Lead UPIN': 'trustupin'}, inplace=True)

academies_financial = aar[aar['MAT SAT or Central Services']=='Single Academy Trust (SAT)'].copy()
academy_financial_position = academies_financial[['academyupin','Academy Balance']]

trust_financial = aar[aar['MAT SAT or Central Services']=='Multi Academy Trust (MAT)'].copy()
trust_financial_position = trust_financial[['trustupin','Academy Balance']].groupby('trustupin').sum().rename(columns={'Academy Balance':'Trust Balance'})

central_services_financial = pd.read_excel('data/SFB_Academies_2022-23_20240418.xlsx', sheet_name='CentralServices')
central_services_financial.rename(columns={'Academy UPIN':'academyupin',
                                    'In Year Balance':'Academy Balance',
                                    'PFI':'PFI School',
                                    'Lead UPIN': 'trustupin'}, inplace=True)

central_services_financial_position = central_services_financial[['trustupin','Academy Balance']].groupby('trustupin').sum().rename(columns={'Academy Balance':'Central Services Balance'})

aar = aar.drop(columns=['Academy Balance'])

ar = (aar.merge(academy_financial_position, on='academyupin', how='left')
     .merge(trust_financial_position, on='trustupin', how='left')
     .merge(central_services_financial_position, on='trustupin', how='left'))

trust_agg = trust_financial[schemas.aar_aggregation_columns].groupby('trustupin').sum()
trust_agg = trust_agg.drop(columns=['academyupin'])
academy_agg = academies_financial[schemas.aar_aggregation_columns].groupby('academyupin').sum()
academy_agg = academy_agg.drop(columns=['trustupin'])



In [43]:
# ar_cell_mapping = pd.read_csv('data/AR_cell_mapping.csv', encoding='utf8', index_col=schemas.ar_cell_mapping_index_col, usecols=schemas.ar_cell_mapping.keys(), dtype=schemas.ar_cell_mapping)

# ar_raw = pd.read_csv('data/AR_raw.csv', encoding='utf8', index_col=schemas.ar_index_col, usecols=schemas.ar.keys(), dtype=schemas.ar)

# ar = ar_raw.reset_index().merge(ar_cell_mapping, right_on='cell', left_on='aruniquereference').set_index(schemas.ar_index_col)

# pfi_schools = ar[ar['aruniquereference'] == 'BAE310-T'][['value']].map(mappings.map_is_pfi_school).rename(columns={'value': 'PFI School'})

# academy_financial_position = ar[ar['aruniquereference'] == 'BAB030-T'][['value']].rename(columns={'value': 'Academy Balance'})
# central_services_financial_position = ar[(ar['aruniquereference'] == 'BAB030-T') | (ar['aruniquereference'] == 'BTB030')].groupby('trustupin')['value'].sum().rename('Central Services Balance')
# trust_financial_position = ar[(ar['aruniquereference'] == 'BTB030')][['trustupin', 'value']].rename(columns={'value': 'Trust Balance'}).set_index('trustupin')

# teachingstaff = ar[ar['aruniquereference'] == 'BAE010-T'][['value']].rename(columns={'value': 'TeachingStaff'})
# cateringsupplies = ar[ar['aruniquereference'] == 'BAE250-T'][['value']].rename(columns={'value': 'CateringSupplies'}) 


# ar = (ar.join(pfi_schools, how='left')
#       .join(academy_financial_position, how='left')
#       .join(teachingstaff, how='left')
#       .join(cateringsupplies, how='left')
#       .join(trust_financial_position, on='trustupin', how='left')
#       .join(central_services_financial_position, on='trustupin', how='left'))

# trust_agg = (ar.reset_index()[['Cost Pool', 'Metric', 'academyupin', 'trustupin', 'value']].pivot_table(index='trustupin', columns=['Cost Pool','Metric'], values='value', aggfunc="sum"))
# trust_agg.columns = ['_'.join(str(s).strip() for s in col if s) for col in trust_agg.columns]
# trust_agg.reset_index().rename({'index':'trustupin'}).set_index('trustupin')

# academy_agg = (ar.reset_index()[['Cost Pool', 'Metric', 'academyupin', 'trustupin', 'value']].pivot_table(index='academyupin', columns=['Cost Pool','Metric'], values='value', aggfunc="sum"))
# academy_agg.columns =  ['_'.join(str(s).strip() for s in col if s) for col in academy_agg.columns]
# academy_agg.reset_index().rename({'index':'academyupin'}).set_index('academyupin')


In [44]:
ar.to_csv('output/pre-processing/ar.csv')
ar

Unnamed: 0,LAEstab,LA,Estab,URN,academyupin,School Name,Period covered by return,MAT SAT or Central Services,trustupin,UID,...,Premises Costs,Catering Expenses,Occupation Costs,Total Costs of Supplies and Services,Total Costs of Educational Supplies,Costs of Brought in Professional Services,Total Expenditure,Academy Balance,Trust Balance,Central Services Balance
0,8655405,865.0,5405.0,138623.0,111443,St John's Marlborough,12.0,Multi Academy Trust (MAT),137157,3065.0,...,719000.0,693000.0,1163000.0,1167000.0,750000.0,204000.0,10949000.0,,-1899000.0,-1830000.0
1,8655411,865.0,5411.0,138630.0,111451,Devizes School,12.0,Multi Academy Trust (MAT),138199,5315.0,...,523000.0,95000.0,326000.0,1398000.0,1038000.0,209000.0,7724000.0,,2880000.0,-8541000.0
2,8654071,865.0,4071.0,143005.0,111453,Avon Valley Academy,12.0,Multi Academy Trust (MAT),135112,2070.0,...,224000.0,43000.0,243000.0,409000.0,159000.0,122000.0,2949000.0,,1985000.0,-3191000.0
3,8655414,865.0,5414.0,136296.0,111710,Hardenhuish School,12.0,Single Academy Trust (SAT),135428,3309.0,...,481000.0,422000.0,771000.0,1331000.0,952000.0,71000.0,10613000.0,-1544000.0,,0.0
4,8454026,845.0,4026.0,137982.0,113087,Beacon Academy,12.0,Multi Academy Trust (MAT),136879,2259.0,...,578000.0,104000.0,398000.0,1341000.0,428000.0,106000.0,10356000.0,,-1232000.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10439,9335950,933.0,5950.0,150134.0,164746,The Sky Academy,11.0,Multi Academy Trust (MAT),140031,2516.0,...,148000.0,112000.0,205000.0,248000.0,184000.0,2000.0,2393000.0,,905000.0,-1067000.0
10440,8014018,801.0,4018.0,150226.0,164811,Lansdown Park Academy,1.0,Multi Academy Trust (MAT),135065,2516.0,...,6000.0,0.0,5000.0,2000.0,2000.0,0.0,99000.0,,5380000.0,-8019000.0
10441,8014018,801.0,4018.0,150226.0,164811,Lansdown Park Academy,11.0,Multi Academy Trust (MAT),140031,2516.0,...,194000.0,56000.0,221000.0,358000.0,232000.0,17000.0,2677000.0,,905000.0,-1067000.0
10442,8792010,879.0,2010.0,150037.0,164812,Hyde Park Infants’ School,1.0,Multi Academy Trust (MAT),139706,16439.0,...,5000.0,0.0,7000.0,1000.0,0.0,1000.0,94000.0,,1053000.0,-950000.0


Create a summary table for the AR stance of each distinct academy in the table.

In [45]:
academy_ar = ar.reset_index().drop_duplicates(subset=['academyupin'], ignore_index=True)[
    ['academyupin', 'Academy Balance', 'Trust Balance', 'Central Services Balance', 'PFI School']
].set_index('academyupin')

academy_ar['Central Services Financial Position'] = academy_ar['Central Services Balance'].map(mappings.map_is_surplus_deficit)
academy_ar['Academy Financial Position'] = academy_ar['Academy Balance'].map(mappings.map_is_surplus_deficit) 
academy_ar['Trust Financial Position'] = academy_ar['Trust Balance'].map(mappings.map_is_surplus_deficit) 

academy_ar.merge(academy_agg, left_on='academyupin', right_index=True, how='left')

Unnamed: 0_level_0,Academy Balance_x,Trust Balance,Central Services Balance,PFI School,Central Services Financial Position,Academy Financial Position,Trust Financial Position,DFE/EFA Revenue grants (includes Coronavirus Government Funding,of which: Coronavirus Government Funding,SEN funding,...,Total Staff Costs,Maintenance & Improvement Costs,Premises Costs,Catering Expenses,Occupation Costs,Total Costs of Supplies and Services,Total Costs of Educational Supplies,Costs of Brought in Professional Services,Total Expenditure,"Share of Revenue Reserve, distributed on per pupil basis\n"
academyupin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
111443,,-1899000.0,-1830000.0,Not part of PFI,Deficit,Unknown,Deficit,,,,...,,,,,,,,,,
111451,,2880000.0,-8541000.0,Not part of PFI,Deficit,Unknown,Surplus,,,,...,,,,,,,,,,
111453,,1985000.0,-3191000.0,Not part of PFI,Deficit,Unknown,Surplus,,,,...,,,,,,,,,,
111710,-1544000.0,,0.0,Not part of PFI,Deficit,Deficit,Unknown,7685000.0,79000.0,323000.0,...,8030000.0,129000.0,481000.0,422000.0,771000.0,1331000.0,952000.0,71000.0,10613000.0,1366000.0
113087,,-1232000.0,0.0,Not part of PFI,Deficit,Unknown,Deficit,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
164644,,341000.0,-253000.0,Not part of PFI,Deficit,Unknown,Surplus,,,,...,,,,,,,,,,
164745,,1768000.0,-989000.0,Not part of PFI,Deficit,Unknown,Surplus,,,,...,,,,,,,,,,
164746,,5380000.0,-8019000.0,Not part of PFI,Deficit,Unknown,Surplus,,,,...,,,,,,,,,,
164811,,5380000.0,-8019000.0,Not part of PFI,Deficit,Unknown,Surplus,,,,...,,,,,,,,,,


Now compute the trust financial position in the same manor as the individual academy position

## Academy and maintained schools data load and preparation

This reads the main GIAS data (edubasealldataYYYYMMDD file) and the associated links file (links_edubasealldataYYYYMMDD file). This is taken from the [GIAS Service](https://get-information-schools.service.gov.uk/help)

Other columns are tidied up by asserting the correct type for that column. This is tidying phase is largly because on load integer columns will be inferred to be a float as opposed to an integer.

In [46]:
gias = pd.read_csv('data/edubasealldata20240312.csv', encoding='cp1252', 
                   index_col=schemas.gias_index_col, usecols=schemas.gias.keys(), dtype=schemas.gias)

gias_links = pd.read_csv('data/links_edubasealldata20240312.csv', encoding='cp1252', 
                         index_col=schemas.gias_links_index_col, usecols=schemas.gias_links.keys(), dtype=schemas.gias_links)

# GIAS transformations
gias['LA Establishment Number'] = gias['LA (code)'] + '-' + gias['EstablishmentNumber'].astype('string')
gias['LA Establishment Number'] = gias['LA Establishment Number'].astype('string')

gias['OpenDate'] = pd.to_datetime(gias['OpenDate'], dayfirst=True, format='mixed')
gias['CloseDate'] = pd.to_datetime(gias['CloseDate'], dayfirst=True, format='mixed')
gias['SchoolWebsite'] = gias['SchoolWebsite'].fillna('').map(mappings.map_school_website)
gias['Boarders (name)'] = gias['Boarders (name)'].fillna('').map(mappings.map_boarders)
gias['OfstedRating (name)'] = gias['OfstedRating (name)'].fillna('').map(mappings.map_ofsted_rating)
gias['NurseryProvision (name)'] = gias['NurseryProvision (name)'].fillna('')
gias['OfficialSixthForm (name)'] = gias['OfficialSixthForm (name)'].fillna('').map(mappings.map_sixth_form)
gias['AdmissionsPolicy (name)'] = gias['AdmissionsPolicy (name)'].fillna('').map(mappings.map_admission_policy)
gias['HeadName'] = gias['HeadTitle (name)'] + ' ' + gias['HeadFirstName'] + ' ' + gias['HeadLastName']

In the following cell, we find all the predecessor and merged links. The links are then Ranked by URN and order by 'Link Established Date'. The linked GAIS data in then joined to the base GIAS data. This creates the overall school data set. This dataset is then filtered for schools that are open (CloseDate is null) and the schools with nested links that are Ranked 1.

In [47]:
gias_links = gias_links[
    gias_links['LinkType'].isin(['Predecessor', 
                                 'Predecessor - amalgamated', 
                                 'Predecessor - Split School', 
                                 'Predecessor - merged', 
                                 'Merged - expansion of school capacity', 
                                 'Merged - change in age range'])
].sort_values(by='LinkEstablishedDate', ascending=False)

gias_links['Rank'] = gias_links.groupby('URN').cumcount() + 1
gias_links['Rank'] = gias_links['Rank'].astype('Int64')

schools = gias.join(gias_links, on='URN', how='left', rsuffix='_links', lsuffix='_school').sort_values(by='URN')

schools = schools[
    schools['CloseDate'].isna() & ((schools['Rank'] == 1) | (schools['Rank'].isna()))
].drop(columns=['LinkURN', 'LinkName', 'LinkType', 'LinkEstablishedDate', 'Rank'])

In [48]:
schools.to_csv('output/pre-processing/schools.csv')
schools.sort_index()

Unnamed: 0_level_0,LA (code),LA (name),EstablishmentNumber,EstablishmentName,TypeOfEstablishment (code),TypeOfEstablishment (name),EstablishmentStatus (code),EstablishmentStatus (name),OpenDate,CloseDate,...,UrbanRural (name),BoardingEstablishment (name),PreviousLA (code),PreviousLA (name),PreviousEstablishmentNumber,OfstedRating (name),MSOA (code),LSOA (code),LA Establishment Number,HeadName
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100000,201,City of London,3614,The Aldgate School,2,Voluntary aided school,1,Open,NaT,NaT,...,(England/Wales) Urban major conurbation,,999,,,Outstanding,E02000001,E01032739,201-3614,Miss Alexandra Allan
100001,201,City of London,6005,City of London School for Girls,11,Other independent school,1,Open,1920-01-01,NaT,...,(England/Wales) Urban major conurbation,Does not have boarders,999,,,,E02000001,E01000002,201-6005,Mrs Jenny Brown
100002,201,City of London,6006,St Paul's Cathedral School,11,Other independent school,1,Open,1939-01-01,NaT,...,(England/Wales) Urban major conurbation,Has boarders,999,,,,E02000001,E01032739,201-6006,
100003,201,City of London,6007,City of London School,11,Other independent school,1,Open,1919-01-01,NaT,...,(England/Wales) Urban major conurbation,Does not have boarders,999,,,,E02000001,E01032739,201-6007,Mr Alan Bird
100005,202,Camden,1048,Thomas Coram Centre,15,Local authority nursery school,1,Open,NaT,NaT,...,(England/Wales) Urban major conurbation,,999,,,Outstanding,E02007115,E01000937,202-1048,Ms Perina Holness
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
402468,679,Monmouthshire,5500,King Henry viii 3-19 School,30,Welsh establishment,1,Open,2023-09-01,NaT,...,,,999,,,,999999999,999999999,679-5500,
402469,681,Cardiff,2333,Ysgol Gynradd Groes-Wen Primary,30,Welsh establishment,1,Open,2023-09-01,NaT,...,(England/Wales) Rural village,,999,,,,W02000380,W01001729,681-2333,
402470,668,Pembrokeshire,2398,Ysgol Bro Penfro,30,Welsh establishment,4,Proposed to open,2024-09-01,NaT,...,(England/Wales) Rural town and fringe,,999,,,,W02000140,W01000607,668-2398,
402471,679,Monmouthshire,2325,Ysgol Gymraeg Trefynwy,30,Welsh establishment,4,Proposed to open,2024-09-01,NaT,...,(England/Wales) Urban city and town,,999,,,,W02000339,W01001978,679-2325,


Merge required GIAS, census, sen, cdc, PFI, and arr data with the base academy data

In [49]:
academies_list = pd.read_csv('data/master_list_raw.csv', encoding='utf8', index_col=schemas.academy_master_list_index_col, dtype=schemas.academy_master_list, usecols=schemas.academy_master_list.keys()).rename(columns={'UKPRN': 'Academy UKPRN'})

academies_base = academies_list.merge(schools.reset_index(), left_index=True, right_on='LA Establishment Number').set_index('URN')

academies = (academies_base.merge(census, on='URN', how='left')
             .merge(sen, on='URN', how='left')
             .merge(cdc, on='URN', how='left')
             .merge(academy_ar, left_on='Academy UPIN', right_index=True, how='left')
             .merge(trust_agg, left_on='Academy Trust UPIN', right_index=True, how='left')
             .merge(ks2, on='URN', how='left')
             .merge(ks4, on='URN', how='left'))
            

academies['Type of Provision - Phase'] = academies.apply(lambda df: mappings.map_academy_phase_type(df['TypeOfEstablishment (code)'], df['Type of Provision - Phase']), axis=1)

# Bizarre I shouldn't need this as this is coming from the original GIAS dataset, but I seem to have to do this twice. 
academies['NurseryProvision (name)'] = academies['NurseryProvision (name)'].fillna('')
academies['NurseryProvision (name)'] = academies.apply(lambda df: mappings.map_nursery(df['NurseryProvision (name)'], df['Type of Provision - Phase']), axis=1)

academies['Status'] = academies.apply(lambda df: mappings.map_academy_status(pd.to_datetime(df['Date left or closed if in period']), 
                                                                             pd.to_datetime(df['Valid to']), 
                                                                             pd.to_datetime(df['OpenDate']), 
                                                                             pd.to_datetime(df['CloseDate']), 
                                                                             pd.to_datetime(accounts_return_period_start_date), pd.to_datetime(academy_year_start_date), pd.to_datetime(academy_year_end_date)), axis=1)

academies['SchoolPhaseType'] = academies.apply(lambda df: mappings.map_school_phase_type(df['TypeOfEstablishment (code)'], df['Type of Provision - Phase']), axis=1)

academies.rename(columns={
    'UKPRN_x':'UKPRN',
    'Number of Pupils': 'Number of pupils',
    '% of pupils known to be eligible for free school meals (Performa': 'Percentage Free school meals'
}, inplace=True)

In [50]:
academies.to_csv('output/pre-processing/academies.csv')
academies.sort_index()

Unnamed: 0,URN,Company Registration Number,Incorporation Date,Academy Trust UPIN,Academy UKPRN,Academy Trust Name,Academy Name,Academy UPIN,Trust Type,Date Opened,...,Total Costs of Educational Supplies,Costs of Brought in Professional Services,Total Expenditure,"Share of Revenue Reserve, distributed on per pupil basis\n",Ks2Progress,AverageAttainment,Progress8Measure,Progress8Banding,Status,SchoolPhaseType
0,148853,10817580,2017-06-14 00:00:00.0000000,139821,10064612,1Excellence Multi Academy Trust,Evenwood Church of England Primary School,163480,Multi Academy Trust (MAT),2021-12-01 00:00:00.0000000,...,727000.0,158000.0,7172000.0,856000.0,11.0,,,,(Re)opened in period,Primary
1,144542,10817580,2017-06-14 00:00:00.0000000,139821,10064612,1Excellence Multi Academy Trust,Pentland Primary School,138448,Multi Academy Trust (MAT),2017-07-01 00:00:00.0000000,...,727000.0,158000.0,7172000.0,856000.0,-10.2,,,,Open,Primary
2,144551,10817580,2017-06-14 00:00:00.0000000,139821,10064612,1Excellence Multi Academy Trust,St Mark's Church of England Primary School - S...,138465,Multi Academy Trust (MAT),2017-07-01 00:00:00.0000000,...,727000.0,158000.0,7172000.0,856000.0,-3.9,,,,Open,Primary
3,148854,10817580,2017-06-14 00:00:00.0000000,139821,10064612,1Excellence Multi Academy Trust,St Michael's Church of England Primary School ...,163504,Multi Academy Trust (MAT),2021-12-01 00:00:00.0000000,...,727000.0,158000.0,7172000.0,856000.0,0.9,,,,(Re)opened in period,Primary
4,136730,07595434,2011-04-07 00:00:00.0000000,134890,10058682,5 Dimensions Trust,Shenley Brook End School,119734,Multi Academy Trust (MAT),2011-05-01 00:00:00.0000000,...,1425000.0,141000.0,23966000.0,4087000.0,,48.9,-0.13,Average,Open,Secondary
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10010,147031,07559293,2011-03-10 00:00:00.0000000,136062,10058621,Zenith Multi Academy Trust,Castle View School - Canvey Island,151873,Multi Academy Trust (MAT),2014-10-01 00:00:00.0000000,...,2277000.0,227000.0,27548000.0,5685000.0,,36.5,-0.75,Well below average,Open,Secondary
10011,145812,07559293,2011-03-10 00:00:00.0000000,136062,10058621,Zenith Multi Academy Trust,Laindon Park Primary School & Nursery,140554,Multi Academy Trust (MAT),2018-09-01 00:00:00.0000000,...,2277000.0,227000.0,27548000.0,5685000.0,1.9,,,,Open,Primary
10012,138865,07559293,2011-03-10 00:00:00.0000000,136062,10058621,Zenith Multi Academy Trust,The James Hornsby School,121878,Multi Academy Trust (MAT),2012-10-01 00:00:00.0000000,...,2277000.0,227000.0,27548000.0,5685000.0,,41.8,-0.32,Below average,Open,Secondary
10013,136577,07559293,2011-03-10 00:00:00.0000000,136062,10058621,Zenith Multi Academy Trust,The King John School,119652,Multi Academy Trust (MAT),2011-04-01 00:00:00.0000000,...,2277000.0,227000.0,27548000.0,5685000.0,,51.8,0.15,Average,Open,Secondary


Merge required census and cdc data to the maintained schools data set

In [51]:
# Load raw list from CSV
maintained_schools_list = pd.read_csv('data/maintained_schools_raw.csv', encoding='utf8', index_col=schemas.maintained_schools_master_list_index_col, usecols=schemas.maintained_schools_master_list.keys(), dtype=schemas.maintained_schools_master_list)

In [52]:
# Merge maintained_schools_base with schools (metadata from GIAS) & rename PFI column
maintained_schools = maintained_schools_list.merge(schools.reset_index(), left_index=True, right_on='URN')


In [53]:
# Merge in census and cdc data
maintained_schools = (maintained_schools
                      .merge(sen, on='URN', how='left')
                      .merge(census, on='URN', how='left')
                      .merge(cdc, on='URN', how='left')
                      .merge(ks2, on='URN', how='left')
                      .merge(ks4, on='URN', how='left'))



In [54]:
# Compute columns
maintained_schools['PFI'] = maintained_schools['PFI'].map(lambda x: 'PFI school' if x == 'Y' else 'Non-PFI school')
maintained_schools['Status'] = maintained_schools.apply(lambda df: mappings.map_maintained_school_status(df['OpenDate'], df['CloseDate'], df['Period covered by return (months)'], pd.to_datetime(maintained_schools_year_start_date), pd.to_datetime(maintained_schools_year_end_date)), axis=1)
maintained_schools['School Balance'] = maintained_schools['Total Income   I01 to I18'] - maintained_schools['Total Expenditure  E01 to E32']
maintained_schools['School Financial Position'] = maintained_schools['School Balance'].map(mappings.map_is_surplus_deficit)
maintained_schools['SchoolPhaseType'] = maintained_schools.apply(lambda df: mappings.map_school_phase_type(df['TypeOfEstablishment (code)'], df['Overall Phase']), axis=1)
maintained_schools['Partial Years Present'] = maintained_schools['Period covered by return (months)'].map(lambda x: x != 12)
maintained_schools['Did Not Submit'] = maintained_schools['Did Not Supply flag'].map(lambda x: x == 1)

In [55]:
# Compute Columns - Will
maintained_schools.columns.values

array(['LAEstab', 'Phase', 'Overall Phase', 'Lowest age of pupils',
       'Highest age of pupils', 'Type', 'No Pupils',
       '% of pupils eligible for FSM',
       'Period covered by return (months)', 'Did Not Supply flag',
       'Federation', 'Lead school in federation', 'No Teachers',
       'Urban  Rural', 'London Weighting', '% of pupils with EAL',
       '% of pupils who are Boarders', 'PFI', 'No of pupils in 6th form',
       '% of teachers with QTS', 'FTE of Teaching Assistants',
       'FTE of Support Staff', 'FTE of Admin Staff',
       'I01  Funds delegated by the LA',
       'I02  Funding for 6th form students', 'I03  SEN funding',
       'I04  Funding for minority ethnic pupils', 'I05  Pupil Premium',
       'I06  Other government grants', 'I07  Other grants and payments',
       'I08  Income from facilities and services',
       'I09  Income from catering',
       'I10  Receipts from supply teacher insurance claims',
       'I11  Receipts from other insurance claims',


In [56]:
# Rename columns
maintained_schools.rename(columns={
    'E22 Administrative supplies':'Administrative supplies_Administrative supplies (non educational)',
    'E06 Catering staff':'Catering_Catering staff',
    'E25  Catering supplies':'Catering_Catering supplies',
    'I09  Income from catering':'Catering_Income from catering',
    'E21  Exam fees':'Educational supplies_Examination fees',
    'E19  Learning resources (not ICT equipment)':'Educational supplies_Learning resources (not ICT equipment)',
    'E20  ICT learning resources':'IT_ICT learning resources',
    'E05 Administrative and clerical staff':'Non-educational support staff_Administrative and clerical staff',
    # '':'Non-educational support staff_Auditor costs',
    'E07  Cost of other staff':'Non-educational support staff_Other staff',
    'E28a  Bought in professional services - other (except PFI)':'Non-educational support staff_Professional services (non-curriculum)',
    'E30 Direct revenue financing (revenue contributions to capital)':'Other costs_Direct revenue financing',
    'E13  Grounds maintenance and improvement':'Other costs_Grounds maintenance',
    'E08  Indirect employee expenses':'Other costs_Indirect employee expenses',
    'E29  Loan interest':'Other costs_Interest charges for loan and bank',
    'E23  Other insurance premiums':'Other costs_Other insurance premiums',
    'E28b Bought in professional services - other (PFI)':'Other costs_PFI charges',
    'E17  Rates':'Other costs_Rent and rates',
    'E24  Special facilities ':'Other costs_Special facilities',
    'E09  Development and training':'Other costs_Staff development and training',
    'E11  Staff related insurance':'Other costs_Staff-related insurance',
    'E10  Supply teacher insurance':'Other costs_Supply teacher insurance',
    'E14  Cleaning and caretaking':'Premises_Cleaning and caretaking',
    'E12  Building maintenance and improvement':'Premises_Maintenance of premises',
    'E18  Other occupation costs':'Premises_Other occupation costs',
    'E04  Premises staff':'Premises_Premises staff',
    'E26 Agency supply teaching staff':'Teaching and Teaching support staff_Agency supply teaching staff',
    'E03 Education support staff':'Teaching and Teaching support staff_Education support staff',
    'E27  Bought in professional services - curriculum':'Teaching and Teaching support staff_Educational consultancy',
    'E02  Supply teaching staff':'Teaching and Teaching support staff_Supply teaching staff',
    'E01  Teaching Staff':'Teaching and Teaching support staff_Teaching staff',
    'E16  Energy':'Utilities_Energy',
    'E15  Water and sewerage':'Utilities_Water and sewerage: ',
    'PFI': 'PFI School',
    '% of pupils eligible for FSM': 'Percentage Free school meals',
    'No Pupils': 'Number of pupils',
    'I07  Other grants and payments': 'Other grants and payments'
}, inplace=True)

In [57]:
maintained_schools.set_index('URN', inplace=True)

In [58]:
maintained_schools

Unnamed: 0_level_0,LAEstab,Phase,Overall Phase,Lowest age of pupils,Highest age of pupils,Type,Number of pupils,Percentage Free school meals,Period covered by return (months),Did Not Supply flag,...,Ks2Progress,AverageAttainment,Progress8Measure,Progress8Banding,Status,School Balance,School Financial Position,SchoolPhaseType,Partial Years Present,Did Not Submit
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100000,2013614,Infant and junior,Primary,3.0,11.0,Voluntary aided school,271.0,18.1,12,0,...,0.5,,,,Open,-103843.66,Deficit,Primary,False,False
100005,2021048,Nursery,Nursery,2.0,5.0,Local authority nursery school,107.5,38.2,12,0,...,,,,,Open,-159542.44,Deficit,Nursery,False,False
100006,2021100,Pupil referral unit,Pupil referral unit,11.0,16.0,Pupil referral unit,49.0,68.4,12,0,...,,,,,Open,488879.73,Surplus,Pupil referral unit,False,False
100007,2021101,Pupil referral unit,Pupil referral unit,5.0,11.0,Pupil referral unit,19.0,100.0,12,0,...,,,,,Open,-70514.23,Deficit,Pupil referral unit,False,False
100008,2022019,Infant and junior,Primary,3.0,11.0,Community school,350.0,55.1,12,0,...,5.5,,,,Open,-116875.63,Deficit,Primary,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131818,3412230,Infant and junior,Primary,3.0,11.0,Community school,417.0,52.0,12,0,...,-4.3,,,,Open,-121353.13,Deficit,Primary,False,False
132176,3412232,Infant and junior,Primary,3.0,11.0,Voluntary aided school,278.0,45.9,12,0,...,-3.8,,,,Open,89141.77,Surplus,Primary,False,False
132793,3412233,Infant and junior,Primary,4.0,11.0,Voluntary aided school,411.0,28.2,12,0,...,-5.6,,,,Open,-92369.69,Deficit,Primary,False,False
132796,3412234,Infant and junior,Primary,2.0,11.0,Voluntary aided school,473.0,29.6,12,0,...,9.1,,,,Open,-161742.84,Deficit,Primary,False,False


In [59]:
maintained_schools.to_csv('output/pre-processing/maintained_schools.csv')


## Federation Capture




In [60]:
group_links = pd.read_csv('data/alllinksdata20240417.csv', encoding='unicode-escape',  index_col=schemas.groups_index_col, usecols=schemas.groups.keys(), dtype=schemas.groups)

In [61]:
# filter lead schools out of the maintained schools list
federations = maintained_schools[['LAEstab']][maintained_schools['Federation']=='Lead school'].copy()
# join 
federations = federations.join(group_links[['Group Name','Group UID','Closed Date']])


# remove federations with an associated closed date
federations = federations.loc[federations['Closed Date'].isna()]

# federations with a UID listed in the GIAS groups data are referred to as "Hard" federations
# while federations not listed in GIAS are referred to as "Soft" federations.
# Soft federation UIDs are a combination of their URN and LAEstab codes.

# create mask for soft federations
mask = federations['Group UID'].isna()

hard_federations = federations.loc[~mask].copy()
soft_federations = federations.loc[mask].copy()

# define members list for hard federations
group_links['Members'] = group_links.index
hard_members = group_links[['Members','Group UID']].groupby('Group UID').agg(list)

hard_federations = hard_federations.join(hard_members, on='Group UID')

# Rename columns
hard_federations.rename(columns={
    'Group Name': 'FederationName',
    'Group UID': 'FederationUid',
}, inplace=True)

# for the soft federations
soft_federations['FederationUid'] = soft_federations.index.astype(str) + soft_federations['LAEstab'].astype(str)

# Rename columns
soft_federations.rename(columns={
    'Group Name': 'FederationName',
    'Group UID': 'FederationUid',
}, inplace=True)

# TODO - add in soft federation members and names (currently no mapping available)

In [62]:
hard_federations.to_csv('output/pre-processing/hard_federations.csv')
hard_federations

Unnamed: 0_level_0,LAEstab,FederationName,FederationUid,Closed Date,Members
URN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100234,2042450,"The Viridis Federation of Orchard, Southwold &...",1652,,"[100234, 100242, 131141]"
100250,2042779,The LEAP Federation,17386,,"[100250, 100261, 130303]"
100258,2042864,New Wave Federation,1589,,"[100241, 100248, 100258]"
100263,2043358,Primary Advantage Federation,1473,,"[100224, 100225, 100232, 100263, 100266, 10026..."
101251,3021000,Barnet Early Years Alliance,5603,,"[101251, 101252, 101254]"
...,...,...,...,...,...
125246,9363928,Newlands CofE School Federation,17520,,"[125199, 125246]"
125288,9365206,The Federation of Holy Trinity and Pewley Down...,1025,,"[125288, 136755]"
125473,9367053,Federation of Manor Mead and Walton Leigh Schools,15744,,"[125468, 125473]"
131426,3112092,The Learning and Achieving Federation,17122,,"[102294, 131426]"


In [63]:
soft_federations.to_csv('output/pre-processing/soft_federations.csv')
soft_federations[['LAEstab']]

Unnamed: 0_level_0,LAEstab
URN,Unnamed: 1_level_1
100391,2061104
100472,2071010
105861,3547006
108960,8012073
109417,8221009
109864,8692131
109964,8693026
111246,8965204
113172,8782249
117386,9193006


### Timing Keep at the bottom

In [64]:
print(f'Processing Time: {time.time() - start_time} seconds')

Processing Time: 106.42713356018066 seconds
