# Analysing the impact of COVID on special educational needs - notebook 2: change by pupil numbers

Data on pupil numbers is at https://explore-education-statistics.service.gov.uk/find-statistics/school-pupils-and-their-characteristics. The raw data can be downloaded, queries can be made, or particular queries re-loaded.  

One query of interest is ‘'Pupil characteristics - Ethnicity and Language' for Known or believed to be English, Known or believed to be other than English, Language unclassified, Non-maintained special school, Pupil referral unit and 4 other filters in England between 2020/21 and 2021/22’ at https://explore-education-statistics.service.gov.uk/data-tables/fast-track/09a7ce09-543a-47f4-bf13-36a85aa20858 

That table provides national figures, but ‘Step 3’ above the table can be edited to change it to all local authorities. 

Likewise, the time period can be edited to the previous five years. 

A version edited for those changes can be accessed at https://explore-education-statistics.service.gov.uk/data-tables/permalink/715a3b55-7f69-4526-b553-cc1b3d120f83 (a URL generated by using the ‘generate link’’ option)

This would allow us not only to put SEN figures into the context of pupil numbers (alternative hypothesis: any rise in SEN numbers is simply due to a rise in pupil numbers) but also to put them into the context of pupils for whom English is a second language (alternative hypothesis: any rise in language support is simply due to a rise in ESL pupils)


## Install the packages we need

In [None]:
#load rmagic to be able to run R
#from https://towardsdatascience.com/how-to-use-r-in-google-colab-b6e02d736497
%load_ext rpy2.ipython

In [None]:
%%R
#install the tidyverse package
install.packages('tidyverse')
library('tidyverse')
#install the downloader package: https://cran.r-project.org/web/packages/downloader/index.html
install.packages("downloader")
library(downloader)

(as ‘lib’ is unspecified)







	‘/tmp/Rtmp2JZJ0Y/downloaded_packages’



── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.7      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.2      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


(as ‘lib’ is unspecified)







	‘/tmp/Rtmp2JZJ0Y/downloaded_packages’



In [None]:
%%R
install.packages('stringi')
library(stringi)

(as ‘lib’ is unspecified)







	‘/tmp/Rtmp2JZJ0Y/downloaded_packages’



## Import the zip file, extract the data

The data is published in a zip file that can be accessed from the 'Download all data' button at https://explore-education-statistics.service.gov.uk/find-statistics/school-pupils-and-their-characteristics#explore-data-and-files


In [None]:
%%R
#store the URL for the data zip file which is found by right-clicking on 
#'Download all data' at https://explore-education-statistics.service.gov.uk/find-statistics/school-pupils-and-their-characteristics#explore-data-and-files
zipurl <- "https://content.explore-education-statistics.service.gov.uk/api/releases/cf516998-1dc1-411d-8225-13f6320547fb/files"


In [None]:
%%R
#download the zip file from the url
downloader::download(zipurl, dest="datasets.zip", mode="wb") 
#unzip it 
unzip ("datasets.zip", exdir = "./")


  cannot open URL 'https://content.explore-education-statistics.service.gov.uk/api/releases/cf516998-1dc1-411d-8225-13f6320547fb/files'


 

 




Error in download.file(url, method = method, ...) : 
  cannot open URL 'https://content.explore-education-statistics.service.gov.uk/api/releases/cf516998-1dc1-411d-8225-13f6320547fb/files'


RInterpreterError: ignored

In [None]:

%%R 
#import the data extracted from the zip
#this is the name in 2022
data22 <- "data/spc_pupils_yeargroup_and_gender_.csv"
data <- readr::read_csv(data22)

data

## `filter()` the data to what we need

This is a very large dataframe, and we don't need all of it. Firstly, we're only interested in primary schools.

In [None]:
%%R
#filter to where column values match specified string
slcdf <- filter(data, phase_type_grouping == "State-funded primary")
print(slcdf)

### Filter: totals only

We also don't need the gender categories - just the total pupils.

In [None]:
%%R
#show the unique values and their frequency
table(slcdf['gender'])

In [None]:
%%R
#filter to where column values match specified string
slcdf_filter3 <- filter(slcdf, gender == "Total")
print(slcdf_filter3)

# A tibble: 23,625 × 17
   time_period time_identifier geographic_level country_code country_name
         <dbl> <chr>           <chr>            <chr>        <chr>       
 1      202122 Academic year   Local authority  E92000001    England     
 2      202122 Academic year   Local authority  E92000001    England     
 3      202122 Academic year   Local authority  E92000001    England     
 4      202122 Academic year   Local authority  E92000001    England     
 5      202122 Academic year   Local authority  E92000001    England     
 6      202122 Academic year   Local authority  E92000001    England     
 7      202122 Academic year   Local authority  E92000001    England     
 8      202122 Academic year   Local authority  E92000001    England     
 9      202122 Academic year   Local authority  E92000001    England     
10      202122 Academic year   Local authority  E92000001    England     
# … with 23,615 more rows, and 12 more variables: region_name <chr>,
#   region_code <ch

### Filter: local authorities only

We also only want to look at data for local authorities. 

In [None]:
%%R
#show the unique values and their frequency
table(slcdf_filter3['geographic_level'])

geographic_level
Local authority        National        Regional 
          22155             147            1323 


In [None]:
%%R
#filter to where column values match specified string
slcdf_filter4 <- filter(slcdf_filter3, geographic_level == "Local authority")
print(slcdf_filter4)

# A tibble: 22,155 × 17
   time_period time_identifier geographic_level country_code country_name
         <dbl> <chr>           <chr>            <chr>        <chr>       
 1      202122 Academic year   Local authority  E92000001    England     
 2      202122 Academic year   Local authority  E92000001    England     
 3      202122 Academic year   Local authority  E92000001    England     
 4      202122 Academic year   Local authority  E92000001    England     
 5      202122 Academic year   Local authority  E92000001    England     
 6      202122 Academic year   Local authority  E92000001    England     
 7      202122 Academic year   Local authority  E92000001    England     
 8      202122 Academic year   Local authority  E92000001    England     
 9      202122 Academic year   Local authority  E92000001    England     
10      202122 Academic year   Local authority  E92000001    England     
# … with 22,145 more rows, and 12 more variables: region_name <chr>,
#   region_code <ch

### Filter: year 1 only

Our analysis focuses on year 1, so we can narrow to that too.

In [None]:
%%R
#filter to where column values match specified string
yr1_totals <- filter(slcdf_filter4, ncyear == "Year 1")
print(yr1_totals)

# A tibble: 1,055 × 17
   time_period time_identifier geographic_level country_code country_name
         <dbl> <chr>           <chr>            <chr>        <chr>       
 1      202122 Academic year   Local authority  E92000001    England     
 2      202122 Academic year   Local authority  E92000001    England     
 3      202122 Academic year   Local authority  E92000001    England     
 4      202122 Academic year   Local authority  E92000001    England     
 5      202122 Academic year   Local authority  E92000001    England     
 6      202122 Academic year   Local authority  E92000001    England     
 7      202122 Academic year   Local authority  E92000001    England     
 8      202122 Academic year   Local authority  E92000001    England     
 9      202122 Academic year   Local authority  E92000001    England     
10      202122 Academic year   Local authority  E92000001    England     
# … with 1,045 more rows, and 12 more variables: region_name <chr>,
#   region_code <chr>

## Remove columns

We also have a number of columns we don't need.

In [None]:
%%R
#show the columns
colnames(yr1_totals)

 [1] "time_period"         "time_identifier"     "geographic_level"   
 [4] "country_code"        "country_name"        "region_name"        
 [7] "region_code"         "old_la_code"         "la_name"            
[10] "new_la_code"         "phase_type_grouping" "gender"             
[13] "ncyear"              "full_time"           "part_time"          
[16] "headcount"           "fte"                


Some columns only have one value

In [None]:
%%R
#count the unique values in the time_identifier column
print(table(yr1_totals['time_identifier']))
#repeat for the geographic level column
print(table(yr1_totals['geographic_level']))
#repeat for country_name
print(table(yr1_totals['country_name']))
#repeat for country_code
print(table(yr1_totals['country_code']))
#repeat for phase_type_grouping
print(table(yr1_totals['phase_type_grouping']))
#repeat for gender
print(table(yr1_totals['gender']))
#repeat for part time
print(table(yr1_totals['part_time']))

time_identifier
Academic year 
         1055 
geographic_level
Local authority 
           1055 
country_name
England 
   1055 
country_code
E92000001 
     1055 
phase_type_grouping
State-funded primary 
                1055 
gender
Total 
 1055 
part_time
   0    1 
1054    1 


We remove that by using `select()` and a minus before the column we want to exclude.

Note: we remove `part_time` here because all but one of the primary schools have 0 in that column, and the one exception is Kent with 1 part time student out of almost 19,000 pupils.

We also remove `full_time` and `fte` because each - with that exception - is a duplicate of `headcount` and headcount is the metric we want to use for Kent (pupils, regardless of full or part time).

In [None]:
%%R
#remove the specified columns
slcdf_col_filter1 <- select(yr1_totals, 
                            -c(time_identifier, 
                               geographic_level,
                               country_name,
                               country_code,
                               phase_type_grouping,
                               gender,
                               part_time,
                               fte,
                               full_time))
slcdf_col_filter1

# A tibble: 1,055 × 8
   time_period region_name    region_code old_la_code la_name new_la_code ncyear
         <dbl> <chr>          <chr>             <dbl> <chr>   <chr>       <chr> 
 1      202122 East of Engla… E12000006           919 Hertfo… E10000015   Year 1
 2      202122 North West     E12000002           344 Wirral  E08000015   Year 1
 3      202122 South West     E12000009           800 Bath a… E06000022   Year 1
 4      202122 South West     E12000009           865 Wiltsh… E06000054   Year 1
 5      202122 East of Engla… E12000006           821 Luton   E06000032   Year 1
 6      202122 East of Engla… E12000006           822 Bedford E06000055   Year 1
 7      202122 London         E12000007           319 Sutton  E09000029   Year 1
 8      202122 London         E12000007           301 Barkin… E09000002   Year 1
 9      202122 North West     E12000002           358 Traffo… E08000009   Year 1
10      202122 North West     E12000002           889 Blackb… E06000008   Year 1
# … wi

## Reshape long to wide using `spread()`

We now reshape to make that column of years into the column headings, so there's only one row per LA, and we can more easily calculate year on year change.

Note that R doesn't like number-only column names, so it puts each one inside the code accent: `

In [None]:
%%R
#specify we want to convert the time_period column values to column names 
#and insert the values from the sum_of_yr1 column underneath
yr1_wide <- slcdf_col_filter1 %>% spread(time_period, headcount)
#show the results
yr1_wide[c(4,7,8,9,10,11,12,13)]

# A tibble: 155 × 8
   la_name        `201516` `201617` `201718` `201819` `201920` `202021` `202122`
   <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 Derbyshire         8503     8733     8682     8341     8214     8152     8145
 2 Derby              3330     3320     3456     3344     3220     3211     3222
 3 Leicestershire     7688     7961     7889     7750     7596     7822     7808
 4 Leicester          4574     4739     4755     4672     4613     4546     4541
 5 Rutland             419      403      406      401      396      398      382
 6 Nottinghamshi…     9445     9662     9770     9432     9284     9177     9348
 7 Nottingham         3630     3755     3771     3761     3618     3628     3729
 8 Lincolnshire       7989     8124     8082     7736     7895     7731     7787
 9 Northamptonsh…     9226     9389     9294     9130     8877     8892       NA
10 North Northam…       NA       NA       NA       NA       NA       NA     4072
# … with

## Add a grand total

We will want a national change figure, too. So let's add that row.

In [None]:
%%R
#check how many rows so we don't overwrite any
nrow(yr1_wide)

[1] 155


In [None]:
%%R
#create empty vector to contain totals, 
#we start with a number which will be replaced with the text that will go in the first column
#but for now we need a number so the others aren't stored as strings
totalsvec <- c(0,0,0,0,0,0)
#loop through the other column names
for (i in colnames(yr1_wide)[seq(7,13)]){
    #print the column name we're working with
    print(i)
    #add all the numbers in that column
    coltotal <- sum(yr1_wide[i][!is.na(yr1_wide[i])])
    print(coltotal)
    #add it to the vector
    totalsvec <- c(totalsvec,coltotal)
}
print(totalsvec)

[1] "201516"
[1] 641959
[1] "201617"
[1] 653309
[1] "201718"
[1] 652347
[1] "201819"
[1] 636409
[1] "201920"
[1] 624344
[1] "202021"
[1] 619221
[1] "202122"
[1] 624147
 [1]      0      0      0      0      0      0 641959 653309 652347 636409
[11] 624344 619221 624147


In [None]:
%%R
#transpose the vectore and convert to a dataframe
newdf <- as.data.frame(t(totalsvec))
print(newdf)
#assign the column names to that dataframe
colnames(newdf) <- colnames(yr1_wide)
print(newdf)
#change the first cell to 'total'
newdf[1,1] <- "Total"
print(newdf)

  V1 V2 V3 V4 V5 V6     V7     V8     V9    V10    V11    V12    V13
1  0  0  0  0  0  0 641959 653309 652347 636409 624344 619221 624147
  region_name region_code old_la_code la_name new_la_code ncyear 201516 201617
1           0           0           0       0           0      0 641959 653309
  201718 201819 201920 202021 202122
1 652347 636409 624344 619221 624147
  region_name region_code old_la_code la_name new_la_code ncyear 201516 201617
1       Total           0           0       0           0      0 641959 653309
  201718 201819 201920 202021 202122
1 652347 636409 624344 619221 624147


In [None]:
%%R
#bind the two dataframes together
yr1_wtotal <- rbind(yr1_wide,newdf)
#and check the last few rows
print(yr1_wtotal[seq(152,155),] )
#overwrite the original dataframe
yr1_wide <- yr1_wtotal

# A tibble: 4 × 13
  region_name        region_code old_la_code la_name new_la_code ncyear `201516`
  <chr>              <chr>             <dbl> <chr>   <chr>       <chr>     <dbl>
1 Yorkshire and The… E12000003           812 North … E06000012   Year 1     1972
2 Yorkshire and The… E12000003           813 North … E06000013   Year 1     1935
3 Yorkshire and The… E12000003           815 North … E10000023   Year 1     6197
4 Yorkshire and The… E12000003           816 York    E06000014   Year 1     1997
# … with 6 more variables: `201617` <dbl>, `201718` <dbl>, `201819` <dbl>,
#   `201920` <dbl>, `202021` <dbl>, `202122` <dbl>


## Calculate year on year changes

Now we can add new columns with the year on year changes.

In [None]:
%%R
#remind ourselves of the column names
colnames(yr1_wide)

 [1] "region_name" "region_code" "old_la_code" "la_name"     "new_la_code"
 [6] "ncyear"      "201516"      "201617"      "201718"      "201819"     
[11] "201920"      "202021"      "202122"     


In [None]:
%%R
#subtract 2021 from 2122 to get the change - and store in a new column
yr1_wide['YOY21to22'] <- yr1_wide['202122'] - yr1_wide['202021']
#show the results to check they tally
yr1_wide[, c(4,12,13,14)]

# A tibble: 156 × 4
   la_name                `202021` `202122` YOY21to22
   <chr>                     <dbl>    <dbl>     <dbl>
 1 Derbyshire                 8152     8145        -7
 2 Derby                      3211     3222        11
 3 Leicestershire             7822     7808       -14
 4 Leicester                  4546     4541        -5
 5 Rutland                     398      382       -16
 6 Nottinghamshire            9177     9348       171
 7 Nottingham                 3628     3729       101
 8 Lincolnshire               7731     7787        56
 9 Northamptonshire           8892       NA        NA
10 North Northamptonshire       NA     4072        NA
# … with 146 more rows


### Calculate them as percentages

Now add another column with those values as percentage changes.

In [None]:
%%R
#divide the change by the older 2021 figure to get the % change - and store in a new column
yr1_wide['YOYperc21to22'] <- yr1_wide['YOY21to22']/yr1_wide['202021']
#show the results to check they tally - the first figure should be around 1% or 0.01 when rounded
yr1_wide[, c(4,12,13,14,15)]

# A tibble: 156 × 5
   la_name                `202021` `202122` YOY21to22 YOYperc21to22
   <chr>                     <dbl>    <dbl>     <dbl>         <dbl>
 1 Derbyshire                 8152     8145        -7     -0.000859
 2 Derby                      3211     3222        11      0.00343 
 3 Leicestershire             7822     7808       -14     -0.00179 
 4 Leicester                  4546     4541        -5     -0.00110 
 5 Rutland                     398      382       -16     -0.0402  
 6 Nottinghamshire            9177     9348       171      0.0186  
 7 Nottingham                 3628     3729       101      0.0278  
 8 Lincolnshire               7731     7787        56      0.00724 
 9 Northamptonshire           8892       NA        NA     NA       
10 North Northamptonshire       NA     4072        NA     NA       
# … with 146 more rows


## Repeat for other years

Can we codify this process and loop to create the other columns?

In [None]:
%%R
#loop through the column names from 8th to 13th position
for (i in seq(8,13)){
    print(i)
    thisyr <- colnames(yr1_wide)[i]
    previousyr <- colnames(yr1_wide)[i-1]
    print(thisyr)
    print(previousyr)
    #subtract the previous year from the current one to get the change - and store 
    YOYchange <- yr1_wide[thisyr] - yr1_wide[previousyr]
    YOYpercChange <- YOYchange/yr1_wide[previousyr]
    #extract the two digits for the later year
    toyr <- substr(thisyr,5,6)
    fromyr <- substr(previousyr,5,6)
    #create column names from these
    colname <- paste0("YOY",fromyr,"to",toyr)
    perccolname <- paste0("YOYperc",fromyr,"to",toyr)
    #create a new column with that name and the values calculated
    yr1_wide[colname] <- YOYchange
    yr1_wide[perccolname] <- YOYpercChange
}

[1] 8
[1] "201617"
[1] "201516"
[1] 9
[1] "201718"
[1] "201617"
[1] 10
[1] "201819"
[1] "201718"
[1] 11
[1] "201920"
[1] "201819"
[1] 12
[1] "202021"
[1] "201920"
[1] 13
[1] "202122"
[1] "202021"


In [None]:
%%R
yr1_wide

# A tibble: 156 × 25
   region_name   region_code old_la_code la_name     new_la_code ncyear `201516`
   <chr>         <chr>             <dbl> <chr>       <chr>       <chr>     <dbl>
 1 East Midlands E12000004           830 Derbyshire  E10000007   Year 1     8503
 2 East Midlands E12000004           831 Derby       E06000015   Year 1     3330
 3 East Midlands E12000004           855 Leicesters… E10000018   Year 1     7688
 4 East Midlands E12000004           856 Leicester   E06000016   Year 1     4574
 5 East Midlands E12000004           857 Rutland     E06000017   Year 1      419
 6 East Midlands E12000004           891 Nottingham… E10000024   Year 1     9445
 7 East Midlands E12000004           892 Nottingham  E06000018   Year 1     3630
 8 East Midlands E12000004           925 Lincolnshi… E10000019   Year 1     7989
 9 East Midlands E12000004           928 Northampto… E10000021   Year 1     9226
10 East Midlands E12000004           940 North Nort… E06000061   Year 1       NA
# … wit

## Export results

Let's export as a CSV.

In [None]:
%%R
#export as a CSV
write.csv(yr1_wide, "yr1_totals.csv")

## Merge with the SEN analysis

A glance already tells us that there's been no pupil numbers increase to account for the increase in demand for SEN. 

But we still need to merge it to put those increases into context.

We've uploaded the results of the analysis from the first notebook to Google Drive and published as a CSV, which is imported below. The data can also be generated by the other notebook and uploaded to the Files area on the left if needed.

In [None]:
%%R
#store the URL for the CSV
analysiscsvurl = "https://docs.google.com/spreadsheets/d/e/2PACX-1vTRm8tw8O_QO_o4bk4Y_3sUlX0N5ukNrCeHj08RyShS3cAEmKTZdvB8g48zDHbl8l_dmDtjOUUrB12L/pub?gid=814230511&single=true&output=csv"
#import the CSV into a dataframe
analysisdata <- readr::read_csv(analysiscsvurl)
analysisdata

Rows: 155 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): la_name
dbl (20): index, 201516, 201617, 201718, 201819, 201920, 202021, 202122, YOY...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 155 × 21
   index la_name  `201516` `201617` `201718` `201819` `201920` `202021` `202122`
   <dbl> <chr>       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1     1 Barking…      241      299      285      287      282      316      319
 2     2 Barnet        254      239      241      250      234      209      244
 3     3 Barnsley      114      147      141      125      129      107      143
 4     4 Bath an…       82      101       95      118       96      115      157
 5     5 Bedford        93       84      115      110      107      130      112
 6     6 Bexley        209      215      

Let's hope the LA names are consistent... if they are we should get a dataframe with 155 rows (the amount in the analysis CSV - there are 156 in the pupil numbers dataframe).

In [None]:
%%R
#merge the two dataframes on the la_name colum - all = F makes it an inner join
merged_data <- merge(analysisdata, yr1_wide, by = "la_name", all = F, suffixes = c("_language","_pupils"))
#check results
print(nrow(merged_data))
print(colnames(merged_data))

[1] 155
 [1] "la_name"                "index"                  "201516_language"       
 [4] "201617_language"        "201718_language"        "201819_language"       
 [7] "201920_language"        "202021_language"        "202122_language"       
[10] "YOY21to22_language"     "YOYperc21to22_language" "YOY16to17_language"    
[13] "YOYperc16to17_language" "YOY17to18_language"     "YOYperc17to18_language"
[16] "YOY18to19_language"     "YOYperc18to19_language" "YOY19to20_language"    
[19] "YOYperc19to20_language" "YOY20to21_language"     "YOYperc20to21_language"
[22] "region_name"            "region_code"            "old_la_code"           
[25] "new_la_code"            "ncyear"                 "201516_pupils"         
[28] "201617_pupils"          "201718_pupils"          "201819_pupils"         
[31] "201920_pupils"          "202021_pupils"          "202122_pupils"         
[34] "YOY21to22_pupils"       "YOYperc21to22_pupils"   "YOY16to17_pupils"      
[37] "YOYperc16to17_pupils"   "Y

## Export merged results

Let's export again.

In [None]:
%%R
#export as a CSV
write.csv(merged_data, "merged_sen_data.csv")

## Calculate SEN numbers as a percentage of pupils

We can now divide the number of pupils needing language support in year 1 by the total number of pupils in that year.

We will do that in a loop from the start.

In [None]:
%%R
print(colnames(merged_data))

 [1] "la_name"                "index"                  "201516_language"       
 [4] "201617_language"        "201718_language"        "201819_language"       
 [7] "201920_language"        "202021_language"        "202122_language"       
[10] "YOY21to22_language"     "YOYperc21to22_language" "YOY16to17_language"    
[13] "YOYperc16to17_language" "YOY17to18_language"     "YOYperc17to18_language"
[16] "YOY18to19_language"     "YOYperc18to19_language" "YOY19to20_language"    
[19] "YOYperc19to20_language" "YOY20to21_language"     "YOYperc20to21_language"
[22] "region_name"            "region_code"            "old_la_code"           
[25] "new_la_code"            "ncyear"                 "201516_pupils"         
[28] "201617_pupils"          "201718_pupils"          "201819_pupils"         
[31] "201920_pupils"          "202021_pupils"          "202122_pupils"         
[34] "YOY21to22_pupils"       "YOYperc21to22_pupils"   "YOY16to17_pupils"      
[37] "YOYperc16to17_pupils"   "YOY17to18

In [None]:
%%R
#loop through the column names from 3rd to 9th position - the SEN totals for each year
for (i in seq(3,9)){
    print(i)
    sentotal <- colnames(merged_data)[i]
    pupiltotal <- colnames(merged_data)[i+24]
    print(sentotal)
    print(pupiltotal)
    #divide the part by the whole to get a percentage
    senperc <- merged_data[sentotal]/merged_data[pupiltotal]
    #extract the year
    yearonly <- substr(sentotal,1,6)
    #create column names from these
    colname <- paste0(yearonly,"_senperc")
    #create a new column with that name and the values calculated
    merged_data[colname] <- senperc
}

[1] 3
[1] "201516_language"
[1] "201516_pupils"
[1] 4
[1] "201617_language"
[1] "201617_pupils"
[1] 5
[1] "201718_language"
[1] "201718_pupils"
[1] 6
[1] "201819_language"
[1] "201819_pupils"
[1] 7
[1] "201920_language"
[1] "201920_pupils"
[1] 8
[1] "202021_language"
[1] "202021_pupils"
[1] 9
[1] "202122_language"
[1] "202122_pupils"


In [None]:
%%R
#check the column names now
colnames(merged_data)

 [1] "la_name"                "index"                  "201516_language"       
 [4] "201617_language"        "201718_language"        "201819_language"       
 [7] "201920_language"        "202021_language"        "202122_language"       
[10] "YOY21to22_language"     "YOYperc21to22_language" "YOY16to17_language"    
[13] "YOYperc16to17_language" "YOY17to18_language"     "YOYperc17to18_language"
[16] "YOY18to19_language"     "YOYperc18to19_language" "YOY19to20_language"    
[19] "YOYperc19to20_language" "YOY20to21_language"     "YOYperc20to21_language"
[22] "region_name"            "region_code"            "old_la_code"           
[25] "new_la_code"            "ncyear"                 "201516_pupils"         
[28] "201617_pupils"          "201718_pupils"          "201819_pupils"         
[31] "201920_pupils"          "202021_pupils"          "202122_pupils"         
[34] "YOY21to22_pupils"       "YOYperc21to22_pupils"   "YOY16to17_pupils"      
[37] "YOYperc16to17_pupils"   "YOY17to18

In [None]:
%%R
#show the columns for the latest year
print(merged_data[,c(9,33,52)])

    202122_language 202122_pupils 202122_senperc
1               319          3378     0.09443458
2               244          4052     0.06021718
3               143          2721     0.05255421
4               157          1854     0.08468177
5               112          2294     0.04882302
6               219          3039     0.07206318
7              1114         14999     0.07427162
8               281          2023     0.13890262
9               182          1492     0.12198391
10              290          3950     0.07341772
11               NA            NA             NA
12              338          3852     0.08774663
13               73          1349     0.05411416
14              529          7376     0.07171909
15              269          3458     0.07779063
16              176          2442     0.07207207
17              388          4997     0.07764659
18              266          4016     0.06623506
19              387          6206     0.06235901
20              152 

## Export merged data with percentages added

Now we can export that.

In [None]:
%%R
#export as a CSV
write.csv(merged_data, "merged_sen_data_withperc.csv")