# Analysing the impact of COVID on special educational needs

Go to [National statistics on special educational needs in England](https://explore-education-statistics.service.gov.uk/find-statistics/special-educational-needs-in-england) and click 'Download all data', then unzip, it's called sen_ncyear_.csv - it has a column for each year group (Q is nc_year_1) and a column (M) for 'primary_need' - so you can filter that down to 'Speech, Language and Communications needs'


## Install the packages we need

In [1]:
#load rmagic to be able to run R
#from https://towardsdatascience.com/how-to-use-r-in-google-colab-b6e02d736497
%load_ext rpy2.ipython

In [77]:
%%R
#install the tidyverse package
install.packages('tidyverse')
library('tidyverse')
#install the downloader package: https://cran.r-project.org/web/packages/downloader/index.html
install.packages("downloader")
library(downloader)
#this helps us calculate grand totals later
install.packages('janitor')
library(janitor)

(as ‘lib’ is unspecified)







	‘/tmp/Rtmpc5269Q/downloaded_packages’

(as ‘lib’ is unspecified)







	‘/tmp/Rtmpc5269Q/downloaded_packages’

(as ‘lib’ is unspecified)














	‘/tmp/Rtmpc5269Q/downloaded_packages’

Attaching package: ‘janitor’



    chisq.test, fisher.test




In [3]:
%%R
install.packages('stringi')
library(stringi)

(as ‘lib’ is unspecified)







	‘/tmp/Rtmpc5269Q/downloaded_packages’



## Import the zip file, extract the data

The data is published in a zip file that can be accessed from the 'Download all data' button at https://explore-education-statistics.service.gov.uk/find-statistics/special-educational-needs-in-england/2020-21#dataDownloads-1


In [4]:
%%R
#store the URL for the data zip file which is found by right-clicking on 
#'Download all data' at https://explore-education-statistics.service.gov.uk/find-statistics/special-educational-needs-in-england/2020-21#dataDownloads-1
zipurl <- "https://content.explore-education-statistics.service.gov.uk/api/releases/0a424edd-2bc7-45aa-a45e-5ae6dc4f23ab/files"
# This is the 21/22 data
zipurl <- "https://content.explore-education-statistics.service.gov.uk/api/releases/daf8d64d-ce21-4f21-c28c-08da3ee963bf/files"


In [5]:
%%R
#download the zip file from the url
downloader::download(zipurl, dest="datasets.zip", mode="wb") 
#unzip it 
unzip ("datasets.zip", exdir = "./")






In [6]:

%%R 
#import the data extracted from the zip
data21 <- "data/sen_ncyear.csv"
#this is the name in 2022
data22 <- "data/sen_ncyear_.csv"
data <- readr::read_csv(data22)



Rows: 218715 Columns: 48
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): time_identifier, geographic_level, country_code, country_name, reg...
dbl (37): time_period, old_la_code, number_of_pupils, nc_early_years, nc_rec...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.


## `filter()` the data to what we need

This is a very large dataframe, and we don't need all of it. Firstly, we're specifically interested in the figures on speech and language.

In [7]:
%%R
#filter to where column values match specified string
slcdf <- filter(data, primary_need == "Speech, Language and Communications needs")
print(slcdf)

# A tibble: 16,735 × 48
   time_period time_identifier geographic_level country_code country_name
         <dbl> <chr>           <chr>            <chr>        <chr>       
 1      202122 Academic year   National         E92000001    England     
 2      202122 Academic year   National         E92000001    England     
 3      202122 Academic year   National         E92000001    England     
 4      202122 Academic year   National         E92000001    England     
 5      202122 Academic year   National         E92000001    England     
 6      202122 Academic year   National         E92000001    England     
 7      202122 Academic year   National         E92000001    England     
 8      202122 Academic year   National         E92000001    England     
 9      202122 Academic year   National         E92000001    England     
10      202122 Academic year   National         E92000001    England     
# … with 16,725 more rows, and 43 more variables: region_name <chr>,
#   region_code <ch

### Filter: primary schools only

We also only want to look at primary schools.

In [8]:
%%R
#filter to where column values match specified string
slcdf <- filter(slcdf, phase_type_grouping == "State-funded primary")
print(slcdf)

# A tibble: 3,374 × 48
   time_period time_identifier geographic_level country_code country_name
         <dbl> <chr>           <chr>            <chr>        <chr>       
 1      202122 Academic year   National         E92000001    England     
 2      202122 Academic year   National         E92000001    England     
 3      202122 Academic year   National         E92000001    England     
 4      202122 Academic year   Regional         E92000001    England     
 5      202122 Academic year   Regional         E92000001    England     
 6      202122 Academic year   Regional         E92000001    England     
 7      202122 Academic year   Local authority  E92000001    England     
 8      202122 Academic year   Local authority  E92000001    England     
 9      202122 Academic year   Local authority  E92000001    England     
10      202122 Academic year   Local authority  E92000001    England     
# … with 3,364 more rows, and 43 more variables: region_name <chr>,
#   region_code <chr>

### Filter: totals only

We also don't need the sub-categories - just the total of those needing support.

In [9]:
%%R
#show the unique values and their frequency
table(slcdf['pupil_sen_status'])

pupil_sen_status
     SEN Support Statement or EHC            Total 
            1125             1124             1125 


In [10]:
%%R
#filter to where column values match specified string
slcdf_filter3 <- filter(slcdf, pupil_sen_status == "Total")
print(slcdf_filter3)

# A tibble: 1,125 × 48
   time_period time_identifier geographic_level country_code country_name
         <dbl> <chr>           <chr>            <chr>        <chr>       
 1      202122 Academic year   National         E92000001    England     
 2      202122 Academic year   Regional         E92000001    England     
 3      202122 Academic year   Local authority  E92000001    England     
 4      202122 Academic year   Local authority  E92000001    England     
 5      202122 Academic year   Local authority  E92000001    England     
 6      202122 Academic year   Local authority  E92000001    England     
 7      202122 Academic year   Local authority  E92000001    England     
 8      202122 Academic year   Local authority  E92000001    England     
 9      202122 Academic year   Local authority  E92000001    England     
10      202122 Academic year   Local authority  E92000001    England     
# … with 1,115 more rows, and 43 more variables: region_name <chr>,
#   region_code <chr>

### Filter: local authorities only

We also only want to look at data for local authorities. 

In [11]:
%%R
#show the unique values and their frequency
table(slcdf_filter3['geographic_level'])

geographic_level
Local authority        National        Regional 
           1055               7              63 


In [12]:
%%R
#filter to where column values match specified string
slcdf_filter4 <- filter(slcdf_filter3, geographic_level == "Local authority")
print(slcdf_filter4)

# A tibble: 1,055 × 48
   time_period time_identifier geographic_level country_code country_name
         <dbl> <chr>           <chr>            <chr>        <chr>       
 1      202122 Academic year   Local authority  E92000001    England     
 2      202122 Academic year   Local authority  E92000001    England     
 3      202122 Academic year   Local authority  E92000001    England     
 4      202122 Academic year   Local authority  E92000001    England     
 5      202122 Academic year   Local authority  E92000001    England     
 6      202122 Academic year   Local authority  E92000001    England     
 7      202122 Academic year   Local authority  E92000001    England     
 8      202122 Academic year   Local authority  E92000001    England     
 9      202122 Academic year   Local authority  E92000001    England     
10      202122 Academic year   Local authority  E92000001    England     
# … with 1,045 more rows, and 43 more variables: region_name <chr>,
#   region_code <chr>

## Remove columns

We also have a number of columns we don't need.

In [None]:
%%R
#show the columns
colnames(slcdf_filter4)

 [1] "time_period"             "time_identifier"        
 [3] "geographic_level"        "country_code"           
 [5] "country_name"            "region_name"            
 [7] "region_code"             "old_la_code"            
 [9] "la_name"                 "new_la_code"            
[11] "phase_type_grouping"     "pupil_sen_status"       
[13] "primary_need"            "number_of_pupils"       
[15] "nc_early_years"          "nc_reception"           
[17] "nc_year_1"               "nc_year_2"              
[19] "nc_year_3"               "nc_year_4"              
[21] "nc_year_5"               "nc_year_6"              
[23] "nc_year_7"               "nc_year_8"              
[25] "nc_year_9"               "nc_year_10"             
[27] "nc_year_11"              "nc_year_12"             
[29] "nc_year_13"              "nc_year_14"             
[31] "nc_not_followed"         "nc_early_years_percent" 
[33] "nc_reception_percent"    "nc_year_1_percent"      
[35] "nc_year_2_percent"       

Some columns only have one value

In [None]:
%%R
#count the unique values in the time_identifier column
print(table(slcdf_filter4['time_identifier']))
#repeat for the geographic level column
print(table(slcdf_filter4['geographic_level']))
#repeat for country_name
print(table(slcdf_filter4['country_name']))
#repeat for country_code
print(table(slcdf_filter4['country_code']))
#repeat for phase_type_grouping
print(table(slcdf_filter4['phase_type_grouping']))
#repeat for pupil_sen_status
print(table(slcdf_filter4['pupil_sen_status']))

time_identifier
Academic year 
         1055 
geographic_level
Local authority 
           1055 
country_name
England 
   1055 
country_code
E92000001 
     1055 
phase_type_grouping
State-funded primary 
                1055 
pupil_sen_status
Total 
 1055 


We remove that by using `select()` and a minus before the column we want to exclude.

In [13]:
%%R
#remove the specified columns
slcdf_col_filter1 <- select(slcdf_filter4, 
                            -c(time_identifier, 
                               geographic_level,
                               country_name,
                               country_code,
                               phase_type_grouping,
                               pupil_sen_status))
slcdf_col_filter1

# A tibble: 1,055 × 42
   time_period region_name region_code old_la_code la_name           new_la_code
         <dbl> <chr>       <chr>             <dbl> <chr>             <chr>      
 1      202122 North East  E12000001           805 Hartlepool        E06000001  
 2      202122 North East  E12000001           806 Middlesbrough     E06000002  
 3      202122 North East  E12000001           807 Redcar and Cleve… E06000003  
 4      202122 North East  E12000001           808 Stockton-on-Tees  E06000004  
 5      202122 North East  E12000001           841 Darlington        E06000005  
 6      202122 North East  E12000001           840 County Durham     E06000047  
 7      202122 North East  E12000001           929 Northumberland    E06000057  
 8      202122 North East  E12000001           391 Newcastle upon T… E08000021  
 9      202122 North East  E12000001           392 North Tyneside    E08000022  
10      202122 North East  E12000001           393 South Tyneside    E08000023  
# … w

And we don't need years 7 onwards because that's after primary school.

In [14]:
%%R
#show the columns we want to exclude
print(colnames(slcdf_col_filter1)[17:25])
print(colnames(slcdf_col_filter1)[34:42])

[1] "nc_year_7"       "nc_year_8"       "nc_year_9"       "nc_year_10"     
[5] "nc_year_11"      "nc_year_12"      "nc_year_13"      "nc_year_14"     
[9] "nc_not_followed"
[1] "nc_year_7_percent"       "nc_year_8_percent"      
[3] "nc_year_9_percent"       "nc_year_10_percent"     
[5] "nc_year_11_percent"      "nc_year_12_percent"     
[7] "nc_year_13_percent"      "nc_year_14_percent"     
[9] "nc_not_followed_percent"


In [15]:
%%R
#remove the specified column
slcdf_col_filter2 <- select(slcdf_col_filter1, 
                            -colnames(slcdf_col_filter1)[17:25]) %>% 
  select(-colnames(slcdf_col_filter1)[34:42])
#show the column names left
colnames(slcdf_col_filter2)

 [1] "time_period"            "region_name"            "region_code"           
 [4] "old_la_code"            "la_name"                "new_la_code"           
 [7] "primary_need"           "number_of_pupils"       "nc_early_years"        
[10] "nc_reception"           "nc_year_1"              "nc_year_2"             
[13] "nc_year_3"              "nc_year_4"              "nc_year_5"             
[16] "nc_year_6"              "nc_early_years_percent" "nc_reception_percent"  
[19] "nc_year_1_percent"      "nc_year_2_percent"      "nc_year_3_percent"     
[22] "nc_year_4_percent"      "nc_year_5_percent"      "nc_year_6_percent"     


## Pivot by LA and year using `group_by()` and `summarize()`

There are a number of ways to create pivot table-like summaries in R, including [the `pivottabler` package](https://cran.r-project.org/web/packages/pivottabler/vignettes/v00-vignettes.html) and `dplyr`. We'll be using the latter approach, [documented here](https://rstudio-conf-2020.github.io/r-for-excel/pivot-tables.html)

In [25]:
%%R
#the group_by line specifies the rows of the pivot table
slcdf_col_filter2 %>%
  group_by(la_name) %>% 
  summarize(count_of_la = n()) # this specifies the name and calculation to create the 'values' part

# A tibble: 154 × 2
   la_name                      count_of_la
   <chr>                              <int>
 1 Barking and Dagenham                   7
 2 Barnet                                 7
 3 Barnsley                               7
 4 Bath and North East Somerset           7
 5 Bedford                                7
 6 Bexley                                 7
 7 Birmingham                             7
 8 Blackburn with Darwen                  7
 9 Blackpool                              7
10 Bolton                                 7
# … with 144 more rows


### Adding rows

This tells us that there are 7 values for each LA. Those would be the figures for each of the 7 academic years covered by the data. We need to add those as rows so that they are separated out.

We can begin to do this by adding the `time_period` column as a second argument in `group_by()`

In [26]:
%%R
#the group_by line specifies the rows of the pivot table
slcdf_col_filter2 %>%
  group_by(la_name, time_period) %>% 
  summarize(count_of_la = n()) # this specifies the name and calculation to create the 'values' part

`summarise()` has grouped output by 'la_name'. You can override using the
`.groups` argument.
# A tibble: 1,055 × 3
# Groups:   la_name [154]
   la_name              time_period count_of_la
   <chr>                      <dbl>       <int>
 1 Barking and Dagenham      201516           1
 2 Barking and Dagenham      201617           1
 3 Barking and Dagenham      201718           1
 4 Barking and Dagenham      201819           1
 5 Barking and Dagenham      201920           1
 6 Barking and Dagenham      202021           1
 7 Barking and Dagenham      202122           1
 8 Barnet                    201516           1
 9 Barnet                    201617           1
10 Barnet                    201718           1
# … with 1,045 more rows


### Sum, don't count

Now, this just adds another column rather than a column headed with each year, but we can reshape this table to fix that. First, let's change from counting to summing.

In [27]:
%%R
#the group_by line specifies the rows of the pivot table
slcdf_col_filter2 %>%
  group_by(la_name, time_period) %>% 
  summarize(sum_of_yr1 = sum(nc_year_1)) # now we 'sum' a specific column, 
  #and the resulting column is called 'sum_of_yr1'

`summarise()` has grouped output by 'la_name'. You can override using the
`.groups` argument.
# A tibble: 1,055 × 3
# Groups:   la_name [154]
   la_name              time_period sum_of_yr1
   <chr>                      <dbl>      <dbl>
 1 Barking and Dagenham      201516        241
 2 Barking and Dagenham      201617        299
 3 Barking and Dagenham      201718        285
 4 Barking and Dagenham      201819        287
 5 Barking and Dagenham      201920        282
 6 Barking and Dagenham      202021        316
 7 Barking and Dagenham      202122        319
 8 Barnet                    201516        254
 9 Barnet                    201617        239
10 Barnet                    201718        241
# … with 1,045 more rows


## Reshape long to wide using `spread()`

We now reshape to make that second column of years into the column headings, so there's only one row per LA, and we can more easily calculate year on year change.

Note that R doesn't like number-only column names, so it puts each one inside the code accent: `

In [149]:
%%R
#save the previous results
pivot_by_la_yr_yr1 <- slcdf_col_filter2 %>%
  group_by(la_name, time_period) %>% 
  summarize(sum_of_yr1 = sum(nc_year_1)) # now we 'sum' a specific column, 
  #and the resulting column is called 'sum_of_yr1'
#specify we want to convert the time_period column values to column names 
#and insert the values from the sum_of_yr1 column underneath
yr1_wide <- pivot_by_la_yr_yr1 %>% spread(time_period, sum_of_yr1)
#show the results
yr1_wide

`summarise()` has grouped output by 'la_name'. You can override using the
`.groups` argument.
# A tibble: 154 × 8
# Groups:   la_name [154]
   la_name        `201516` `201617` `201718` `201819` `201920` `202021` `202122`
   <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 Barking and D…      241      299      285      287      282      316      319
 2 Barnet              254      239      241      250      234      209      244
 3 Barnsley            114      147      141      125      129      107      143
 4 Bath and Nort…       82      101       95      118       96      115      157
 5 Bedford              93       84      115      110      107      130      112
 6 Bexley              209      215      213      247      245      200      219
 7 Birmingham          765      836      868      900      899      936     1114
 8 Blackburn wit…      181      200      199      216      224      226      281
 9 Blackpool           146      139      170      

## Add a grand total

We will want a national change figure, too. So let's add that row.

In [150]:
%%R
#check how many rows so we don't overwrite any
nrow(yr1_wide)

[1] 154


In [151]:
%%R
#create empty vector to contain totals, 
#we start with a number which will be replaced with the text that will go in the first column
#but for now we need a number so the others aren't stored as strings
totalsvec <- (0)
#loop through the other column names
for (i in colnames(yr1_wide)[seq(2,8)]){
    #print the column name we're working with
    print(i)
    #add all the numbers in that column
    coltotal <- sum(yr1_wide[i][!is.na(yr1_wide[i])])
    print(coltotal)
    #add it to the vector
    totalsvec <- c(totalsvec,coltotal)
}
print(totalsvec)

[1] "201516"
[1] 33464
[1] "201617"
[1] 35061
[1] "201718"
[1] 35949
[1] "201819"
[1] 37178
[1] "201920"
[1] 37632
[1] "202021"
[1] 38560
[1] "202122"
[1] 42341
[1]     0 33464 35061 35949 37178 37632 38560 42341


In [152]:
%%R
#transpose the vectore and convert to a dataframe
newdf <- as.data.frame(t(totalsvec))
print(newdf)
#assign the column names to that dataframe
colnames(newdf) <- colnames(yr1_wide)
print(newdf)
#change the first cell to 'total'
newdf[1,1] <- "Total"
print(newdf)

  V1    V2    V3    V4    V5    V6    V7    V8
1  0 33464 35061 35949 37178 37632 38560 42341
  la_name 201516 201617 201718 201819 201920 202021 202122
1       0  33464  35061  35949  37178  37632  38560  42341
  la_name 201516 201617 201718 201819 201920 202021 202122
1   Total  33464  35061  35949  37178  37632  38560  42341


In [153]:
%%R
#bind the two dataframes together
yr1_wtotal <- rbind(yr1_wide,newdf)
#and check the last few rows
print(yr1_wtotal[seq(152,155),] )
#overwrite the original dataframe
yr1_wide <- yr1_wtotal

# A tibble: 4 × 8
# Groups:   la_name [4]
  la_name        `201516` `201617` `201718` `201819` `201920` `202021` `202122`
  <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 Wolverhampton       102      121      133      142      151      188      246
2 Worcestershire      473      514      563      615      622      671      724
3 York                 65       61       90       70       62       90       74
4 Total             33464    35061    35949    37178    37632    38560    42341


## Calculate year on year changes

Now we can add new columns with the year on year changes.

In [154]:
%%R
#remind ourselves of the column names
colnames(yr1_wide)

[1] "la_name" "201516"  "201617"  "201718"  "201819"  "201920"  "202021" 
[8] "202122" 


In [155]:
%%R
#subtract 2021 from 2122 to get the change - and store in a new column
yr1_wide['YOY21to22'] <- yr1_wide['202122'] - yr1_wide['202021']
#show the results to check they tally
yr1_wide[, c(1,7,8,9)]

# A tibble: 155 × 4
# Groups:   la_name [155]
   la_name                      `202021` `202122` YOY21to22
   <chr>                           <dbl>    <dbl>     <dbl>
 1 Barking and Dagenham              316      319         3
 2 Barnet                            209      244        35
 3 Barnsley                          107      143        36
 4 Bath and North East Somerset      115      157        42
 5 Bedford                           130      112       -18
 6 Bexley                            200      219        19
 7 Birmingham                        936     1114       178
 8 Blackburn with Darwen             226      281        55
 9 Blackpool                         178      182         4
10 Bolton                            260      290        30
# … with 145 more rows


### Calculate them as percentages

Now add another column with those values as percentage changes.

In [156]:
%%R
#divide the change by the older 2021 figure to get the % change - and store in a new column
yr1_wide['YOYperc21to22'] <- yr1_wide['YOY21to22']/yr1_wide['202021']
#show the results to check they tally - the first figure should be around 1% or 0.01 when rounded
yr1_wide[, c(1,7,8,9,10)]

# A tibble: 155 × 5
# Groups:   la_name [155]
   la_name                      `202021` `202122` YOY21to22 YOYperc21to22
   <chr>                           <dbl>    <dbl>     <dbl>         <dbl>
 1 Barking and Dagenham              316      319         3       0.00949
 2 Barnet                            209      244        35       0.167  
 3 Barnsley                          107      143        36       0.336  
 4 Bath and North East Somerset      115      157        42       0.365  
 5 Bedford                           130      112       -18      -0.138  
 6 Bexley                            200      219        19       0.095  
 7 Birmingham                        936     1114       178       0.190  
 8 Blackburn with Darwen             226      281        55       0.243  
 9 Blackpool                         178      182         4       0.0225 
10 Bolton                            260      290        30       0.115  
# … with 145 more rows


## Repeat for other years

Can we codify this process and loop to create the other columns?

In [157]:
%%R
#loop through the column names from 3rd to 7th position
for (i in seq(3,7)){
    print(i)
    thisyr <- colnames(yr1_wide)[i]
    previousyr <- colnames(yr1_wide)[i-1]
    print(thisyr)
    print(previousyr)
    #subtract the previous year from the current one to get the change - and store 
    YOYchange <- yr1_wide[thisyr] - yr1_wide[previousyr]
    YOYpercChange <- YOYchange/yr1_wide[previousyr]
    #extract the two digits for the later year
    toyr <- substr(thisyr,5,6)
    fromyr <- substr(previousyr,5,6)
    #create column names from these
    colname <- paste0("YOY",fromyr,"to",toyr)
    perccolname <- paste0("YOYperc",fromyr,"to",toyr)
    #create a new column with that name and the values calculated
    yr1_wide[colname] <- YOYchange
    yr1_wide[perccolname] <- YOYpercChange
}

[1] 3
[1] "201617"
[1] "201516"
[1] 4
[1] "201718"
[1] "201617"
[1] 5
[1] "201819"
[1] "201718"
[1] 6
[1] "201920"
[1] "201819"
[1] 7
[1] "202021"
[1] "201920"


In [158]:
%%R
yr1_wide

# A tibble: 155 × 20
# Groups:   la_name [155]
   la_name        `201516` `201617` `201718` `201819` `201920` `202021` `202122`
   <chr>             <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 Barking and D…      241      299      285      287      282      316      319
 2 Barnet              254      239      241      250      234      209      244
 3 Barnsley            114      147      141      125      129      107      143
 4 Bath and Nort…       82      101       95      118       96      115      157
 5 Bedford              93       84      115      110      107      130      112
 6 Bexley              209      215      213      247      245      200      219
 7 Birmingham          765      836      868      900      899      936     1114
 8 Blackburn wit…      181      200      199      216      224      226      281
 9 Blackpool           146      139      170      180      203      178      182
10 Bolton              142      178      218      214      260

## Export results

Let's export as a CSV.

In [159]:
%%R
#export as a CSV
write.csv(yr1_wide, "yr1_analysis.csv")