<font style='font-size:1.5em'>**🧑‍🏫 Week 10 Lecture**</font><br>
<font style='font-size:1.3em;color:#888888'>NOTEBOOK 01: Analysis of Educational Attainment Across English Towns</font>

<font style='font-size:1.2em;color:#e26a4f;font-weight:bold'>LSE DS105A – Data for Data Science (2024/25) </font>



<div style="color: #333333; background-color:rgba(226, 106, 79, 0.075); border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 350px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;">

🗓️ **DATE:** 5 December 2024 

⌚ **TIME:** 16.00-18.00

📍 **LOCATION:** CLM.5.02
</div>


**AUTHORS:**  Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi)

**CONTEXT**: In July 2023, the UK Office for National Statistics (ONS) published the following analysis online: ["Why do children and young people in smaller towns do better academically than those in larger towns?"](https://www.ons.gov.uk/peoplepopulationandcommunity/educationandchildcare/articles/whydochildrenandyoungpeopleinsmallertownsdobetteracademicallythanthoseinlargertowns/2023-07-25). I replicate a few parts of this report in this notebook. Let's pretend this is a complete analysis after a thorough data exploration and cleaning process. Jupyter Notebooks are good for sharing with data-savvy colleagues, but what if we want to reach a broader audience? What if we want our research to be as engaging and accessible as the original ONS article?

**OBJECTIVE**: In this lecture, we'll transform our technical analysis (**this notebook**) into a public-facing website using GitHub Pages, making our research findings accessible to educators, policymakers, and anyone interested in understanding educational patterns across English towns.

---

**⚙️ SETUP**

Before you continue, set up your Python environment. Check the instructions under the ['🐍 Python environment' section on README](../README.md#🐍-python-environment).

<span style="color: #333333; background-color:rgba(226, 180, 79, 0.1); border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px 0 20px 10px; margin: 10px 0 10px 0; flex: 1 1 calc(45% - 20px);min-width: 250px;max-width: 450px;align-items:top;min-height: calc(45% - 20px); box-sizing: border-box;font-size:0.9em;display:block;">⚠️ **WARNING:** There is a new package to install this week: `openpyxl`. <br><br>Either update your environment using the `requirements.txt` file or install it manually by running `pip install openpyxl`. </span>

In [15]:
import os
import json
import requests

import pandas as pd

from tqdm.notebook import tqdm
tqdm.pandas()

from IPython.display import Image, display

# 1. Data Preparation

Of course, In a real project of your own you would spend a lot of time cleaning and preparing the data, but the ONS has made our lives easier by providing a clean dataset for each plot. All we need to do is load the data and start plotting.

In a real project, you would need to:

- **Load the data**: The data probably comes from a variety of sources, so you would need to load it from different files or databases.
- **Clean the data**: The data is likely to be messy, with missing values, inconsistent formatting, and various other issues. You would need to clean it up before you could do any analysis.
- **Transform the data**: You might need to reshape the data, merge it with other datasets, or aggregate it in various ways to make it suitable for analysis.
- **Summarise the data**: You would need to further reorganise the data to make it ready for plotting.

## Preparing the Data for Plot 1

Note that we'll be using [read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) to read the data from an Excel file.

In [None]:
file_path = "../data/ons_uk_education/edu_attainment_by_settlement_type.xlsx"
column_names = [
    "settlement_type",
    "age18_level3_qualifications",
    "age19_higher_education",
    "age19_further_education",
]

# After inspecting the spreadsheet, I found that I need to ignore the first 5 rows and only read the next 7 rows
df_edu_attainment = pd.read_excel(file_path, skiprows=5, nrows=7, names=column_names)

df_edu_attainment

Unnamed: 0,settlement_type,age18_level3_qualifications,age19_higher_education,age19_further_education
0,Small towns,49.7,20.6,33.3
1,Medium towns,47.7,20.1,32.6
2,Large towns,47.3,19.2,33.6
3,Cities,42.3,20.7,32.8
4,Inner London,53.5,14.0,49.3
5,Outer London,57.1,14.3,48.5


<div style="background-color: #fff; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px; margin: 10px; max-width:450px; box-sizing: border-box;font-size:0.9em;">

☝️ **TEACHING NOTES:**

- The data above has already been cleaned and summarised for us.
- **The data is 'tidy'**: each row represents a unique observation (a town) and each column represents a variable about that observation.

This is what we aim for when producing a plot-ready dataset.

</div>

Despite all that, we still need to 'melt' the data to make it easier to plot. We'll use the [melt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) function from pandas to do this:

In [None]:
df_edu_attainment = df_edu_attainment.melt(id_vars=["settlement_type"], 
                                           var_name="attainment", 
                                           value_name="percentage")

df_edu_attainment

Unnamed: 0,settlement_type,attainment,percentage
0,Small towns,age18_level3_qualifications,49.7
1,Medium towns,age18_level3_qualifications,47.7
2,Large towns,age18_level3_qualifications,47.3
3,Cities,age18_level3_qualifications,42.3
4,Inner London,age18_level3_qualifications,53.5
5,Outer London,age18_level3_qualifications,57.1
6,Small towns,age19_higher_education,20.6
7,Medium towns,age19_higher_education,20.1
8,Large towns,age19_higher_education,19.2
9,Cities,age19_higher_education,20.7


**Summarised Data for Plot 2**

Once again, this is already cleaned and summarised for us and ready for a plot.

In [18]:
file_path = "../data/ons_uk_education/edu_attainment_scores_by_town_size.xlsx"

# I can keep the column names but I found that I need to ignore the first 4 rows
df_edu_scores = pd.read_excel(file_path, skiprows=4)

display(df_edu_scores)

Unnamed: 0,TOWN11CD,TOWN11NM,Town size,Educational attainment score
0,E34000007,Carlton in Lindrick BUA,Small Towns,-0.534
1,E34000016,Dorchester (West Dorset) BUA,Small Towns,1.952
2,E34000020,Ely BUA,Small Towns,-1.044
3,E34000026,Market Weighton BUA,Small Towns,-1.249
4,E34000027,Downham Market BUA,Small Towns,-1.169
...,...,...,...,...
1099,K06000004,Chester BUASD,Large Towns,-0.811
1100,Inner London BUAs,Inner London BUAs,London,0.068
1101,Outer london BUAs,Outer london BUAs,London,1.262
1102,Not BUA,Not BUA,Not BUA,1.802


In [19]:
# Let's get a sense of the data types and the number of missing values
df_edu_scores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1104 entries, 0 to 1103
Data columns (total 4 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   TOWN11CD                      1104 non-null   object 
 1   TOWN11NM                      1104 non-null   object 
 2   Town size                     1104 non-null   object 
 3   Educational attainment score  1104 non-null   float64
dtypes: float64(1), object(3)
memory usage: 34.6+ KB


# 2. Plot 1: Education attainment of young people in England

I will write code to replicate the plot below:

![](../data/ons_uk_education/figure1_attainment_after18.png){width=50%}

<div style="background-color: #fff; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1); padding: 20px; margin: 10px; max-width:450px; box-sizing: border-box;font-size:0.9em;">

☝️ **TEACHING NOTES:**

Notice how they engage with all the best practices of reporting we've been discussing in the course:

- **Title**: The title reveals the main takeaway from the plot. It tells a story.

- **Subtitle**: The subtitle provides context and additional information.

- **Y-axis**: The y-axis is clearly labelled and the scale is appropriate. They put the text horizontally to make it easier to read.

- **X-axis**: There's no need for a title on the x-axis because the labels are self-explanatory.

- **Source**: The sources of data (without the technical details) are clearly stated in the caption.


</div>

## Replicating the Plot (the final code) 

Scroll down to the next section if you want a step-by-step guide to creating this plot.

In [None]:
plot_edu_attainment = (
    ggplot(df_edu_attainment, aes
)

# 2. Breakdown of education attainment scores by town

![](../data/ons_uk_education/figure2_edu_attainment_score_per_town.png){width=40%}