# Women's imprisonment rates
## ONS population by Police Force Area: Data QA
Checking that my new dataset values are in line with previous years

## Loading newly processed data

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import src.utilities as utils
config = utils.read_config()

In [3]:
df = utils.load_data('interim', 'LA_PFA_population_women_2011-2023.csv')
df

2025-07-21 12:31:01,876 - INFO - Loaded data from data/interim/LA_PFA_population_women_2011-2023.csv


Unnamed: 0,ladcode,laname,year,freq,pfa
0,E06000001,Hartlepool,2011,37332,Cleveland
1,E06000001,Hartlepool,2012,37470,Cleveland
2,E06000001,Hartlepool,2013,37476,Cleveland
3,E06000001,Hartlepool,2014,37491,Cleveland
4,E06000001,Hartlepool,2015,37524,Cleveland
...,...,...,...,...,...
4116,W06000024,Merthyr Tydfil,2019,24168,South Wales
4117,W06000024,Merthyr Tydfil,2020,24134,South Wales
4118,W06000024,Merthyr Tydfil,2021,24061,South Wales
4119,W06000024,Merthyr Tydfil,2022,24056,South Wales


## Loading previous analysis' data

Steps to reproduce:
1. Download ONS mid-year estimates (with 2021 geog LA codes)
2. Process for adult women in England and Wales
3. Match LAs to earlier PFA codes and process

1. The pre-2021 Census data requires a manual download of the data and can be found within the zip file of the [Mid-2001 to mid-2020 detailed time series edition of this dataset](https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationestimatesforukenglandandwalesscotlandandnorthernireland). I will look to automate this in the future.

2. The data is then processed by `ons_comparator.py` to produce `LA_population_women_2001-2020.csv`, ready for matching with the PFA codes.

Both of these steps have already been completed in preparation for the QA process for the population data in `1.5-ah-pfa-population-qa.ipynb` and are documented in the notebook.

In [5]:
df_old = utils.load_data('interim', 'LA_population_women_2001-2020.csv')
df_old

2025-07-21 12:33:13,364 - INFO - Loaded data from data/interim/LA_population_women_2001-2020.csv


Unnamed: 0,ladcode,laname,year,freq
0,E06000001,Hartlepool,2001,35629
1,E06000001,Hartlepool,2002,35660
2,E06000001,Hartlepool,2003,35795
3,E06000001,Hartlepool,2004,35901
4,E06000001,Hartlepool,2005,36065
...,...,...,...,...
6615,W06000024,Merthyr Tydfil,2016,24249
6616,W06000024,Merthyr Tydfil,2017,24358
6617,W06000024,Merthyr Tydfil,2018,24426
6618,W06000024,Merthyr Tydfil,2019,24493


### 3. Matching PFA codes to `df_old`

In [6]:
from src.data.processing import la_to_pfa_matching

In [10]:
old_pfa_lookup_filename = config['data']['qaFilenames']['la_to_pfa_lookup']
old_pfa_lookup = utils.load_data('raw', old_pfa_lookup_filename)
old_pfa_lookup

2025-07-21 12:53:12,545 - INFO - Loaded data from data/raw/LA_to_PFA_(December_2022)_Lookup_in_EW.csv


Unnamed: 0,LAD22CD,LAD22NM,PFA22CD,PFA22NM
0,E08000001,Bolton,E23000005,Greater Manchester
1,E08000002,Bury,E23000005,Greater Manchester
2,E08000003,Manchester,E23000005,Greater Manchester
3,E08000004,Oldham,E23000005,Greater Manchester
4,E08000005,Rochdale,E23000005,Greater Manchester
...,...,...,...,...
336,E07000227,Horsham,E23000033,Sussex
337,E07000063,Lewes,E23000033,Sussex
338,E07000228,Mid Sussex,E23000033,Sussex
339,E07000064,Rother,E23000033,Sussex


In [11]:
df_old_qa = (
    la_to_pfa_matching.assign_pfa(old_pfa_lookup, df_old)
    .pipe(la_to_pfa_matching.filter_and_clean_data)
)
df_old_qa

2025-07-21 12:53:15,155 - INFO - Matching Local Authority Districts to Police Force Areas...
2025-07-21 12:53:15,157 - INFO - Creating lookup dictionary...
2025-07-21 12:53:15,159 - INFO - Standardising column names...
2025-07-21 12:53:15,218 - INFO - Filtering and cleaning population data...


Unnamed: 0,ladcode,laname,year,freq,pfa
0,E06000001,Hartlepool,2001,35629,Cleveland
1,E06000001,Hartlepool,2002,35660,Cleveland
2,E06000001,Hartlepool,2003,35795,Cleveland
3,E06000001,Hartlepool,2004,35901,Cleveland
4,E06000001,Hartlepool,2005,36065,Cleveland
...,...,...,...,...,...
6615,W06000024,Merthyr Tydfil,2016,24249,South Wales
6616,W06000024,Merthyr Tydfil,2017,24358,South Wales
6617,W06000024,Merthyr Tydfil,2018,24426,South Wales
6618,W06000024,Merthyr Tydfil,2019,24493,South Wales


### Comparing the old and new dataset values

In [12]:
df_old_qa.query('pfa == "Avon and Somerset" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
433,E06000022,Bath and North East Somerset,2014,75367,Avon and Somerset
453,E06000023,"Bristol, City of",2014,177077,Avon and Somerset
473,E06000024,North Somerset,2014,86277,Avon and Somerset
493,E06000025,South Gloucestershire,2014,108608,Avon and Somerset
3813,E07000187,Mendip,2014,45653,Avon and Somerset
3833,E07000188,Sedgemoor,2014,48795,Avon and Somerset
3853,E07000189,South Somerset,2014,67660,Avon and Somerset
4793,E07000246,Somerset West and Taunton,2014,62070,Avon and Somerset


In [13]:
df.query('pfa == "Avon and Somerset" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
276,E06000022,Bath and North East Somerset,2014,75340,Avon and Somerset
289,E06000023,"Bristol, City of",2014,178108,Avon and Somerset
302,E06000024,North Somerset,2014,86610,Avon and Somerset
315,E06000025,South Gloucestershire,2014,108410,Avon and Somerset
809,E06000066,Somerset,2014,225122,Avon and Somerset


Hmmm, seems as though there are some missing values

In [10]:
df.query('laname == "Mendip" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa


Checking population differences

In [16]:
new_pop_sum = df.query('pfa == "Avon and Somerset" and year == 2014')['freq'].sum()
old_pop_sum = df_old_qa.query('pfa == "Avon and Somerset" and year == 2014')['freq'].sum()
print(f'The new population for the Avon and Somerset PFA is {new_pop_sum}, and the old was {old_pop_sum}. A difference of {new_pop_sum - old_pop_sum}.')

The new population for the Avon and Somerset PFA is 673590, and the old was 671507. A difference of 2083.


In [12]:
pct_diff = ((new_pop_sum - old_pop_sum)/abs(old_pop_sum)) * 100
pct_diff

np.float64(0.3101978088091412)

0.3%, so not massively out. Will investigate where those missing LAs have gone

Looking at the ONS' [*A Beginner's Guide to UK Geography*](https://geoportal.statistics.gov.uk/datasets/d1f39e20edb940d58307a54d6e1045cd/about) in 2023 "the four districts within the county of Somerset were merged to form Somerset UA".

Looking at the explanation of the *ONS' coding and naming policy* and the UK Geography guide codes starting with E06 refer to unitary authorities and E07 refer to non-metropolitan districts, so it would make sense that the E07 values have been dropped and a new E06 value has appeared in this more recent dataset. The coding policy explains:

*"Instances must not be coded with, and/or be based on, inbuilt intelligence (for example, alphabetically or hierarchically). This is because any later change (like renaming) that may occur might upset this inbuilt intelligence."*

Again, this makes it more understandable to see that there is no logical pattern to the last two numeric digits for the Somerset UA.

## Checking values for other UAs

There are three other UAs that have been created that I need to check:
1. North Yorkshire
2. Cumberland
3. Westmorland

In [30]:
df.query('pfa == "North Yorkshire" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
172,E06000014,York,2014,86320,North Yorkshire
796,E06000065,North Yorkshire,2014,250054,North Yorkshire


In [32]:
df_old_qa.query('pfa == "North Yorkshire" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
273,E06000014,York,2014,86247,North Yorkshire
3433,E07000163,Craven,2014,23833,North Yorkshire
3453,E07000164,Hambleton,2014,37436,North Yorkshire
3473,E07000165,Harrogate,2014,65720,North Yorkshire
3493,E07000166,Richmondshire,2014,19659,North Yorkshire
3513,E07000167,Ryedale,2014,22220,North Yorkshire
3533,E07000168,Scarborough,2014,46306,North Yorkshire
3553,E07000169,Selby,2014,34952,North Yorkshire


As the Beginner's Guide to UK Geography explains, "The seven districts within the county of North Yorkshire were merged to form North Yorkshire UA". So this appears to marry up with the data we have for North Yorkshire. I'll compare the population values again.

Building on the previous code, I will now check the values for the other UAs.

In [33]:
def compare_pfa_population(df_new: pd.DataFrame, df_old: pd.DataFrame, pfas: list, year: int) -> pd.DataFrame:
    """
    Compare population data for multiple PFAs between two datasets.

    Parameters
    ----------
    df_new : pd.DataFrame
        The new population dataset.
    df_old : pd.DataFrame
        The old population dataset.
    pfas : list
        List of PFAs to compare.
    year : int
        The year to compare.

    Returns
    -------
    pd.DataFrame
        A DataFrame summarizing the population comparison for each PFA.
    """
    results = []
    for pfa in pfas:
        new_pop_sum = df_new.query('pfa == @pfa and year == @year')['freq'].sum()
        old_pop_sum = df_old.query('pfa == @pfa and year == @year')['freq'].sum()
        diff = new_pop_sum - old_pop_sum
        pct_diff = (diff / abs(old_pop_sum)) * 100 if old_pop_sum != 0 else None

        results.append({
            'PFA': pfa,
            'Year': year,
            'New Population': new_pop_sum,
            'Old Population': old_pop_sum,
            'Difference': diff,
            'Percentage Difference': pct_diff
        })

    return pd.DataFrame(results)

In [37]:
# List of PFAs to compare
pfas_to_compare = ["Avon and Somerset", "North Yorkshire", "Cumbria"]

# Year to compare
comparison_year = 2014

# Call the function
comparison_results = compare_pfa_population(df, df_old_qa, pfas_to_compare, comparison_year)
comparison_results

Unnamed: 0,PFA,Year,New Population,Old Population,Difference,Percentage Difference
0,Avon and Somerset,2014,673590,671507,2083,0.310198
1,North Yorkshire,2014,336374,336373,1,0.000297
2,Cumbria,2014,207498,207419,79,0.038087


These values all show very small differences, and within a reasonable range of 0.3% to 0.5%. I will accept these as being within the margin of error for the data.

Checking Cumbria next

In [31]:
df.query('pfa == "Cumbria" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
770,E06000063,Cumberland,2014,113421,Cumbria
783,E06000064,Westmorland and Furness,2014,94077,Cumbria


In [38]:
df_old_qa.query('pfa == "Cumbria" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
1293,E07000026,Allerdale,2014,40130,Cumbria
1313,E07000027,Barrow-in-Furness,2014,27647,Cumbria
1333,E07000028,Carlisle,2014,44921,Cumbria
1353,E07000029,Copeland,2014,28330,Cumbria
1373,E07000030,Eden,2014,22031,Cumbria
1393,E07000031,South Lakeland,2014,44360,Cumbria


Again, the change in the number of local authority districts is explained by the restructuring of local governance in the area, and the population difference is minimal. "The six districts in the county of Cumbria were split into two UAs, Cumberland UA (comprising the districts of Allerdale, Carlisle and Copeland) and Westmorland UA (comprising the districts of Barrow-in-Furness, Eden and South Lakeland)."

## Visualising the differences over the entire time period

In [40]:
def compare_pfa_population(df_new: pd.DataFrame, df_old: pd.DataFrame, pfas: list) -> pd.DataFrame:
    """
    Compare population data for multiple PFAs between two datasets.

    Parameters
    ----------
    df_new : pd.DataFrame
        The new population dataset.
    df_old : pd.DataFrame
        The old population dataset.
    pfas : list
        List of PFAs to compare.

    Returns
    -------
    pd.DataFrame
        A DataFrame summarizing the population comparison for each PFA.
    """
    shared_years = sorted(set(df_new['year']).intersection(set(df_old['year'])))
    comparison_results = []
    for year in shared_years:
        for pfa in pfas:
            new_pop_sum = df_new.query('pfa == @pfa and year == @year')['freq'].sum()
            old_pop_sum = df_old.query('pfa == @pfa and year == @year')['freq'].sum()
            diff = new_pop_sum - old_pop_sum
            pct_diff = (diff / abs(old_pop_sum)) * 100 if old_pop_sum != 0 else None

            comparison_results.append({
                'PFA': pfa,
                'Year': year,
                'New Population': new_pop_sum,
                'Old Population': old_pop_sum,
                'Difference': diff,
                'Percentage Difference': pct_diff
            })

    return pd.DataFrame(comparison_results)

In [97]:
# List of PFAs to compare
pfas_to_compare = ["Avon and Somerset", "North Yorkshire", "Cumbria"]

# Call the function
comparison_results = compare_pfa_population(df, df_old_qa, pfas_to_compare)
comparison_results

Unnamed: 0,PFA,Year,New Population,Old Population,Difference,Percentage Difference
0,Avon and Somerset,2011,652988,652988,0,0.0
1,North Yorkshire,2011,331242,331242,0,0.0
2,Cumbria,2011,207349,207349,0,0.0
3,Avon and Somerset,2012,659643,658816,827,0.125528
4,North Yorkshire,2012,333345,333067,278,0.083467
5,Cumbria,2012,207293,207356,-63,-0.030383
6,Avon and Somerset,2013,666906,665230,1676,0.251943
7,North Yorkshire,2013,334883,334893,-10,-0.002986
8,Cumbria,2013,206992,207085,-93,-0.044909
9,Avon and Somerset,2014,673590,671507,2083,0.310198


In [46]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio

from src.visualization import prt_theme
pio.templates.default = "prt_template"

In [99]:
no_modebar = {'displayModeBar': False}
# Filter data for visualization
sample_pfas = comparison_results['PFA'].unique()

# Dynamically calculate vertical spacing based on the number of rows
num_cols = 3
num_rows = -(-len(sample_pfas) // num_cols)  # Ceiling division to determine rows
max_vertical_spacing = 1 / (num_rows - 1) if num_rows > 1 else 0.1
vertical_spacing = min(0.02, max_vertical_spacing)  # Use 0.02 or the maximum allowed value

# Create subplots
fig = make_subplots(
    rows=num_rows, cols=num_cols,
    subplot_titles=sample_pfas,
    vertical_spacing=vertical_spacing,
    horizontal_spacing=0.08
)

for i, pfa in enumerate(sample_pfas):
    pfa_data = comparison_results[comparison_results['PFA'] == pfa]
    
    # Calculate row and column position (Plotly uses 1-based indexing)
    row = (i // num_cols) + 1
    col = (i % num_cols) + 1
    
    fig.add_trace(
        go.Scatter(
            x=pfa_data['Year'],
            y=pfa_data['Percentage Difference'],
            mode='lines+markers',
            name=pfa,
            showlegend=False,
            line=dict(width=2),
            marker=dict(size=6),
            hovertemplate="%{y:.2f}%<extra>Percentage difference</extra>",
        ),
        row=row, col=col
    )

# Update layout
fig.update_layout(
    margin=dict(l=30, r=15),
    height=300 * num_rows,  # Adjust height dynamically based on rows
    title_text="Percentage Differences in Population by PFA (New vs Old Data)",
)

# Update x and y axis labels
fig.update_xaxes(nticks=6)
fig.update_yaxes(nticks=5, ticksuffix="%", range=[-2.1, 6.1])

fig.show(config=no_modebar)

All very minimal differences, so I will accept these as being within the margin of error for the data.

### Exporting the visualisation
Exporting the visualisation to HTML for inclusion in the documentation

In [87]:
fig.write_html("docs/assets/pfa_population_comparison.html", config=no_modebar, include_plotlyjs="cdn")

### Visualising the differences in population for all PFAs

In [88]:
comparison_results = compare_pfa_population(df, df_old_qa, list(df['pfa'].unique()))
comparison_results

Unnamed: 0,PFA,Year,New Population,Old Population,Difference,Percentage Difference
0,Cleveland,2011,225604,225604,0,0.000000
1,Durham,2011,255745,255745,0,0.000000
2,Cheshire,2011,420894,420894,0,0.000000
3,Lancashire,2011,591265,591265,0,0.000000
4,Humberside,2011,373171,373171,0,0.000000
...,...,...,...,...,...,...
415,Metropolitan Police,2020,3614063,3486556,127507,3.657105
416,North Wales,2020,285472,288072,-2600,-0.902552
417,Dyfed-Powys,2020,215341,217989,-2648,-1.214740
418,South Wales,2020,539902,548809,-8907,-1.622969


In [96]:
no_modebar = {'displayModeBar': False}
# Filter data for visualization
sample_pfas = comparison_results['PFA'].unique()

# Dynamically calculate vertical spacing based on the number of rows
num_cols = 3
num_rows = -(-len(sample_pfas) // num_cols)  # Ceiling division to determine rows
max_vertical_spacing = 1 / (num_rows - 1) if num_rows > 1 else 0.1
vertical_spacing = min(0.02, max_vertical_spacing)  # Use 0.02 or the maximum allowed value

# Create subplots
fig = make_subplots(
    rows=num_rows, cols=num_cols,
    subplot_titles=sample_pfas,
    vertical_spacing=vertical_spacing,
    horizontal_spacing=0.08
)

for i, pfa in enumerate(sample_pfas):
    pfa_data = comparison_results[comparison_results['PFA'] == pfa]
    
    # Calculate row and column position (Plotly uses 1-based indexing)
    row = (i // num_cols) + 1
    col = (i % num_cols) + 1
    
    fig.add_trace(
        go.Scatter(
            x=pfa_data['Year'],
            y=pfa_data['Percentage Difference'],
            mode='lines+markers',
            name=pfa,
            showlegend=False,
            line=dict(width=2),
            marker=dict(size=6),
            hovertemplate="%{y:.2f}%<extra>Percentage difference</extra>",
        ),
        row=row, col=col
    )

# Update layout
fig.update_layout(
    margin=dict(l=30, r=15),
    height=300 * num_rows,  # Adjust height dynamically based on rows
    # title_text="Percentage Differences in Population by PFA (New vs Old Data)",
)

# Update x and y axis labels
fig.update_xaxes(nticks=6)
fig.update_yaxes(nticks=5, ticksuffix="%", range=[-2.1, 6.1])

fig.show(config=no_modebar)

### Checking which local authorities are represented in those PFAs with the largest differences

* Cambridgeshire
* Bedfordshire
* Thames Valley
* Metropolitan Police

In [26]:
df_old_qa.query('pfa == "Thames Valley" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
673,E06000036,Bracknell Forest,2014,45845,Thames Valley
693,E06000037,West Berkshire,2014,61448,Thames Valley
713,E06000038,Reading,2014,62159,Thames Valley
733,E06000039,Slough,2014,52590,Thames Valley
753,E06000040,Windsor and Maidenhead,2014,58573,Thames Valley
773,E06000041,Wokingham,2014,62817,Thames Valley
793,E06000042,Milton Keynes,2014,99507,Thames Valley
1133,E06000060,Buckinghamshire,2014,207716,Thames Valley
3713,E07000177,Cherwell,2014,57109,Thames Valley
3733,E07000178,Oxford,2014,62277,Thames Valley


In [27]:
df.query('pfa == "Thames Valley" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
432,E06000036,Bracknell Forest,2014,46063,Thames Valley
445,E06000037,West Berkshire,2014,61927,Thames Valley
458,E06000038,Reading,2014,65617,Thames Valley
471,E06000039,Slough,2014,54458,Thames Valley
484,E06000040,Windsor and Maidenhead,2014,59637,Thames Valley
497,E06000041,Wokingham,2014,63109,Thames Valley
510,E06000042,Milton Keynes,2014,102025,Thames Valley
731,E06000060,Buckinghamshire,2014,209681,Thames Valley
2291,E07000177,Cherwell,2014,58044,Thames Valley
2304,E07000178,Oxford,2014,64558,Thames Valley


The following table shows the local authorities that are represented in the PFAs with the largest differences between the 2021 rolled-forward and 2021 Census-based mid-year estimates.
| Police Force Area | Local Authority | ONS % difference |
| ----------------- | --------------- | ---------------- |
| Cambridgeshire    | Cambridge       | \-15.66          |
| Cambridgeshire    | Peterborough    | \-6.53           |
| Bedfordshire      | Luton           | \-5.69           |
| Thames Valley     | Reading         | \-9.57           |
| Thames Valley     | Slough          | \-6.53           |
| Thames Valley     | Oxford          | \-5.56           |

In addition, the ONS highlights that London population estimates also saw larger differences.

## Final thoughts
The analysis shows that the changes to the ONS population data have had a minimal impact on the overall population of the PFAs, and that the differences in population values are very small. The newly created UAs have had a very modest impact on the overall population of the PFAs, and the differences in population values remain within an acceptable margin of error.