# Women's imprisonment rates
## ONS population by Police Force Area: Data QA
Checking that my new dataset values are in line with previous years

## Loading data

In [1]:
%load_ext autoreload
%autoreload 2

In [6]:
import pandas as pd
import src.utilities as utils
config = utils.read_config()

In [39]:
df = utils.load_data('interim', 'LA_PFA_population_women_2011-2023.csv')
df

2025-07-15 16:49:01,820 - INFO - Loaded data from data/interim/LA_PFA_population_women_2011-2023.csv


Unnamed: 0,ladcode,laname,year,freq,pfa
0,E06000001,Hartlepool,2011,37332,Cleveland
1,E06000001,Hartlepool,2012,37470,Cleveland
2,E06000001,Hartlepool,2013,37476,Cleveland
3,E06000001,Hartlepool,2014,37491,Cleveland
4,E06000001,Hartlepool,2015,37524,Cleveland
...,...,...,...,...,...
4116,W06000024,Merthyr Tydfil,2019,24168,South Wales
4117,W06000024,Merthyr Tydfil,2020,24134,South Wales
4118,W06000024,Merthyr Tydfil,2021,24061,South Wales
4119,W06000024,Merthyr Tydfil,2022,24056,South Wales


## Steps to complete for loading previous analysis' data:

1. Download ONS mid-year estimates (with 2021 geog LA codes)
2. Process for adult women in England and Wales
3. Match LAs to earlier PFA codes and process

1. The pre-2021 Census data requires a manual download of the data and can be found within the zip file of the [Mid-2001 to mid-2020 detailed time series edition of this dataset](https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationestimatesforukenglandandwalesscotlandandnorthernireland). I will look to automate this in the future.

2. The data is then processed by `ons_comparator.py` to produce `LA_population_women_2001-2020.csv`, ready for matching with the PFA codes.

In [4]:
df_old = utils.load_data('interim', 'LA_population_women_2001-2020.csv')
df_old

2025-07-15 13:31:58,980 - INFO - Loaded data from data/interim/LA_population_women_2001-2020.csv


Unnamed: 0,ladcode,laname,year,freq
0,E06000001,Hartlepool,2001,35629
1,E06000001,Hartlepool,2002,35660
2,E06000001,Hartlepool,2003,35795
3,E06000001,Hartlepool,2004,35901
4,E06000001,Hartlepool,2005,36065
...,...,...,...,...
6615,W06000024,Merthyr Tydfil,2016,24249
6616,W06000024,Merthyr Tydfil,2017,24358
6617,W06000024,Merthyr Tydfil,2018,24426
6618,W06000024,Merthyr Tydfil,2019,24493


### 3. Matching PFA codes to `df_old`

In [5]:
from src.data.processing import la_to_pfa_matching

In [7]:
old_pfa_lookup_filename = config['data']['qaFilenames']['la_to_pfa_lookup']
old_pfa_lookup = utils.load_data('raw', old_pfa_lookup_filename)
old_pfa_lookup

2025-07-15 13:36:34,207 - INFO - Loaded data from data/raw/LA_to_PFA_(December_2022)_Lookup_in_EW.csv


Unnamed: 0,LAD22CD,LAD22NM,PFA22CD,PFA22NM
0,E08000001,Bolton,E23000005,Greater Manchester
1,E08000002,Bury,E23000005,Greater Manchester
2,E08000003,Manchester,E23000005,Greater Manchester
3,E08000004,Oldham,E23000005,Greater Manchester
4,E08000005,Rochdale,E23000005,Greater Manchester
...,...,...,...,...
336,E07000227,Horsham,E23000033,Sussex
337,E07000063,Lewes,E23000033,Sussex
338,E07000228,Mid Sussex,E23000033,Sussex
339,E07000064,Rother,E23000033,Sussex


In [8]:
df_old_qa = (
    la_to_pfa_matching.assign_pfa(old_pfa_lookup, df_old)
    .pipe(la_to_pfa_matching.filter_and_clean_data)
)
df_old_qa

2025-07-15 13:37:39,233 - INFO - Matching Local Authority Districts to Police Force Areas...
2025-07-15 13:37:39,235 - INFO - Creating lookup dictionary...
2025-07-15 13:37:39,236 - INFO - Standardising column names...
2025-07-15 13:37:39,246 - INFO - Filtering and cleaning population data...


Unnamed: 0,ladcode,laname,year,freq,pfa
0,E06000001,Hartlepool,2001,35629,Cleveland
1,E06000001,Hartlepool,2002,35660,Cleveland
2,E06000001,Hartlepool,2003,35795,Cleveland
3,E06000001,Hartlepool,2004,35901,Cleveland
4,E06000001,Hartlepool,2005,36065,Cleveland
...,...,...,...,...,...
6615,W06000024,Merthyr Tydfil,2016,24249,South Wales
6616,W06000024,Merthyr Tydfil,2017,24358,South Wales
6617,W06000024,Merthyr Tydfil,2018,24426,South Wales
6618,W06000024,Merthyr Tydfil,2019,24493,South Wales


### Comparing the old and new dataset values

In [9]:
df_old_qa.query('pfa == "Avon and Somerset" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
433,E06000022,Bath and North East Somerset,2014,75367,Avon and Somerset
453,E06000023,"Bristol, City of",2014,177077,Avon and Somerset
473,E06000024,North Somerset,2014,86277,Avon and Somerset
493,E06000025,South Gloucestershire,2014,108608,Avon and Somerset
3813,E07000187,Mendip,2014,45653,Avon and Somerset
3833,E07000188,Sedgemoor,2014,48795,Avon and Somerset
3853,E07000189,South Somerset,2014,67660,Avon and Somerset
4793,E07000246,Somerset West and Taunton,2014,62070,Avon and Somerset


In [10]:
df.query('pfa == "Avon and Somerset" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
276,E06000022,Bath and North East Somerset,2014,75340,Avon and Somerset
289,E06000023,"Bristol, City of",2014,178108,Avon and Somerset
302,E06000024,North Somerset,2014,86610,Avon and Somerset
315,E06000025,South Gloucestershire,2014,108410,Avon and Somerset
809,E06000066,Somerset,2014,225122,Avon and Somerset


Hmmm, seems as though there are some missing values

In [10]:
df.query('laname == "Mendip" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa


Checking population differences

In [16]:
new_pop_sum = df.query('pfa == "Avon and Somerset" and year == 2014')['freq'].sum()
old_pop_sum = df_old_qa.query('pfa == "Avon and Somerset" and year == 2014')['freq'].sum()
print(f'The new population for the Avon and Somerset PFA is {new_pop_sum}, and the old was {old_pop_sum}. A difference of {new_pop_sum - old_pop_sum}.')

The new population for the Avon and Somerset PFA is 673590, and the old was 671507. A difference of 2083.


In [12]:
pct_diff = ((new_pop_sum - old_pop_sum)/abs(old_pop_sum)) * 100
pct_diff

np.float64(0.3101978088091412)

0.3%, so not massively out. Will investigate where those missing LAs have gone

Looking at the ONS' [*A Beginner's Guide to UK Geography*](https://geoportal.statistics.gov.uk/datasets/d1f39e20edb940d58307a54d6e1045cd/about) in 2023 "the four districts within the county of Somerset were merged to form Somerset UA".

Looking at the explanation of the *ONS' coding and naming policy* and the UK Geography guide codes starting with E06 refer to unitary authorities and E07 refer to non-metropolitan districts, so it would make sense that the E07 values have been dropped and a new E06 value has appeared in this more recent dataset. The coding policy explains:

*"Instances must not be coded with, and/or be based on, inbuilt intelligence (for example, alphabetically or hierarchically). This is because any later change (like renaming) that may occur might upset this inbuilt intelligence."*

Again, this makes it more understandable to see that there is no logical pattern to the last two numeric digits for the Somerset UA.

## Checking values for other UAs

I am using some of my previous analysis after the 2021 Census to compare against my new dataset in a new QA script. It's also important to note that this data wasn't reconciled post-Census at this stage, so it may also be helpful to compare against the subsequent adjusted mid-2011 to mid-2022 edition of the data at https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationestimatesforukenglandandwalesscotlandandnorthernireland

In [24]:
import src.data.qa.LA_PFA_QA as LA_PFA_QA

In [22]:
df = LA_PFA_QA.load_and_process_data()
df

2025-07-15 14:16:45,043 - INFO - Loaded data from data/raw/MYEB1_detailed_population_estimates_series_UK_(2021_geog21).csv
2025-07-15 14:16:45,973 - INFO - Loaded data from data/raw/MYEB2_detailed_components_of_change_for reconciliation_EW_(2021_geog21).csv
2025-07-15 14:16:46,007 - INFO - Combining 2021 census population figures with reconciliation data...
2025-07-15 14:16:46,009 - INFO - Preprocessing population data...
2025-07-15 14:16:46,011 - INFO - Standardising column names...
2025-07-15 14:16:46,016 - INFO - Filtering for England and Wales...
2025-07-15 14:16:46,031 - INFO - Filtering for adult women...
2025-07-15 14:16:46,039 - INFO - Preprocessing population data...
2025-07-15 14:16:46,040 - INFO - Standardising column names...
2025-07-15 14:16:46,049 - INFO - Filtering for England and Wales...
2025-07-15 14:16:46,060 - INFO - Filtering for adult women...
2025-07-15 14:16:46,127 - INFO - Melting DataFrame from wide to long format...
2025-07-15 14:16:46,228 - INFO - Cleaning y

Unnamed: 0,ladcode,laname,year,freq
0,E06000001,Hartlepool,2001,32246
1,E06000001,Hartlepool,2002,32300
2,E06000001,Hartlepool,2003,32464
3,E06000001,Hartlepool,2004,32710
4,E06000001,Hartlepool,2005,32946
...,...,...,...,...
6946,W06000024,Merthyr Tydfil,2017,22859
6947,W06000024,Merthyr Tydfil,2018,22943
6948,W06000024,Merthyr Tydfil,2019,22976
6949,W06000024,Merthyr Tydfil,2020,23025


Now that the reconciliation has been done, I will move on to matching the PFA codes to the ladcodes

In [30]:
df = LA_PFA_QA.load_and_process_data()
df

2025-07-15 15:44:59,844 - INFO - Loaded data from data/raw/MYEB1_detailed_population_estimates_series_UK_(2021_geog21).csv
2025-07-15 15:45:00,732 - INFO - Loaded data from data/raw/MYEB2_detailed_components_of_change_for reconciliation_EW_(2021_geog21).csv
2025-07-15 15:45:00,820 - INFO - Loaded data from data/raw/LA_to_PFA_(December_2024)_Lookup_in_EW.csv
2025-07-15 15:45:00,829 - INFO - Combining 2021 census population figures with reconciliation data...
2025-07-15 15:45:00,830 - INFO - Preprocessing population data...
2025-07-15 15:45:00,831 - INFO - Standardising column names...
2025-07-15 15:45:00,834 - INFO - Filtering for England and Wales...
2025-07-15 15:45:00,838 - INFO - Filtering for adult women...
2025-07-15 15:45:00,844 - INFO - Preprocessing population data...
2025-07-15 15:45:00,845 - INFO - Standardising column names...
2025-07-15 15:45:00,854 - INFO - Filtering for England and Wales...
2025-07-15 15:45:00,866 - INFO - Filtering for adult women...
2025-07-15 15:45:00,

Unnamed: 0,ladcode,laname,year,freq,pfa
0,E06000001,Hartlepool,2001,32246,Cleveland
1,E06000001,Hartlepool,2002,32300,Cleveland
2,E06000001,Hartlepool,2003,32464,Cleveland
3,E06000001,Hartlepool,2004,32710,Cleveland
4,E06000001,Hartlepool,2005,32946,Cleveland
...,...,...,...,...,...
6946,W06000024,Merthyr Tydfil,2017,22859,South Wales
6947,W06000024,Merthyr Tydfil,2018,22943,South Wales
6948,W06000024,Merthyr Tydfil,2019,22976,South Wales
6949,W06000024,Merthyr Tydfil,2020,23025,South Wales


Check that all of the LAs in the population data are present in the PFA lookup table

In [31]:
df.query('pfa == "Avon and Somerset" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
454,E06000022,Bath and North East Somerset,2014,71450,Avon and Somerset
475,E06000023,"Bristol, City of",2014,174782,Avon and Somerset
496,E06000024,North Somerset,2014,79439,Avon and Somerset
517,E06000025,South Gloucestershire,2014,105335,Avon and Somerset


In [32]:
local_authorities = ['Mendip', 'Somerset', 'Sedgemoor']
df.query('laname in @local_authorities')

Unnamed: 0,ladcode,laname,year,freq,pfa
3990,E07000187,Mendip,2001,38138,
3991,E07000187,Mendip,2002,38430,
3992,E07000187,Mendip,2003,38741,
3993,E07000187,Mendip,2004,39106,
3994,E07000187,Mendip,2005,39419,
3995,E07000187,Mendip,2006,39686,
3996,E07000187,Mendip,2007,40220,
3997,E07000187,Mendip,2008,40721,
3998,E07000187,Mendip,2009,40754,
3999,E07000187,Mendip,2010,40924,


Looks as though there are some missing values in the processed data. I suspect that this is because the PFA lookup table has been updated to include the new Somerset unitary authority, but the population data still has the old LAs, so I will need to check the lookup file and if necessary use an older version of the PFA lookup table to match the LAs to the PFAs and re-run the QA script.

In [33]:
la_pfa_lookup = LA_PFA_QA.load_la_to_pfa_lookup()
la_pfa_lookup

2025-07-15 16:24:43,637 - INFO - Loaded data from data/raw/LA_to_PFA_(December_2024)_Lookup_in_EW.csv


Unnamed: 0,LAD24CD,LAD24NM,PFA24CD,PFA24NM
0,E07000136,Boston,E23000020,Lincolnshire
1,E07000137,East Lindsey,E23000020,Lincolnshire
2,E07000138,Lincoln,E23000020,Lincolnshire
3,E07000139,North Kesteven,E23000020,Lincolnshire
4,E07000140,South Holland,E23000020,Lincolnshire
...,...,...,...,...
327,W06000020,Torfaen,W15000002,Gwent
328,W06000021,Monmouthshire,W15000002,Gwent
329,W06000022,Newport,W15000002,Gwent
330,W06000023,Powys,W15000004,Dyfed-Powys


Right, that's loaded the 2024 version, rather than the 2022 version. Script now updated.

In [34]:
la_pfa_lookup = LA_PFA_QA.load_la_to_pfa_lookup()
la_pfa_lookup

2025-07-15 16:26:28,427 - INFO - Loaded data from data/raw/LA_to_PFA_(December_2022)_Lookup_in_EW.csv


Unnamed: 0,LAD22CD,LAD22NM,PFA22CD,PFA22NM
0,E08000001,Bolton,E23000005,Greater Manchester
1,E08000002,Bury,E23000005,Greater Manchester
2,E08000003,Manchester,E23000005,Greater Manchester
3,E08000004,Oldham,E23000005,Greater Manchester
4,E08000005,Rochdale,E23000005,Greater Manchester
...,...,...,...,...
336,E07000227,Horsham,E23000033,Sussex
337,E07000063,Lewes,E23000033,Sussex
338,E07000228,Mid Sussex,E23000033,Sussex
339,E07000064,Rother,E23000033,Sussex


That's better. Let's re-run the QA script to check the values.

In [37]:
df = LA_PFA_QA.load_and_process_data()
df

2025-07-15 16:28:25,693 - INFO - Loaded data from data/raw/MYEB1_detailed_population_estimates_series_UK_(2021_geog21).csv
2025-07-15 16:28:26,713 - INFO - Loaded data from data/raw/MYEB2_detailed_components_of_change_for reconciliation_EW_(2021_geog21).csv
2025-07-15 16:28:26,753 - INFO - Loaded data from data/raw/LA_to_PFA_(December_2022)_Lookup_in_EW.csv
2025-07-15 16:28:26,757 - INFO - Combining 2021 census population figures with reconciliation data...
2025-07-15 16:28:26,758 - INFO - Preprocessing population data...
2025-07-15 16:28:26,758 - INFO - Standardising column names...
2025-07-15 16:28:26,762 - INFO - Filtering for England and Wales...
2025-07-15 16:28:26,766 - INFO - Filtering for adult women...
2025-07-15 16:28:26,771 - INFO - Preprocessing population data...
2025-07-15 16:28:26,772 - INFO - Standardising column names...
2025-07-15 16:28:26,782 - INFO - Filtering for England and Wales...
2025-07-15 16:28:26,794 - INFO - Filtering for adult women...
2025-07-15 16:28:26,

Unnamed: 0,ladcode,laname,year,freq,pfa
0,E06000001,Hartlepool,2001,32246,Cleveland
1,E06000001,Hartlepool,2002,32300,Cleveland
2,E06000001,Hartlepool,2003,32464,Cleveland
3,E06000001,Hartlepool,2004,32710,Cleveland
4,E06000001,Hartlepool,2005,32946,Cleveland
...,...,...,...,...,...
6946,W06000024,Merthyr Tydfil,2017,22859,South Wales
6947,W06000024,Merthyr Tydfil,2018,22943,South Wales
6948,W06000024,Merthyr Tydfil,2019,22976,South Wales
6949,W06000024,Merthyr Tydfil,2020,23025,South Wales


In [38]:
df.query('pfa == "Avon and Somerset" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
454,E06000022,Bath and North East Somerset,2014,71450,Avon and Somerset
475,E06000023,"Bristol, City of",2014,174782,Avon and Somerset
496,E06000024,North Somerset,2014,79439,Avon and Somerset
517,E06000025,South Gloucestershire,2014,105335,Avon and Somerset
4003,E07000187,Mendip,2014,42248,Avon and Somerset
4024,E07000188,Sedgemoor,2014,45771,Avon and Somerset
4045,E07000189,South Somerset,2014,64313,Avon and Somerset
5032,E07000246,Somerset West and Taunton,2014,56761,Avon and Somerset


That's better, now I can check wither the values are in line with the new dataset.

In [40]:
df_old.query('pfa == "Avon and Somerset" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
433,E06000022,Bath and North East Somerset,2014,75367,Avon and Somerset
453,E06000023,"Bristol, City of",2014,177077,Avon and Somerset
473,E06000024,North Somerset,2014,86277,Avon and Somerset
493,E06000025,South Gloucestershire,2014,108608,Avon and Somerset
3813,E07000187,Mendip,2014,45653,Avon and Somerset
3833,E07000188,Sedgemoor,2014,48795,Avon and Somerset
3853,E07000189,South Somerset,2014,67660,Avon and Somerset
4793,E07000246,Somerset West and Taunton,2014,62070,Avon and Somerset


Check whether the the dataframe `df` processed at the very start of the notebook is the same as the one processed in the script, or whether there are any differences following the reconciliation of the data and then continue with the analysis.