# Women's imprisonment rates
## ONS population by Police Force Area: Data QA
Checking that my new dataset values are in line with previous years

## Loading data

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import src.utilities as utils
import src.data.processing.combine_custody_pfa_population as combine_custody_pfa_population

In [3]:
df1, df2 = combine_custody_pfa_population.load_data()

2025-07-10 14:56:27,690 - INFO - Loaded data from data/processed/PFA_custodial_sentences_all_FINAL.csv
2025-07-10 14:56:27,707 - INFO - Loaded data from data/interim/LA_PFA_population_women_2011-2023.csv


Completing a similar check to the previous analysis to ensure that both the filtering and all LAs are included

In [4]:
df2.query('pfa == "Avon and Somerset" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa
276,E06000022,Bath and North East Somerset,2014,75340,Avon and Somerset
289,E06000023,"Bristol, City of",2014,178108,Avon and Somerset
302,E06000024,North Somerset,2014,86610,Avon and Somerset
315,E06000025,South Gloucestershire,2014,108410,Avon and Somerset
809,E06000066,Somerset,2014,225122,Avon and Somerset


Hmmm, seems as though there are some missing values

In [5]:
df2.query('laname == "Mendip" and year == 2014')

Unnamed: 0,ladcode,laname,year,freq,pfa


## UP TO HERE

Checking population differences

In [None]:
new_pop_sum = df2.query('pfa == "Avon and Somerset" and year == 2014')['freq'].sum()
old_pop_sum = 640099

In [None]:
pct_diff = ((new_pop_sum - old_pop_sum)/abs(old_pop_sum)) * 100
pct_diff

np.float64(5.23215940034276)

Just over 5%, so not massively out. Will investigate where those missing LAs have gone

Looking at the ONS' [*A Beginner's Guide to UK Geography*](https://geoportal.statistics.gov.uk/datasets/d1f39e20edb940d58307a54d6e1045cd/about) in 2023 "the four districts within the county of Somerset were merged to form Somerset UA".

Looking at the explanation of the *ONS' coding and naming policy* and the UK Geography guide codes starting with E06 refer to unitary authorities and E07 refer to non-metropolitan districts, so it would make sense that the E07 values have been dropped and a new E06 value has appeared in this more recent dataset. The coding policy explains:

*"Instances must not be coded with, and/or be based on, inbuilt intelligence (for example, alphabetically or hierarchically). This is because any later change (like renaming) that may occur might upset this inbuilt intelligence."*

Again, this makes it more understandable to see that there is no logical pattern to the last two numeric digits for the Somerset UA.

## Checking values for other UAs

I am using some of my previous analysis after the 2021 Census to compare against my new dataset in a new QA script. It's also important to note that this data wasn't reconciled post-Census at this stage, so it may also be helpful to compare against the subsequent adjusted mid-2011 to mid-2022 edition of the data at https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationestimatesforukenglandandwalesscotlandandnorthernireland

In [None]:
import src.data.LA_PFA_QA as LA_PFA_QA

In [None]:
df_population, df_reconciliation = LA_PFA_QA.load_data()

2025-07-07 16:13:36,145 - INFO - Loaded data from data/raw/MYEB1_detailed_population_estimates_series_UK_(2021_geog21).csv
2025-07-07 16:13:37,158 - INFO - Loaded data from data/raw/MYEB2_detailed_components_of_change_for reconciliation_EW_(2021_geog21).csv
2025-07-07 16:13:37,158 - INFO - Loaded data from data/raw/MYEB2_detailed_components_of_change_for reconciliation_EW_(2021_geog21).csv


In [None]:
df_population

Unnamed: 0,ladcode21,ladname21,country,sex,age,population_2021
0,E06000001,Hartlepool,E,1,0,446
1,E06000001,Hartlepool,E,1,1,477
2,E06000001,Hartlepool,E,1,2,506
3,E06000001,Hartlepool,E,1,3,464
4,E06000001,Hartlepool,E,1,4,524
...,...,...,...,...,...,...
68063,W06000024,Merthyr Tydfil,W,2,86,85
68064,W06000024,Merthyr Tydfil,W,2,87,54
68065,W06000024,Merthyr Tydfil,W,2,88,50
68066,W06000024,Merthyr Tydfil,W,2,89,50


In [None]:
df_reconciliation

Unnamed: 0,ladcode21,ladname21,country,sex,age,population_2001,population_2002,population_2003,population_2004,population_2005,...,population_2011,population_2012,population_2013,population_2014,population_2015,population_2016,population_2017,population_2018,population_2019,population_2020
0,E06000001,Hartlepool,E,1,0,519,499,513,517,551,...,555,557,509,514,517,511,509,464,491,455
1,E06000001,Hartlepool,E,1,1,550,520,511,508,518,...,584,559,557,507,515,522,526,496,477,489
2,E06000001,Hartlepool,E,1,2,548,558,517,506,513,...,561,575,574,554,516,523,526,525,516,484
3,E06000001,Hartlepool,E,1,3,523,549,554,511,501,...,565,565,586,581,551,531,543,530,542,516
4,E06000001,Hartlepool,E,1,4,589,527,553,574,510,...,546,552,561,591,584,564,532,550,525,539
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60237,W06000024,Merthyr Tydfil,W,2,86,39,43,39,37,51,...,62,57,56,73,85,69,74,74,71,63
60238,W06000024,Merthyr Tydfil,W,2,87,29,30,36,29,29,...,46,55,43,47,63,74,65,64,67,55
60239,W06000024,Merthyr Tydfil,W,2,88,27,22,25,31,24,...,48,38,43,35,40,55,66,63,55,60
60240,W06000024,Merthyr Tydfil,W,2,89,22,24,17,22,23,...,32,44,32,38,28,32,45,54,55,44


In [None]:
df_merged = LA_PFA_QA.combine_population_data(df_population, df_reconciliation)

2025-07-07 16:22:30,590 - INFO - Combining 2021 census population figures with reconciliation data...
2025-07-07 16:22:30,591 - INFO - Preprocessing population data...
2025-07-07 16:22:30,591 - INFO - Renaming columns with regex...
2025-07-07 16:22:30,591 - INFO - Preprocessing population data...
2025-07-07 16:22:30,591 - INFO - Renaming columns with regex...
2025-07-07 16:22:30,606 - INFO - Filtering for adult women...
2025-07-07 16:22:30,620 - INFO - Preprocessing population data...
2025-07-07 16:22:30,606 - INFO - Filtering for adult women...
2025-07-07 16:22:30,620 - INFO - Preprocessing population data...
2025-07-07 16:22:30,622 - INFO - Renaming columns with regex...
2025-07-07 16:22:30,622 - INFO - Renaming columns with regex...
2025-07-07 16:22:30,657 - INFO - Filtering for adult women...
2025-07-07 16:22:30,657 - INFO - Filtering for adult women...


In [None]:
df_merged

Unnamed: 0,ladcode,laname,country,sex,age,population_2001,population_2002,population_2003,population_2004,population_2005,...,population_2012,population_2013,population_2014,population_2015,population_2016,population_2017,population_2018,population_2019,population_2020,population_2021
0,E06000001,Hartlepool,E,2,18,558,595,645,640,665,...,655,667,628,622,546,548,554,618,521,521
1,E06000001,Hartlepool,E,2,19,490,465,537,580,556,...,517,583,612,575,577,497,461,489,553,452
2,E06000001,Hartlepool,E,2,20,506,465,442,505,549,...,611,499,570,564,556,544,475,454,494,509
3,E06000001,Hartlepool,E,2,21,498,497,450,439,503,...,598,605,478,554,575,552,545,481,465,489
4,E06000001,Hartlepool,E,2,22,467,494,489,468,459,...,558,635,618,504,582,595,578,559,509,502
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24158,W06000024,Merthyr Tydfil,W,2,86,39,43,39,37,51,...,57,56,73,85,69,74,74,71,63,85
24159,W06000024,Merthyr Tydfil,W,2,87,29,30,36,29,29,...,55,43,47,63,74,65,64,67,55,54
24160,W06000024,Merthyr Tydfil,W,2,88,27,22,25,31,24,...,38,43,35,40,55,66,63,55,60,50
24161,W06000024,Merthyr Tydfil,W,2,89,22,24,17,22,23,...,44,32,38,28,32,45,54,55,44,50


In [None]:
df = LA_PFA_QA.process_data(df_merged)
df

2025-07-07 16:26:14,196 - INFO - Melting DataFrame from wide to long format...
2025-07-07 16:26:14,437 - INFO - Cleaning year column...
2025-07-07 16:26:14,437 - INFO - Cleaning year column...
2025-07-07 16:26:14,745 - INFO - Combining age groups for aggregation...
2025-07-07 16:26:14,745 - INFO - Combining age groups for aggregation...


Unnamed: 0,ladcode,laname,year,freq
0,E06000001,Hartlepool,2001,32246
1,E06000001,Hartlepool,2002,32300
2,E06000001,Hartlepool,2003,32464
3,E06000001,Hartlepool,2004,32710
4,E06000001,Hartlepool,2005,32946
...,...,...,...,...
6946,W06000024,Merthyr Tydfil,2017,22859
6947,W06000024,Merthyr Tydfil,2018,22943
6948,W06000024,Merthyr Tydfil,2019,22976
6949,W06000024,Merthyr Tydfil,2020,23025


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6951 entries, 0 to 6950
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   ladcode  6951 non-null   object  
 1   laname   6951 non-null   category
 2   year     6951 non-null   object  
 3   freq     6951 non-null   int64   
dtypes: category(1), int64(1), object(2)
memory usage: 187.3+ KB


In [None]:
df = LA_PFA_QA.load_and_process_data()
df

2025-07-08 15:34:43,903 - INFO - Loaded data from data/raw/MYEB1_detailed_population_estimates_series_UK_(2021_geog21).csv
2025-07-08 15:34:44,763 - INFO - Loaded data from data/raw/MYEB2_detailed_components_of_change_for reconciliation_EW_(2021_geog21).csv
2025-07-08 15:34:44,796 - INFO - Combining 2021 census population figures with reconciliation data...
2025-07-08 15:34:44,797 - INFO - Preprocessing population data...
2025-07-08 15:34:44,797 - INFO - Renaming columns with regex...
2025-07-08 15:34:44,828 - INFO - Filtering for adult women...
2025-07-08 15:34:44,876 - INFO - Preprocessing population data...
2025-07-08 15:34:44,876 - INFO - Renaming columns with regex...
2025-07-08 15:34:44,909 - INFO - Filtering for adult women...
2025-07-08 15:34:44,956 - INFO - Melting DataFrame from wide to long format...
2025-07-08 15:34:45,051 - INFO - Cleaning year column...
2025-07-08 15:34:45,382 - INFO - Combining age groups for aggregation...


Unnamed: 0,ladcode,laname,year,freq
0,E06000001,Hartlepool,2001,32246
1,E06000001,Hartlepool,2002,32300
2,E06000001,Hartlepool,2003,32464
3,E06000001,Hartlepool,2004,32710
4,E06000001,Hartlepool,2005,32946
...,...,...,...,...
6946,W06000024,Merthyr Tydfil,2017,22859
6947,W06000024,Merthyr Tydfil,2018,22943
6948,W06000024,Merthyr Tydfil,2019,22976
6949,W06000024,Merthyr Tydfil,2020,23025
