# Women's imprisonment rates
## Data Merging

### Importing pandas library and reading in data

In [7]:
import pandas as pd

#### LA PFA population data

In [8]:
df = pd.read_csv('data/interim/LA_population_female_2001_2021_PFAs_NOT_REBASED.csv')
df

Unnamed: 0,ladcode21,ladname21,year,population,pfa
0,E06000001,Hartlepool,2001,32246,Cleveland
1,E06000002,Middlesbrough,2001,50887,Cleveland
2,E06000003,Redcar and Cleveland,2001,50671,Cleveland
3,E06000004,Stockton-on-Tees,2001,66611,Cleveland
4,E06000005,Darlington,2001,35701,Durham
...,...,...,...,...,...
6925,W06000020,Torfaen,2021,34999,Gwent
6926,W06000021,Monmouthshire,2021,36922,Gwent
6927,W06000022,Newport,2021,60097,Gwent
6928,W06000023,Powys,2021,53829,Dyfed-Powys


#### Reading in the imprisonment figures in from the processed CJS court outcomes by PFA dataset and dropping the column which shows percentage change.

In [9]:
df2 = pd.read_csv('data/processed/cust_sentences_total_table.csv', index_col = 'pfa')
df2.drop(columns='per_change_2014', inplace=True)

In [10]:
df2.head()

Unnamed: 0_level_0,2014,2015,2016,2017,2018,2019,2020,2021,2022
pfa,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Avon and Somerset,196,165,164,158,148,151,103,103,116
Bedfordshire,69,80,53,53,36,31,23,20,38
Cambridgeshire,91,89,112,115,116,89,78,47,68
Cheshire,169,181,167,172,176,149,123,117,74
Cleveland,91,78,108,152,140,98,55,103,100


#### Melting the dataframe

In [11]:
df3 = df2.melt(var_name="year", value_name="no_imp", ignore_index=False).copy()
df3

Unnamed: 0_level_0,year,no_imp
pfa,Unnamed: 1_level_1,Unnamed: 2_level_1
Avon and Somerset,2014,196
Bedfordshire,2014,69
Cambridgeshire,2014,91
Cheshire,2014,169
Cleveland,2014,91
...,...,...
Warwickshire,2022,41
West Mercia,2022,47
West Midlands,2022,216
West Yorkshire,2022,251


#### Sorting by PFA and year

In [12]:
df3.sort_values(by=['pfa', 'year'], inplace=True)
df3

Unnamed: 0_level_0,year,no_imp
pfa,Unnamed: 1_level_1,Unnamed: 2_level_1
Avon and Somerset,2014,196
Avon and Somerset,2015,165
Avon and Somerset,2016,164
Avon and Somerset,2017,158
Avon and Somerset,2018,148
...,...,...
Wiltshire,2018,49
Wiltshire,2019,49
Wiltshire,2020,34
Wiltshire,2021,33


#### Resetting the index

In [13]:
df3.reset_index(drop=False, inplace=True)
df3

Unnamed: 0,pfa,year,no_imp
0,Avon and Somerset,2014,196
1,Avon and Somerset,2015,165
2,Avon and Somerset,2016,164
3,Avon and Somerset,2017,158
4,Avon and Somerset,2018,148
...,...,...,...
373,Wiltshire,2018,49
374,Wiltshire,2019,49
375,Wiltshire,2020,34
376,Wiltshire,2021,33


In [14]:
df3['year'] = df3['year'].astype(int).copy()

### Filtering LA PFA population data to match time series of CJS PFA imprisonment data

In [15]:
df_1421 = df.query('2014 <= year <= 2021').copy()
df_1421

Unnamed: 0,ladcode21,ladname21,year,population,pfa
4290,E06000001,Hartlepool,2014,35016,Cleveland
4291,E06000002,Middlesbrough,2014,52198,Cleveland
4292,E06000003,Redcar and Cleveland,2014,51559,Cleveland
4293,E06000004,Stockton-on-Tees,2014,73669,Cleveland
4294,E06000005,Darlington,2014,39916,Durham
...,...,...,...,...,...
6925,W06000020,Torfaen,2021,34999,Gwent
6926,W06000021,Monmouthshire,2021,36922,Gwent
6927,W06000022,Newport,2021,60097,Gwent
6928,W06000023,Powys,2021,53829,Dyfed-Powys


Let's filter this new dataset to see which `laname21` is included in Avon and Somerset

In [16]:
df_1421.query('pfa == "Avon and Somerset" and year == 2014')

Unnamed: 0,ladcode21,ladname21,year,population,pfa
4311,E06000022,Bath and North East Somerset,2014,71450,Avon and Somerset
4312,E06000023,"Bristol, City of",2014,174782,Avon and Somerset
4313,E06000024,North Somerset,2014,79439,Avon and Somerset
4314,E06000025,South Gloucestershire,2014,105335,Avon and Somerset
4480,E07000187,Mendip,2014,42248,Avon and Somerset
4481,E07000188,Sedgemoor,2014,45771,Avon and Somerset
4482,E07000189,South Somerset,2014,64313,Avon and Somerset
4529,E07000246,Somerset West and Taunton,2014,56761,Avon and Somerset


In [17]:
df_2019 = pd.read_csv('../Women\'s PFA imprisonment rates/data/LA_population_female_2001_2019_PFAs_cleansed.csv')
df_2019.query('pfa == "Avon and Somerset" and year == 2014')

Unnamed: 0,ladcode19,name,year,population,pfa
4428,E06000022,Bath and North East Somerset,2014,75367,Avon and Somerset
4429,E06000023,"Bristol, City of",2014,177077,Avon and Somerset
4430,E06000024,North Somerset,2014,86277,Avon and Somerset
4431,E06000025,South Gloucestershire,2014,108608,Avon and Somerset
4605,E07000187,Mendip,2014,45653,Avon and Somerset
4606,E07000188,Sedgemoor,2014,48795,Avon and Somerset
4607,E07000189,South Somerset,2014,67660,Avon and Somerset
4654,E07000246,Somerset West and Taunton,2014,62070,Avon and Somerset


Right, there doesn't appear to be any difference in the local areas which are included within the pfa, but it appears as though there are some differences with the population numbers. Let's circle back to `LA_PFA_matching_2021_NOT_REBASED.ipynb` to see if something has gone awry earlier.

Having now checked directly with the ONS, they have confirmed that this is the right dataset to be using. I just need to double check whether this includes the latest 2021 census mid-year estimate figures, as the reconciliation dataset information note states that the 2021 figures are not the best ones to use here.

Following the check, the 2021 Census population figures hadn't been included, but this has now been rectified and the steps have been re-run.

In [18]:
df4 = df_1421.groupby(['pfa', 'year'], as_index=False).agg({'population': 'sum'}).copy()
df4

Unnamed: 0,pfa,year,population
0,Avon and Somerset,2014,640099
1,Avon and Somerset,2015,649120
2,Avon and Somerset,2016,656882
3,Avon and Somerset,2017,662282
4,Avon and Somerset,2018,669435
...,...,...,...
331,Wiltshire,2017,276033
332,Wiltshire,2018,277329
333,Wiltshire,2019,277747
334,Wiltshire,2020,280304


In [19]:
df3.columns

Index(['pfa', 'year', 'no_imp'], dtype='object')

In [20]:
df4.columns

Index(['pfa', 'year', 'population'], dtype='object')

In [21]:
df3.dtypes

pfa       object
year       int64
no_imp     int64
dtype: object

In [22]:
df4.dtypes

pfa           object
year           int64
population     int64
dtype: object

In [23]:
df4.reset_index(drop=True, inplace=True)
df4

Unnamed: 0,pfa,year,population
0,Avon and Somerset,2014,640099
1,Avon and Somerset,2015,649120
2,Avon and Somerset,2016,656882
3,Avon and Somerset,2017,662282
4,Avon and Somerset,2018,669435
...,...,...,...
331,Wiltshire,2017,276033
332,Wiltshire,2018,277329
333,Wiltshire,2019,277747
334,Wiltshire,2020,280304


In [24]:
df3

Unnamed: 0,pfa,year,no_imp
0,Avon and Somerset,2014,196
1,Avon and Somerset,2015,165
2,Avon and Somerset,2016,164
3,Avon and Somerset,2017,158
4,Avon and Somerset,2018,148
...,...,...,...
373,Wiltshire,2018,49
374,Wiltshire,2019,49
375,Wiltshire,2020,34
376,Wiltshire,2021,33


#### Right, we have the same issue as we've had previously where we have CJS data up to 2022, but ONS population data up to 2021

In [25]:
population_2022 = df4.query('year == 2021').copy()
population_2022.replace({'year': {2021: 2022}}, inplace=True)

In [26]:
population_2022

Unnamed: 0,pfa,year,population
7,Avon and Somerset,2022,684048
15,Bedfordshire,2022,265386
23,Cambridgeshire,2022,346854
31,Cheshire,2022,426291
39,Cleveland,2022,215172
47,Cumbria,2022,200205
55,Derbyshire,2022,412888
63,Devon and Cornwall,2022,704449
71,Dorset,2022,309007
79,Durham,2022,245552


Now to add this to df4, sort and reset index

In [27]:
df5 = pd.concat([df4, population_2022]).sort_values(by=['pfa', 'year']).reset_index(drop=True)
df5

Unnamed: 0,pfa,year,population
0,Avon and Somerset,2014,640099
1,Avon and Somerset,2015,649120
2,Avon and Somerset,2016,656882
3,Avon and Somerset,2017,662282
4,Avon and Somerset,2018,669435
...,...,...,...
373,Wiltshire,2018,277329
374,Wiltshire,2019,277747
375,Wiltshire,2020,280304
376,Wiltshire,2021,290739


We now have the same number of rows for both dataframes, so let's see if we can now perform the merge.

In [28]:
df6 = df5.join(df3, rsuffix = 'bleh')
df6

Unnamed: 0,pfa,year,population,pfableh,yearbleh,no_imp
0,Avon and Somerset,2014,640099,Avon and Somerset,2014,196
1,Avon and Somerset,2015,649120,Avon and Somerset,2015,165
2,Avon and Somerset,2016,656882,Avon and Somerset,2016,164
3,Avon and Somerset,2017,662282,Avon and Somerset,2017,158
4,Avon and Somerset,2018,669435,Avon and Somerset,2018,148
...,...,...,...,...,...,...
373,Wiltshire,2018,277329,Wiltshire,2018,49
374,Wiltshire,2019,277747,Wiltshire,2019,49
375,Wiltshire,2020,280304,Wiltshire,2020,34
376,Wiltshire,2021,290739,Wiltshire,2021,33


The `pfa` and `year` data appears to match up, which is good. Let's extract a few `pfa` entries and double check this is consistent.

In [29]:
pfa_test = ["Lincolnshire", "Staffordshire", "Cumbria", "West Midlands"]
for pfa in pfa_test:
    print(df6.query('pfa == @pfa'))

              pfa  year  population       pfableh  yearbleh  no_imp
189  Lincolnshire  2014      285633  Lincolnshire      2014      40
190  Lincolnshire  2015      288325  Lincolnshire      2015      52
191  Lincolnshire  2016      291373  Lincolnshire      2016      66
192  Lincolnshire  2017      294133  Lincolnshire      2017      61
193  Lincolnshire  2018      295902  Lincolnshire      2018      73
194  Lincolnshire  2019      298479  Lincolnshire      2019      65
195  Lincolnshire  2020      300527  Lincolnshire      2020      31
196  Lincolnshire  2021      303133  Lincolnshire      2021      43
197  Lincolnshire  2022      303133  Lincolnshire      2022      46
               pfa  year  population        pfableh  yearbleh  no_imp
288  Staffordshire  2014      436569  Staffordshire      2014     124
289  Staffordshire  2015      438671  Staffordshire      2015     126
290  Staffordshire  2016      442355  Staffordshire      2016     113
291  Staffordshire  2017      445389  St

Okay, these are looking encouraging. Let's move on to dropping the columns we don't need and saving this out.

In [30]:
df6.drop(columns=['pfableh', 'yearbleh'], inplace=True)
df6

Unnamed: 0,pfa,year,population,no_imp
0,Avon and Somerset,2014,640099,196
1,Avon and Somerset,2015,649120,165
2,Avon and Somerset,2016,656882,164
3,Avon and Somerset,2017,662282,158
4,Avon and Somerset,2018,669435,148
...,...,...,...,...
373,Wiltshire,2018,277329,49
374,Wiltshire,2019,277747,49
375,Wiltshire,2020,280304,34
376,Wiltshire,2021,290739,33


In [31]:
df6.query('pfa == "Essex"')

Unnamed: 0,pfa,year,population,no_imp
99,Essex,2014,674460,198
100,Essex,2015,680328,162
101,Essex,2016,686461,176
102,Essex,2017,691254,176
103,Essex,2018,694665,181
104,Essex,2019,699717,155
105,Essex,2020,702210,91
106,Essex,2021,704908,76
107,Essex,2022,704908,92


In [32]:
df6.query('pfa == "Avon and Somerset"')

Unnamed: 0,pfa,year,population,no_imp
0,Avon and Somerset,2014,640099,196
1,Avon and Somerset,2015,649120,165
2,Avon and Somerset,2016,656882,164
3,Avon and Somerset,2017,662282,158
4,Avon and Somerset,2018,669435,148
5,Avon and Somerset,2019,671756,151
6,Avon and Somerset,2020,676896,103
7,Avon and Somerset,2021,684048,103
8,Avon and Somerset,2022,684048,116


In [33]:
df6.to_csv('data/interim/merged_rate_pop_2014-2022_NOT_REBASED.csv', index=False)