# Setting up the project

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

In [5]:
!pip install odfpy

Defaulting to user installation because normal site-packages is not writeable
Collecting odfpy
  Downloading odfpy-1.4.1.tar.gz (717 kB)
[K     |████████████████████████████████| 717 kB 1.8 MB/s eta 0:00:01
Building wheels for collected packages: odfpy
  Building wheel for odfpy (setup.py) ... [?25ldone
[?25h  Created wheel for odfpy: filename=odfpy-1.4.1-py2.py3-none-any.whl size=160691 sha256=5e2e6d116abc955915a005b199e452be31ae6ad661f245e62ab19409ecb33ebc
  Stored in directory: /home/nimenides/.cache/pip/wheels/ea/af/da/2bdd7308f7b334429a558df1e36d31864cd19c07ede92ddf0e
Successfully built odfpy
Installing collected packages: odfpy
Successfully installed odfpy-1.4.1


# Data context
This data was published on the 16th December 2021, featuring the data along with a list of fields and their descriptions. CED stands for conducted energy devices which is a fancy way of saying a "stun gun". This data is based on police use of force in England and Wales from April 2020 to March 2021 because earlier sources of data were marked as "experimental", meaning the recording of data wasn't refined and it made the data possibly unreliable.

Data couldn't be collected earlier than April 2017 as it became a legal requirement since then (police forces must now record this data).

## Limitations
The data does not represent **all** police use of force in England and Wales. The numbers in police use of force would be expected to increase as the recording of data improves (although from April 2020 onwards, the data is no longer marked as "experimental"). The data also originates from the reporting police officer where they report *observed* characteristics like ethnicity. In other words, individual police forces use their own recording systems and conduct their own quality assurance processes (meaning data quality can differ across the national dataset). 

If an incident occurs where multiple officers used force, then each officer who used force must complete one use fo force report, per individual where they detail their *own* use of force. This means that the number of incidents reported doesn't tell how many individual people experienced police use of force.

Not all police recording systems allow officers to report multiple reasons, impact factors and outcomes for an incident.

Injuries to individuals may be reported even if the injuries weren't caused by the officers themselves. This means that the number of injury incidents don't precisely equate to the number of individual people being injured or killed by officers.

While the data does record mental and physical health conditions, some mental conditions such as autism or learning disabilities can be seen as "invisible" or not that visible at all, suggesting that the recording of mental conditions may not be that accurate.

# Data cleaning

## A glance at the data

Just for some added context, I split the original .xlsx file into a fields.csv file and a "police-use-of-force-statistics".csv file which contains the actual data. I did this to speed up data retrieval.

In [20]:
fields_df = pd.read_csv('fields.csv')
fields_df

Unnamed: 0,Field,Description
0,location_street_highway,Incident location: street/highway
1,location_public_transport,Incident location: public transport
2,location_retail_premises,Incident location: retail premises
3,location_open_ground,"Incident location: open ground (e.g. park, car..."
4,location_licensed_premises,Incident location: licensed premises
...,...,...
79,firearms_not_stated,Tactic used: firearms not stated
80,other_improvised,Tactic used: other/improvised
81,dog_deployed,Tactic used: dog deployed
82,dog_bite,Tactic used: dog bite


In [33]:
police_df = pd.read_csv('police-use-of-force-statistics.csv')
police_df.head()

Unnamed: 0,location_street_highway,location_public_transport,location_retail_premises,location_open_ground,location_licensed_premises,location_sports_event_stadia,location_hospital_a_and_e,location_mental_health_setting,location_police_vehicle_w_cage,location_police_vehicle_wo_cage,...,aep_drawn,aep_used,aep_not_stated,firearms_aimed,firearms_fired,firearms_not_stated,other_improvised,dog_deployed,dog_bite,dog_not_stated
0,yes,no,no,no,no,no,no,no,no,no,...,no,no,no,no,no,no,no,no,no,no
1,yes,no,no,no,no,no,no,no,no,no,...,no,no,no,no,no,no,no,no,no,no
2,no,no,yes,no,no,no,no,no,no,no,...,no,no,no,no,no,no,no,no,no,no
3,no,no,no,no,no,no,no,no,no,no,...,no,no,no,no,no,no,no,no,no,no
4,no,no,no,yes,no,no,no,no,no,no,...,no,no,no,no,no,no,no,no,no,no


In [26]:
police_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 562280 entries, 0 to 562279
Data columns (total 84 columns):
 #   Column                                    Non-Null Count   Dtype 
---  ------                                    --------------   ----- 
 0   location_street_highway                   562280 non-null  object
 1   location_public_transport                 562280 non-null  object
 2   location_retail_premises                  562280 non-null  object
 3   location_open_ground                      562280 non-null  object
 4   location_licensed_premises                562280 non-null  object
 5   location_sports_event_stadia              562280 non-null  object
 6   location_hospital_a_and_e                 562280 non-null  object
 7   location_mental_health_setting            562280 non-null  object
 8   location_police_vehicle_w_cage            562280 non-null  object
 9   location_police_vehicle_wo_cage           562280 non-null  object
 10  location_dwelling               

From a glance, we can see that there are lots of columns (84 columns) that seem to belong to different categories such as "location" that are boolean in nature. We can probably decrease the memory usage by a decent amount just by converting "yes" and "no" responses into True or False so that we have "Boolean" instead of "object" datatypes.

## Imputing null values

In [28]:
police_df.isnull().sum().sum()

18

In [38]:
police_df.columns[police_df.isnull().sum() > 0]

Index(['ced_highest_use'], dtype='object')

In [44]:
police_df['ced_highest_use'].value_counts()

not_applicable       527833
red_dot               17353
drawn                  7624
aimed                  3428
fired                  3208
not_stated             2432
arced                   308
drive_stun               64
angled_drive_stun        12
Name: ced_highest_use, dtype: int64

Given that there is a value for "not_applicable", we can just convert null values to that instead. The "ced_highest_use" column is the only column that has "NaN" values.

In [45]:
police_df['ced_highest_use'].fillna('not_applicable', inplace=True)

In [46]:
police_df.isnull().sum().sum()

0

In [49]:
for col in police_df.columns:
    print(police_df[col].value_counts())

no            303418
yes           258826
not_stated        36
Name: location_street_highway, dtype: int64
no            548040
not_stated     11423
yes             2817
Name: location_public_transport, dtype: int64
no            553216
yes             9028
not_stated        36
Name: location_retail_premises, dtype: int64
no            524592
yes            37652
not_stated        36
Name: location_open_ground, dtype: int64
no            559346
yes             2898
not_stated        36
Name: location_licensed_premises, dtype: int64
no            559891
not_stated      2250
yes              139
Name: location_sports_event_stadia, dtype: int64
no            544076
yes            18168
not_stated        36
Name: location_hospital_a_and_e, dtype: int64
no            556001
yes             6243
not_stated        36
Name: location_mental_health_setting, dtype: int64
no            552216
yes            10028
not_stated        36
Name: location_police_vehicle_w_cage, dtype: int64
no           

no     558663
yes      3617
Name: baton_drawn, dtype: int64
no     560332
yes      1948
Name: baton_used, dtype: int64
no     562209
yes        71
Name: baton_not_stated, dtype: int64
no     554269
yes      8011
Name: grouped_irritant_drawn, dtype: int64
no     552188
yes     10092
Name: grouped_irritant_used, dtype: int64
no     561857
yes       423
Name: irritant_spray_not_stated, dtype: int64
no     553255
yes      9025
Name: spit_guard, dtype: int64
no     561222
yes      1058
Name: shield, dtype: int64
no     527851
yes     34429
Name: ced, dtype: int64
not_applicable       527851
red_dot               17353
drawn                  7624
aimed                  3428
fired                  3208
not_stated             2432
arced                   308
drive_stun               64
angled_drive_stun        12
Name: ced_highest_use, dtype: int64
no     561395
yes       885
Name: aep_drawn, dtype: int64
no     562228
yes        52
Name: aep_used, dtype: int64
no     562279
yes         1
Name

In [50]:
location_cols = [col for col in police_df.columns if 'location' in col]
location_cols

['location_street_highway',
 'location_public_transport',
 'location_retail_premises',
 'location_open_ground',
 'location_licensed_premises',
 'location_sports_event_stadia',
 'location_hospital_a_and_e',
 'location_mental_health_setting',
 'location_police_vehicle_w_cage',
 'location_police_vehicle_wo_cage',
 'location_dwelling',
 'location_police_station_ex_custody_block',
 'location_custody_block',
 'location_ambulance',
 'location_other']