
## Research Question: "Predict whether a stop and search will conclude in police action".

#### In this notebook we attempt to answer the research question using the data set (https://www.kaggle.com/sohier/london-police-records?select=london-stop-and-search.csv). First we clean the data from null values.

We import the libraries

In [33]:
import pandas as pd

We import the dataset. It is "london-stop-and-search.csv", retrieved from Kaggle (https://www.kaggle.com/sohier/london-police-records?select=london-stop-and-search.csv), then cleaned by having some columns removed ("part of police operation", "police operation", and "self-defined ethnicity").

In [36]:
data = pd.read_csv("data.csv")
print(data.info())

# Programatically remove "Outcome linked to object of search" and "Removal of more than just outer clothing". 
# Reason: too many nulls. For the former, not something that is relevant before the police action- so irrelevant to research question.
# TODO: discuss wether the csv file needs to be modified to reflect this.
del data["Outcome linked to object of search"]
del data["Removal of more than just outer clothing"]
print("\n")
print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302623 entries, 0 to 302622
Data columns (total 12 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   Type                                      302623 non-null  object 
 1   Date                                      302623 non-null  object 
 2   Latitude                                  110615 non-null  float64
 3   Longitude                                 110615 non-null  float64
 4   Gender                                    299453 non-null  object 
 5   Age range                                 288579 non-null  object 
 6   Officer-defined ethnicity                 298958 non-null  object 
 7   Legislation                               302623 non-null  object 
 8   Object of search                          216156 non-null  object 
 9   Outcome linked to object of search        1206 non-null    object 
 10  Removal of more than

The columns "Type", "Date", "Gender", "Age Range", "Officer-defined Ethnicity", "Legislation", "Object of Search", and "Outcome" are type object. So null values for those columns cannot be replaced by a mean. However, Latitude and Longitude are floats, so we will replace null values with the mean.

In [41]:
#Cover latitude and longitude nulls to median

# data["Latitude"] = missing_median(df=data, name="Latitude")
# data["Longitude"] = missing_median(df=data, name="Longitude")
lat_median = data["Latitude"].median()
lon_median = data["Longitude"].median()

data["Latitude"] = data["Latitude"].fillna(lat_median)
data["Longitude"] = data["Longitude"].fillna(lon_median)

print(data.info())

# Some of the values for "Age range" have an inexplicable value of "Oct-17... those will be removed as well"
# Reference for dictionary idea to replace values: https://stackoverflow.com/questions/17114904/python-pandas-replacing-strings-in-dataframe-with-numbers
oct17_to_None = {"Oct-17": None}
data = data.applymap(lambda s: oct17_to_None.get(s) if s in oct17_to_None else s)



<class 'pandas.core.frame.DataFrame'>
Int64Index: 205461 entries, 0 to 302622
Data columns (total 10 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Type                       205461 non-null  object 
 1   Date                       205461 non-null  object 
 2   Latitude                   205461 non-null  float64
 3   Longitude                  205461 non-null  float64
 4   Gender                     205461 non-null  object 
 5   Age range                  205461 non-null  object 
 6   Officer-defined ethnicity  205461 non-null  object 
 7   Legislation                205461 non-null  object 
 8   Object of search           205461 non-null  object 
 9   Outcome                    205461 non-null  object 
dtypes: float64(2), object(8)
memory usage: 17.2+ MB
None


In [42]:
#For other columsn we'll need to drop the null values 
data = data.dropna()
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165651 entries, 0 to 302621
Data columns (total 10 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Type                       165651 non-null  object 
 1   Date                       165651 non-null  object 
 2   Latitude                   165651 non-null  float64
 3   Longitude                  165651 non-null  float64
 4   Gender                     165651 non-null  object 
 5   Age range                  165651 non-null  object 
 6   Officer-defined ethnicity  165651 non-null  object 
 7   Legislation                165651 non-null  object 
 8   Object of search           165651 non-null  object 
 9   Outcome                    165651 non-null  object 
dtypes: float64(2), object(8)
memory usage: 13.9+ MB
None
