
## Research Question: "Predict whether a stop and search will conclude in police action".

#### In this notebook we attempt to answer the research question using the data set (https://www.kaggle.com/sohier/london-police-records?select=london-stop-and-search.csv). First we clean the data from null values.

We import the libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder


We import the dataset. It is "london-stop-and-search.csv", retrieved from Kaggle (https://www.kaggle.com/sohier/london-police-records?select=london-stop-and-search.csv), then cleaned by having some columns removed ("part of police operation", "police operation", and "self-defined ethnicity").

In [3]:
data = pd.read_csv("data.csv")

# Programatically remove the columns: 
#   - "Outcome linked to object of search"
#   - "Removal of more than just outer clothing". 

# Reason: too many nulls. For the former, not something that is relevant before the police action- so irrelevant to research question.
# TODO: discuss wether the csv file needs to be modified to reflect this.

# We delete the columns using the "del" keyword
del data["Outcome linked to object of search"]
del data["Removal of more than just outer clothing"]

# We print information about our data object (the csv file), which displays the number of non-empty ("non-Null") values for each column, as well as the types associated with their non-null values.
print(data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302623 entries, 0 to 302622
Data columns (total 10 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Type                       302623 non-null  object 
 1   Date                       302623 non-null  object 
 2   Latitude                   110615 non-null  float64
 3   Longitude                  110615 non-null  float64
 4   Gender                     299453 non-null  object 
 5   Age range                  288579 non-null  object 
 6   Officer-defined ethnicity  298958 non-null  object 
 7   Legislation                302623 non-null  object 
 8   Object of search           216156 non-null  object 
 9   Outcome                    302623 non-null  object 
dtypes: float64(2), object(8)
memory usage: 23.1+ MB
None


Columns for this dataset are either numerical (e.g. "Latitude") or categorical (e.g. "Gender", describing a category of things). 

For categorical columns, we can't calculate medians (how would we calculate a median for a series of string values!?), but we can do so for our numerical columns. Specifically, the two nmerical columns are Latitude and Longitude. 

Furthermore, we can convert all values under "Date" to type DateTime using pd.to_datetime.

Finally, we drop all rows with empty values for the remaining columns using .dropna(), as we are unable to replace them with a median.

In [4]:
#Convert latitude and longitude nulls to median
lat_median = data["Latitude"].median()
lon_median = data["Longitude"].median()

data["Latitude"] = data["Latitude"].fillna(lat_median)
data["Longitude"] = data["Longitude"].fillna(lon_median)

#Change the "Date" column to type DateTime
data['Date'] = pd.to_datetime(data['Date'])

# Some of the values for "Age range" have an inexplicable value of "Oct-17"... those will be removed as well. We replace them with "None" for the time being, so that they're sweeped alongside any null values when we call dropna().
# Reference for dictionary idea to replace values: https://stackoverflow.com/questions/17114904/python-pandas-replacing-strings-in-dataframe-with-numbers
oct17_to_None = {"Oct-17": None}
data = data.applymap(lambda s: oct17_to_None.get(s) if s in oct17_to_None else s)

#For the other columns, we'll need to drop the null values 
data = data.dropna()

#Let's print data.info again to see the changes
print(data.info())

#Notice: for each column, the number of non-null rows is equal to the number of rows in the table (165,651). Hence, no column has empty values under it.
#Noitce: the type of Date is now datetime64[ns, UTC]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165651 entries, 0 to 302621
Data columns (total 10 columns):
 #   Column                     Non-Null Count   Dtype              
---  ------                     --------------   -----              
 0   Type                       165651 non-null  object             
 1   Date                       165651 non-null  datetime64[ns, UTC]
 2   Latitude                   165651 non-null  float64            
 3   Longitude                  165651 non-null  float64            
 4   Gender                     165651 non-null  object             
 5   Age range                  165651 non-null  object             
 6   Officer-defined ethnicity  165651 non-null  object             
 7   Legislation                165651 non-null  object             
 8   Object of search           165651 non-null  object             
 9   Outcome                    165651 non-null  object             
dtypes: datetime64[ns, UTC](1), float64(2), object(7)
memory 

## Encoding data 

Most sklearn functions don't expect string values. Unfortunately our dataset is filled with those. We need to therefore encode them into numerical types. Below, we build on the method outlined in ex5Part2 (Lab 5 of Introduction to AI module), using LabelEncoder from sklearn, to automatically encode our strings into numerals.

Our process will proceed as follows:

1. We build a dictionary to store an instance of LabelEncoder for each 'categorical column' (the seven labels assigned as a list to the variable "categorical_cols" below). We exclude any numerical columns (i.e. Date, Latitude, and Longitude) because those don't need to be encoded.
2. We iterate over the columns in our dataset. For each categorical column, we instantiate a corresponding encoder of type LabelEncoder(). The goal is to be able to access the specific encoder used to encode a particular column, thus allowing us to see the original string values (using encoder.classes_), and map them to the encoded number. This benefit will be made clearer in the 'Encoder Mapper' below.
3. We copy our dataset "data", into a variable "data_encoded". We will replace the categorical values of "data_encoded" with the encoded numerals, and leave the cleaned dataset of strings ("data") untouched. May prove to be useful later.
4. Finally, we iterate over the columns in "data_encoded", and transform all strings to an appropriate numeral, using the corresponding encoder in the "encoders" dictionary, as shown in lab 5.

In [7]:
#Build a dictionary of Encoders
encoders = {}

#List the categorical cols
categorical_cols = ["Type", "Gender", "Age range", "Officer-defined ethnicity", "Legislation", "Object of search", "Outcome"]

#Build an encoder for each categorical_col, and fit it to the values under that column
for label in categorical_cols:
    encoders[f"{label} Encoder"] = LabelEncoder()
    encoders[f"{label} Encoder"].fit(data[label])

#Our dictionary of encoders now has seven encoders, fitted to the categorical values in "data"
print(encoders)

{'Type Encoder': LabelEncoder(), 'Gender Encoder': LabelEncoder(), 'Age range Encoder': LabelEncoder(), 'Officer-defined ethnicity Encoder': LabelEncoder(), 'Legislation Encoder': LabelEncoder(), 'Object of search Encoder': LabelEncoder(), 'Outcome Encoder': LabelEncoder()}


In [10]:
#Encoder Mapper
# We are now able to access the seven encoders in the "encoders" dictionary, and use them to map the string value we're going to encode, to the corresponding numeral for that column (to see what the value of that cell is, before and after the encoding).
for label in categorical_cols:
    print(f"Encoder dictionary for the column '{label}':")
    categories = encoders[f"{label} Encoder"].classes_
    for i in range(len(categories)):
        numeral = encoders[f"{label} Encoder"].transform([categories[i]])
        print(f"'{categories[i]}' will be encoded to '{numeral[0]}'")
    print("\n")

Encoder dictionary for the column 'Type':
'Person and Vehicle search' will be encoded to '0'
'Person search' will be encoded to '1'
'Vehicle search' will be encoded to '2'


Encoder dictionary for the column 'Gender':
'Female' will be encoded to '0'
'Male' will be encoded to '1'
'Other' will be encoded to '2'


Encoder dictionary for the column 'Age range':
'18-24' will be encoded to '0'
'25-34' will be encoded to '1'
'over 34' will be encoded to '2'
'under 10' will be encoded to '3'


Encoder dictionary for the column 'Officer-defined ethnicity':
'Asian' will be encoded to '0'
'Black' will be encoded to '1'
'Mixed' will be encoded to '2'
'Other' will be encoded to '3'
'White' will be encoded to '4'


Encoder dictionary for the column 'Legislation':
'Criminal Justice Act 1988 (section 139B)' will be encoded to '0'
'Criminal Justice and Public Order Act 1994 (section 60)' will be encoded to '1'
'Firearms Act 1968 (section 47)' will be encoded to '2'
'Misuse of Drugs Act 1971 (section 23

In [11]:
#We copy the "data" variable into "data_encoded", such that changing one won't impact the other
data_encoded = data.copy() 

#We perform the encoding to "data_encoded"
for label in categorical_cols:
    data_encoded[label] = encoders[f"{label} Encoder"].fit(data[label]).transform(data[label])

#Print the data_encoded... notice all values have been numerified!
data_encoded.head()

Unnamed: 0,Type,Date,Latitude,Longitude,Gender,Age range,Officer-defined ethnicity,Legislation,Object of search,Outcome
0,1,2015-03-02 16:40:00+00:00,51.512286,-0.114491,1,1,0,4,8,6
1,1,2015-03-02 16:40:00+00:00,51.512286,-0.114491,1,1,0,4,8,6
2,1,2015-03-02 18:45:00+00:00,51.512286,-0.114491,1,1,4,4,8,6
4,0,2015-03-03 15:50:00+00:00,51.512286,-0.114491,1,1,4,4,8,6
5,1,2015-03-03 20:20:00+00:00,51.512286,-0.114491,1,1,0,3,3,2


In [12]:
#But "data" remains as it was...
data.head()

#You can use the Encoder Mapper to map between the numrical values in data_encoded and the original string

Unnamed: 0,Type,Date,Latitude,Longitude,Gender,Age range,Officer-defined ethnicity,Legislation,Object of search,Outcome
0,Person search,2015-03-02 16:40:00+00:00,51.512286,-0.114491,Male,25-34,Asian,Police and Criminal Evidence Act 1984 (section 1),Stolen goods,Suspect arrested
1,Person search,2015-03-02 16:40:00+00:00,51.512286,-0.114491,Male,25-34,Asian,Police and Criminal Evidence Act 1984 (section 1),Stolen goods,Suspect arrested
2,Person search,2015-03-02 18:45:00+00:00,51.512286,-0.114491,Male,25-34,White,Police and Criminal Evidence Act 1984 (section 1),Stolen goods,Suspect arrested
4,Person and Vehicle search,2015-03-03 15:50:00+00:00,51.512286,-0.114491,Male,25-34,White,Police and Criminal Evidence Act 1984 (section 1),Stolen goods,Suspect arrested
5,Person search,2015-03-03 20:20:00+00:00,51.512286,-0.114491,Male,25-34,Asian,Misuse of Drugs Act 1971 (section 23),Controlled drugs,Nothing found - no further action
