
## Research Question: "Predict whether a stop and search will conclude in police action".

#### In this notebook we attempt to answer the research question using the data set (https://www.kaggle.com/sohier/london-police-records?select=london-stop-and-search.csv). First we clean the data from null values.

We import the libraries

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder


We import the dataset. It is "london-stop-and-search.csv", retrieved from Kaggle (https://www.kaggle.com/sohier/london-police-records?select=london-stop-and-search.csv), then cleaned by having some columns removed ("part of police operation", "police operation", and "self-defined ethnicity").

In [None]:
data = pd.read_csv("data.csv")

# Programatically remove the columns: 
#   - "Outcome linked to object of search"
#   - "Removal of more than just outer clothing". 

# Reason: too many nulls. For the former, not something that is relevant before the police action- so irrelevant to research question.
# TODO: discuss wether the csv file needs to be modified to reflect this.
del data["Outcome linked to object of search"]
del data["Removal of more than just outer clothing"]

# We print information about our data object (the csv file), which displays the number non-empty ("non-Null") values for each column, as well as the types of their values.
# PS: Most columns have a "Dtype" of "object"... probably means string
print(data.info())


Columns for this dataset are either numerical (e.g. Latitude) or categorical (e.g. Gender, describing a category of things). 

For categorical columns, we can't calculate medians (how to calculate medians for string values!?), but we can do so for our numerical columns. Specifically, the two nmerical columns are Latitude and Longitude. 

Furthermore, we can convert all values under "Date" to type DateTime using pd.to_datetime.

Finally, we drop all rows with empty values using .dropna()

In [None]:
#Convert latitude and longitude nulls to median
lat_median = data["Latitude"].median()
lon_median = data["Longitude"].median()

data["Latitude"] = data["Latitude"].fillna(lat_median)
data["Longitude"] = data["Longitude"].fillna(lon_median)

#Change the "Date" column to type DateTime
data['Date'] = pd.to_datetime(data['Date'])

# Some of the values for "Age range" have an inexplicable value of "Oct-17... those will be removed as well"
# Reference for dictionary idea to replace values: https://stackoverflow.com/questions/17114904/python-pandas-replacing-strings-in-dataframe-with-numbers
oct17_to_None = {"Oct-17": None}
data = data.applymap(lambda s: oct17_to_None.get(s) if s in oct17_to_None else s)

#For other columsn we'll need to drop the null values 
data = data.dropna()

#Let's print data.info again to see the changes
print(data.info())

#Notice: for each column, the number of non-null rows is equal to the number of rows in the table (165,651). Hence, no column has empty values under it.
#Noitce: the type of Date is now datetime64[ns, UTC]

## Encoding data 

Most sklearn functions don't expect string values. Unfortunately our dataset is filled with those. We need to therefore encode them into numerical types. Below, we build on the method outlined in ex5Part2 (Lab 5 of Introduction to AI module), using LabelEncoder from sklearn, to automatically encode our strings into numerals.

Our process will proceed as follows:

1. We build a dictionary to store an instance of LabelEncoder for each 'categorical column' (the seven labels assigned as a list to the variable "categorical_cols" below). We exclude any numerical columns (i.e. Date, Latitude, and Longitude) because those don't need to be encoded.
2. We iterate over the columns in our dataset. For each categorical column, we instantiate a corresponding encoder of type LabelEncoder(). The goal is to be able to access the specific encoder used to encode a particular column, thus allowing us to see the original string values (using encoder.classes_), and map them to the encoded number. This benefit will be made clearer in the 'Encoder Mapper' below.
3. We copy our dataset "data", into a variable "data_encoded". We will replace the categorical values of "data_encoded" with the encoded numerals, and leave the cleaned dataset of strings ("data") untouched. May prove to be useful later.
4. We iterate over the columns in "data_encoded", and transform all strings to an appropriate numeral, using the corresponding encoder in the "encoders" dictionary, as shown in lab 5.

In [None]:
#Build a dictionary of Encoders
encoders = {}
categorical_cols = ["Type", "Gender", "Age range", "Officer-defined ethnicity", "Legislation", "Object of search", "Outcome"]

for label in categorical_cols:
    encoders[f"{label} Encoder"] = LabelEncoder()
    encoders[f"{label} Encoder"].fit(data[label])

print(encoders)

In [None]:
#Encoder Mapper
# The purpose is to access the encoders in the "encoders" dictionary, and use them to map the string value we're going to encode, to the corresponding numeral for that column.
encoders["Outcome Encoder"].classes_

for label in categorical_cols:
    print(f"Encoder dictionary for {label}:")
    categories = encoders[f"{label} Encoder"].classes_
    for i in range(len(categories)):
        numeral = encoders[f"{label} Encoder"].transform([categories[i]])
        print(f"'{categories[i]}' encoded to '{numeral[0]}'")
    print("\n")

In [None]:
#We copy data, so data_encoded and data are two different objects (such that changing one won't impact the other)
data_encoded = data.copy() 

#We perform the encoding to data_encoded
for label in categorical_cols:
    data_encoded[label] = encoders[f"{label} Encoder"].fit(data[label]).transform(data[label])

#Print the data_encoded... notice all values have been numerified!
data_encoded.head()

In [None]:
#But "data" remains as it was...
data.head()

#You can use the Encoder Mapper to map between the numrical values in data_encoded and the original string