# Report 2: Theft Over Open Data (TOOD)

> This file is intended to be used as references for sections 3 - 5 which all involves model training, 
> testing, and deployment of the model. Creativity and innovation are encouraged for this and other 
> sections moving forward.

**What is the purpose of this file?**

This notebook will be covering parts 3-5 of the assignment namely going over the following:
1. Predictive Model Building - Building the Predictive Model using Modules within `sklearn`.
2. Model Scoring and Evaluation - Evaluation and Scoring of the Model with Training Data.
3. Model Deployment - Deployment of the model as `.pkl` files.

**NOTE**: The naming of this notebook is intentionally, as it stands for the following:
- `c309` - This is the course code COMP309.
- `r2` - This is the report number which is report 2 of the group project.
- `toodu` - This is the name of the dataset we named and will continue working with in this notebook.
- `model` - This is just a generic name but this notebook will contain sections 3 - 5 which all involves model training, testing, and deployment of the model.

**NOTE**: Provided below is a notebook that includes the above sections, thoroughly covering all aspects of sections 3-5. When required, there will be additional informaiton and insights throughout the notebook to help understand the specifics of the model, algorithms used, and how we handle imbalanced data when training the model.

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
DEFAULT_DATA_PATH = os.path.join(os.pardir, "data")

toodu_ft_df = pd.read_csv(os.path.join(DEFAULT_DATA_PATH, "Theft_Over_Open_Data_Cleaned.csv"))

**Observation**: The Theft Over Dataset has a noticeable glaring issue, in that much of its data for different offense overlaps with others making it hard to predict them. As such a decision was made to merge the 6 smaller offence 
types into Theft Over. This is because many of them are already classified as sub categories of Theft Over. This Dataset sorts into 3 `UCR_CODE` categories. One is for `Theft Over`, the 2nd for `Motor Vehicle Over` and the 3rd for `Shoplifting`. `Shoplifting` however as well, shares many similarities to `Theft Over`, it's difference being that it is a theft attempted from an open retail store of merchandise. However, the two would often target the same types of locations, and the difference in charge is not a meaningful difference from `Theft Over` itself. The only one that truly differentiates itself is Theft From Motor Vehicle Over, the rest almost always being theft from similar locations.

When it comes to features, while many columns appeared as though they would be useful such as using the latitude and longitude columns, or the neighbourhood codes, we found that while when included these had a higher importance than `PREMISES_CODE` or `LOCATION_CODE`, they were also a negative impact on the ability of the model to predict the outcome properly. This appears to be because it would then attempt to associate a neighbourhood with a type of crime which resulted in many improper predictions. However, adding another feature could then cause the model to basically ignore those high importance features. We found that these two features we went with, were the most relevant, and were the features that consistently didn't swing back and forth on importance and stayed very consistent between different feature combinations.

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

theft_over_categories = {
    "Theft - Misapprop Funds Over",
    "Theft Over - Bicycle",
    "Theft Over - Distraction",
    "Theft Over",
    "Theft Over - Shoplifting",
    "Theft Of Utilities Over",
    "Theft From Mail / Bag / Key"
}

toodu_ft_df["OFFENCE"] = toodu_ft_df["OFFENCE"].replace(theft_over_categories, "Theft Over")

le = LabelEncoder()
toodu_ft_df["OFFENCE_ENCODED"] = le.fit_transform(toodu_ft_df["OFFENCE"])

categorical_features = [
    "PREMISES_TYPE",
    "LOCATION_TYPE"
]
label_encoders = {}

for col in categorical_features:
    label_encoders[col] = LabelEncoder()
    toodu_ft_df[col] = label_encoders[col].fit_transform(toodu_ft_df[col])

In [None]:
features_filtered = toodu_ft_df[categorical_features]
features_filtered

In [None]:
target_filtered = toodu_ft_df["OFFENCE_ENCODED"]
target_filtered

In [None]:
scaler = StandardScaler()

numerical_features_sf = scaler.fit_transform(features_filtered)
numerical_features_sf

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, cross_val_predict

x_train, x_test, y_train, y_test = train_test_split(numerical_features_sf, target_filtered, test_size=0.2, random_state=42)

In [None]:
!pip install imbalanced-learn

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# NOTE: We can use a better `sampling_strategy` than just "auto"
smote = SMOTE(random_state=42, sampling_strategy="auto")
x_train, y_train = smote.fit_resample(x_train, y_train)

In [None]:
SAMPLES_PER_CLASS = 1500

unique_classes = np.unique(y_train).tolist()
sampling_strategy = {cls: min(SAMPLES_PER_CLASS, (y_train == cls).sum()) for cls in unique_classes}

under_sampler = RandomUnderSampler(sampling_strategy=sampling_strategy, random_state=42)
x_train, y_train = under_sampler.fit_resample(x_train, y_train)

In [None]:
# NOTE: Start of Model Creating and Training

# NOTE: Also can add the following: n_jobs=5, C=100 | C=0.01
model = LogisticRegression(solver="liblinear", random_state=42)
model.fit(x_train, y_train)

score_model = lambda model, x_test, y_test: model.score(x_test, y_test)
score_model(model, x_test, y_test)

In [None]:
# NOTE: Start of Model Scoring and Evaluation (Classification Reports)
from sklearn.metrics import classification_report, accuracy_score

y_pred = model.predict(x_test)
y_pred_proba = model.predict_proba(x_test)[:10]
y_pred_inverse = le.inverse_transform(y_pred)
y_test_inverse = le.inverse_transform(y_test)

print(f"Classification Report: {classification_report(y_test, y_pred)}")
print(f"Accuracy Score: {accuracy_score(y_test, y_pred)}")
print(f"Training Data Score: {model.score(x_train, y_train)}")
print(f"Testing Data Score (Overall Accuracy): {model.score(x_test, y_test)}")

In [None]:
y_test.value_counts()

In [None]:
y_pred_proba

In [None]:
null_accuracy_score = (2092/(2092 + 700))
print(f"Null Accuracy Score: {null_accuracy_score}")

In [None]:
# NOTE: Start of Confusion Matrix and Visualizations

from sklearn.metrics import confusion_matrix

# TODO: FIX THE LABELS!
cm = confusion_matrix(y_test, y_pred)
labels = ["True Negative", "False Positive", "False Negative", "True Positive"]

plt.figure(figsize=(10, 10))
sns.heatmap(cm, annot=True, cmap="YlOrRd", fmt="d")
plt.title("Confusion Matrix of Theft Incidents")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.xticks([0.5, 1.5], labels[:2],rotation=0)
plt.yticks([0.5, 1.5], labels[2:],rotation=90)
plt.show()

In [None]:
# NOTE: Classification Accuracy and Error
TP = cm[0, 0]
TN = cm[1, 1]
FP = cm[0, 1]
FN = cm[1, 0]

classification_accuracy = (TP + TN) / float(TP + TN + FP + FN)
classification_error = (FP + FN) / float(TP + TN + FP + FN)
precision_score = TP / float(TP + FP)
recall_score = TP / float(TP + FN)
true_positive_rate = TP / float(TP + FN)
false_positive_rate = FP / float(FP + TN)

print(f"""
Classification Accuracy: {classification_accuracy}
Classification Error:    {classification_error}
Precision Score:         {precision_score}
Recall Score:            {recall_score}
True Positive Rate:      {true_positive_rate}
False Positive Rate:     {false_positive_rate}
""")

In [None]:
# TODO: ROC (Receiver Operation Characteristics) Curve Here.

In [None]:
# TODO: Feature Importances Here.

In [42]:
toodu_ft_df.to_csv(os.path.join(DEFAULT_DATA_PATH, "Theft_Over_Data_Cleaned_Encoded.csv"), index=False)