# Crime Analysis Modeling

The purpose of this notebook is to begin the training of machine learning algorithms to analyze the likelihood of becoming a victim of a crime - violent or non-violent within Kansas City Metropolotian area based on the time of the year and the individuals age range. 

It will be done in two parts - first through a multi-label classification by predicting if a crime Is_Violent, Is_Property_Crime, Is_Crime_Against_Person, Is_Crime_Against_Society and through single label classification of Offense_Rank which is ordinal encoded based on the UCR program classifications.  

## Read in CSV

In [2]:
import pandas as pd
csv_file_path = '/Users/lindseysullivan/Documents/School/Kansas-City-Crime-Analysis/Data/Data_Sets/KCPD-5-Year-Analysis-Model.csv'
df = pd.read_csv(csv_file_path)

# Multi-Label Classification Predictive Models

### Is_Violent/Type of Crime (Is_Property_Crime, Is_Crime_Against_Person, Is_Crime_Against_Society)
The selection of these features is to quickly analyze if a crime would be violent and the classification of the crime based on the time frame and the age_range explored below. This is explored through the Random Forest, K-Nearest Neighbor (K-NN), and an Support Vector Model (SVM) models.

## Random Forest

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, multilabel_confusion_matrix

# Feature and Target Setup
X = df[['sin_Quarter', 'cos_Quarter', 'sin_Months', 'cos_Months', 'sin_DayOfWeek', 'cos_DayOfWeek', 'Offense_Rank', 'age_range_encoded']]
y = df[['Is_Violent', 'Is_Property_Crime', 'Is_Crimes_Against_Person', 'Is_Crimes_Against_Society']]

# Splitting Data into Train/Test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initializing and training the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Predicting on the test set
y_pred = rf_classifier.predict(X_test)

# Accuracy 
accuracy = accuracy_score(y_test, y_pred)

# Evaluating the model
print(f'Model accuracy: {accuracy:.4f}')
print('--------------------')
print('Classification Report:')
print(classification_report(y_test, y_pred, target_names=y.columns,digits=3, zero_division=0))
print('--------------------')
print('Multilabel Confusion Matrix:')
print(multilabel_confusion_matrix(y_test, y_pred))
print('--------------------')
individual_accuracies = (y_pred == y_test).mean(axis=0)
print(f"Individual Label Accuracies: {individual_accuracies}")

Model accuracy: 0.9984
--------------------
Classification Report:
                           precision    recall  f1-score   support

               Is_Violent      1.000     0.998     0.999     14434
        Is_Property_Crime      0.999     1.000     1.000     29529
 Is_Crimes_Against_Person      0.999     0.994     0.996     11301
Is_Crimes_Against_Society      0.998     0.999     0.999     48228

                micro avg      0.999     0.999     0.999    103492
                macro avg      0.999     0.998     0.998    103492
             weighted avg      0.999     0.999     0.999    103492
              samples avg      0.999     0.999     0.999    103492

--------------------
Multilabel Confusion Matrix:
[[[74624     0]
  [   30 14404]]

 [[59512    17]
  [   12 29517]]

 [[77743    14]
  [   65 11236]]

 [[40756    74]
  [   35 48193]]]
--------------------
Individual Label Accuracies: Is_Violent                   0.999663
Is_Property_Crime            0.999674
Is_Crimes_Again

## k-NN (k-Nearest Neighbor)

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, multilabel_confusion_matrix

# Feature and Target Setup
X = df[['sin_Quarter', 'cos_Quarter', 'sin_Months', 'cos_Months', 'sin_DayOfWeek', 'cos_DayOfWeek', 'Offense_Rank', 'age_range_encoded']]
y = df[['Is_Violent', 'Is_Property_Crime', 'Is_Crimes_Against_Person', 'Is_Crimes_Against_Society']]

# Splitting Data into Train/Test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

k = 5  # Number of neighbors to consider
knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(X_train, y_train)

# Predicting on the test set
y_pred = knn.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Evaluate the Model
print(f'Model Accuracy: {accuracy:.4f}')
print('--------------------')
print('Classification Report:')
print(classification_report(y_test, y_pred, target_names=y.columns, zero_division=0, digits=3))
print('--------------------')
print('Multilabel Confusion Matrix:')
print(multilabel_confusion_matrix(y_test, y_pred))
print('--------------------')
individual_accuracies = (y_pred == y_test).mean(axis=0)
print(f"Individual Label Accuracies: {individual_accuracies}")

Model Accuracy: 0.9994
--------------------
Classification Report:
                           precision    recall  f1-score   support

               Is_Violent      1.000     0.999     0.999     14434
        Is_Property_Crime      1.000     1.000     1.000     29529
 Is_Crimes_Against_Person      0.999     0.998     0.999     11301
Is_Crimes_Against_Society      0.999     1.000     1.000     48228

                micro avg      1.000     0.999     1.000    103492
                macro avg      1.000     0.999     0.999    103492
             weighted avg      1.000     0.999     1.000    103492
              samples avg      1.000     0.999     0.999    103492

--------------------
Multilabel Confusion Matrix:
[[[74624     0]
  [   15 14419]]

 [[59520     9]
  [    7 29522]]

 [[77751     6]
  [   24 11277]]

 [[40801    29]
  [   13 48215]]]
--------------------
Individual Label Accuracies: Is_Violent                   0.999832
Is_Property_Crime            0.999820
Is_Crimes_Again

## XGBoost

In [6]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, multilabel_confusion_matrix

# Features and Target Setup
X = df[['sin_Quarter', 'cos_Quarter', 'sin_Months', 'cos_Months', 'sin_DayOfWeek', 'cos_DayOfWeek', 'Offense_Rank', 'age_range_encoded']]
y = df[['Is_Violent', 'Is_Property_Crime', 'Is_Crimes_Against_Person', 'Is_Crimes_Against_Society']]

# Split the data into training/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the XGBoost classifier
xgb_clf = xgb.XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False)

xgb_clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = xgb_clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.4f}')
print('--------------------')
print('Classification Report:')
print(classification_report(y_test, y_pred, target_names=y.columns,digits=3, zero_division=0))
print('--------------------')
print('Multilabel Confusion Matrix:')
print(multilabel_confusion_matrix(y_test, y_pred))
print('--------------------')
individual_accuracies = (y_pred == y_test).mean(axis=0)
print(f"Individual Label Accuracies: {individual_accuracies}")


Model Accuracy: 1.0000
--------------------
Classification Report:
                           precision    recall  f1-score   support

               Is_Violent      1.000     1.000     1.000     14434
        Is_Property_Crime      1.000     1.000     1.000     29529
 Is_Crimes_Against_Person      1.000     1.000     1.000     11301
Is_Crimes_Against_Society      1.000     1.000     1.000     48228

                micro avg      1.000     1.000     1.000    103492
                macro avg      1.000     1.000     1.000    103492
             weighted avg      1.000     1.000     1.000    103492
              samples avg      1.000     1.000     1.000    103492

--------------------
Multilabel Confusion Matrix:
[[[74624     0]
  [    0 14434]]

 [[59529     0]
  [    0 29529]]

 [[77757     0]
  [    0 11301]]

 [[40830     0]
  [    0 48228]]]
--------------------
Individual Label Accuracies: Is_Violent                   1.0
Is_Property_Crime            1.0
Is_Crimes_Against_Person 

## Model Performance
### Random Forest
| **Features**     | **Precision** | **Recall** | **F1-Score** | **Support** |**Individual Accuracy**|
|-----------|-----------|--------|----------|---------|-------------------|
| Violent   | 1.000      | 0.998  | 0.999    | 14434   | 0.999663 |
| Property  | 0.999     | 1.000  | 1.000    | 29529   | 0.999674 |
| Person    | 0.999      | 0.994    | 0.996    | 11301   | 0.999113 |
| Society   | 0.998          | 0.999       | 0.999    | 48228   | 0.998776 |
|   |   |   |   |   |   |   |
| **Model Accuracy** | | | 0.9984  |         |    |

### K-Nearest Neighbor
| **Features**     | **Precision** | **Recall** | **F1-Score** | **Support** |**Individual Accuracy**|
|-----------|-----------|--------|----------|---------|-------------------|
| Violent   | 1.000      | 0.999  | 0.999    | 14434   | 0.999832 |
| Property  | 1.000      | 1.000  | 1.000    | 29529   | 0.999820 |
| Person    | 0.999      | 0.998    | 0.999    | 11301   | 0.999663 |
| Society   | 0.999         | 1.000       | 1.000    | 48228   | 0.999528 |
|   |   |   |   |   |   |   |
| **Model Accuracy** | | | 0.9994  |         |    |

### XGBoost
| **Features**     | **Precision** | **Recall** | **F1-Score** | **Support** |**Individual Accuracy**|
|-----------|-----------|--------|----------|---------|-------------------|
| Violent   | 1.000      | 1.000  | 1.000    | 14434   | 1.000 |
| Property  | 1.000      | 1.000  | 1.000    | 29529   | 1.000 |
| Person    | 1.000      | 1.000    | 1.000    | 11301   | 1.000 |
| Society   | 1.000         | 1.000       | 1.000    | 48228   | 1.000 |
|   |   |   |   |   |   |   |
| **Model Accuracy** | | | 1.000  |         |    |


## Cross-Validation

In [7]:
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score, make_scorer
import xgboost as xgb
from sklearn.pipeline import Pipeline

# Define a custom scorer for multi-label classification
def multi_label_f1_score(y_true, y_pred, **kwargs):
    # Calculate the F1 score for each class individually and then take the average
    return f1_score(y_true, y_pred, average='samples', **kwargs)

# Create a scorer object
f1_scorer = make_scorer(multi_label_f1_score)

# Features and target setup
X = df[['sin_Quarter', 'cos_Quarter', 'sin_Months', 'cos_Months', 'sin_DayOfWeek', 
         'cos_DayOfWeek', 'Offense_Rank', 'age_range_encoded']]
y = df[['Is_Violent', 'Is_Property_Crime', 'Is_Crimes_Against_Person', 'Is_Crimes_Against_Society']]

# Define the cross-validation strategy
cv_strategy = KFold(n_splits=5, shuffle=True, random_state=42)

# Define the models with a pipeline that includes the scaler
models = {
    'Random Forest': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
    ]),
    'K-Nearest Neighbor': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', KNeighborsClassifier(n_neighbors=5))
    ]),
    'XGBoost': Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', xgb.XGBClassifier(
            objective='binary:logistic', eval_metric='logloss', use_label_encoder=False, random_state=42
        ))
    ])
}

# Perform cross-validation for each model
for model_name, pipeline in models.items():
    cv_scores = cross_val_score(pipeline, X, y, cv=cv_strategy, scoring=f1_scorer, n_jobs=-1)
    
    # Output the results
    print(f"{model_name} CV F1 Scores: {cv_scores}")
    print(f"{model_name} CV Average F1 Score: {cv_scores.mean():.4f}")
    print(f"{model_name} CV Standard Deviation of F1 Score: {cv_scores.std():.4f}\n")


Random Forest CV F1 Scores: [0.99862636 0.99881725 0.99856835 0.99847291 0.99855149]
Random Forest CV Average F1 Score: 0.9986
Random Forest CV Standard Deviation of F1 Score: 0.0001

K-Nearest Neighbor CV F1 Scores: [0.99484606 0.99428649 0.99438755 0.99320106 0.99399261]
K-Nearest Neighbor CV Average F1 Score: 0.9941
K-Nearest Neighbor CV Standard Deviation of F1 Score: 0.0005

XGBoost CV F1 Scores: [1. 1. 1. 1. 1.]
XGBoost CV Average F1 Score: 1.0000
XGBoost CV Standard Deviation of F1 Score: 0.0000

