# Violent Offender Risk Prediction and Analysis

In [77]:
# Import basic libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Generate Fake Violent Offender Data

In [79]:
# In order to create a synthetic dataset, Faker is imported.
# The language and locale is set to the UK:

from faker import Faker
fake = Faker("en_GB")
Faker.seed(7)

In [80]:
# Specify the number of rows of data required:
dflen = 25000

# Create the dataframe:
offenderdata = pd.DataFrame()

# Create the structure of the dataframe and fake data required:
offenderdata = offenderdata.assign(offender_id = pd.Series(fake.unique.random_int(min=1, max=30000) for i in range(dflen)),
                                   offence_date = pd.Series(fake.date_between_dates(pd.to_datetime('2013-01-01'),pd.to_datetime('2023-12-31')) for i in range(dflen)),
                                   name = pd.Series(fake.name() for i in range(dflen)),
                                   address = pd.Series(fake.address() for i in range(dflen)),
                                   latitude = pd.Series(fake.latitude() for i in range(dflen)),
                                   longitude = pd.Series(fake.longitude() for i in range(dflen)),
                                   age = pd.Series(fake.random_int(min=18, max=95) for i in range(dflen)),
                                   offence_type = pd.Series(fake.random.choice(['common assault', 'grievous bodily harm', 'domestic abuse', 'sexual assault', 'possession weapon',
                                                                               'aggravated robbery', 'serious violence', 'drug dealing']) for i in range(dflen)),
                                   mental_health = pd.Series(fake.random_int(min=0, max=1) for i in range(dflen)),
                                   alcohol = pd.Series(fake.random_int(min=0, max=1) for i in range(dflen)),
                                   drugs = pd.Series(fake.random_int(min=0, max=1) for i in range(dflen)),
                                   fixed_abode = pd.Series(fake.random_int(min=0, max=1) for i in range(dflen)),
                                   repeat_offender = pd.Series(fake.random_int(min=0, max=1) for i in range(dflen)))
                                   
# View the dataframe:
offenderdata.head(50)

In [81]:
# Save the data as a csv to re-import as a new dataframe and ensure data consistency:
offenderdata.to_csv("offenderdata.csv")

In [82]:
alldata = pd.read_csv('offenderdata.csv')

alldata.head()

Unnamed: 0.1,Unnamed: 0,offender_id,offence_date,name,address,latitude,longitude,age,offence_type,mental_health,alcohol,drugs,fixed_abode,repeat_offender
0,0,10612,2021-10-26,Russell Fitzgerald,221 Albert haven\nEast Malcolm\nEX1 9UD,78.006032,165.836374,90,possession weapon,1,1,0,1,0
1,1,4944,2015-04-19,Mr Ryan Rogers,Studio 68o\nLeslie key\nSouth Marktown\nSP2 4RG,-87.960595,-47.011347,53,drug dealing,0,0,1,0,1
2,2,12938,2016-06-15,Carl Hughes,393 Lee radial\nSouth Claremouth\nN2T 5BJ,-21.703899,-166.73759,85,drug dealing,1,1,0,1,1
3,3,21330,2023-10-16,Dr Mark Smart,Flat 5\nJacqueline stream\nLake Marian\nFY4 9WN,21.765661,19.72342,29,grievous bodily harm,0,1,0,1,1
4,4,1583,2019-06-19,Lisa Thomas-Donnelly,318 Aaron tunnel\nAngelatown\nBB21 4TG,-72.218123,48.655279,25,grievous bodily harm,1,1,1,0,1


In [83]:
alldata = alldata.drop('Unnamed: 0', axis=1)

alldata.head()

Unnamed: 0,offender_id,offence_date,name,address,latitude,longitude,age,offence_type,mental_health,alcohol,drugs,fixed_abode,repeat_offender
0,10612,2021-10-26,Russell Fitzgerald,221 Albert haven\nEast Malcolm\nEX1 9UD,78.006032,165.836374,90,possession weapon,1,1,0,1,0
1,4944,2015-04-19,Mr Ryan Rogers,Studio 68o\nLeslie key\nSouth Marktown\nSP2 4RG,-87.960595,-47.011347,53,drug dealing,0,0,1,0,1
2,12938,2016-06-15,Carl Hughes,393 Lee radial\nSouth Claremouth\nN2T 5BJ,-21.703899,-166.73759,85,drug dealing,1,1,0,1,1
3,21330,2023-10-16,Dr Mark Smart,Flat 5\nJacqueline stream\nLake Marian\nFY4 9WN,21.765661,19.72342,29,grievous bodily harm,0,1,0,1,1
4,1583,2019-06-19,Lisa Thomas-Donnelly,318 Aaron tunnel\nAngelatown\nBB21 4TG,-72.218123,48.655279,25,grievous bodily harm,1,1,1,0,1


In [84]:
alldata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   offender_id      25000 non-null  int64  
 1   offence_date     25000 non-null  object 
 2   name             25000 non-null  object 
 3   address          25000 non-null  object 
 4   latitude         25000 non-null  float64
 5   longitude        25000 non-null  float64
 6   age              25000 non-null  int64  
 7   offence_type     25000 non-null  object 
 8   mental_health    25000 non-null  int64  
 9   alcohol          25000 non-null  int64  
 10  drugs            25000 non-null  int64  
 11  fixed_abode      25000 non-null  int64  
 12  repeat_offender  25000 non-null  int64  
dtypes: float64(2), int64(7), object(4)
memory usage: 2.5+ MB


In [85]:
alldata.describe()

Unnamed: 0,offender_id,latitude,longitude,age,mental_health,alcohol,drugs,fixed_abode,repeat_offender
count,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0,25000.0
mean,15001.25288,0.548297,0.016517,56.56976,0.49676,0.50084,0.50084,0.49448,0.50476
std,8655.048724,52.142384,103.502852,22.423747,0.5,0.500009,0.500009,0.49998,0.499987
min,1.0,-89.995554,-179.995751,18.0,0.0,0.0,0.0,0.0,0.0
25%,7517.5,-45.071835,-89.167705,37.0,0.0,0.0,0.0,0.0,0.0
50%,15001.5,0.621181,0.679561,57.0,0.0,1.0,1.0,0.0,1.0
75%,22485.25,46.00676,88.104499,76.0,1.0,1.0,1.0,1.0,1.0
max,30000.0,89.993324,179.992996,95.0,1.0,1.0,1.0,1.0,1.0


## Labeling of Offenders as High, Medium and Low Risk

The labelling of the data is based on past analysis, which identified that the greater number of aggravating factors that co-exist (i.e. mental health issues, drug and alcohol abuse, no fixed address and past offending), the more likely an offender is to become wither a victim or perpetrator of murder, attempted murder or manslaughter.

Consequently, the data has to be labelled accordingly and according to the presence of these factors. Mental health is a serious factor, but combine that with drugs and you have an even greater risk (not least because of the company the individual may be keeping and possible debts they may owe etc). If you further combine that with having no fixed abode, the individual is even more vulnerable. 

With this logic in mind, the most serious combinations of factors will be labelled as high risk, while the presence of some (or a number of lesser factors) will be labelled as medium risk. Everything else will be considered low risk.

In [88]:
# Labeling 'high risk' and 'medium risk' cases - those offenders that are at a high risk of committing a murder, attempted murder or 
# manslaughter offence.

# Define conditions where an offender is considered high or medium risk:
conditions = [
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 1) & (alldata['drugs'] == 1) & (alldata['fixed_abode'] == 0) & (alldata[
     'repeat_offender'] == 1), # HIGH: Offender with all markers (mental health, alcohol, drugs, NFA and repeat offender.
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 1) & (alldata['drugs'] == 1) & (alldata['fixed_abode'] == 0) & (alldata[
     'repeat_offender'] == 0), # HIGH: Offender with all markers but not a repeat offender.
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 1) & (alldata['drugs'] == 1) & (alldata['fixed_abode'] == 1) & (alldata[
     'repeat_offender'] == 1), # HIGH: Offender with all markers but has fixed abode.
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 1) & (alldata['drugs'] == 1) & (alldata['fixed_abode'] == 1) & (alldata[
     'repeat_offender'] == 0), # HIGH: Offender with all markers but has fixed abode and not a repeat offender.
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 0) & (alldata['drugs'] == 1) & (alldata['fixed_abode'] == 1) & (alldata[
     'repeat_offender'] == 0), # HIGH: Offender has mental health and drugs issues.
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 1) & (alldata['drugs'] == 1) & (alldata['fixed_abode'] == 1) & (alldata[
     'repeat_offender'] == 0), # HIGH: Offender has mental health, drugs and alcohol issues.
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 0) & (alldata['drugs'] == 1) & (alldata['fixed_abode'] == 1) & (alldata[
     'repeat_offender'] == 1), # HIGH: Offender has mental health and drugs issues and is repeat offender.
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 1) & (alldata['drugs'] == 0) & (alldata['fixed_abode'] == 0) & (alldata[
     'repeat_offender'] == 1), # HIGH: Offender has all markers except drugs.
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 0) & (alldata['drugs'] == 0) & (alldata['fixed_abode'] == 0) & (alldata[
     'repeat_offender'] == 1), # HIGH: Offender has all markers except drugs and alcohol.
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 1) & (alldata['drugs'] == 0) & (alldata['fixed_abode'] == 1) & (alldata[
     'repeat_offender'] == 1), # HIGH: Offender has all markers except drugs and has no fixed abode.
    (alldata['mental_health'] == 0) & (alldata['alcohol'] == 1) & (alldata['drugs'] == 1) & (alldata['fixed_abode'] == 0) & (alldata[
     'repeat_offender'] == 1), # HIGH: Offender has drug and alcohol issues, no fixed abode and is a repeat offender.
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 0) & (alldata['drugs'] == 0) & (alldata['fixed_abode'] == 1) & (alldata[
     'repeat_offender'] == 0), # MEDIUM: Offender has mental health issues only.
    (alldata['mental_health'] == 1) & (alldata['alcohol'] == 0) & (alldata['drugs'] == 0) & (alldata['fixed_abode'] == 0) & (alldata[
     'repeat_offender'] == 0), # MEDIUM: Offender has mental health and no fixed abode.
    (alldata['mental_health'] == 0) & (alldata['alcohol'] == 1) & (alldata['drugs'] == 1) & (alldata['fixed_abode'] == 1) & (alldata[
     'repeat_offender'] == 1), # MEDIUM: Offender has drug and alcohol issues and is a repeat offender.
    (alldata['mental_health'] == 0) & (alldata['alcohol'] == 1) & (alldata['drugs'] == 0) & (alldata['fixed_abode'] == 0) & (alldata[
     'repeat_offender'] == 1), # MEDIUM: Offender has alcohol issues, no fixed abode and is a repeat offender.
    (alldata['mental_health'] == 0) & (alldata['alcohol'] == 0) & (alldata['drugs'] == 0) & (alldata['fixed_abode'] == 0) & (alldata[
     'repeat_offender'] == 1), # MEDIUM: Offender has no fixed abode and is a repeat offender.
    (alldata['mental_health'] == 0) & (alldata['alcohol'] == 0) & (alldata['drugs'] == 1) & (alldata['fixed_abode'] == 0) & (alldata[
     'repeat_offender'] == 1)] # MEDIUM: Offender has drug issues and is a repeat offender.

# Create a list of values to label each condition (either as high or medium risk):
values = ['High Risk', 'High Risk', 'High Risk', 'High Risk', 'High Risk', 'High Risk', 'High Risk', 'High Risk', 'High Risk', 'High Risk'
          , 'High Risk', 'Medium Risk', 'Medium Risk', 'Medium Risk', 'Medium Risk', 'Medium Risk', 'Medium Risk']

# Create new column in the data that labels any case matching condition as 'High Risk' or 'Medium Risk':
alldata['risk_level'] = np.select(conditions, values)

print(alldata.head(10))
  


   offender_id offence_date                  name  \
0        10612   2021-10-26    Russell Fitzgerald   
1         4944   2015-04-19        Mr Ryan Rogers   
2        12938   2016-06-15           Carl Hughes   
3        21330   2023-10-16         Dr Mark Smart   
4         1583   2019-06-19  Lisa Thomas-Donnelly   
5         2374   2018-01-09        Ronald Chapman   
6        26912   2017-12-02        Mr Glenn Clark   
7        17560   2022-07-17          Karen Murphy   
8         3085   2022-01-04        Joseph Johnson   
9        11983   2021-03-25             Gary Dyer   

                                             address   latitude   longitude  \
0            221 Albert haven\nEast Malcolm\nEX1 9UD  78.006032  165.836374   
1    Studio 68o\nLeslie key\nSouth Marktown\nSP2 4RG -87.960595  -47.011347   
2          393 Lee radial\nSouth Claremouth\nN2T 5BJ -21.703899 -166.737590   
3    Flat 5\nJacqueline stream\nLake Marian\nFY4 9WN  21.765661   19.723420   
4             318 Aar

In [89]:
# Change all zero labels in the risk_level column to 'Low Risk' (i.e. all the rest of the cases will be considered as low risk):

alldata['risk_level'] = alldata['risk_level'].replace('0', 'Low Risk')

alldata.head(100)

Unnamed: 0,offender_id,offence_date,name,address,latitude,longitude,age,offence_type,mental_health,alcohol,drugs,fixed_abode,repeat_offender,risk_level
0,10612,2021-10-26,Russell Fitzgerald,221 Albert haven\nEast Malcolm\nEX1 9UD,78.006032,165.836374,90,possession weapon,1,1,0,1,0,Low Risk
1,4944,2015-04-19,Mr Ryan Rogers,Studio 68o\nLeslie key\nSouth Marktown\nSP2 4RG,-87.960595,-47.011347,53,drug dealing,0,0,1,0,1,Medium Risk
2,12938,2016-06-15,Carl Hughes,393 Lee radial\nSouth Claremouth\nN2T 5BJ,-21.703899,-166.737590,85,drug dealing,1,1,0,1,1,High Risk
3,21330,2023-10-16,Dr Mark Smart,Flat 5\nJacqueline stream\nLake Marian\nFY4 9WN,21.765661,19.723420,29,grievous bodily harm,0,1,0,1,1,Low Risk
4,1583,2019-06-19,Lisa Thomas-Donnelly,318 Aaron tunnel\nAngelatown\nBB21 4TG,-72.218123,48.655279,25,grievous bodily harm,1,1,1,0,1,High Risk
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3869,2013-06-24,Rachel Allen,050 Stacey mount\nEileentown\nLE65 7LW,-51.472389,-65.216263,20,domestic abuse,1,0,1,1,0,High Risk
96,16776,2016-12-11,Francesca Stewart,465 Patricia ranch\nLeonfort\nW6D 9QF,-73.686994,-152.403218,31,common assault,1,1,1,1,0,High Risk
97,13702,2022-12-21,Dr Bruce Thompson,95 Christine ways\nNew Emilyton\nL4 6XS,-61.098558,21.110015,52,sexual assault,0,0,1,0,1,Medium Risk
98,5406,2017-07-14,Phillip Reed,751 Jordan road\nCookfurt\nL7 4EA,31.654653,-147.582924,21,sexual assault,1,1,1,1,0,High Risk


## Data Pre-Processing for Machine Learning Models

In [91]:
# Check there are no null values in the risk_level column:

alldata['risk_level'].isna().sum()

0

In [92]:
# Import required libraries for pre-processing, the models we want to apply, model validation and model evaluation:

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import MinMaxScaler 

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, f1_score, make_scorer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.metrics import RocCurveDisplay

In [93]:
# Check the balance of categories in the risk_level column:

alldata['risk_level'].value_counts()

Low Risk       12574
High Risk       7831
Medium Risk     4595
Name: risk_level, dtype: int64

The classes are not balanced, so we will need to balance them before we train our Multinomial Logistic Regression, Multinomial Naive Bayes and Decision Tree models.  When we come to train a deep learning model later on, we will not need to have a balanced dataset for that model, as the model can cope with imbalances.

In [95]:
# Drop unneeded columns.
# Note that as this is not a real data example, we will drop everything except the 0/1 categories and the target 'risk_level'.
# In a real data scenario, we would undoubtedly want to also predict on the severity of the offence the suspect has been arrested for 
# (i.e. the 'offence_type' column) and maybe even age, if we found that to be relevent.  Real data would likely have more variables/columns
# than this and more to feed into the models:

alldata = alldata.drop('offender_id', axis=1)
alldata = alldata.drop('offence_date', axis=1)
alldata = alldata.drop('name', axis=1)
alldata = alldata.drop('address', axis=1)
alldata = alldata.drop('latitude', axis=1)
alldata = alldata.drop('longitude', axis=1)
alldata = alldata.drop('age', axis=1)
alldata = alldata.drop('offence_type', axis=1)

alldata.head()

Unnamed: 0,mental_health,alcohol,drugs,fixed_abode,repeat_offender,risk_level
0,1,1,0,1,0,Low Risk
1,0,0,1,0,1,Medium Risk
2,1,1,0,1,1,High Risk
3,0,1,0,1,1,Low Risk
4,1,1,1,0,1,High Risk


In [96]:
# Check data type in the remaining columns to ensure they are all int64:

alldata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   mental_health    25000 non-null  int64 
 1   alcohol          25000 non-null  int64 
 2   drugs            25000 non-null  int64 
 3   fixed_abode      25000 non-null  int64 
 4   repeat_offender  25000 non-null  int64 
 5   risk_level       25000 non-null  object
dtypes: int64(5), object(1)
memory usage: 1.1+ MB


In [97]:
# Convert the categorical risk_level column into a numerical value (dummy variable) for use by the models:

le = LabelEncoder()
alldata['risk_level_num'] = le.fit_transform(alldata['risk_level'])

alldata.head(10)

Unnamed: 0,mental_health,alcohol,drugs,fixed_abode,repeat_offender,risk_level,risk_level_num
0,1,1,0,1,0,Low Risk,1
1,0,0,1,0,1,Medium Risk,2
2,1,1,0,1,1,High Risk,0
3,0,1,0,1,1,Low Risk,1
4,1,1,1,0,1,High Risk,0
5,0,1,1,0,1,High Risk,0
6,1,1,0,1,1,High Risk,0
7,1,1,0,0,1,High Risk,0
8,0,0,0,0,1,Medium Risk,2
9,1,0,1,0,1,Low Risk,1


In [98]:
# Drop the 'risk level' column as that is no longer needed:

alldata = alldata.drop('risk_level', axis=1)

alldata.head()

Unnamed: 0,mental_health,alcohol,drugs,fixed_abode,repeat_offender,risk_level_num
0,1,1,0,1,0,1
1,0,0,1,0,1,2
2,1,1,0,1,1,0
3,0,1,0,1,1,1
4,1,1,1,0,1,0


In [99]:
# Save a copy of the dataframe for the Scikit_Learn-based models. We will need this original clean data again later to use for deep learning:

sk_modeldata = alldata

In [100]:
# Set the independent (X) and dependent (y) variables:

X = sk_modeldata.drop('risk_level_num', axis=1)
y = sk_modeldata['risk_level_num']

In [101]:
# As we have found that the data is not balanced, create the SMOTE variable to balance the data:

os = SMOTE(random_state=42)

In [102]:
# Balance the data

# Specify the new data sets.
os_data_X, os_data_y = os.fit_resample(X, y)  

# Create two DataFrames; one for X and one for y:
os_data_X = pd.DataFrame(data = os_data_X, columns = X.columns) 

os_data_y = pd.DataFrame(data = os_data_y, columns = ['risk_level_num'])

# View DataFrame.
print(os_data_X.head())

   mental_health  alcohol  drugs  fixed_abode  repeat_offender
0              1        1      0            1                0
1              0        0      1            0                1
2              1        1      0            1                1
3              0        1      0            1                1
4              1        1      1            0                1


In [103]:
print(os_data_y.head())

   risk_level_num
0               1
1               2
2               0
3               1
4               0


In [104]:
# Check to see if the data is now balanced:

os_data_y['risk_level_num'].value_counts()

1    12574
2    12574
0    12574
Name: risk_level_num, dtype: int64

In [105]:
# Split the balanced data into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(os_data_X, os_data_y,
                                                    test_size=0.3,
                                                    random_state=42)

## Random Forest Model

In [107]:
# A Random Forest will be applied. This is more resource-intensive than a Decision Tree (as it is an ensemble of several trees), but as
# the dataset does not have too many variables, this may work well.  Decision Trees are more accurate and less prone to overfitting than a 
# Decision Tree:

# Create a forest object based on the 'RandomForestClassifier' (first of all, with some default hyperparameters):
forest = RandomForestClassifier(n_estimators=200, 
                                criterion = 'gini', 
                                min_samples_split = 2,
                                min_samples_leaf = 2,
                                max_features = 'auto',
                                bootstrap = True,
                                n_jobs = -1,
                                random_state = 42)

# Train and predict the model:
forest.fit(X_train, y_train)

training_preds = forest.predict(X_train)
y_predict = forest.predict(X_test)
y_probs = forest.predict_proba(X_test)

  forest.fit(X_train, y_train)
  warn(


In [108]:
# Print out the metrics and classification report to evaluate the performance of the model:

print(metrics.accuracy_score(y_train, training_preds))
print(metrics.accuracy_score(y_test, y_predict))
print(classification_report(y_test, y_predict))

1.0
1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3751
           1       1.00      1.00      1.00      3753
           2       1.00      1.00      1.00      3813

    accuracy                           1.00     11317
   macro avg       1.00      1.00      1.00     11317
weighted avg       1.00      1.00      1.00     11317



In [109]:
# Plot a confusion matrix to visualise actuals vs. predictions:
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(y_test, y_predict)

confusion = pd.DataFrame(confusion_matrix, index=['is_high', 'is_low', 'is_medium'],
                         columns=['predicted_high', 'predicted_low', 'predicted_medium'])

# View the output:
confusion

Unnamed: 0,predicted_high,predicted_low,predicted_medium
is_high,3751,0,0
is_low,0,3753,0
is_medium,0,0,3813


The model appears to have a 100% accuracy rate, which may be because the pattern is very easy to detect, due to this being synthetic data with very specific patterns applied to it.

## Multinomial Logistic Regression

In [113]:
# We will apply Multinomial Logistic Regression.
# For the purposes of the exercise, this model may be sufficient owing to the simple synthetic patterns in the fake data.
# Reasons for choice: provides a probability score for outcomes (what we need here), we have few categorical features (this model deals
# best with fewer features), no missing values, X variables are independent of each other (no multicollinarity).
# In a real scenario with real data (and consequently more complex patterns and number of features) a more robust model
# could be required that balances computational cost with accuracy.

# Create a function to scale the X data first and set scale between 0 and 1:
scaler = MinMaxScaler(feature_range = (0, 1))  

# Add the X_train data set to the 'scaler' function:
scaler.fit(X_train)

# Specify X_train data set:
X_train = scaler.transform(X_train) 
# Specify X_test data set: 
X_test = scaler.transform(X_test)

# Define the MLR model and set predictions and parameters:
MLR = LogisticRegression(random_state=42, 
                         multi_class='multinomial', 
                         penalty='none', 
                         solver='newton-cg').fit(X_train, y_train)

# Set the predictions equal to the ‘MLR’ function and specify the DataFrame:
MLR_training_preds = MLR.predict(X_train)
MLR_y_predict = MLR.predict(X_test)
MLR_y_probs = MLR.predict_proba(X_test)

# Set the parameters equal to the DataFrame and add the ‘get_params’ function: 
params = MLR.get_params() 

# Print the parameters, intercept, and coefficients:
print(params)  
print("Intercept: \n", MLR.intercept_)
print("Coefficients: \n", MLR.coef_)

  y = column_or_1d(y, warn=True)


{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'multinomial', 'n_jobs': None, 'penalty': 'none', 'random_state': 42, 'solver': 'newton-cg', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
Intercept: 
 [-6.28322103  3.43757923  2.8456418 ]
Coefficients: 
 [[ 4.50516858  1.92548162  2.27711653 -0.43633122  2.37436838]
 [-2.66649154 -0.73697263 -0.9966514   0.65219737 -2.17037351]
 [-1.83867704 -1.18850898 -1.28046513 -0.21586615 -0.20399487]]


In [114]:
# Print out the metrics and classification report to evaluate the performance of the model:

print(metrics.accuracy_score(y_train, MLR_training_preds))
print(metrics.accuracy_score(y_test, MLR_y_predict))
print(classification_report(y_test, MLR_y_predict))

0.6813482295019883
0.6803039674825484
              precision    recall  f1-score   support

           0       0.81      0.81      0.81      3751
           1       0.57      0.56      0.56      3753
           2       0.66      0.67      0.66      3813

    accuracy                           0.68     11317
   macro avg       0.68      0.68      0.68     11317
weighted avg       0.68      0.68      0.68     11317



In [115]:
# Plot a confusion matrix to visualise actuals vs. predictions:
from sklearn.metrics import confusion_matrix

confusion_matrix_mlr = confusion_matrix(y_test, MLR_y_predict)

confusion = pd.DataFrame(confusion_matrix_mlr, index=['is_high', 'is_low', 'is_medium'],
                         columns=['predicted_high', 'predicted_low', 'predicted_medium'])

# View the output:
confusion

Unnamed: 0,predicted_high,predicted_low,predicted_medium
is_high,3051,348,352
is_low,706,2110,937
is_medium,0,1275,2538


In contrast the to Random Forest, this model has not done well at all at only 68% accuracy. It has particularly not done well at differentiating between medium and low risk cases.  This is in some ways not surprising because the patterns have been put into the data artificially.  However, the Random Forest seems to have interpreted, due to its greater power and ensemble nature, that anything not fitting the specific pattern of high and medium cases is therefore low.

With this in mind (and for the purposes of this exercise), we will use the Random Forest Model as the final solution.
In a scenario with real data, we would expect more nuances in the data and genuine patterns to be found.  As such, the MLR (with some
hyperparameter tuning) may do considerably better, whereas the Random Forest would not be expected to achieve such a high accuracy.

## Test Predictions using the Random Forest Model

In [161]:
import pickle

model_filename = 'rfmodel.pkl'
pickle.dump(forest, open(model_filename,'wb'))

model = pickle.load(open('rfmodel.pkl','rb'))

print(model.predict([[1,0,0,1,1]]))

print(model.predict_proba([[1,0,0,1,1]]))

[1]
[[0. 1. 0.]]




In [163]:
print(model.predict([[1,1,1,1,1]]))

print(model.predict_proba([[1,1,1,1,1]]))

[0]
[[1. 0. 0.]]




In [165]:
print(model.predict([[0,0,0,1,1]]))

print(model.predict_proba([[0,0,0,1,1]]))

[1]
[[0. 1. 0.]]




We can see that the model can predict whether a case is low, medium or high risk using entered values for the X independent variables.

In [2]:
!jupyter nbconvert --to webpdf --allow-chromium-download Violent_Offender_Risk_Analysis.ipynb

[NbConvertApp] Converting notebook Violent_Offender_Risk_Analysis.ipynb to webpdf
[NbConvertApp] Building PDF
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 338067 bytes to Violent_Offender_Risk_Analysis.pdf
