# Loading Data and Making Predictions 

Here we briefly discuss how to load the [Cook County Data](https://datacatalog.cookcountyil.gov/Courts/Sentencing/tg8v-tm6u/data)
set and apply the necessary preprocessing so that a discrepancy calculation can be made.  We also demonstrate how these functions can be called either through the notebook (python) API or via REST API if the Flask Server is running.   

In [2]:
import pickle
import json
import pandas as pd
import os
import requests
import sys

## Loading and Preparing the Dataset

The model was trained on the Cook County Dataset, so we will load this data and make the same modifications to the schema that were made during the training process.  Let's make sure we download the right dataset:

In [3]:
# Downloading Cook County Dataset
if os.path.isfile('../server/models/Sentencing-cook_county.csv'):
    print("Cook County Data Already Downloaded")
else:
    !wget -O ../server/models/Sentencing-cook_county.csv https://datacatalog.cookcountyil.gov/api/views/tg8v-tm6u/rows.csv?accessType=DOWNLOAD 

--2020-10-15 04:47:46--  https://datacatalog.cookcountyil.gov/api/views/tg8v-tm6u/rows.csv?accessType=DOWNLOAD
Resolving datacatalog.cookcountyil.gov (datacatalog.cookcountyil.gov)... 52.206.140.205, 52.206.68.26, 52.206.140.199
Connecting to datacatalog.cookcountyil.gov (datacatalog.cookcountyil.gov)|52.206.140.205|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘../server/models/Sentencing-cook_county.csv’

../server/models/Se     [    <=>             ] 117.36M  8.97MB/s    in 13s     

2020-10-15 04:48:00 (9.04 MB/s) - ‘../server/models/Sentencing-cook_county.csv’ saved [123066339]



We are going to use some data processing functions that were written especially for this project, so we need to import those:

It is possible, using [Model Notebook](Model.ipynb), to train a new model.  Each new model has a timestamp associated with it.  We can make sure the model that we are using is the one being used in the server by comparing model names:

In [4]:
model_name = 'sentence_pipe_mae1.555_2020-10-10_02h46m24s'

cwd = os.getcwd()
model_path = cwd + '/../server/models/' + model_name + '.pkl'

# loading trained model
with open(model_path, 'rb') as f:
    model = pickle.load(f)

We can look at the columns that are used in the model prediction.  Knowing what columns serve as inputs to the model will help us formulate the `predict` requests.

In [5]:
orig_cols = model[0]._df_columns
print(orig_cols)

Index(['PRIMARY_CHARGE_FLAG', 'DISPOSITION_CHARGED_OFFENSE_TITLE',
       'CHARGE_COUNT', 'DISPOSITION_CHARGED_CLASS', 'CHARGE_DISPOSITION',
       'SENTENCE_JUDGE', 'SENTENCE_PHASE', 'AGE_AT_INCIDENT', 'RACE', 'GENDER',
       'LAW_ENFORCEMENT_AGENCY', 'UPDATED_OFFENSE_CATEGORY'],
      dtype='object')


The data has 41 columns, and many of those columns are not useful in any type of training strategy:

In [6]:
csv = '../server/models/Sentencing-cook_county.csv'
all_cols = open(csv).readline()
print(all_cols)
print("number of columns in csv file:  " + str(len(all_cols.split(","))))

CASE_ID,CASE_PARTICIPANT_ID,RECEIVED_DATE,OFFENSE_CATEGORY,PRIMARY_CHARGE_FLAG,CHARGE_ID,CHARGE_VERSION_ID,DISPOSITION_CHARGED_OFFENSE_TITLE,CHARGE_COUNT,DISPOSITION_DATE,DISPOSITION_CHARGED_CHAPTER,DISPOSITION_CHARGED_ACT,DISPOSITION_CHARGED_SECTION,DISPOSITION_CHARGED_CLASS,DISPOSITION_CHARGED_AOIC,CHARGE_DISPOSITION,CHARGE_DISPOSITION_REASON,SENTENCE_JUDGE,SENTENCE_COURT_NAME,SENTENCE_COURT_FACILITY,SENTENCE_PHASE,SENTENCE_DATE,SENTENCE_TYPE,CURRENT_SENTENCE_FLAG,COMMITMENT_TYPE,COMMITMENT_TERM,COMMITMENT_UNIT,LENGTH_OF_CASE_in_Days,AGE_AT_INCIDENT,RACE,GENDER,INCIDENT_CITY,INCIDENT_BEGIN_DATE,INCIDENT_END_DATE,LAW_ENFORCEMENT_AGENCY,LAW_ENFORCEMENT_UNIT,ARREST_DATE,FELONY_REVIEW_DATE,FELONY_REVIEW_RESULT,ARRAIGNMENT_DATE,UPDATED_OFFENSE_CATEGORY

number of columns in csv file:  41


We can filter some of these columns down to those features which are more likely to be relevant to modeling.  As we will see, this column list will be further modified as we hoome in on the particular model that will be used, but this is a much better place to start than the full dataset:

In [7]:
#Loading directly from Cook County Data
csv = '../server/models/Sentencing-cook_county.csv'
cols = ['CHARGE_COUNT',
        'CHARGE_DISPOSITION', 'UPDATED_OFFENSE_CATEGORY', 'PRIMARY_CHARGE_FLAG',
        'DISPOSITION_CHARGED_OFFENSE_TITLE', 'DISPOSITION_CHARGED_CLASS', 'SENTENCE_JUDGE',
        'SENTENCE_PHASE', 'COMMITMENT_TERM', 'COMMITMENT_UNIT', 'LENGTH_OF_CASE_in_Days',
        'AGE_AT_INCIDENT', 'RACE', 'GENDER', 'INCIDENT_CITY', 'LAW_ENFORCEMENT_AGENCY',
        'LAW_ENFORCEMENT_UNIT', 'SENTENCE_TYPE']

# dataset for including criminal history information
orig_data = pd.read_csv(csv, usecols=cols)

  interactivity=interactivity, compiler=compiler, result=result)


The loaded data contains many records for which there are no sentences applied.  We want to filter these out and apply some other operations.  This is done using `clean_data`.  Here, we only remove rows, and do not remove unnecessary columns (yet).

In [8]:
from predict import clean_data, estimate_discrepancy
data = clean_data(orig_data.copy(),removeColumns=False)
print("number of records in original set:" + str(len(orig_data)))
print("number of records in filtered set:" + str(len(data)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['RACE'] = data['RACE'].map(standard_race_map)


number of records in original set:243006
number of records in filtered set:39973


We can verify that the schema has not changed:

In [9]:
print("columns from original data load:\n")
print(orig_data.columns)
print("\ncolumns after clean_data\n")
print(data.columns)

columns from original data load:

Index(['PRIMARY_CHARGE_FLAG', 'DISPOSITION_CHARGED_OFFENSE_TITLE',
       'CHARGE_COUNT', 'DISPOSITION_CHARGED_CLASS', 'CHARGE_DISPOSITION',
       'SENTENCE_JUDGE', 'SENTENCE_PHASE', 'SENTENCE_TYPE', 'COMMITMENT_TERM',
       'COMMITMENT_UNIT', 'LENGTH_OF_CASE_in_Days', 'AGE_AT_INCIDENT', 'RACE',
       'GENDER', 'INCIDENT_CITY', 'LAW_ENFORCEMENT_AGENCY',
       'LAW_ENFORCEMENT_UNIT', 'UPDATED_OFFENSE_CATEGORY'],
      dtype='object')

columns after clean_data

Index(['PRIMARY_CHARGE_FLAG', 'DISPOSITION_CHARGED_OFFENSE_TITLE',
       'CHARGE_COUNT', 'DISPOSITION_CHARGED_CLASS', 'CHARGE_DISPOSITION',
       'SENTENCE_JUDGE', 'SENTENCE_PHASE', 'SENTENCE_TYPE', 'COMMITMENT_TERM',
       'COMMITMENT_UNIT', 'LENGTH_OF_CASE_in_Days', 'AGE_AT_INCIDENT', 'RACE',
       'GENDER', 'INCIDENT_CITY', 'LAW_ENFORCEMENT_AGENCY',
       'LAW_ENFORCEMENT_UNIT', 'UPDATED_OFFENSE_CATEGORY'],
      dtype='object')


## Making predictions with the Python API

We can take the first row (`rowNumber = 0`) and use that as an input to the `model.predict` method.  This model returns the expected length of a prison sentence in years for a person with a given set of features (including race).

In [10]:
rowNumber = 0
results = (model.predict(data[orig_cols]))
# this returns a list of results, we are only spot checking:
result = round(results[rowNumber], 3)

print("\nresult of model prediction: " + str(result) + " years")


result of model prediction: 4.132 years


The discrepancy calculation predicts the difference in sentencing if the person were of a different race.  For this calculation, we remove some columns and call `estimate_discrepancy`.

In [11]:
dataForDiscrepancyCalc = clean_data(data, removeColumns=True)[orig_cols]
print(dataForDiscrepancyCalc.columns)

# again, the full dataset is evaluated and we only spot check a single row
discrepancies  = estimate_discrepancy(model, dataForDiscrepancyCalc, return_pred=True)

discrepancy = round(discrepancies[0][rowNumber], 3)

print("difference in sentencing: "  + str(discrepancy) + " years")

Index(['PRIMARY_CHARGE_FLAG', 'DISPOSITION_CHARGED_OFFENSE_TITLE',
       'CHARGE_COUNT', 'DISPOSITION_CHARGED_CLASS', 'CHARGE_DISPOSITION',
       'SENTENCE_JUDGE', 'SENTENCE_PHASE', 'AGE_AT_INCIDENT', 'RACE', 'GENDER',
       'LAW_ENFORCEMENT_AGENCY', 'UPDATED_OFFENSE_CATEGORY'],
      dtype='object')
difference in sentencing: 0.211 years


## Calling REST API

### Launching the Flask Server and Making Requests from Notebook

The (need to update link) [README](https://github.ibm.com/nilmeier/Embrace-2020/blob/master/README.md) of this repository gives instructions on launching the server and making appropriate calls to the API using `curl` or Postman. To make REST calls from the notebook, you will need to launch the flask server separately.

Download the repository and run the following command to start up a flask server running the model. 
```
python manage.py run
```


The server should be accessible at `localhost:5000`
Use the `request` module to send requests.  Here, we are simply taking the first row of our dataset and posting a `predict` request. 

In [12]:
postJson = data.iloc[rowNumber,:].to_json()

print("JSON string used in REST call\n")
print(str(postJson))

url = "http://localhost:3000/predict"

resp = requests.post(url, data = postJson, headers = {'content-type':'application/json'})

print()
print(resp.text)
print("compare to (using direct python):\n  discrepancy:" + str(discrepancy))
print("  model_name: " + model_name)

JSON string used in REST call

{"PRIMARY_CHARGE_FLAG":false,"DISPOSITION_CHARGED_OFFENSE_TITLE":"[POSSESSION OF CONTROLLED SUBSTANCE WITH INTENT TO DELIVER\/ DELIVERY OF A CONTROLLED SUBSTANCE]","CHARGE_COUNT":2,"DISPOSITION_CHARGED_CLASS":"2","CHARGE_DISPOSITION":"Plea Of Guilty","SENTENCE_JUDGE":"Maura  Slattery Boyle","SENTENCE_PHASE":"Original Sentencing","SENTENCE_TYPE":"Prison","COMMITMENT_TERM":3.0,"COMMITMENT_UNIT":"Year(s)","LENGTH_OF_CASE_in_Days":336.0,"AGE_AT_INCIDENT":52.0,"RACE":"Black","GENDER":"Female","INCIDENT_CITY":"Chicago","LAW_ENFORCEMENT_AGENCY":"CHICAGO PD","LAW_ENFORCEMENT_UNIT":"District 25 - Grand Central","UPDATED_OFFENSE_CATEGORY":"Narcotics"}

{
  "model_name": "sentence_pipe_mae1.555_2020-10-10_02h46m24s", 
  "sentencing_discrepency": 0.211, 
  "severity": 0.555
}

compare to (using direct python):
  discrepancy:0.211
  model_name: sentence_pipe_mae1.555_2020-10-10_02h46m24s


The severity calculation is slightly more involved, and is not discussed in this notebook.  You can review the code to see what it does if interested!

Notebook written by:
- [Noah Chasek Macfoy](https://www.linkedin.com/in/noah-chasek-macfoy) 
Data Scientist, IBM
- [Jerome Nilmeier, PhD](http://linkedin.com/in/nilmeier)
Developer Advocate and Data Scientist, IBM