# Lab 04, Feb 11 2020.
### Due Feb 18, 2020, before lab.
### Building your own COMPAS like predictor

In this lab, you will build your own COMPAS like predictor with the data we have been using for the last two labs. To do this, we will use logistic regression. To recall logistic regression, refer to [videos on Logistic Regression posted on the schedule for today](https://www.youtube.com/watch?v=vN5cNN2-HWE) and to [Lab 1](https://render.githubusercontent.com/view/ipynb?commit=06e777dc2ca6fea2d8bd7e98919ac8dc8847c051&enc_url=68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f4275636b6e656c6c466169726e6573732f435343493337392d535032302f303665373737646332636136666561326438626437653938393139616338646338383437633035312f4c6162732f6c6162312f6c6162312e6970796e62&nwo=BucknellFairness%2FCSCI379-SP20&path=Labs%2Flab1%2Flab1.ipynb&repository_id=233859195&repository_type=Repository#Onto-Classification) on how to use it in Python.

### Part 1 Getting Started
To get started, like before, create `lab4` folder in your local repo alongwith a `data` folder. In the data folder, place the data file we have been using for labs 2 and 3. (The csv file can be downloaded from [here](https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv)). Place it in your `data` folder for `lab4`. 



In [None]:
# Making sure we have all the right libraries
%matplotlib inline

import pandas as pd
import pylab
import numpy as np

import pylab
import numpy as np
import matplotlib.pyplot as plt

Loading the data

In [None]:

data = pd.read_csv('data/compas-scores-two-years.csv', index_col=0)

Now we will filter out rows where **days_b_screening_arrest** is over 30 or under -30, leaving us with 6,172 rows. 

In [None]:
#Filtering out data
filterData = data[(data['days_b_screening_arrest'] <= 30) & (data['days_b_screening_arrest'] >= -30)]

In [None]:
#checking if the data is the correct size
filterData.shape

We are going to use logistic regression to try to build our own predictor. To begin with, we will use the following variables.

* Age
* Sex
* Felony or Misdemeanor charge (`c_charge_degree`)
* Number of prior arrests (`priors_count`)

 We will also try this both with and without race as a predictive factor.

In case you haven't seen this before, here is a quick way of getting all the column names for your Pandas dataframe. As you can see there are 45+ features in this data frame, but we will find that only a few are enough to achieve an accuracy comparable to COMPAS.

In [None]:
filterData.columns

Here is a way of getting a quick look at the data by an attribute.

In [None]:
filterData.sex.value_counts()

The **crosstab** function in pandas is a useful function to familiarize yourself with.

In [None]:
# recidivsm rates by race
recid_by_race = pd.crosstab(filterData.race, filterData.two_year_recid)
recid_by_race 

In [None]:
# You can insert another column in this dataframe called "rate"
recid_by_race['rate'] = recid_by_race[1]/recid_by_race.sum(axis = 1)
recid_by_race

**Task 1** Now create a similar dataframe to view recidivism rates by sex.

In [None]:
# Answer to Task 1


### Before you move further along, answer the following questions. Inspect the output of `filterData.columns()` to answer the first two. 

* Q1 Which variable in the COMPAS data represents the outcomes of the predictor built by Northpointe?
* Q2: Which variable represents the "reality"?
* Q3: What is a confusion matrix and why is it useful?

We can also use the `crosstab` function to derive the confusion matrix for a predictor. In this case, the one used by ProPublica that we have seen in our prior labs. 

In [None]:
guessed_recid = filterData['score_text']!= 'Low'
actual_recid = filterData.two_year_recid == 1
conf_matrix = pd.crosstab(actual_recid,guessed_recid, rownames = ['actual_recid'], colnames = ['guessed_recid'])
conf_matrix

### Part 2 Using logistic regression:
First, we will convert the `age`, `sex` and `c_charge_degree` (which are all categorical data) to what are called indicator variables. Make sure you read [this resource on logistic regression and so called dummy variables](https://www.statisticssolutions.com/dummy-coding-the-how-and-why/) before going any further.

 The code below uses the pandas function `get_dummies` to convert `age`, `sex`, and `charge_degree` to dummy variables, and concatenates the resulting data into a new data frame.

In [None]:
features = pd.concat(
    [pd.get_dummies(filterData.age_cat, prefix='age'),
     pd.get_dummies(filterData.sex, prefix='sex'),
     pd.get_dummies(filterData.c_charge_degree, prefix='degree'), # felony or misdemeanor charge ('f' or 'm')
     filterData.priors_count],
    axis=1)

Inspect the features dataframe.

In [None]:
features.head()

Now, we will drop one of the [dummy variables](https://en.wikiversity.org/wiki/Dummy_variable_(statistics)) for each category because they are redundant.

In [None]:
features.drop(['age_25 - 45', 'sex_Female', 'degree_M'], axis=1, inplace=True)


In [None]:
features.head()

In [None]:
# The true target variable we're trying
# to predict: whether someone is re-arrested.
target = filterData.two_year_recid

In [None]:
# get only the values in the features 
x = features.values
# the values in the target
y = target.values

In [None]:
# importing what we need
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [None]:
lr = LogisticRegression()

In [None]:
#fit the LR model to x. 
lr.fit(x,y)

Now, we will try to understand what the coefficients of the LR model mean.

In [None]:

coeffs = pd.DataFrame(np.exp(lr.coef_), columns=features.columns)
coeffs

The model thinks that:

* being young (<25) more than doubles your odds of recidivism
* but being >45 years old makes half as likely
* being male increases your odds by 40%
* every prior arrest increases your odds by 18%

Q: Explain why that's the case? You might want to review the [video on coefficients in logistic regression](https://www.youtube.com/watch?v=vN5cNN2-HWE).

**Task 2** Now make a prediction according to the LR model, and compare it to the actual target variable. 

In [None]:
# assume y_pred has the prediction from the LR model.
y_pred =  # fill in your answer here. In the code below
          # I'm assuming that y_pred is a numpy array

# this code converts the y_pred numpy array into a Pandas series
guessed=pd.Series(y_pred, index=features.index)==1

# This is the target variable, also as a series.
actual=filterData.two_year_recid==1

# The confusion matrix for the guessed and actual prediction
cm = pd.crosstab(guessed, actual, rownames=['guessed'], colnames=['actual'])
cm


The code below has helper functions for us to print the confusion matrix and the accuracy rates for a predictor. It will be of use later on.

In [None]:
# cm is a confusion matrix. The rows are guessed, the columns are actual 
def print_ppv_fpv(cm):
    # takes in a confusion matrix
    # the indices here are [col][row] or [actual][guessed]
    # prints the various metrics
    TN = cm[False][False]   
    TP = cm[True][True]
    FN = cm[True][False]
    FP = cm[False][True]
    print('Accuracy: ', (TN+TP)/(TN+TP+FN+FP))
    print('PPV: ', TP / (TP + FP))
    print('FPR: ', FP / (FP + TN))
    print('FNR: ', FN / (FN + TP))
    print()

def print_metrics(guessed, actual):
    # takes in series of guessed and actual predictions
    cm = pd.crosstab(guessed, actual, rownames=['guessed'], colnames=['actual'])
    print(cm)
    print()
    print_ppv_fpv(cm)

In [None]:
# computing the accuracy of our LR model.
print('Our Logistic Regression for the enire group')
print_metrics(guessed, actual)


**Task 3** Now use the **print_metrics** function to print the Confusion Matrix and all the rates for both the Black and Caucasian Group according to our predictor.

In [None]:
# Answer to Task 3
print('White')


**Task 4** How does this compare to the rates for the COMPAS predictor from Labs 2 and 3?

In [None]:
# Answer to Task 4



**Task 5** Now, repeat the entire prediction process by including race in the feature variables. How do the different rates compare to the case when race was not explicitly a feature variable? Do you have an explanation for this?

In [None]:
#Answer to Task 5


**Task 6** Refer to Lab 1 to plot an ROC curve for the first LR predictor that does *not* include race as a feature. Plot the ROC curves for the entire population as well as the ROC curves for the the 'African-American' and 'Caucasian' group. 

Plot another figure that shows the ROC curves when race is explicitly used as a feature.

**Task 7** Now, we will use 7 features to train our logistic regression classifier - age, sex, number of juvenile misdemeanors, number of juvenile felonies, number of prior (nonjuvenile) crimes, crime degree, and crime charge. How does this compare to our first classifier, that used only four features? 
