In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Week 5 - Logistic Regression with Generalized Linear Models

Load in the COMPAS dataset ('compas-scores-two-years.csv')

**OR...**
* If you have your cleaned dataset from last week, these changes should already be saved in there. You can import this dataset instead and skip to ***Question 5***

In [None]:
#Import cleaned dataset if you have it


1) There are a number of rows we can't use for analysis.


*   If the charge or arrest date of a defendants Compas scored crime was not within 30 days from when the person was compas scored, we assume that because of data quality reasons, that we do not have the right offense. (If days_b_screening_arrest > 30 OR if days_b_screening_arrest < -30, remove these rows from the dataset).
*   the recidivist flag -- is_recid -- is -1 if ProPublica could not find a compas case at all. Remove these rows.

*   ordinary traffic offenses -- those with a c_charge_degree of 'O' -- that will not result in Jail time are to be removed (only two of them).

**(Same thing we've been doing before - feel free to copy paste from past weeks or import your cleaned dataset)**

2) Filter the columns to only include ['name', 'age', 'c_charge_degree', 'race', 'score_text', 'sex', 'priors_count', 'days_b_screening_arrest', 'decile_score', 'is_recid', 'two_year_recid', 'c_jail_in', 'c_jail_out']

3) Create a column 'age_cat', where age is separated into 3 categories: > 45, 25 <= age <= 45, and < 25. Name these in an easy to read way.

4) Create a binary score category. 0 is 'Low', 1 is 'Medium' or 'High'.

#Start of New Content

5) Create new columns that are 'c_charge_degree', 'age_cat', 'race', and 'gender' in categorical form.

In [None]:
# Convert the charge degree, age category, and race to categorical variables.
# This is necessary for statistical modeling as these are qualitative variables.




# Convert 'sex' into a categorical variable by replacing numeric codes with string labels.
# In the COMPAS dataset, typically 0 represents Female and 1 represents Male.

6) Relevel the new age, race, and gender columns to have '25-45' (however you named this range in your age_cat function), 'Caucasian', and 'Male' as reference categories, respectively.

In [None]:
# Reorder categories so that the reference (or baseline) category for analysis is clear:




# The baseline categories are used in regression to compare the effects of other groups relative to these groups.

7) Fit a GLM model to a binomial distribution, predicting score factor based on gender, age, race, priors_count, crime, and two year recid

In [None]:
# Fit a generalized linear model (GLM) using logistic regression (binomial family)
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder






# The formula expresses the relationship between the outcome (score_factor) and predictors:
# gender, age, race, priors_count, crime type, and whether the defendant recidivated in two years.

#*For the following questions, you will need to calculate ***odds ratios****
* Use the NumPy function *np.exp(model.params['predictor_cat1[T.predictor_variable]'])
  * For example, if predicting likelihood to get a high score as a man, you would do: np.exp(model.params['gender_cat[T.Male]']). MAKE SURE that this column is the ***categorical*** version that you created earlier

8) Adjusting for confounders, how much more likely are Black Defendants to get a higher score than White Defendants?

In [None]:
# Calculate the odds ratio for Black defendants compared to White defendants.
# Here the parameter is named with the reference level as 'Caucasian' so we exponentiate the coefficient.





# In logistic regression, the coefficients represent changes in the log-odds of the outcome.
# Exponentiating these coefficients transforms them from the log-odds scale into odds ratios, which are much more interpretable.

# This odds ratio tells you how many times more likely (or less likely, if less than 1)
# Black defendants are to receive a higher score compared to White defendants, after adjusting for other factors.





# Example interpretation:
#     If the odds ratio is 1.5, it means Black defendants are 1.5 times (or 50% more) likely to get a
#      higher score compared to White defendants, assuming all other variables in the model are held constant.
#      Conversely, if the odds ratio were 0.7, Black defendants would be 30% less likely than White defendants to get a higher score.

9) How much more likely are men to get a higher score than women?

In [None]:
# Calculate the odds ratio for men compared to women. A positive coefficient for Male indicates higher odds of a high score.


10)  How much more likely are "youth" defendants to get a higher score than "middle aged" defendants? (However you named them in your dataset)

In [None]:
# Here we look at the odds ratio for defendants under 25 (youth) relative to the middle-aged group (25 - 45).
# A higher odds ratio indicates that being younger is associated with a higher likelihood of a higher risk score.


11) How about if a defendant recidivated? What does this mean?

In [None]:
# Calculate the odds ratio for two_year_recid (i.e., recidivism within two years).


12) Hispanic vs. African American? (Hint: You'll have to create a new model.)

In [None]:
# To compare Hispanic vs. African-American, set 'African-American' as the reference category in the race factor.

# When you reorder a categorical variable using cat.reorder_categories,
# the first category in the new order becomes the reference (or baseline) category for the model.

#*For the following questions, you will need to calculate ***p-values and SEs****
* Use the statsmodels function model.pvalues['predictor_cat[T.predictor_val]'] to find the p-value for
  * For example, if predicting likelihood to get a high score as a man, you would do: np.exp(model.params['gender_cat[T.Male]']). MAKE SURE that this column is the ***categorical*** version that you created earlier

13) How can we interpret P>|z| in the Asian-American category? Similarly, what about the standard error?

In [None]:
# Extract the p-value for the coefficient associated with Asian defendants.

# Extract the standard error for the Asian category coefficient.


14) What if we wanted to change two predictors at once to see their combined effect on the probability?

To do this, you'll need to make a dataframe of two rows: One represents a reference individual, the other is the same except for two categories changed. Your columns will be the predictors (6 columns).

Then, use the command

```
predicted_probabilities = model.predict(dataframe)
```




In [1]:
# Make your dataframe here. Hint: Create a DICTIONARY called reference_data and assign this
# to a series of values for columns in the dataset. For example, to test two people who are
# both Caucausian, you would do... 'race_cat' : ['Caucasian', 'Caucasian'], and so on for extra
# categories. For different characteristics... 'race_cat' : ['Caucasian', 'African-American'].




# Make a new dataframe using values dictated by your dictionary (hint: pd.DataFrame())



# Predict probabilities based on the new reference DataFrame using the fitted model.
# This gives us an idea of how the predicted COMPAS score probability changes with different inputs.





# Display the predicted probabilities.



# Extra Exercises
* Congrats! You made it to the end of the extra exercises. If you're still hankering for more logistic regression, you can continue your good work with the open-ended prompt below. If you've gotten here and you're kinda burnt out, that's also okay! You're free to go :)

15) Load the violent dataset, clean it, change respective variables to categorical, relevel, and run a GLM logistic regression. Then analyze it however you want! Take inspiration from the exercises above.