<a href="https://colab.research.google.com/github/NathaliL/Recidivism-Prediction/blob/main/BERT_Recidivism_Classifier_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
# Install necessary libraries
!pip install kaggle transformers scikit-learn pandas numpy



In [5]:
# Set up Kaggle API (Make sure to upload your kaggle.json file to the notebook environment)
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content'
!kaggle datasets download -d uocoeeds/recidivism
!unzip recidivism.zip

Dataset URL: https://www.kaggle.com/datasets/uocoeeds/recidivism
License(s): unknown
recidivism.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  recidivism.zip
replace recidivism_full.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [6]:
# Import libraries
import torch
import numpy as np
import pandas as pd
from transformers import BertTokenizer, BertModel
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [7]:
# Load and inspect the dataset from Kaggle
df = pd.read_csv('recidivism_full.csv')
print(df.head())
print(df.columns)

   ID Gender   Race Age_at_Release  Residence_PUMA Gang_Affiliated  \
0   1      M  BLACK          43-47              16           False   
1   2      M  BLACK          33-37              16           False   
2   3      M  BLACK    48 or older              24           False   
3   4      M  WHITE          38-42              16           False   
4   5      M  WHITE          33-37              16           False   

   Supervision_Risk_Score_First Supervision_Level_First  \
0                           3.0                Standard   
1                           6.0             Specialized   
2                           7.0                    High   
3                           7.0                    High   
4                           4.0             Specialized   

         Education_Level Dependents  ... DrugTests_Meth_Positive  \
0  At least some college  3 or more  ...                0.000000   
1   Less than HS diploma          1  ...                0.000000   
2  At least some col

In [9]:
# Identify the text and label columns
# Assume the column 'text_column' contains the text data you want to use with BERT
texts = df['ID'].astype(str).tolist()  # Convert to list of strings (
labels = df['Training_Sample']  # Assuming 'recidivism_column' is the column containing labels

In [10]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased').to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [11]:
def get_embeddings(text_list, batch_size=32):
    embeddings = []
    for i in range(0, len(text_list), batch_size):
        batch_texts = text_list[i:i+batch_size]
        inputs = tokenizer(batch_texts, return_tensors='pt', truncation=True, padding=True, max_length=512).to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
        with torch.no_grad():
            outputs = model(**inputs)
            batch_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
            embeddings.append(batch_embeddings)
    return np.vstack(embeddings)

In [12]:
# Generate embeddings for the entire dataset
embeddings = get_embeddings(texts)

In [13]:
numerical_features = df[['Residence_PUMA', 'Jobs_Per_Year' ,'Supervision_Risk_Score_First']]
# Combine embeddings with numerical features
features = np.hstack((embeddings, numerical_features.values))

In [14]:
valid_idx = ~np.isnan(features).any(axis=1)
features = features[valid_idx]
labels = labels[valid_idx]

In [15]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Train the Logistic Regression classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [17]:
# Predict and evaluate the classifier
# y_pred = clf.predict(X_test)
# accuracy = accuracy_score(y_test, y_pred)
# precision = precision_score(y_test, y_pred, average='binary')
# recall = recall_score(y_test, y_pred, average='binary')
# f1 = f1_score(y_test, y_pred, average='binary')

# print(f"Accuracy: {accuracy}")
# print(f"Precision: {precision}")
# print(f"Recall: {recall}")
# print(f"F1 Score: {f1}")

from sklearn.model_selection import KFold

# Assume the BERT embeddings are stored in a list called embeddings
X = np.array(embeddings)  # Convert list of embeddings to a NumPy array
y = df['Training_Sample'].values  # Convert labels to a NumPy array

kf5 = KFold(n_splits=5, shuffle=True, random_state=42)
kf10 = KFold(n_splits=10, shuffle=True, random_state=42)

def train_and_evaluate_kfold(kf, X, y):
    accuracies = []
    precisions = []
    recalls = []
    f1s = []

    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Train your model (e.g., logistic regression)
        model = LogisticRegression()
        model.fit(X_train, y_train)

        # Make predictions
        y_pred = model.predict(X_test)

        # Evaluate model
        accuracies.append(accuracy_score(y_test, y_pred))
        precisions.append(precision_score(y_test, y_pred))
        recalls.append(recall_score(y_test, y_pred))
        f1s.append(f1_score(y_test, y_pred))

    # Return all values for accuracy, precision, recall, and F1 score
    return accuracies, precisions, recalls, f1s

# Example usage for k=5:
acc_5, prec_5, rec_5, f1_5 = train_and_evaluate_kfold(kf5, X, y)
print(f"K=5 Fold Accuracy: {acc_5}")
print(f"K=5 Fold Precision: {prec_5}")
print(f"K=5 Fold Recall: {rec_5}")
print(f"K=5 Fold F1 Score: {f1_5}")

# Example usage for k=10:
acc_10, prec_10, rec_10, f1_10 = train_and_evaluate_kfold(kf10, X, y)
print(f"K=10 Fold Accuracy: {acc_10}")
print(f"K=10 Fold Precision: {prec_10}")
print(f"K=10 Fold Recall: {rec_10}")
print(f"K=10 Fold F1 Score: {f1_10}")





STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

K=5 Fold Accuracy: [0.6980839945809948, 0.6957615637700794, 0.6951809560673505, 0.6907296303464293, 0.6895684149409715]
K=5 Fold Precision: [0.7012480499219969, 0.7001756783134881, 0.698828125, 0.6924271844660194, 0.6937182988685134]
K=5 Fold Recall: [0.9922737306843267, 0.9900634833011317, 0.9908612572694544, 0.9960893854748604, 0.9905292479108635]
K=5 Fold F1 Score: [0.8217550274223033, 0.8202606906014177, 0.8196082922918337, 0.8169530355097365, 0.8159706287287747]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

K=10 Fold Accuracy: [0.7058823529411765, 0.6927244582043344, 0.7047213622291022, 0.6938854489164087, 0.6904024767801857, 0.6995741385985288, 0.6895083236546651, 0.6887340301974448, 0.699961285327139, 0.6840882694541232]
K=10 Fold Precision: [0.7082362082362083, 0.6937111801242236, 0.7051978277734678, 0.6964632724446171, 0.6927265655387009, 0.703862660944206, 0.6915306915306916, 0.6922477600311648, 0.7028816199376947, 0.6862592448423511]
K=10 Fold Recall: [0.9950873362445415, 0.9972098214285714, 0.9983525535420099, 0.9944506104328524, 0.9944165270798436, 0.9906644700713894, 0.9955257270693513, 0.9921831379117811, 0.9933957072096863, 0.9943598420755781]
K=10 Fold F1 Score: [0.8275079437131184, 0.8182234432234432, 0.826551488974767, 0.8192, 0.816597890875745, 0.822992700729927, 0.8161393856029345, 0.8155117026158789, 0.8232611174458381, 0.8120681713496085]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
