In [1]:
import sys
print(sys.version)

3.13.2 (v3.13.2:4f8bb3947cf, Feb  4 2025, 11:51:10) [Clang 15.0.0 (clang-1500.3.9.4)]


# Similarity Scoring: Identifying Similarly Situated Defendants

**Purpose:** Use Redo.io's `similarity_scoring` toolkit to find *similarly situated* defendants in the CDCR resentencing data and compare their sentencing outcomes.

**Research Question:** Among defendants who are highly similar on offense profile and criminal history, do Black or Hispanic defendants receive longer sentences than White defendants?

**CRJA Connection:** This analysis focuses on *pairwise* comparisons of "similarly situated" individuals, which is directly relevant for CRJA motions where public defenders present concrete case comparisons and not just aggregate trends.

**Datasets Used:**

demographics.csv

current_commitments.csv

prior_commitments.csv

Tools Used: Redo.io similarity_scoring toolkit

## Step 1: Setup & Imports

Configure Python path to use the cloned `similarity_scoring` toolkit and import core functions.

In [3]:
import sys, os

# Tell Python where the similarity_scoring package lives
sys.path.append(os.path.join(os.getcwd(), "similarity_scoring", "similarity_scoring"))

print(os.listdir(os.path.join(os.getcwd(), "similarity_scoring", "similarity_scoring")))

['config.py', 'vector_similarity.py', 'similarity_metrics.py', 'offense_helpers.py', 'compute_metrics.py', '__init__.py', 'run_similarity.py', 'sentencing_math.py']


In [5]:
import config as CFG
import compute_metrics as cm
import sentencing_math as sm
from similarity_metrics import (
    euclidean_similarity_named,
    tanimoto_from_named,
    jaccard_on_keys,
)
from vector_similarity import cosine_from_named

print("Imports successful!")

Imports successful!


## Step 2: Load CDCR Public Data

For this prototype, we use the public GitHub versions of the demographics, current commitments, and prior commitments tables from Redo's `resentencing_data_initiative` repository.

In [9]:
import ssl

# TEMPORARY: disable SSL verification so we can read from GitHub
ssl._create_default_https_context = ssl._create_unverified_context


In [12]:
import pandas as pd
import numpy as np
from pathlib import Path

In [10]:
import pandas as pd
import numpy as np

# GitHub URLs for public CDCR data
demographics_url = "https://raw.githubusercontent.com/redoio/resentencing_data_initiative/main/data/demographics.csv"
current_url      = "https://raw.githubusercontent.com/redoio/resentencing_data_initiative/main/data/current_commitments.csv"
prior_url        = "https://raw.githubusercontent.com/redoio/resentencing_data_initiative/main/data/prior_commitments.csv"

print("Loading data from GitHub...")

demo_df    = pd.read_csv(demographics_url)
current_df = pd.read_csv(current_url)
prior_df   = pd.read_csv(prior_url)

print(f"Demographics:          {demo_df.shape}")
print(f"Current commitments:   {current_df.shape}")
print(f"Prior commitments:     {prior_df.shape}")

demo_df.head()


Loading data from GitHub...


URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1028)>

In [17]:
from pathlib import Path
data_dir = Path("similarity_scoring/resources")

DATA_PATHS = {
    'demographics': data_dir / "demographics.csv",
    'current_commits': data_dir / "current_commitments.csv",
    'prior_commits': data_dir / "prior_commitments.csv"
}

In [18]:
demo_df    = pd.read_csv(DATA_PATHS['demographics'])
current_df = pd.read_csv(DATA_PATHS['current_commits'])
prior_df   = pd.read_csv(DATA_PATHS['prior_commits'])

print(f"Demographics:          {demo_df.shape}")
print(f"Current commitments:   {current_df.shape}")
print(f"Prior commitments:     {prior_df.shape}")

demo_df.head()

  current_df = pd.read_csv(DATA_PATHS['current_commits'])


Demographics:          (95476, 16)
Current commitments:   (369125, 34)
Prior commitments:     (191436, 13)


Unnamed: 0,cdcno,ethnicity,controlling offense,description,offense begin date,offense end date,controlling case number,controlling case sentencing county,sentence type,aggregate sentence in months,offense category,eprd mepd month and year,current location,aggregate sentence in years,time served in years,expected release date
0,2cf2a233c4,Black,VC10851(a),Vehicle Theft,2022-12-07,2022-12-07,FVI22003547,San Bernardino,Second Striker,32,Property Crimes,APR24,Central California Women's Facility,2.7,2.6,2025-08-07
1,5a72696541,White,PC187 2nd,Murder 2nd,2012-09-18,2012-09-18,12F06402,Sacramento,Life with Parole,360,Crimes Against Persons,NOV33,Central California Women's Facility,30.0,12.8,2042-09-18
2,7d608b6a4c,White,PC187 2nd,Murder 2nd,2010-05-18,2010-05-18,CM032513,Butte,Life with Parole,300,Crimes Against Persons,SEP28,Central California Women's Facility,25.0,15.2,2035-05-18
3,39c1bc8c2f,Other,PC459,Burglary 1st,1998-01-20,1998-01-20,CM010387,Butte,Third Striker,348,Property Crimes,NOV22,California Institution for Women,29.0,27.5,2027-01-20
4,220f2cdfc5,Black,PC187,Murder 1st,2017-03-21,2017-03-21,BA455966,Los Angeles,Life with Parole,312,Crimes Against Persons,MAY35,California Institution for Women,26.0,8.3,2043-03-21


### **Data Loaded Successfully**

- **95,476** individuals in demographics  
- **369,125** current commitments  
- **191,436** prior commitments  

Datasets are successfully loaded and ready for similarity scoring.

## Step 3 – Compute Similarity Features for a Single Defendant

Here we use Redo.io's `compute_features()` function to build a feature vector for one person.
This vector summarizes their offense history and sentence profile and is the input to all
similarity metrics (cosine, Euclidean, Tanimoto, etc.).

In [20]:
# Pick one sample defendant from the demographics table
sample_id = demo_df["cdcno"].iloc[0]
print("Sample CDC Number:", sample_id)

# Compute features + auxiliary info for debugging
feats, aux = cm.compute_features(
    uid=sample_id,
    demo=demo_df,
    current_df=current_df,
    prior_df=prior_df,
    lists=CFG.OFFENSE_LISTS,
)

print("\n=== Feature Vector (for similarity) ===")
for k, v in feats.items():
    print(f"{k:20s} : {v}")

print("\n=== Auxiliary Info (for interpretation/QA) ===")
for k, v in aux.items():
    print(f"{k:20s} : {v}")

Sample CDC Number: 2cf2a233c4

=== Feature Vector (for similarity) ===
desc_nonvio_curr     : 1.0
desc_nonvio_past     : 1.0
severity_trend       : 0.0

=== Auxiliary Info (for interpretation/QA) ===
time_inputs          : TimeInputs(current_sentence_months=32.0, completed_months=31.200000000000003, past_time_months=nan, childhood_months=0.0)
pct_completed        : 97.50000000000001
time_outside         : 0.0
age_value            : nan
counts_by_category   : {'current': {'violent': 0, 'nonviolent': 1, 'other': 0, 'clash': 0}, 'prior': {'violent': 0, 'nonviolent': 2, 'other': 11, 'clash': 0}}
years_elapsed_from_commitments : None
years_elapsed_for_trend : 10.0


## Step 4 – Compare Two Defendants Using Similarity Metrics

In [21]:
# Choose two people
id1 = demo_df["cdcno"].iloc[0]
id2 = demo_df["cdcno"].iloc[1]

f1, _ = cm.compute_features(id1, demo_df, current_df, prior_df, CFG.OFFENSE_LISTS)
f2, _ = cm.compute_features(id2, demo_df, current_df, prior_df, CFG.OFFENSE_LISTS)

print("ID1:", id1)
print("ID2:", id2)

print("\nCosine similarity:   ", cosine_from_named(f1, f2))
print("Euclidean similarity:", euclidean_similarity_named(f1, f2))
print("Tanimoto similarity: ", tanimoto_from_named(f1, f2))
print("Jaccard on keys:     ", jaccard_on_keys(f1, f2))

ID1: 2cf2a233c4
ID2: 5a72696541

Cosine similarity:    0.6901355398841714
Euclidean similarity: 0.49392687973787863
Tanimoto similarity:  0.4878555511603685
Jaccard on keys:      nan


**Interpretation:**

Cosine similarity = 0.69. These two defendants have moderately similar feature vectors (offense mix, severity trend, etc.).

Euclidean similarity = 0.49. This is a distance-based similarity. Mid-range, consistent with cosine.

Tanimoto similarity = 0.48. Also measuring overlap in weighted features — again consistent.

Jaccard = NaN
Jaccard only works for presence/absence of keys, and our feature vectors often have differing keys depending on what fields exist (e.g., freq_* may be missing). When both vectors have zero intersection, Jaccard = NaN. This is expected for CDCR data.

## Step 5 – Find the Top-K Most Similar Defendants to a Target
Here we compute similarity between a target person and *all other* defendants,
then return the top matches.

In [None]:
def compute_feature_vector(uid):
    feats, _ = cm.compute_features(
        uid=uid,
        demo=demo_df,
        current_df=current_df,
        prior_df=prior_df,
        lists=CFG.OFFENSE_LISTS
    )
    return feats

def similarity_to_target(target_uid, other_uid):
    f1 = compute_feature_vector(target_uid)
    f2 = compute_feature_vector(other_uid)
    if not f1 or not f2:
        return None
    return cosine_from_named(f1, f2)

def top_k_similar(target_uid, k=10):
    sims = []
    for uid in demo_df["cdcno"].unique():
        if uid == target_uid:
            continue
        score = similarity_to_target(target_uid, uid)
        if score is not None:
            sims.append((uid, score))
    sims = sorted(sims, key=lambda x: x[1], reverse=True)
    return sims[:k]

# Run it
target = demo_df["cdcno"].iloc[0]
print("Target ID:", target)
top10 = top_k_similar(target, k=10)
top10

In [None]:
import pandas as pd

# Convert list of tuples → DataFrame
top10_df = pd.DataFrame(top10, columns=["cdcno", "similarity"])
top10_df.head(10)