In [1]:
import sys
print(sys.version)

3.13.2 (v3.13.2:4f8bb3947cf, Feb  4 2025, 11:51:10) [Clang 15.0.0 (clang-1500.3.9.4)]


# Similarity Scoring: Identifying Similarly Situated Defendants

**Purpose:** Use Redo.io's `similarity_scoring` toolkit to find *similarly situated* defendants in the CDCR resentencing data and compare their sentencing outcomes.

**Research Question:** Among defendants who are highly similar on offense profile and criminal history, do Black or Hispanic defendants receive longer sentences than White defendants?
OR 
For a specific defendant, who are the most similarly situated people in the CDCR data (same kind of offenses, similar history, etc.), and how do their sentences compare – especially by ethnicity?

**CRJA Connection:** This analysis focuses on *pairwise* comparisons of "similarly situated" individuals, which is directly relevant for CRJA motions where public defenders present concrete case comparisons and not just aggregate trends.

**Datasets Used:**

demographics.csv

current_commitments.csv

prior_commitments.csv

**Tools Used:** Redo.io similarity_scoring toolkit

So this notebook is:

Turning each person’s offense / history into a feature vector (using compute_features() from the similarity_scoring repo).

Using cosine / Euclidean / Tanimoto similarity to compare these feature vectors between people.

For a chosen target CDC number, finding the top-K most similar defendants, then pulling:

their ethnicity (from demographics),

their sentencing info (from current commitments)

## Step 1: Setup & Imports

Configure Python path to use the cloned `similarity_scoring` toolkit and import core functions.

In [2]:
import sys, os

# Tell Python where the similarity_scoring package lives
sys.path.append(os.path.join(os.getcwd(), "similarity_scoring", "similarity_scoring"))

print(os.listdir(os.path.join(os.getcwd(), "similarity_scoring", "similarity_scoring")))

['config.py', 'vector_similarity.py', 'similarity_metrics.py', 'offense_helpers.py', 'compute_metrics.py', '__init__.py', 'run_similarity.py', '__pycache__', 'sentencing_math.py']


In [3]:
import config as CFG
import compute_metrics as cm
import sentencing_math as sm
from similarity_metrics import (
    euclidean_similarity_named,
    tanimoto_from_named,
    jaccard_on_keys,
)
from vector_similarity import cosine_from_named

print("Imports successful!")

Imports successful!


## Step 2: Define Local Paths and Load CDCR Public Data

For this prototype, we use the public versions of the demographics, current commitments, and prior commitments tables from https://data.world/redoio/

In [None]:
import ssl

# TEMPORARY: disable SSL verification so we can read from GitHub
ssl._create_default_https_context = ssl._create_unverified_context


In [4]:
import pandas as pd
import numpy as np
from pathlib import Path

In [None]:
# GitHub URLs for public CDCR data
demographics_url = "https://raw.githubusercontent.com/redoio/resentencing_data_initiative/main/data/demographics.csv"
current_url      = "https://raw.githubusercontent.com/redoio/resentencing_data_initiative/main/data/current_commitments.csv"
prior_url        = "https://raw.githubusercontent.com/redoio/resentencing_data_initiative/main/data/prior_commitments.csv"

print("Loading data from GitHub...")

demo_df    = pd.read_csv(demographics_url)
current_df = pd.read_csv(current_url)
prior_df   = pd.read_csv(prior_url)

print(f"Demographics:          {demo_df.shape}")
print(f"Current commitments:   {current_df.shape}")
print(f"Prior commitments:     {prior_df.shape}")

demo_df.head()


In [5]:
data_dir = Path("similarity_scoring/resources")

DATA_PATHS = {
    'demographics': data_dir / "demographics.csv",
    'current_commits': data_dir / "current_commitments.csv",
    'prior_commits': data_dir / "prior_commitments.csv"
}

In [6]:
demo_df    = pd.read_csv(DATA_PATHS['demographics'])
current_df = pd.read_csv(DATA_PATHS['current_commits'])
prior_df   = pd.read_csv(DATA_PATHS['prior_commits'])

print(f"Demographics:          {demo_df.shape}")
print(f"Current commitments:   {current_df.shape}")
print(f"Prior commitments:     {prior_df.shape}")

demo_df.head()

  current_df = pd.read_csv(DATA_PATHS['current_commits'])


Demographics:          (95476, 16)
Current commitments:   (369125, 34)
Prior commitments:     (191436, 13)


Unnamed: 0,cdcno,ethnicity,controlling offense,description,offense begin date,offense end date,controlling case number,controlling case sentencing county,sentence type,aggregate sentence in months,offense category,eprd mepd month and year,current location,aggregate sentence in years,time served in years,expected release date
0,2cf2a233c4,Black,VC10851(a),Vehicle Theft,2022-12-07,2022-12-07,FVI22003547,San Bernardino,Second Striker,32,Property Crimes,APR24,Central California Women's Facility,2.7,2.6,2025-08-07
1,5a72696541,White,PC187 2nd,Murder 2nd,2012-09-18,2012-09-18,12F06402,Sacramento,Life with Parole,360,Crimes Against Persons,NOV33,Central California Women's Facility,30.0,12.8,2042-09-18
2,7d608b6a4c,White,PC187 2nd,Murder 2nd,2010-05-18,2010-05-18,CM032513,Butte,Life with Parole,300,Crimes Against Persons,SEP28,Central California Women's Facility,25.0,15.2,2035-05-18
3,39c1bc8c2f,Other,PC459,Burglary 1st,1998-01-20,1998-01-20,CM010387,Butte,Third Striker,348,Property Crimes,NOV22,California Institution for Women,29.0,27.5,2027-01-20
4,220f2cdfc5,Black,PC187,Murder 1st,2017-03-21,2017-03-21,BA455966,Los Angeles,Life with Parole,312,Crimes Against Persons,MAY35,California Institution for Women,26.0,8.3,2043-03-21


### **Data Loaded Successfully**

- **95,476** individuals in demographics  
- **369,125** current commitments  
- **191,436** prior commitments  

Datasets are successfully loaded and ready for similarity scoring.

## Step 3 – Define Compute Feature Vector and build Similarity Features for a Single Defendant

Here we use Redo.io's `compute_features()` function to build a feature vector for one person.
This vector summarizes their offense history and sentence profile and is the input to all
similarity metrics (cosine, Euclidean, Tanimoto, etc.).

In [7]:
def compute_feature_vector(uid):
    """
    Wrapper around cm.compute_features to get the feature dict for one defendant.
    """
    feats, _ = cm.compute_features(
        uid=uid,
        demo=demo_df,
        current_df=current_df,
        prior_df=prior_df,
        lists=CFG.OFFENSE_LISTS,
    )
    return feats

In [8]:
# Pick one sample defendant from the demographics table
sample_id = demo_df["cdcno"].iloc[0]
print("Sample CDC Number:", sample_id)

# Compute features + auxiliary info for debugging
feats, aux = cm.compute_features(
    uid=sample_id,
    demo=demo_df,
    current_df=current_df,
    prior_df=prior_df,
    lists=CFG.OFFENSE_LISTS,
)

print("\n=== Feature Vector (for similarity) ===")
for k, v in feats.items():
    print(f"{k:20s} : {v}")

print("\n=== Auxiliary Info (for interpretation/QA) ===")
for k, v in aux.items():
    print(f"{k:20s} : {v}")

Sample CDC Number: 2cf2a233c4

=== Feature Vector (for similarity) ===
desc_nonvio_curr     : 1.0
desc_nonvio_past     : 1.0
severity_trend       : 0.0

=== Auxiliary Info (for interpretation/QA) ===
time_inputs          : TimeInputs(current_sentence_months=32.0, completed_months=31.200000000000003, past_time_months=nan, childhood_months=0.0)
pct_completed        : 97.50000000000001
time_outside         : 0.0
age_value            : nan
counts_by_category   : {'current': {'violent': 0, 'nonviolent': 1, 'other': 0, 'clash': 0}, 'prior': {'violent': 0, 'nonviolent': 2, 'other': 11, 'clash': 0}}
years_elapsed_from_commitments : None
years_elapsed_for_trend : 10.0


## Step 4 – Compare Two Defendants Using Similarity Metrics

In [9]:
# Choose two people
id1 = demo_df["cdcno"].iloc[0]
id2 = demo_df["cdcno"].iloc[1]

f1 = compute_feature_vector(id1)
f2 = compute_feature_vector(id2)

print("ID1:", id1)
print("ID2:", id2)

print("\nCosine similarity:   ", cosine_from_named(f1, f2))
print("Euclidean similarity:", euclidean_similarity_named(f1, f2))
print("Tanimoto similarity: ", tanimoto_from_named(f1, f2))
print("Jaccard on keys:     ", jaccard_on_keys(f1, f2))

ID1: 2cf2a233c4
ID2: 5a72696541

Cosine similarity:    0.6901355398841714
Euclidean similarity: 0.49392687973787863
Tanimoto similarity:  0.4878555511603685
Jaccard on keys:      nan


**Interpretation:**

Cosine similarity = 0.69. These two defendants have moderately similar feature vectors (offense mix, severity trend, etc.).

Euclidean similarity = 0.49. This is a distance-based similarity. Mid-range, consistent with cosine.

Tanimoto similarity = 0.48. Also measuring overlap in weighted features — again consistent.

Jaccard = NaN
Jaccard only works for presence/absence of keys, and our feature vectors often have differing keys depending on what fields exist (e.g., freq_* may be missing). When both vectors have zero intersection, Jaccard = NaN. This is expected for CDCR data.

## Step 5 – Find the Top-K Most Similar Defendants to a Target
Here we compute similarity between a target person and *all other* defendants,
then return the top matches.

In [10]:
def similarity_to_target(target_uid, other_uid):
    """
    Compute cosine similarity between target_uid and other_uid.
    Returns None if either vector is empty.
    """
    f1 = compute_feature_vector(target_uid)
    f2 = compute_feature_vector(other_uid)
    if not f1 or not f2:
        return None
    return cosine_from_named(f1, f2)


def top_k_similar_fast(target_uid, k=10, sample_size=50):
    """
    Super-fast version:
    - Only compares against the first `sample_size` individuals
    - Precomputes the target vector once
    """
    sims = []

    target_feats = compute_feature_vector(target_uid)
    if not target_feats:
        print("No features for target; cannot compute similarity.")
        return []

    candidate_ids = demo_df["cdcno"].unique()[:sample_size]

    for uid in candidate_ids:
        if uid == target_uid:
            continue

        feats = compute_feature_vector(uid)
        if not feats:
            continue

        score = cosine_from_named(target_feats, feats)
        sims.append((uid, score))

    sims = sorted(sims, key=lambda x: x[1], reverse=True)
    return sims[:k]


In [11]:
target = demo_df["cdcno"].iloc[0]
print("Target ID:", target)

top10 = top_k_similar_fast(target, k=10, sample_size=50)
top10

Target ID: 2cf2a233c4


[('5a72696541', 0.6901355398841714),
 ('7d608b6a4c', nan),
 ('39c1bc8c2f', 1.0),
 ('0ecc570351', 1.0),
 ('03a747e771', 0.6901355398841714),
 ('40d174eb94', nan),
 ('a539d48493', nan),
 ('28fd24c3c4', nan),
 ('8b55645f1c', nan),
 ('21b0fdbf87', nan)]

In [12]:
# Drop NaN similarities
top10_clean = [(uid, score) for uid, score in top10 if not pd.isna(score)]

top10_df = pd.DataFrame(top10_clean, columns=["cdcno", "similarity"])
top10_df

Unnamed: 0,cdcno,similarity
0,5a72696541,0.690136
1,39c1bc8c2f,1.0
2,0ecc570351,1.0
3,03a747e771,0.690136


In [15]:
# One row per person from current commitments (for summary)
current_summary = (
    current_df[[
        "cdcno",
        "sentence from abstract of judgement",
        "offense category",
        "sentencing county",
    ]]
    .drop_duplicates(subset=["cdcno"])
)

# Merge similarities with demographics + sentencing
similar_matches = (
    top10_df
    .merge(demo_df[["cdcno", "ethnicity"]], on="cdcno", how="left")
    .merge(current_summary, on="cdcno", how="left")
    .sort_values("similarity", ascending=False)
)

similar_matches

Unnamed: 0,cdcno,similarity,ethnicity,sentence from abstract of judgement,offense category,sentencing county
1,39c1bc8c2f,1.0,Other,1 Years 4 Months,Drug Crimes,San Bernardino
2,0ecc570351,1.0,Mexican,4 Years,Drug Crimes,San Bernardino
0,5a72696541,0.690136,White,Life with Parole,Crimes Against Persons,Sacramento
3,03a747e771,0.690136,Black,7 Years,Crimes Against Persons,San Diego


In [16]:
import re
import numpy as np

def parse_sentence_to_months(text):
    """
    Convert 'sentence from abstract of judgement' text to months.
    
    Examples:
      '4 Years'            -> 48
      '1 Years 4 Months'   -> 16
      '7 Months'           -> 7
      '36'                 -> 36  (assume already in months)
      'Life with Parole'   -> NaN (we skip life sentences for now)
    """
    if pd.isna(text):
        return np.nan
    
    s = str(text).strip().lower()
    
    # Skip life sentences for now – they aren't directly comparable
    if "life" in s:
        return np.nan
    
    years = 0
    months = 0
    
    # Find years
    m_years = re.search(r'(\d+)\s*year', s)
    if m_years:
        years = int(m_years.group(1))
    
    # Find months
    m_months = re.search(r'(\d+)\s*month', s)
    if m_months:
        months = int(m_months.group(1))
    
    # If we didn't find explicit years/months, maybe it's just a bare number
    if years == 0 and months == 0:
        m_plain = re.search(r'^\s*(\d+)\s*$', s)
        if m_plain:
            months = int(m_plain.group(1))
    
    total_months = years * 12 + months
    if total_months == 0:
        return np.nan
    
    return total_months

# Apply to current commitments
current_df["sentence_months"] = current_df["sentence from abstract of judgement"].apply(parse_sentence_to_months)

current_df[["sentence from abstract of judgement", "sentence_months"]].head(10)

Unnamed: 0,sentence from abstract of judgement,sentence_months
0,2 Years 8 Months,32.0
1,Life with Parole,
2,Life with Parole,
3,Life with Parole,
4,1 Years 4 Months,16.0
5,4 Years,48.0
6,3 Years,36.0
7,Life with Parole,
8,Life with Parole,
9,Life with Parole,


In [17]:
# Build a one-row-per-person summary from current commitments
current_summary = (
    current_df[[
        "cdcno",
        "sentence from abstract of judgement",
        "sentence_months",
        "offense category",
        "sentencing county",
    ]]
    .drop_duplicates(subset=["cdcno"])
)

# Filter to defendants with a numeric (non-NaN) sentence
valid_ids = current_summary[~current_summary["sentence_months"].isna()]["cdcno"].unique()
len(valid_ids), valid_ids[:5]


(64624,
 array(['2cf2a233c4', '39c1bc8c2f', '4151566ebc', 'ec4c0ee237',
        'e7f80e06ab'], dtype=object))

In [19]:
# Choose a concrete target defendant for the case-study
target_id = valid_ids[0]
print("Target CDC Number:", target_id)

top10 = top_k_similar_fast(target_id, k=10, sample_size=500)
top10_df = pd.DataFrame(top10, columns=["cdcno", "similarity"])
top10_df

Target CDC Number: 2cf2a233c4


Unnamed: 0,cdcno,similarity
0,729d47746f,1.0
1,5a72696541,0.690136
2,7d608b6a4c,
3,39c1bc8c2f,1.0
4,0ecc570351,1.0
5,03a747e771,0.690136
6,40d174eb94,
7,a539d48493,
8,28fd24c3c4,
9,8b55645f1c,
