# Community Risk Index

## Setting the Scene

Crime is a complex, multi-dimensional phenomenon driven by economic hardship, social cohesion, physical environment and formal policing capacity. Rather than look at any single factor in isolation, your **Community Risk Index** brings together 25 variables—ranging from poverty rates to housing vacancy to police-per-capita—to create a holistic, data-driven ranking of neighbourhoods’ propensity for crime. This approach:

- **Reflects Social Disorganization Theory**, which links poverty, residential turnover, demographic heterogeneity and family structure to weakened informal social control;
- **Incorporates Routine Activity Theory**, which highlights how density, neighbourhood neglect and policing levels affect the “target–guardian” balance.

---

## Why this Problem?

- Crime not only inflicts economic and human costs, it exacerbates inequality and erodes trust in institutions.
- Policy-makers and community groups need a transparent, quantitative tool to pinpoint at-risk areas and the underlying drivers of that risk—so they can deploy resources where they’ll have greatest impact.

---

## Why this Data?

- The UCI **“Communities & Crime”** dataset contains 147 community-level attributes for 2,200+ U.S. jurisdictions, of which we select 25 high-quality variables covering all theoretical pillars.
- It’s a **single-file, publicly available CSV**—no arduous merging or licensing—and the variables directly map to the social and physical dimensions that criminology research (Shaw & McKay; Cohen & Felson) identifies as critical.

---

## Sub-Indices → Final Index

1. **Socio-economic Disadvantage**  
   (e.g. poverty rate, median income, unemployment)

2. **Residential Instability & Family Structure**  
   (e.g. residential turnover, two-parent family rates)

3. **Ethnic & Cultural Heterogeneity**  
   (racial composition shares)

4. **Housing & Density**  
   (owner-occupancy, vacancy, density, housing age)

5. **Policing Capacity**  
   (officers per capita, vehicles per capita)

Each set of variables is first combined into a **sub-index** (using equal weights and PCA for robustness), then the five sub-indices are **aggregated** into a single composite score. This workflow follows the **10-step OECD composite-indicator handbook** and aligns perfectly with your CA specification’s requirement to ground an index in theory, select appropriate data, and construct transparent sub-indices before forming the final measure.

In [8]:
# Cell 1: Imports & helper functions
import pandas as pd
import numpy as np


In [9]:
def load_data(filepath, encoding='latin1'):
    """Loads the CSV file."""
    return pd.read_csv(filepath, encoding=encoding)

def recode_implausible_zeros(df, zero_to_nan):
    """Recode zeros to NaN for columns where zero is not realistic."""
    for col in zero_to_nan:
        if col in df.columns:
            df.loc[df[col] == 0, col] = np.nan
    return df

In [10]:
# Define the variables for each category with their rationales
categories = {
    'Socio-economic Disadvantage': {
        'PctPopUnderPov': 'Poverty rate—key strain driver',
        'medIncome': 'Median household income—wealth proxy',
        'PctUnemployed': 'Unemployment rate—economic stress',
        'PctLess9thGrade': 'Very low education—limits mobility',
        'PctNotHSGrad': 'Low completion rate—correlates with crime'
    },
    'Residential Instability & Family Structure': {
        'PctSameHouse85': '% in same home since 1985—inverse instability',
        'PctForeignBorn': '% immigrants—turnover proxy',
        'PctImmigRec5': '% arrived last 5 yrs—acute social change',
        'PctFam2Par': '% two-parent families—youth supervision',
        'PctKids2Par': '% children in two-parent homes'
    },
    'Ethnic & Cultural Heterogeneity': {
        'racepctblack': '% Black residents—demographic mix',
        'racePctWhite': '% White',
        'racePctAsian': '% Asian',
        'racePctHisp': '% Hispanic/Latino'
    },
    'Housing & Density': {
        'PctHousOwnOcc': 'Owner-occupancy—investment & stability proxy',
        'PctHousNoPhone': '% without phone—material deprivation indicator',
        'PctVacantBoarded': '% boarded homes—neighbourhood decay signal',
        'PopDens': 'Density—target availability & anonymity factor',
        'MedYrHousBuilt': 'Housing-stock age—physical environment factor'
    },
    'Crime Outcomes': {
        'murdPerPop': 'Murders per population—violent crime gauge',
        'robbbPerPop': 'Robberies per population',
        'assaultPerPop': 'Assaults per population',
        'larcPerPop': 'Larcenies per population',
        'autoTheftPerPop': 'Vehicle theft per population',
        'arsonsPerPop': 'Arsons per population'
    },
    'Human Capital & Mobility': {
        'PctSameCity85': '% in same city since 1985—city-level stability proxy',
        'PctSameState85': '% in same state since 1985—state-level stability',
        'PctBSorMore': '% with bachelor\'s degree or higher—education level',
        'PctEmploy': 'Employment rate—labour-market engagement',
        'PctWorkMom': '% working mothers—dual-earner family structure proxy'
    }
}


In [11]:
# Flatten the list of variables to select from the CSV
selected_vars = ['Êcommunityname']  # Include community name as ID
for category, vars_dict in categories.items():
    selected_vars.extend(vars_dict.keys())

# Columns where zeros should become NaN (implausible values)
zero_to_nan = [
    'medIncome', 'PopDens', 'MedYrHousBuilt',
    'murdPerPop', 'robbbPerPop', 'assaultPerPop', 
    'larcPerPop', 'autoTheftPerPop', 'arsonsPerPop'
]

# Load the raw data
df = load_data('crimedata.csv')


In [12]:
# Select only the relevant columns
if all(var in df.columns for var in selected_vars):
    df_selected = df[selected_vars].copy()
else:
    missing_vars = [var for var in selected_vars if var not in df.columns]
    print(f"Warning: The following columns were not found in the data: {missing_vars}")
    df_selected = df[[var for var in selected_vars if var in df.columns]].copy()

# Clean the data by recoding implausible zeros
df_clean = recode_implausible_zeros(df_selected, zero_to_nan)

In [13]:

# ────────────────────────────────────────────────────────────────────────────────
# Cell 6: Create metadata and save outputs
# ────────────────────────────────────────────────────────────────────────────────
# Create metadata DataFrame with variables and their rationales
metadata_rows = []
for category, vars_dict in categories.items():
    for var_name, rationale in vars_dict.items():
        if var_name in df_clean.columns:
            metadata_rows.append({
                'Category': category,
                'Variable': var_name,
                'Rationale': rationale
            })

metadata_df = pd.DataFrame(metadata_rows)

# Save to CSV files
df_clean.to_csv('community_risk_data.csv', index=False)
metadata_df.to_csv('variable_metadata.csv', index=False)
print("Saved CSV files: community_risk_data.csv, variable_metadata.csv")


Saved CSV files: community_risk_data.csv, variable_metadata.csv


In [14]:
# ────────────────────────────────────────────────────────────────────────────────
# Cell 7: Generate Excel file with multiple sheets
# ────────────────────────────────────────────────────────────────────────────────
# Save to Excel with each category in a separate sheet
with pd.ExcelWriter('community_risk_categories.xlsx') as writer:
    # First, save the entire dataset to a sheet
    df_clean.to_excel(writer, sheet_name='All Data', index=False)
    
    # Then save each category to its own sheet
    for category, vars_dict in categories.items():
        sheet_vars = ['Êcommunityname'] + list(vars_dict.keys())  # Include ID column
        sheet_vars = [var for var in sheet_vars if var in df_clean.columns]
        
        if len(sheet_vars) > 1:  # Only create sheet if we have the ID column and at least one variable
            df_clean[sheet_vars].to_excel(writer, sheet_name=category[:31], index=False)
    
    # Add metadata sheet
    metadata_df.to_excel(writer, sheet_name='Variable Metadata', index=False)

print("Saved Excel file: community_risk_categories.xlsx") 

Saved Excel file: community_risk_categories.xlsx
