# Community Risk Index

## Setting the Scene

Crime is a complex, multi-dimensional phenomenon driven by economic hardship, social cohesion, physical environment and formal policing capacity. Rather than look at any single factor in isolation, your **Community Risk Index** brings together 25 variables—ranging from poverty rates to housing vacancy to police-per-capita—to create a holistic, data-driven ranking of neighbourhoods’ propensity for crime. This approach:

- **Reflects Social Disorganization Theory**, which links poverty, residential turnover, demographic heterogeneity and family structure to weakened informal social control;
- **Incorporates Routine Activity Theory**, which highlights how density, neighbourhood neglect and policing levels affect the “target–guardian” balance.

---

## Why this Problem?

- Crime not only inflicts economic and human costs, it exacerbates inequality and erodes trust in institutions.
- Policy-makers and community groups need a transparent, quantitative tool to pinpoint at-risk areas and the underlying drivers of that risk—so they can deploy resources where they’ll have greatest impact.

---

## Why this Data?

- The UCI **“Communities & Crime”** dataset contains 147 community-level attributes for 2,200+ U.S. jurisdictions, of which we select 25 high-quality variables covering all theoretical pillars.
- It’s a **single-file, publicly available CSV**—no arduous merging or licensing—and the variables directly map to the social and physical dimensions that criminology research (Shaw & McKay; Cohen & Felson) identifies as critical.

---

## Sub-Indices → Final Index

1. **Socio-economic Disadvantage**  
   (e.g. poverty rate, median income, unemployment)

2. **Residential Instability & Family Structure**  
   (e.g. residential turnover, two-parent family rates)

3. **Ethnic & Cultural Heterogeneity**  
   (racial composition shares)

4. **Housing & Density**  
   (owner-occupancy, vacancy, density, housing age)

5. **Policing Capacity**  
   (officers per capita, vehicles per capita)

Each set of variables is first combined into a **sub-index** (using equal weights and PCA for robustness), then the five sub-indices are **aggregated** into a single composite score. This workflow follows the **10-step OECD composite-indicator handbook** and aligns perfectly with your CA specification’s requirement to ground an index in theory, select appropriate data, and construct transparent sub-indices before forming the final measure.

In [38]:
# Cell 1: Imports & helper functions
import pandas as pd
import numpy as np


In [39]:
def load_data(filepath, encoding='latin1'):
    """Loads the CSV file."""
    return pd.read_csv(filepath, encoding=encoding)

def recode_implausible_zeros(df, zero_to_nan):
    """Recode zeros to NaN for columns where zero is not realistic."""
    for col in zero_to_nan:
        if col in df.columns:
            df.loc[df[col] == 0, col] = np.nan
    return df

# Community Risk Index: Foundations & Variable Rationale

# 1. Thought Process & Research Foundations

1. **Social Disorganization Theory** (Shaw & McKay, 1942; Sampson et al., 1997)  
   - Four core community characteristics undermine “collective efficacy”:  
     1. Economic deprivation  
     2. Residential mobility  
     3. Ethnic heterogeneity  
     4. Family disruption  

2. **Routine Activity Theory** (Cohen & Felson, 1979)  
   - Physical environment (density, neglect) and guardianship (formal/informal) shape crime opportunities.

3. **OECD Composite-Index Handbook** (2008)  
   - Select indicators that are:  
     - Theoretically relevant  
     - High-quality (low missingness)  
     - Sufficiently diverse to capture sub-dimensions

4. **Practical constraints**  
   - Dropped police-capacity fields (> 80 % missing)  
   - Selected near-complete crime rates for robust outcomes (e.g. murder, robbery)  
   - Added **Human Capital & Mobility** pillar to proxy informal control when policing data are sparse (Sampson & Wilson, 1995)

---

# 2. Variable-by-Variable Rationale

## Pillar 1: Socio-economic Disadvantage  
| #  | Variable                 | What it measures                         | Why                                                             |
|----|--------------------------|------------------------------------------|-----------------------------------------------------------------|
| 1  | **PctPopUnderPov**       | % below poverty line                     | Poverty → strain & competition erode social norms (Shaw & McKay) |
| 2  | **medIncome**            | Median household income                  | Complements poverty rate; overall wealth level                  |
| 3  | **PctUnemployed**        | Unemployment rate                        | Joblessness → financial stress & idle time → higher crime       |
| 4  | **PctLess9thGrade**      | % without 9th-grade education            | Very low attainment limits legitimate opportunities             |
| 5  | **PctNotHSGrad**         | % without high-school diploma            | Broader slice of low-education disadvantage                     |

## Pillar 2: Residential Instability & Family Structure  
| #  | Variable                | What it measures                           | Why                                                                      |
|----|-------------------------|--------------------------------------------|--------------------------------------------------------------------------|
| 6  | **PctSameHouse85**      | % in same home since 1985                  | Long-term residents build trust & watchful eyes (inverse → turnover)      |
| 7  | **PctForeignBorn**      | % immigrants                               | High immigration flux can slow social-tie formation (Shaw & McKay)        |
| 8  | **PctImmigRec5**        | % arrived in last 5 years                  | Recent arrivals → acute disruption to networks (Sampson et al.)           |
| 9  | **PctFam2Par**          | % two-parent families                      | Two-parent households → more adult supervision                           |
| 10 | **PctKids2Par**         | % children in two-parent homes             | Focus on youth supervision                                               |

## Pillar 3: Ethnic & Cultural Heterogeneity  
| #   | Variable          | What it measures            | Why                                                                  |
|-----|-------------------|-----------------------------|----------------------------------------------------------------------|
| 11  | **racePctBlack**  | % Black residents           | Diverse racial mix → slows shared informal-norm emergence            |
| 12  | **racePctWhite**  | % White residents           | “                                                                    |
| 13  | **racePctAsian**  | % Asian residents           | “                                                                    |
| 14  | **racePctHisp**   | % Hispanic/Latino residents | “                                                                    |

## Pillar 4: Housing & Density (Routine Activities)  
| #   | Variable                 | What it measures                         | Why                                                                  |
|-----|--------------------------|------------------------------------------|----------------------------------------------------------------------|
| 15  | **PctHousOwnOcc**        | % owner-occupied housing                 | Owners more likely to invest in upkeep (guardianship proxy)          |
| 16  | **PctHousNoPhone**       | % without telephone                      | Material deprivation; isolation from services                        |
| 17  | **PctVacantBoarded**     | % boarded/vacant homes                   | Physical decay → attracts offenders, reduces guardianship (Newman)   |
| 18  | **PopDens**              | Population density                       | High density → target availability & anonymity                       |
| 19  | **MedYrHousBuilt**       | Median housing age                       | Very old → disinvestment; very new → lack of social roots           |

## Pillar 5: Crime Outcomes (Validation)  
| #   | Variable           | Crime rate per population | Why                                  |
|-----|--------------------|---------------------------|--------------------------------------|
| 20  | **murdPerPop**     | Murder                    | Outcome for validation               |
| 21  | **robbbPerPop**    | Robbery                   | “                                    |
| 22  | **assaultPerPop**  | Assault                   | “                                    |
| 23  | **larcPerPop**     | Larceny                   | “                                    |
| 24  | **autoTheftPerPop**| Auto theft                | “                                    |
| 25  | **arsonsPerPop**   | Arson                     | “                                    |

## Pillar 6: Human Capital & Mobility  
| #   | Variable               | What it measures                          | Why                                                                      |
|-----|------------------------|-------------------------------------------|--------------------------------------------------------------------------|
| 26  | **PctSameCity85**      | % in same city since 1985                 | City-level stability → social ties                                        |
| 27  | **PctSameState85**     | % in same state since 1985                | State-level stability → broader social-tie reinforcement                  |
| 28  | **PctBSorMore**        | % with bachelor’s degree or higher        | Higher education → social capital & economic opportunity                 |
| 29  | **PctEmploy**          | % employed                                | Employment → routine guardianship roles & less idle time                |
| 30  | **PctWorkMom**         | % of mothers employed                     | Dual earners indicate economic health (but may affect supervision)      |

---

# 3. How Variables Map to Pillars & Sub-Indices

| Pillar No. | Pillar Name                                    | Conceptual Focus                          | # Variables |
|------------|------------------------------------------------|-------------------------------------------|-------------|
| 1          | Socio-economic Disadvantage                    | Financial strain & deprivation            | 5           |
| 2          | Residential Instability & Family Structure     | Turnover & supervision capacity           | 5           |
| 3          | Ethnic & Cultural Heterogeneity                | Demographic mix                           | 4           |
| 4          | Housing & Density (Routine Activities)         | Physical environment & guardianship       | 5           |
| 5          | Crime Outcomes (Validation)                    | Validation & outcome measurement          | 6           |
| 6          | Human Capital & Mobility                       | Education, employment & broader stability | 5           |

> **Note:** Within each pillar, variables are normalized (z-score or min–max) and combined into a pillar score via equal-weight averages and PCA. The six pillar scores are then aggregated (equal weights + second-stage PCA) to yield the final Community Risk Index.

---

# Expert Advice & Best Practice

- **Missingness:** Select variables with < 10 % missingness to minimize imputation bias (OECD Step 2).  
- **Conceptual validity:** Ground each indicator in peer-reviewed theory (Shaw & McKay; Cohen & Felson; Sampson et al.).  
- **Validation:** Include multiple crime-type outcomes for robust validation (criminology method guides).  
- **Supplement policing data:** When police data are sparse, use socioeconomic proxies (Human Capital pillar) per community-safety literature.  


In [None]:
# ────────────────────────────────────────────────────────────────────────────────
# Cell 3: Define data categories with variables and their rationales
# ────────────────────────────────────────────────────────────────────────────────
# Define the variables for each category with their rationales
categories = {
    'Socio-economic Disadvantage': {
        'PctPopUnderPov': 'Poverty rate—key strain driver',
        'medIncome': 'Median household income—wealth proxy',
        'PctUnemployed': 'Unemployment rate—economic stress',
        'PctLess9thGrade': 'Very low education—limits mobility',
        'PctNotHSGrad': 'Low completion rate—correlates with crime'
    },
    'Residential Instability & Family Structure': {
        'PctSameHouse85': '% in same home since 1985—inverse instability',
        'PctForeignBorn': '% immigrants—turnover proxy',
        'PctImmigRec5': '% arrived last 5 yrs—acute social change',
        'PctFam2Par': '% two-parent families—youth supervision',
        'PctKids2Par': '% children in two-parent homes'
    },
    'Ethnic & Cultural Heterogeneity': {
        'racepctblack': '% Black residents—demographic mix',
        'racePctWhite': '% White',
        'racePctAsian': '% Asian',
        'racePctHisp': '% Hispanic/Latino'
    },
    'Housing & Density': {
        'PctHousOwnOcc': 'Owner-occupancy—investment & stability proxy',
        'PctHousNoPhone': '% without phone—material deprivation indicator',
        'PctVacantBoarded': '% boarded homes—neighbourhood decay signal',
        'PopDens': 'Density—target availability & anonymity factor',
        'MedYrHousBuilt': 'Housing-stock age—physical environment factor'
    },
    'Crime Outcomes': {
        'murdPerPop': 'Murders per population—violent crime gauge',
        'robbbPerPop': 'Robberies per population',
        'assaultPerPop': 'Assaults per population',
        'larcPerPop': 'Larcenies per population',
        'autoTheftPerPop': 'Vehicle theft per population',
        'arsonsPerPop': 'Arsons per population'
    },
    'Human Capital & Mobility': {
        'PctSameCity85': '% in same city since 1985—city-level stability proxy',
        'PctSameState85': '% in same state since 1985—state-level stability',
        'PctBSorMore': '% with bachelor\'s degree or higher—education level',
        'PctEmploy': 'Employment rate—labour-market engagement',
        'PctWorkMom': '% working mothers—dual-earner family structure proxy'
    }
}

In [None]:

# ────────────────────────────────────────────────────────────────────────────────
# Cell 4: Prepare variable lists and load data
# ────────────────────────────────────────────────────────────────────────────────
# Flatten the list of variables to select from the CSV
selected_vars = ['Êcommunityname']  # Include community name as ID
for category, vars_dict in categories.items():
    selected_vars.extend(vars_dict.keys())

# Columns where zeros should become NaN (implausible values)
zero_to_nan = [
    'medIncome', 'PopDens', 'MedYrHousBuilt',
    'murdPerPop', 'robbbPerPop', 'assaultPerPop', 
    'larcPerPop', 'autoTheftPerPop', 'arsonsPerPop'
]

# Load the raw data
df = load_data('crimedata.csv')