# Census Data: Economic and Housing Characteristics

## Data Acquisition

**Source:** U.S. Census Bureau - American Community Survey 5-Year Estimates (2019-2023)  
**Acquisition Date:** December 2024  
**Method:** API access via `tidycensus` R package  
**Geographic Level:** Census Tract (Onondaga County, NY)  
**API Documentation:** https://api.census.gov/data/2023/acs/acs5/profile  
**tidycensus Package:** https://walker-data.com/tidycensus/

### Why We Used R for Census Data

Census data was acquired using R's `tidycensus` package rather than Python due to its specialized functionality for accessing U.S. Census Bureau APIs. The `tidycensus` package simplifies the process of authenticating with the Census API, automatically handles variable code lookups, and retrieves both estimates and margins of error in a structured format. After downloading the data in R, we exported it to CSV format for subsequent analysis in Python. This hybrid approach allows us to leverage R's specialized Census tools while maintaining our primary analysis workflow in Python.

### Tables Downloaded

We downloaded two American Community Survey (ACS) data profile tables for all Census tracts in Onondaga County, New York. **DP03 (Selected Economic Characteristics)** provides data on median household income, poverty rates, SNAP (food stamp) usage, and employment status. **DP04 (Selected Housing Characteristics)** contains information on housing age, specifically the percentage of housing units built before 1980 and before 1960, which serves as a proxy for heating system efficiency and weatherization needs. These tract-level data will be aggregated to the neighborhood level to align with our heating violations and rental registry analyses.

### Census Tract to Neighborhood Mapping

Census data is organized by Census Tract (approximately 70 tracts in Onondaga County), while our analysis requires neighborhood-level aggregation (50 Syracuse neighborhoods). We will map Census tracts to neighborhoods using spatial joins based on geographic boundaries, then aggregate tract-level statistics to the neighborhood level using appropriate weighting methods. This process ensures consistency across all datasets in our final vulnerability index.

# Census Data Files: Documentation

## Overview

We acquired Census data from the American Community Survey (ACS) 5-Year Estimates (2019-2023) using R's `tidycensus` package. The data was downloaded at the **Census Tract level** for Onondaga County, NY, then processed and aggregated to the **neighborhood level** to align with our Syracuse Open Data analysis.

---

## File Descriptions

### 1. `dp03.csv` - Raw Economic Characteristics (DP03 Table)

**Source:** ACS 5-Year Data Profile - Selected Economic Characteristics  
**Geographic Level:** Census Tract  
**Records:** ~70 tracts in Onondaga County  

**Content:**
- Raw data from the Census DP03 table
- Includes ALL economic variables from the data profile
- Each row = 1 Census tract
- Columns include estimates (E suffix) and margins of error (M suffix)

**Key Variables (examples):**
- `DP03_0062E` - Median household income (estimate)
- `DP03_0128P` - Poverty rate percentage
- `DP03_0074P` - SNAP (food stamps) households percentage

**Note:** This is the complete raw table - we extracted only the variables we need in `tract_features.csv`

---

### 2. `dp04.csv` - Raw Housing Characteristics (DP04 Table)

**Source:** ACS 5-Year Data Profile - Selected Housing Characteristics  
**Geographic Level:** Census Tract  
**Records:** ~70 tracts in Onondaga County  

**Content:**
- Raw data from the Census DP04 table
- Includes ALL housing variables from the data profile
- Each row = 1 Census tract
- Columns include estimates (E suffix) and margins of error (M suffix)

**Key Variables (examples):**
- Housing age by decade (year structure built)
- Percentages for housing built pre-1980, pre-1960, pre-1940
- Occupancy status
- Housing costs

**Note:** This is the complete raw table - we calculated housing age summaries in `tract_features.csv`

---

### 3. `tract_features.csv` - Processed Census Tract Features

**Source:** Derived from dp03.csv and dp04.csv (processed in R)  
**Geographic Level:** Census Tract  
**Records:** 69 tracts (Syracuse-area tracts only)  

**Content:**
This file contains the **cleaned and processed** economic and housing variables we selected for our vulnerability analysis. Each tract that intersects with Syracuse neighborhoods is included.

**Columns:**
- `GEOID` - Census tract identifier (11-digit FIPS code)
- `NAME` - Census tract name (e.g., "Census Tract 1, Onondaga County, New York")
- `median_household_income` - Median household income (dollars)
- `poverty_rate_pct` - Percentage of population below poverty line
- `snap_households_pct` - Percentage of households receiving SNAP benefits
- `pct_housing_pre_1980` - Percentage of housing built before 1980
- `pct_housing_pre_1960` - Percentage of housing built before 1960

**Purpose:** 
- Filtered to only Syracuse-relevant tracts
- Selected only the variables needed for vulnerability scoring
- Combined economic (DP03) and housing (DP04) data
- Ready for tract-to-neighborhood mapping

---

### 4. `tract_neighborhood_map.csv` - Tract to Neighborhood Crosswalk

**Source:** Spatial join between Census tracts and Syracuse neighborhoods (processed in R using `sf` package)  
**Geographic Level:** Census Tract  
**Records:** 69 tracts  

**Content:**
This is a **crosswalk file** that maps each Census tract to a Syracuse neighborhood. This mapping was created using spatial analysis:
1. Tract centroids were calculated
2. Centroids were spatially joined to neighborhoods
3. For tracts whose centroids fell outside neighborhoods, we assigned based on maximum overlap area

**Columns:**
- `GEOID` - Census tract identifier
- `Name` - Syracuse neighborhood name assigned to this tract

**Example:**
```
GEOID           Name
36067000100     Downtown
36067000200     University Hill
36067000300     Eastwood
```

**Purpose:**
- Enables aggregation from tract level → neighborhood level
- One-to-one mapping (each tract assigned to exactly one neighborhood)
- Handles boundary issues where tracts span multiple neighborhoods

**Quality Note:** All 69 tracts have been assigned to neighborhoods (no missing values)

---

### 5. `neighborhood_features.csv` - Aggregated Neighborhood-Level Features

**Source:** Aggregated from `tract_features.csv` using `tract_neighborhood_map.csv` (processed in R)  
**Geographic Level:** Neighborhood  
**Records:** ~35 neighborhoods  

**Content:**
This is the **final aggregated dataset** at the neighborhood level. Census tract data was aggregated to neighborhoods using appropriate statistical methods for each variable type.

**Columns:**
- `neighborhood` - Syracuse neighborhood name (matches heating + rentals data)
- `n_tracts` - Number of Census tracts contributing to this neighborhood
- `neighborhood_population` - Total population (sum of tract populations)
- `median_household_income` - Median of tract medians (dollars)
- `poverty_rate_pct` - Population-weighted mean of tract poverty rates
- `snap_households_pct` - Population-weighted mean of tract SNAP rates
- `pct_housing_pre_1980` - Simple mean of tract percentages
- `pct_housing_pre_1960` - Simple mean of tract percentages

**Aggregation Methods:**
- **Median household income:** Median of tract-level medians (income distributions are skewed)
- **Poverty rate:** Population-weighted mean (larger tracts weighted more)
- **SNAP rate:** Population-weighted mean (larger tracts weighted more)
- **Housing age:** Simple mean (assumes even distribution of housing across tract)

**Purpose:**
- Ready to merge with `heating_plus_rentals.csv`
- Neighborhood names standardized to match Syracuse Open Data
- Contains economic vulnerability and housing age metrics for vulnerability index

---

## Data Flow Summary
```
RAW CENSUS DATA (API)
    ↓
dp03.csv (Economic - all variables, all tracts)
dp04.csv (Housing - all variables, all tracts)
    ↓
tract_features.csv (Selected variables, Syracuse tracts only)
    ↓
tract_neighborhood_map.csv (Spatial mapping: tract → neighborhood)
    ↓
neighborhood_features.csv (Aggregated to neighborhood level)
    ↓
READY TO MERGE WITH heating_plus_rentals.csv
```

---

## Variable Selection Rationale

### Economic Vulnerability Indicators:
1. **Median Household Income** - Direct measure of economic resources
2. **Poverty Rate** - Percentage below federal poverty line
3. **SNAP Rate** - Indicator of food insecurity and low income

**Why these three?**
- Together they capture different dimensions of economic hardship
- Median income shows central tendency but can miss extremes
- Poverty rate captures proportion in severe hardship
- SNAP usage indicates households receiving government assistance

### Housing Age Indicators:
1. **Pre-1980 Housing** - Likely to have inefficient heating systems
2. **Pre-1960 Housing** - Very likely to need weatherization

**Why housing age matters for winter vulnerability:**
- Older buildings have poor insulation
- Outdated heating systems (pre-energy efficiency standards)
- More likely to have heating failures
- Higher energy costs for residents

---

## Geographic Alignment Notes

**Challenge:** Census tracts don't align perfectly with neighborhood boundaries

**Solution Approach:**
1. Used tract centroids (geographic center points)
2. Assigned tract to neighborhood if centroid falls within neighborhood polygon
3. For 14 tracts with centroids outside neighborhoods, calculated overlap area and assigned to neighborhood with maximum overlap
4. Result: All 69 Syracuse tracts assigned to exactly one neighborhood

**Limitation:** Some neighborhoods span multiple tracts; some tracts span multiple neighborhoods. Our aggregation assumes tract characteristics are uniform within the tract, which is an approximation.

---

## Next Steps

This Census data will be merged with:
- `heating_violations_by_neighborhood.csv` (heating risk scores)
- `rental_compliance_by_neighborhood.csv` (landlord compliance)

To create the **final Winter Vulnerability Index** combining:
- Heating violations (25 points)
- Rental non-compliance (20 points)  
- Economic vulnerability (25 points)
- Housing age (30 points)

**Total: 100-point vulnerability score per neighborhood**

In [1]:
import pandas as pd
import numpy as np

# ==============================================================================
# LOAD ALL CENSUS DATA
# ==============================================================================

print("Loading Census data...")
print("="*60)

# Load tract features (processed in R - has economic + housing data)
tract_features = pd.read_csv('/Users/saiswethalakkoju/Downloads/Open_Data_Project/tract_features.csv')
print("\n✓ Tract Features loaded")
print(f"  Shape: {tract_features.shape}")
print(f"  Columns: {tract_features.columns.tolist()}")
print("\nFirst few rows:")
print(tract_features.head())

# Load tract-to-neighborhood mapping
tract_neighborhood_map = pd.read_csv('/Users/saiswethalakkoju/Downloads/Open_Data_Project/tract_neighborhood_map.csv')
print("\n✓ Tract-Neighborhood Map loaded")
print(f"  Shape: {tract_neighborhood_map.shape}")
print(f"  Unique tracts: {tract_neighborhood_map['GEOID'].nunique()}")
print(f"  Unique neighborhoods: {tract_neighborhood_map['Name'].nunique()}")
print("\nFirst few rows:")
print(tract_neighborhood_map.head())

# Load neighborhood features (aggregated in R)
neighborhood_features = pd.read_csv('/Users/saiswethalakkoju/Downloads/Open_Data_Project/neighborhood_features.csv')
print("\n✓ Neighborhood Features loaded")
print(f"  Shape: {neighborhood_features.shape}")
print(f"  Columns: {neighborhood_features.columns.tolist()}")
print("\nFirst few rows:")
print(neighborhood_features.head())

# Check for missing values in key columns
print("\n" + "="*60)
print("DATA QUALITY CHECK")
print("="*60)

print("\nTract Features - Missing values:")
print(tract_features.isnull().sum())

print("\nNeighborhood Features - Missing values:")
print(neighborhood_features.isnull().sum())

print("\nNeighborhood Features - Summary statistics:")
print(neighborhood_features.describe())

print("\n" + "="*60)
print("✓ ALL CENSUS DATA LOADED SUCCESSFULLY!")
print("="*60)

Loading Census data...

✓ Tract Features loaded
  Shape: (142, 7)
  Columns: ['GEOID', 'NAME', 'median_household_income', 'poverty_rate_pct', 'snap_households_pct', 'pct_housing_pre_1980', 'pct_housing_pre_1960']

First few rows:
         GEOID                                          NAME  \
0  36067000100     Census Tract 1; Onondaga County; New York   
1  36067000200     Census Tract 2; Onondaga County; New York   
2  36067000300     Census Tract 3; Onondaga County; New York   
3  36067000400     Census Tract 4; Onondaga County; New York   
4  36067000501  Census Tract 5.01; Onondaga County; New York   

   median_household_income  poverty_rate_pct  snap_households_pct  \
0                    88667              11.8                 10.7   
1                    37872              39.3                 30.7   
2                    68021              27.0                 16.4   
3                    68457              14.4                 16.7   
4                    30104              

  from pandas.core import (


## Component 3: Census Data - Economic Vulnerability & Housing Age

**Status:** Complete  
**Coverage:** 35 neighborhoods (100% coverage)  
**Source:** American Community Survey 5-Year Estimates (2019-2023)  

### Data Acquisition Method

Census data was acquired using R's `tidycensus` package to access the Census Bureau API. We downloaded two data profile tables at the Census tract level for Onondaga County:
- **DP03:** Selected Economic Characteristics
- **DP04:** Selected Housing Characteristics

### Geographic Alignment Challenge

Census data is organized by Census tract (~70 tracts in Onondaga County), while our analysis requires neighborhood-level aggregation (~35 Syracuse neighborhoods). We performed spatial mapping:

1. **Loaded tract geometry:** Census TIGER shapefiles for tract boundaries
2. **Loaded neighborhood geometry:** Syracuse Open Data neighborhood boundaries
3. **Coordinate transformation:** Aligned both datasets to WGS 84 (EPSG:4326)
4. **Spatial join:** Assigned tracts to neighborhoods using tract centroids
5. **Handled edge cases:** For 14 tracts with centroids outside neighborhoods, calculated overlap area and assigned to neighborhood with maximum overlap
6. **Result:** All 69 Syracuse-area tracts mapped to exactly one neighborhood

### Variables Extracted

**Economic Vulnerability Indicators:**
- Median household income
- Poverty rate (percentage below federal poverty line)
- SNAP usage rate (percentage of households receiving food stamps)

**Housing Age Indicators:**
- Percentage of housing built before 1980
- Percentage of housing built before 1960

**Rationale:** Older housing stock (pre-1980) predates modern energy efficiency standards and is more likely to have inadequate insulation and inefficient heating systems.

### Aggregation Methodology

Tract-level data was aggregated to neighborhoods using appropriate methods for each variable type:
- **Median household income:** Median of tract-level medians (income distributions are skewed)
- **Poverty rate:** Population-weighted mean (larger tracts weighted proportionally)
- **SNAP rate:** Population-weighted mean
- **Housing age:** Simple mean (assumes uniform housing distribution within tracts)

### Key Findings

- Economic vulnerability and housing age are geographically concentrated
- Neighborhoods with high heating violations also show elevated poverty rates
- Significant proportion of Syracuse housing stock predates energy efficiency standards
- Census data provides 100% neighborhood coverage (no missing data)

### Merging with Heating and Rental Data

Census neighborhood features were merged with the previously integrated heating and rental dataset using neighborhood name as the join key. Before merging, a naming conflict was resolved: "Hawley Green" and "Hawley-Green" appeared as separate entries in the heating and rental data due to inconsistent naming in the source violations dataset. These were combined into a single neighborhood by summing their violation and rental counts. The "Unknown" category was also removed as it does not represent a real neighborhood.

The final merge was performed as an outer join to ensure no neighborhoods were lost from either side. The resulting dataset contains 32 neighborhoods. Of these, 28 neighborhoods have complete data across all three sources. Franklin Square and Hawley Green have heating and rental data but no Census coverage, likely because they were not captured in the tract-to-neighborhood spatial mapping. South Campus and Winkworth have Census data only, as no heating violations or rental registry entries were recorded for these areas during the analysis period.