# Assignment 2 - Task 2: Data Visualization & Analytics

**Course:** CS424 - Visualization & Visual Analytics (Fall 2025)  
**Dataset:** San Francisco Assessor Historical Secured Property Tax Rolls  
**Visualization Library:** Altair/Vega-Lite (Declarative Grammar)

## IMPORTANT: Read and Follow All Instructions

### Setup Requirements

**Before running this notebook, complete these steps in order:**

1. Run Section 1 completely to load data and verify column names
2. Review the COLUMNS dictionary output and update if your column names differ
3. Verify all required columns exist in your dataset:
   - Temporal: `year`, `closed_roll_year`
   - Financial: `total_assessed_value`, `assessed_land_value`, `assessed_improvement_value`
   - Calculated: `land_value_pct`, `building_age`, `value_per_sqft`
   - Categorical: `property_class_code_definition`, `neighborhood`, `property_area`
   - Physical: `number_of_bedrooms`, `building_square_feet`
4. Run all cells sequentially (do not skip cells)

### Assignment Requirements Met

This notebook fulfills all Task 2 requirements:

- **Visualization Library:** Altair/Vega-Lite exclusively (no matplotlib, seaborn, or plotly)
- **Exploratory Visualizations:** 8 out of 15 total (53%, exceeds 50% requirement)
- **Assignment 1 Retention:** 7.5 out of 13 sketches implemented (58%, exceeds 50% requirement)
- **Time-Series Analysis:** Median value trends and normalized recovery slopes
- **Geographic Analysis:** Neighborhood comparisons and ZIP code choropleth map
- **Distribution Analysis:** Univariate and multivariate distributions
- **Advanced Analytics:** COVID impact analysis and gentrification risk assessment

### Notebook Structure

| Section | Visualizations | Type |
|---------|---------------|------|
| Section 1 | Setup | Data loading and column verification |
| Section 2 | 8 charts | Exploratory: distributions and property characteristics |
| Section 3 | 2 charts | Time-series: trends and recovery patterns |
| Section 4 | 2 charts | Geographic: neighborhood analysis and COVID comparison |
| Section 5 | 2 charts | Multivariate: stacked area and parallel coordinates |
| Section 6 | 1 map | Spatial: interactive choropleth by ZIP code |
| **Total** | **15 visualizations** | **All using Altair/Vega-Lite** |

### Common Issues and Solutions

- **MaxRowsError:** Already handled with `alt.data_transformers.disable_max_rows()`
- **Charts not displaying:** Ensure you are running in Jupyter Notebook or JupyterLab (not plain Python)
- **Choropleth map fails:** Install geopandas: `pip install geopandas`
- **Column not found errors:** Update the COLUMNS dictionary in Section 1 to match your data

## 1. Setup and Data Loading


In [None]:
# Import required libraries - ONLY Altair for visualization!
import pandas as pd
import numpy as np
import altair as alt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Configure Altair
alt.data_transformers.disable_max_rows()
alt.renderers.enable('default')

print(f"Altair version: {alt.__version__}")
print("Libraries imported successfully!")
print("All visualizations will use Altair/Vega-Lite only!")

In [None]:
# Load the processed dataset from Task 1
# Using Parquet for fast loading and efficient storage
df = pd.read_parquet('sf_property_data_clean.parquet')

print(f"Dataset loaded: {df.shape}")
print(f"Date range: {df['year'].min()} - {df['year'].max()}")
print(f"Total records: {len(df):,}")
print(f"\nNote: You can also load from CSV or pickle if needed:")
print(f"  CSV: pd.read_csv('sf_property_data_clean.csv')")
print(f"  Pickle: pd.read_pickle('sf_property_data_clean.pkl')")

In [None]:
# IMPORTANT: Check available columns
print("Available columns in dataset:")
print(df.columns.tolist())
print(f"\nDataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Column name mapping - These match your actual data!
COLUMNS = {
    'year': 'year',
    'total_assessed_value': 'total_assessed_value',
    'land_value': 'assessed_land_value',  # ✓ Mapped to your column
    'improvement_value': 'assessed_improvement_value',  # ✓ Mapped to your column
    'land_value_pct': 'land_value_pct',
    'building_age': 'building_age',
    'number_of_bedrooms': 'number_of_bedrooms',
    'property_class_code_definition': 'property_class_code_definition',
    'neighborhood': 'neighborhood',
    'property_area': 'property_area'
}

# Check which columns exist
missing_cols = [col for col in COLUMNS.values() if col not in df.columns]
if missing_cols:
    print("WARNING: The following expected columns are missing:")
    for col in missing_cols:
        print(f"  - {col}")
    print("\nPlease check your column names or create missing columns.")
else:
    print("✓ All expected columns found!")
    print("\nColumn mapping:")
    for key, val in COLUMNS.items():
        print(f"  {key} → {val}")

## 2. Exploratory Visualizations

*Exploring data distributions, temporal coverage, and property characteristics*


In [None]:
# Visualization 1: Property Tax Records by Year (2015-2023)
year_counts = df['year'].value_counts().reset_index()
year_counts.columns = ['year', 'count']
year_counts = year_counts.sort_values('year')

chart = alt.Chart(year_counts).mark_bar(color='steelblue', opacity=0.7).encode(
    x=alt.X('year:O', title='Year', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('count:Q', title='Number of Properties', scale=alt.Scale(domain=[0, 220000])),
    tooltip=[alt.Tooltip('year:O', title='Year'), 
             alt.Tooltip('count:Q', title='Properties', format=',')]
).properties(
    width=700,
    height=400,
    title='Property Tax Records by Year (2015-2023)'
)

# Add text labels on bars
text = chart.mark_text(
    align='center',
    baseline='bottom',
    dy=-5,
    fontSize=10
).encode(
    text=alt.Text('count:Q', format=',')
)

final_chart = (chart + text).configure_axis(
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=16,
    fontWeight='bold'
)

final_chart

In [None]:
# Filter to reasonable range (up to 99th percentile) and SAMPLE
df_filtered = df[df[COLUMNS['total_assessed_value']] < df[COLUMNS['total_assessed_value']].quantile(0.99)].copy()
df_sample = df_filtered.sample(n=100000, random_state=445645)

print(f"Creating violin plot with {len(df_sample):,} sampled properties")

violin = alt.Chart(df_sample).transform_density(
    COLUMNS['total_assessed_value'],
    as_=[COLUMNS['total_assessed_value'], 'density'],
    groupby=[COLUMNS['year']]
).mark_area(
    orient='horizontal',
    opacity=0.7
).encode(
    y=alt.Y(f"{COLUMNS['total_assessed_value']}:Q", 
            title='Property Value ($)',
            axis=alt.Axis(format='$,.0s')),
    color=alt.Color(f"{COLUMNS['year']}:O", 
                    scale=alt.Scale(scheme='tableau10'),
                    legend=alt.Legend(title='Year')),
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=alt.Axis(labels=False, values=[0], grid=False, ticks=True)
    ),
    column=alt.Column(
        f"{COLUMNS['year']}:O",
        title='Year',
        header=alt.Header(
            titleOrient='bottom',
            labelOrient='bottom'
        )
    )
).properties(
    width=80,
    height=400,
    title='Property Value Distribution by Year (Violin Plot) - A1 Sketch #4'
).configure_facet(
    spacing=10
).configure_view(
    stroke=None
)

violin

In [None]:
# Visualization 3: Ridge plots showing value distribution shifts by year
# PRE-COMPUTE density curves in pandas to avoid kernel crash

value_99th = 8389117  # Based on README

# Pre-compute KDE for each year
years = sorted(df['year'].unique())
x_range = np.linspace(0, value_99th, 300)  # 300 points for smooth curve
density_data = []

print("Computing density curves for each year...")
for year in years:
    year_data = df[(df['year'] == year) & 
                   (df['total_assessed_value'] > 0) & 
                   (df['total_assessed_value'] <= value_99th)]['total_assessed_value']
    
    # Sample for faster KDE computation
    if len(year_data) > 20000:
        year_data = year_data.sample(20000, random_state=42)
    
    # Compute KDE with automatic bandwidth
    try:
        kde = stats.gaussian_kde(year_data)
        density_values = kde(x_range)
        
        # Normalize density for this year (for better visualization)
        max_density = density_values.max()
        if max_density > 0:
            density_values = density_values / max_density
        
        # Add to dataframe
        for x, d in zip(x_range, density_values):
            density_data.append({
                'year': int(year),
                'value': x,
                'density': d
            })
    except Exception as e:
        print(f"Warning: Could not compute KDE for year {year}: {e}")

density_df = pd.DataFrame(density_data)
print(f"Generated {len(density_df)} density points across {len(years)} years")

# Create ridge plot from pre-computed densities
ridge_plot = alt.Chart(density_df).mark_area(
    opacity=0.6,
    interpolate='monotone',
    line={'color': 'darkblue'}
).encode(
    x=alt.X('value:Q', 
            title='Total Assessed Value ($)',
            scale=alt.Scale(domain=[0, value_99th]),
            axis=alt.Axis(format='$,.0s')),
    y=alt.Y('density:Q',
            stack=None,
            title=None,
            axis=alt.Axis(labels=False, ticks=False),
            scale=alt.Scale(domain=[0, 1.1])),
    color=alt.Color('year:O', 
                    scale=alt.Scale(scheme='blues'),
                    legend=alt.Legend(title='Year', orient='right')),
    row=alt.Row('year:O', 
                title=None,
                header=alt.Header(labelAngle=0, labelAlign='left', labelFontSize=12)),
    tooltip=[alt.Tooltip('year:O', title='Year'),
             alt.Tooltip('value:Q', title='Value', format='$,.0f')]
).properties(
    width=700,
    height=70
).configure_title(
    fontSize=16,
    fontWeight='bold',
    anchor='start'
).configure_view(
    stroke=None
).configure_facet(
    spacing=5
).resolve_scale(
    y='independent'
)

ridge_plot.properties(
    title='Distribution Shifts in Assessed Values by Year (capped at 99th pct: $8,389,117)'
)

In [None]:
# Visualization 4: Land vs Improvement Value (Median)
median_land = df['assessed_land_value'].median()
median_improvement = df['assessed_improvement_value'].median()

comparison_data = pd.DataFrame({
    'Component': ['Land Value', 'Improvement Value'],
    'Median Value': [median_land, median_improvement]
})

chart = alt.Chart(comparison_data).mark_bar(opacity=0.7).encode(
    x=alt.X('Component:N', title='Value Component', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('Median Value:Q', title='Median Value ($)', scale=alt.Scale(domain=[0, 500000])),
    color=alt.Color('Component:N', 
                    scale=alt.Scale(domain=['Land Value', 'Improvement Value'],
                                   range=['orange', 'blue']),
                    legend=None),
    tooltip=[alt.Tooltip('Component:N'),
             alt.Tooltip('Median Value:Q', format='$,.0f', title='Median')]
).properties(
    width=500,
    height=400,
    title='Land vs Improvement Value (Median Comparison)'
)

# Add text labels
text = chart.mark_text(
    align='center',
    baseline='bottom',
    dy=-5,
    fontSize=11,
    fontWeight='bold'
).encode(
    text=alt.Text('Median Value:Q', format='$,.0f')
)

final_chart = (chart + text).configure_axis(
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=16,
    fontWeight='bold'
)

final_chart

In [None]:
# Visualization 5: Distribution of Land Value Percentage

hist_data, bin_edges = np.histogram(df['land_value_pct'], bins=50)
hist_df = pd.DataFrame({
    'bin_start': bin_edges[:-1],
    'bin_end': bin_edges[1:],
    'count': hist_data
})

chart = alt.Chart(hist_df).mark_bar(color='purple', opacity=0.7).encode(
    x=alt.X('bin_start:Q', title='Land Value as % of Total'),
    x2='bin_end:Q',
    y=alt.Y('count:Q', title='Number of Properties'),
    tooltip=[alt.Tooltip('bin_start:Q', title='% Range Start', format='.1f'),
             alt.Tooltip('bin_end:Q', title='% Range End', format='.1f'),
             alt.Tooltip('count:Q', title='Properties', format=',')]
).properties(
    width=700,
    height=400,
    title='Distribution of Land Value Percentage'
)

# Add median line
median_pct = df['land_value_pct'].median()
median_line = alt.Chart(pd.DataFrame({'median': [median_pct]})).mark_rule(
    color='red',
    strokeDash=[5, 5],
    size=2
).encode(
    x='median:Q'
)

median_text = alt.Chart(pd.DataFrame({
    'median': [median_pct],
    'label': [f'Median: {median_pct:.1f}%']
})).mark_text(
    align='left',
    dx=5,
    dy=-10,
    fontSize=11,
    color='red'
).encode(
    x='median:Q',
    y=alt.value(0),
    text='label:N'
)

final_chart = (chart + median_line + median_text).configure_axis(
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=16,
    fontWeight='bold'
)

final_chart

In [None]:
# Visualization 6: Distribution of Building Ages
# Filter to reasonable ages (0-150 years)
age_filtered = df[(df['building_age'] >= 0) & (df['building_age'] <= 150)]['building_age']

# PRE-COMPUTE histogram in pandas for speed
hist_data, bin_edges = np.histogram(age_filtered, bins=40)
hist_df = pd.DataFrame({
    'bin_start': bin_edges[:-1],
    'bin_end': bin_edges[1:],
    'count': hist_data
})

chart = alt.Chart(hist_df).mark_bar(color='brown', opacity=0.7).encode(
    x=alt.X('bin_start:Q', title='Building Age (years)'),
    x2='bin_end:Q',
    y=alt.Y('count:Q', title='Number of Properties'),
    tooltip=[alt.Tooltip('bin_start:Q', title='Age Range Start', format='.0f'),
             alt.Tooltip('bin_end:Q', title='Age Range End', format='.0f'),
             alt.Tooltip('count:Q', title='Properties', format=',')]
).properties(
    width=700,
    height=400,
    title='Distribution of Building Ages (0-150 years)'
)

# Add median line
median_age = age_filtered.median()
median_line = alt.Chart(pd.DataFrame({'median': [median_age]})).mark_rule(
    color='red',
    strokeDash=[5, 5],
    size=2
).encode(
    x='median:Q'
)

median_text = alt.Chart(pd.DataFrame({
    'median': [median_age],
    'label': [f'Median: {median_age:.0f} years']
})).mark_text(
    align='left',
    dx=5,
    dy=-10,
    fontSize=11,
    color='red'
).encode(
    x='median:Q',
    y=alt.value(0),
    text='label:N'
)

final_chart = (chart + median_line + median_text).configure_axis(
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=16,
    fontWeight='bold'
)

final_chart

In [None]:
# Visualization 7: Distribution of Number of Bedrooms
# Filter to reasonable bedroom counts (0-10)
bedroom_filtered = df[(df['number_of_bedrooms'] >= 0) & 
                      (df['number_of_bedrooms'] <= 10)].copy()

bedroom_counts = bedroom_filtered['number_of_bedrooms'].value_counts().reset_index()
bedroom_counts.columns = ['bedrooms', 'count']
bedroom_counts = bedroom_counts.sort_values('bedrooms')

chart = alt.Chart(bedroom_counts).mark_bar(color='teal', opacity=0.7).encode(
    x=alt.X('bedrooms:O', title='Number of Bedrooms', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('count:Q', title='Number of Properties'),
    tooltip=[alt.Tooltip('bedrooms:O', title='Bedrooms'),
             alt.Tooltip('count:Q', title='Properties', format=',')]
).properties(
    width=700,
    height=400,
    title='Distribution of Number of Bedrooms (0-10)'
)

final_chart = chart.configure_axis(
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=16,
    fontWeight='bold'
)

final_chart

In [None]:
# Visualization 8: Top Property Types
top_types = df['property_class_code_definition'].value_counts().head(10).reset_index()
top_types.columns = ['property_type', 'count']

chart = alt.Chart(top_types).mark_bar(color='coral', opacity=0.7).encode(
    y=alt.Y('property_type:N', 
            title='Property Type',
            sort='-x'),
    x=alt.X('count:Q', title='Number of Properties'),
    tooltip=[alt.Tooltip('property_type:N', title='Type'),
             alt.Tooltip('count:Q', title='Count', format=',')]
).properties(
    width=700,
    height=400,
    title='Top 10 Property Types'
)

# Add text labels
text = chart.mark_text(
    align='left',
    dx=5,
    fontSize=10
).encode(
    text=alt.Text('count:Q', format=',')
)

final_chart = (chart + text).configure_axis(
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=16,
    fontWeight='bold'
)

final_chart

## 3. Time-Series Visualizations

*Analyzing trends, recovery patterns, and temporal dynamics*


In [None]:
# Prepare data for dual-axis chart
yearly_stats = df.groupby(COLUMNS['year']).agg({
    COLUMNS['total_assessed_value']: 'median'
}).reset_index()

# Add property count separately
property_counts = df.groupby(COLUMNS['year']).size().reset_index(name='property_count')
yearly_stats = yearly_stats.merge(property_counts, on=COLUMNS['year'])
yearly_stats.columns = ['year', 'median_value', 'property_count']

print("Creating dual-axis chart...")
print(f"Years: {yearly_stats['year'].min()} - {yearly_stats['year'].max()}")

# Base chart
base = alt.Chart(yearly_stats)

# Left Y-axis - Median Property Values (primary metric)
line_values = base.mark_line(
    point=True,
    color='steelblue',
    size=3
).encode(
    x=alt.X('year:O', title='Year', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('median_value:Q',
            title='Median Property Value ($)',
            axis=alt.Axis(titleColor='steelblue', format='$,.0s')),
    tooltip=[
        alt.Tooltip('year:O', title='Year'),
        alt.Tooltip('median_value:Q', title='Median Value', format='$,.0f')
    ]
)

# Right Y-axis - Property Count (secondary metric)
line_count = base.mark_line(
    point=True,
    color='coral',
    size=3,
    strokeDash=[5, 5]
).encode(
    x=alt.X('year:O', title='Year'),
    y=alt.Y('property_count:Q',
            title='Number of Properties',
            axis=alt.Axis(titleColor='coral', format=','),
            scale=alt.Scale(domain=[yearly_stats['property_count'].min() * 0.95,
                                   yearly_stats['property_count'].max() * 1.05])),
    tooltip=[
        alt.Tooltip('year:O', title='Year'),
        alt.Tooltip('property_count:Q', title='Properties', format=',')
    ]
)

# Combine with independent Y-scales
dual_axis_chart = alt.layer(
    line_values,
    line_count
).resolve_scale(
    y='independent'
).properties(
    width=700,
    height=400,
    title={
        'text': 'Property Values & Volume: Dual-Axis Analysis (2015-2023)',
        'subtitle': 'Blue (left axis): Median values | Coral (right axis): Property count - Assignment 1 Sketch #7',
        'fontSize': 16,
        'fontWeight': 'bold'
    }
).configure_axis(
    grid=True,
    gridOpacity=0.3
)

dual_axis_chart

In [None]:
# Visualization 10: Normalized Recovery Slopes by Neighborhood
# Select top neighborhoods by property count
top_neighborhoods = df['neighborhood'].value_counts().head(8).index.tolist()

# Calculate normalized trends for top neighborhoods
neighborhood_trends = []
for neighborhood in top_neighborhoods:
    nbhd_data = df[df['neighborhood'] == neighborhood]
    yearly = nbhd_data.groupby('year')['total_assessed_value'].median().reset_index()
    # Normalize to base year 2015
    base_value = yearly[yearly['year'] == 2015]['total_assessed_value'].values[0]
    yearly['normalized'] = 100 * yearly['total_assessed_value'] / base_value
    yearly['neighborhood'] = neighborhood
    neighborhood_trends.append(yearly)

trend_df = pd.concat(neighborhood_trends, ignore_index=True)

# Create line chart
chart = alt.Chart(trend_df).mark_line(size=2.5, point=True).encode(
    x=alt.X('year:O', title='Year', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('normalized:Q', 
            title='Normalized Value (2015 = 100)',
            scale=alt.Scale(zero=False)),
    color=alt.Color('neighborhood:N', 
                    title='Neighborhood',
                    scale=alt.Scale(scheme='category10')),
    tooltip=[alt.Tooltip('neighborhood:N', title='Neighborhood'),
             alt.Tooltip('year:O', title='Year'),
             alt.Tooltip('normalized:Q', title='Normalized Value', format='.1f')]
).properties(
    width=700,
    height=400,
    title='Normalized Property Value Trends by Neighborhood (2015 = 100)'
)

final_chart = chart.configure_axis(
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=16,
    fontWeight='bold'
).configure_legend(
    orient='right'
)

final_chart

## 4. Geographic/Spatial Visualizations

*Neighborhood analysis and COVID impact comparisons*


In [None]:
# Visualization 11: Top 15 Neighborhoods by Property Count
top_15_neighborhoods = df['neighborhood'].value_counts().head(15).reset_index()
top_15_neighborhoods.columns = ['neighborhood', 'count']

chart = alt.Chart(top_15_neighborhoods).mark_bar(color='darkgreen', opacity=0.7).encode(
    y=alt.Y('neighborhood:N', 
            title='Neighborhood',
            sort='-x'),
    x=alt.X('count:Q', title='Number of Properties'),
    tooltip=[alt.Tooltip('neighborhood:N', title='Neighborhood'),
             alt.Tooltip('count:Q', title='Properties', format=',')]
).properties(
    width=700,
    height=500,
    title='Top 15 Neighborhoods by Property Count'
)

# Add text labels
text = chart.mark_text(
    align='left',
    dx=5,
    fontSize=10
).encode(
    text=alt.Text('count:Q', format=',')
)

final_chart = (chart + text).configure_axis(
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=16,
    fontWeight='bold'
)

final_chart

In [None]:
# Visualization 12: COVID-Era Comparison (Pre-COVID vs COVID Era)
# Select neighborhoods for comparison
comparison_neighborhoods = [
    'Russian Hill', 'Pacific Heights', 'Mission', 'Castro/Upper Market',
    'Haight Ashbury', 'Outer Sunset'
]

# Calculate pre-COVID (2019) and COVID-era (2021) median values
comparison_data = []
for neighborhood in comparison_neighborhoods:
    nbhd_data = df[df['neighborhood'] == neighborhood]
    pre_covid = nbhd_data[nbhd_data['year'] == 2019]['total_assessed_value'].median()
    covid_era = nbhd_data[nbhd_data['year'] == 2021]['total_assessed_value'].median()
    
    comparison_data.append({
        'Neighborhood': neighborhood,
        'Period': 'Pre-COVID (2019)',
        'Median Value': pre_covid
    })
    comparison_data.append({
        'Neighborhood': neighborhood,
        'Period': 'COVID Era (2021)',
        'Median Value': covid_era
    })

comparison_df = pd.DataFrame(comparison_data)

# Create grouped bar chart
chart = alt.Chart(comparison_df).mark_bar(opacity=0.8).encode(
    x=alt.X('Neighborhood:N', 
            title='Neighborhood',
            axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('Median Value:Q', title='Median Assessed Value ($)'),
    color=alt.Color('Period:N',
                    scale=alt.Scale(domain=['Pre-COVID (2019)', 'COVID Era (2021)'],
                                   range=['#1f77b4', '#ff7f0e']),
                    legend=alt.Legend(title='Period')),
    xOffset='Period:N',
    tooltip=[alt.Tooltip('Neighborhood:N'),
             alt.Tooltip('Period:N'),
             alt.Tooltip('Median Value:Q', format='$,.0f', title='Median Value')]
).properties(
    width=700,
    height=400,
    title='COVID-Era Property Value Comparison: Pre-COVID (2019) vs COVID Era (2021)'
)

final_chart = chart.configure_axis(
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=16,
    fontWeight='bold'
)

final_chart

## 5. Multivariate Visualizations

*Multi-dimensional analysis of relationships and patterns*


In [None]:
# Visualization 13: Property Type Distribution Over Time (Stacked Area)
# Get top 7 property types
top_7_types = df['property_class_code_definition'].value_counts().head(7).index.tolist()

# PRE-AGGREGATE: Calculate yearly counts for each type (already aggregated, just making sure)
type_trends = df[df['property_class_code_definition'].isin(top_7_types)].groupby(
    ['year', 'property_class_code_definition'], as_index=False
).size().rename(columns={'size': 'count'})

# Create stacked area chart from aggregated data
chart = alt.Chart(type_trends).mark_area(opacity=0.7, interpolate='monotone').encode(
    x=alt.X('year:O', title='Year', axis=alt.Axis(labelAngle=0)),
    y=alt.Y('count:Q', title='Number of Properties', stack='normalize'),
    color=alt.Color('property_class_code_definition:N',
                    title='Property Type',
                    scale=alt.Scale(scheme='category20')),
    tooltip=[alt.Tooltip('year:O', title='Year'),
             alt.Tooltip('property_class_code_definition:N', title='Type'),
             alt.Tooltip('count:Q', title='Count', format=',')]
).properties(
    width=700,
    height=400,
    title='Property Type Distribution Over Time (Normalized, Top 7 Types)'
)

final_chart = chart.configure_axis(
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=16,
    fontWeight='bold'
).configure_legend(
    orient='right',
    labelLimit=300
)

final_chart

In [None]:
# Visualization 14: Parallel Coordinates - Gentrification Risk Indicators
# Note: Altair doesn't have native parallel coordinates, so we'll use an alternative
# multi-dimensional view using faceted charts

# Select affluent neighborhoods for gentrification risk analysis
risk_neighborhoods = [
    'St. Francis Wood', 'Stonestown', 'Presidio Heights', 'Lake Street',
    'Sea Cliff', 'Corona Heights', 'Diamond Heights', 'Central Richmond'
]

# Calculate risk indicators
risk_data = []
for neighborhood in risk_neighborhoods:
    nbhd_data = df[df['neighborhood'] == neighborhood]
    if len(nbhd_data) > 0:
        risk_data.append({
            'Neighborhood': neighborhood,
            'Median Value': nbhd_data['total_assessed_value'].median(),
            'Property Area': nbhd_data['property_area'].median(),
            'Building Age': nbhd_data['building_age'].median(),
            'Land %': nbhd_data['land_value_pct'].median()
        })

risk_df = pd.DataFrame(risk_data)

# Normalize all indicators to 0-1 scale for comparison
for col in ['Median Value', 'Property Area', 'Building Age', 'Land %']:
    min_val = risk_df[col].min()
    max_val = risk_df[col].max()
    risk_df[f'{col}_normalized'] = (risk_df[col] - min_val) / (max_val - min_val)

# Reshape data for Altair
risk_long = risk_df.melt(
    id_vars=['Neighborhood'],
    value_vars=['Median Value_normalized', 'Property Area_normalized', 
                'Building Age_normalized', 'Land %_normalized'],
    var_name='Indicator',
    value_name='Normalized Value'
)

# Clean up indicator names
risk_long['Indicator'] = risk_long['Indicator'].str.replace('_normalized', '')

# Create connected scatter plot (alternative to parallel coordinates)
chart = alt.Chart(risk_long).mark_line(size=2, point=True).encode(
    x=alt.X('Indicator:N', 
            title='Risk Indicator',
            axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('Normalized Value:Q', 
            title='Normalized Value (0-1)',
            scale=alt.Scale(domain=[0, 1])),
    color=alt.Color('Neighborhood:N',
                    title='Neighborhood',
                    scale=alt.Scale(scheme='category10')),
    tooltip=[alt.Tooltip('Neighborhood:N'),
             alt.Tooltip('Indicator:N'),
             alt.Tooltip('Normalized Value:Q', format='.2f')]
).properties(
    width=700,
    height=400,
    title='Gentrification Risk Indicators by Neighborhood (Normalized 0-1)'
)

final_chart = chart.configure_axis(
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=16,
    fontWeight='bold'
).configure_legend(
    orient='right'
)

final_chart

## 6. Geographic Analysis: Choropleth Map by ZIP Code

*Spatial distribution of property values across San Francisco ZIP codes*


---

## 6. Geographic Analysis: Choropleth Map by ZIP Code

This section creates an interactive choropleth map showing average property values across San Francisco ZIP codes using spatial analysis.

### 6.1 Load ZIP Code Boundaries and Property Data

In [None]:
# File paths - UPDATE THESE WITH YOUR FILE NAMES
PROPERTY_GEOJSON = 'sf_property_data_geo.geojson'  # Your property point data
ZIPCODE_GEOJSON = 'San_Francisco_ZIP_Codes_20251020.geojson'  # ZIP boundaries

print("Loading property data...")
properties_gdf = gpd.read_file(PROPERTY_GEOJSON)
print(f"  ✓ Loaded {len(properties_gdf):,} property records")
print(f"  Sample columns: {properties_gdf.columns.tolist()[:8]}")

print("\nLoading ZIP code boundaries...")
zipcodes_gdf = gpd.read_file(ZIPCODE_GEOJSON)
print(f"  ✓ Loaded {len(zipcodes_gdf)} ZIP code polygons")
unique_zips = sorted(zipcodes_gdf['zip_code'].unique())
print(f"  ZIP codes: {unique_zips[:10]}... (showing first 10)")

### 6.2 Spatial Join - Assign ZIP Codes to Properties

Perform a spatial join to assign each property to its corresponding ZIP code based on geographic location.

In [None]:
# Ensure both datasets use the same coordinate reference system
if properties_gdf.crs != zipcodes_gdf.crs:
    print(f"Converting property CRS from {properties_gdf.crs} to {zipcodes_gdf.crs}")
    properties_gdf = properties_gdf.to_crs(zipcodes_gdf.crs)
else:
    print(f"Both datasets already use CRS: {properties_gdf.crs}")

print("\nPerforming spatial join...")
print("  (This may take a moment for large datasets)")

# Spatial join: assign ZIP code to each property based on location
properties_with_zip = gpd.sjoin(
    properties_gdf, 
    zipcodes_gdf[['zip_code', 'geometry']], 
    how='left', 
    predicate='within'
)

# Check results
matched = properties_with_zip['zip_code'].notna().sum()
match_rate = matched / len(properties_with_zip) * 100

print(f"\n✓ Spatial join complete!")
print(f"  Matched: {matched:,}/{len(properties_with_zip):,} properties ({match_rate:.1f}%)")

# Show sample of matched data
print("\nSample of properties with ZIP codes:")
properties_with_zip[['zip_code', 'total_assessed_value', 'neighborhood']].head(10)

### 6.3 Calculate Property Value Statistics by ZIP Code

In [None]:
print("Calculating ZIP code statistics...\n")

# Group by ZIP code and calculate statistics
zip_stats = properties_with_zip.groupby('zip_code').agg({
    'total_assessed_value': ['mean', 'median', 'count', 'sum', 'std']
}).reset_index()

# Flatten column names
zip_stats.columns = ['zip_code', 'avg_value', 'median_value', 
                     'property_count', 'total_value', 'std_value']

# Remove any NaN ZIP codes
zip_stats = zip_stats[zip_stats['zip_code'].notna()]

In [None]:
# Display top 10 ZIP codes by average value
print("\nTop 10 ZIP Codes by Average Property Value:")
print("="*80)

top_zips = zip_stats.sort_values('avg_value', ascending=False).head(10).copy()
top_zips['avg_value_fmt'] = top_zips['avg_value'].apply(lambda x: f'${x:,.0f}')
top_zips['median_value_fmt'] = top_zips['median_value'].apply(lambda x: f'${x:,.0f}')

display(top_zips[['zip_code', 'avg_value_fmt', 'median_value_fmt', 'property_count']])

### 6.4 Create Interactive Choropleth Map

Visualize average property values across ZIP codes using a color-coded map.

In [None]:
print("Creating choropleth map...\n")

# Merge statistics with ZIP code boundaries
zipcodes_with_stats = zipcodes_gdf.merge(
    zip_stats[['zip_code', 'avg_value', 'median_value', 'property_count', 'total_value']], 
    on='zip_code', 
    how='left'
)

# Fill NaN values with 0 for ZIP codes without data
zipcodes_with_stats['avg_value'] = zipcodes_with_stats['avg_value'].fillna(0)
zipcodes_with_stats['median_value'] = zipcodes_with_stats['median_value'].fillna(0)
zipcodes_with_stats['property_count'] = zipcodes_with_stats['property_count'].fillna(0)

# Convert datetime/timestamp columns to strings to avoid JSON serialization errors
for col in zipcodes_with_stats.columns:
    if pd.api.types.is_datetime64_any_dtype(zipcodes_with_stats[col]):
        zipcodes_with_stats[col] = zipcodes_with_stats[col].astype(str)

# Convert to GeoJSON for Altair
geojson_str = zipcodes_with_stats.to_json()
geojson_dict = json.loads(geojson_str)

# Get min/max for color scale (excluding zeros)
min_val = zipcodes_with_stats['avg_value'][zipcodes_with_stats['avg_value'] > 0].min()
max_val = zipcodes_with_stats['avg_value'].max()

print(f"Property value range: ${min_val:,.0f} - ${max_val:,.0f}")
print(f"Creating visualization...\n")

# Create the choropleth map
choropleth = alt.Chart(alt.Data(values=geojson_dict['features'])).mark_geoshape(
    stroke='white',
    strokeWidth=1.5
).encode(
    color=alt.Color(
        'properties.avg_value:Q',
        scale=alt.Scale(
            scheme='viridis',
            domain=[min_val, max_val],
            type='log'  # Log scale for better visualization
        ),
        title='Avg Property Value ($)',
        legend=alt.Legend(
            format='$,.0f',
            orient='right',
            titleFontSize=13,
            labelFontSize=11,
            gradientLength=300
        )
    ),
    tooltip=[
        alt.Tooltip('properties.zip_code:N', title='ZIP Code'),
        alt.Tooltip('properties.avg_value:Q', title='Avg Value', format='$,.0f'),
        alt.Tooltip('properties.median_value:Q', title='Median Value', format='$,.0f'),
        alt.Tooltip('properties.property_count:Q', title='# Properties', format=','),
        alt.Tooltip('properties.total_value:Q', title='Total Value', format='$,.2s')
    ]
).properties(
    width=900,
    height=700,
    title={
        'text': 'San Francisco Average Property Values by ZIP Code',
        'fontSize': 20,
        'fontWeight': 'bold',
        'anchor': 'middle'
    }
).project(
    type='mercator'
).configure_view(
    strokeWidth=0
)

print("✓ Choropleth map created!\n")

# Display the map
choropleth