## Background and motivations





In urban planning, understanding the factors that drive housing development is crucial for sustainable growth. Cities often use zoning regulations to influence where and how much housing is built. However, these effects are not always straightforward to predict, as they involve a complex interplay of economic, demographic, and spatial factors. 

In this analysis, we are working to identify the best predictors of housing unit availability. The goal is to understand which factors or regulatory variables most effectively increase the number of housing units. By focusing on the city of Minneapolis, the dataset gives us insights into various social, economic, and spatial characteristics across census tracts, enabling us to examine potential impacts on housing supply.

To capture the causal relationships between these variables and housing availability, we’re leveraging the Pyro library, a tool specifically suited for probabilistic programming and causal inference. Pyro offers a powerful framework for causal modeling, especially beneficial in scenarios where the underlying relationships between variables are complex and uncertain. You can read more about pyro here: https://pyro.ai/


### The dataset

**Source and details here**
The dataset was obtained...

The  Person's correlation plot below represents the relationship between the variables in the dataset:

![](..\experimental_notebooks\zoning\corr_plot.png)

In the `zoning_tracts_data.ipynb` notebook, the data is structured into a format suitable for modeling by aggregating the number of housing units across Minneapolis by year and census tract ID. The total number of housing units within each census tract is calculated by summing the individual units, ensuring no loss of data due to the non-overlapping nature of the parcels. Additionally, two key metrics are derived: `median_value` and `summed_value`, which represent the respective values of housing units within each census tract. This comprehensive data structure facilitates insightful analysis and model development.

The model uses the following variables:

- **Categorical Variables**: `year`, `census_tract`

- **Continuous Variables**: `housing_units`, `total_value`, `median_value`, `mean_limit_original`, `median_distance`, `income`, `segregation_original`, `white_original`, `parcel_mean_sqm`, `parcel_median_sqm`, `parcel_sqm`, `downtown_overlap`, `university_overlap`

- **Outcome Variable**: `housing_units`

### Causal Modeling 

The `TractsModelContinuousInteractions` class constructs a causal model that includes continuous interaction terms, where variables like `income`, `limit`, and `distance` interact to predict outcomes. This is achieved using the following modeling approach:

1. **Model Structure**: 
   - The model is structured hierarchically, where each dependent variable (like `housing_units`, `income`, and `segregation`) is regressed on a combination of categorical and continuous predictors.
   - For example, the relationship for housing units can be expressed mathematically as:
     $$
     \text{housing\_units} = f(\text{distance}, \text{income}, \text{white}, \text{segregation}, \text{sqm}, \text{...}) + \epsilon
     $$
   - Here, $ f $ represents the functional form (e.g., linear regression), and $ \epsilon $ is the error term, capturing unobserved influences.

2. **Sampling**: 
   - The model uses probabilistic inference methods to sample from the posterior distribution of the model parameters using Pyro’s `pyro.sample` function. For instance, categorical variables are sampled using a categorical distribution, while continuous variables are sampled from a normal distribution:
     $$
     y \sim \text{Categorical}(\pi), \quad x \sim \mathcal{N}(\mu, \sigma)
     $$

3. **Components of the Model**: 
   - The model integrates multiple components that represent specific relationships. For example:
     - **Linear Components**: Models like `add_linear_component` estimate relationships based on linear regression equations.
     - **Ratio Components**: Models like `add_ratio_component` handle variables that are ratios, such as `segregation` and `income`, allowing for multiplicative interactions between predictors.


#### Continuous Interactions
Continuous interactions are treated as additional predictors in the model. For instance, if `limit` interacts with `income`, this is captured by adding terms like `limit * income` in the regression structure, allowing us to explore how multiple factors together influence housing outcomes.

### WAIC and Model Evaluation
**Watanabe-Akaike Information Criterion (WAIC)** is employed to evaluate and compare different model configurations. It is particularly suited to Bayesian models, as it considers the entire posterior distribution rather than just point estimates. WAIC works by calculating the **log pointwise predictive density (lppd)**, averaged across posterior samples, to capture how well the model predicts the observed data, while also penalizing for model complexity to avoid overfitting.

$$
\text{WAIC} = -2 \cdot (\text{lppd} - \text{penalty})
$$
where:
   - **lppd** sums the log-likelihood of each data point, averaged across posterior samples.
   - **penalty** accounts for model complexity by considering the variance of log-likelihood values across posterior samples.

Lower WAIC values indicate a model that better balances predictive accuracy and complexity. By comparing WAIC values across models, we can determine which model best captures the data's patterns while avoiding overfitting. If one model has a significantly lower WAIC, it is considered to offer better predictive performance. Thus, WAIC guides the model development process, helping select a model that provides an optimal fit without unnecessary complexity.

### Outliers

Urban zoning and housing are subject to many unpredictable factors, resulting in high variability and outliers. For instance, zones like university areas and downtown are governed by unique regulations, which introduces unexpected patterns in housing data.



### Interventions

Interventions in causal models allow us to simulate hypothetical policy changes. By directly manipulating a variable, such as setting zoning `limit` values to different levels, we can analyze potential impacts on housing units.

We begin with brute-force interventions by setting variables to extreme values (e.g., all zeros or all ones). This provides a general idea of the effect range based on zoning limits alone.

Using the `do` operator in Pyro, we simulate a realistic intervention scenario by adjusting `limit` values in line with Minneapolis reforms. This approach helps compare observed values with both factual and counterfactual predictions, examining how Minneapolis’ zoning changes might have impacted housing availability.
