# A Case Study: The Effect of Gun Ownership on Gun-Homicide Rates

We consider the problem of estimating the effect of gun
ownership on the homicide rate. For this purpose, we estimate the following partially
linear model

$$
 Y_{j,t} = \beta D_{j,(t-1)} + g(Z_{j,t}) + \epsilon_{j,t}.
$$

## Data

$Y_{j,t}$ is the log homicide rate in county $j$ at time $t$, $D_{j, t-1}$ is the log fraction of suicides committed with a firearm in county $j$ at time $t-1$, which we use as a proxy for gun ownership,  and  $Z_{j,t}$ is a set of demographic and economic characteristics of county $j$ at time $t$. The parameter $\beta$ is the effect of gun ownership on homicide rates, controlling for county-level demographic and economic characteristics.

The sample covers 195 large United States counties between the years 1980 through 1999, giving us 3900 observations.

In [None]:
# Import relevant packages
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression, Ridge, Lasso, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.neural_network import MLPRegressor
import patsy
import warnings
from sklearn.base import BaseEstimator, clone
import statsmodels.api as sm
import statsmodels.formula.api as smf
warnings.simplefilter('ignore')

np.random.seed(1234)

In [None]:
file = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/gun_clean.csv"
data = pd.read_csv(file)
data.shape

### Preprocessing

To account for heterogeneity across counties and time trends in  all variables, we remove from them county-specific and time-specific effects in the following preprocessing.

In [None]:
##################### Find Variable Names from Dataset ######################
def varlist(df=None, type=["numeric", "factor", "character"], pattern="", exclude=None):
    vars = []
    if any(t in type for t in ["numeric", "factor", "character"]):
        if "numeric" in type:
            vars += df.select_dtypes(include=["number"]).columns.tolist()
        if "factor" in type:
            vars += df.select_dtypes(include=["category"]).columns.tolist()
        if "character" in type:
            vars += df.select_dtypes(include=["object"]).columns.tolist()

    if exclude:
        vars = [var for var in vars if var not in exclude]

    if pattern:
        vars = [var for var in vars if re.search(pattern, var)]

    return vars


############################# Create Variables ##############################

# dummy variables for year and county fixed effects
fixed = [col for col in data.columns if "X_Jfips" in col]
year = varlist(data, pattern="X_Tyear")

# census control variables
census = []
census_var = ["^AGE", "^BN", "^BP", "^BZ", "^ED", "^EL", "^HI", "^HS", "^INC", "^LF", "^LN", "^PI", "^PO", "^PP", "^PV", "^SPR", "^VS"]

for pattern in census_var:
    census.extend(varlist(data, pattern=pattern))

################################ Variables ##################################
# treatment variable
d = "logfssl"
# outcome variable
y = "logghomr"
# other control variables
X1 = ["logrobr", "logburg", "burg_missing", "robrate_missing"]
X2 = ["newblack", "newfhh", "newmove", "newdens", "newmal"]


######################## Partial out Fixed Effects ##########################

# new dataset for partialled-out variables
rdata = pd.DataFrame(data["CountyCode"])

# variables to partial out
pvar = [y, d] + X1 + X2 + census

# partial out year and county fixed effect from variables in pvar
residuals = []
for var in pvar:
    formula = f"{var} ~ {' + '.join(year)} + {' + '.join(fixed)}"
    model = sm.OLS.from_formula(formula, data=data)
    result = model.fit()
    residuals.append(pd.Series(result.resid, name=var))

rdata = pd.concat([rdata] + residuals, axis=1)

rdata.head()


Now, we can construct the treatment variable, the outcome variable and the matrix $Z$ that includes the control variables.

In [None]:
# Treatment variable
D = rdata[d]

# Outcome variable
Y = rdata[y]

# Construct matrix Z
Z_cols = X1 + X2 + census
Z = rdata[Z_cols]
Z.shape

We have 195 control variables in total. The control variables $Z_{j,t}$ are from the U.S. Census Bureau and  contain demographic and economic characteristics of the counties such as  the age distribution, the income distribution, crime rates, federal spending, home ownership rates, house prices, educational attainment, voting paterns, employment statistics, and migration rates.

In [None]:
clu = rdata["CountyCode"]  # for clustering SE
time = np.tile(np.arange(20), int(data.shape[0]/20))
time = pd.Series(time, name='time')
data = pd.concat([Y, D, Z, clu, pd.Series(time)], axis=1)

## The effect of gun ownership

### OLS

After preprocessing the data, as a baseline model, we first look at simple regression of $Y_{j,t}$ on $D_{j,t-1}$ without controls.

In [None]:
ols_model = smf.ols(formula = 'logghomr ~ 1 + logfssl', data = data).fit(cov_type='cluster', cov_kwds={"groups": data['CountyCode']})
ols_model.summary()

The point estimate is $0.282$ with the confidence interval ranging from 0.155 to 0.41. This
suggests that increases in gun ownership rates are related to gun homicide rates - if gun ownership increases by 1% relative
to a trend then the predicted gun homicide rate goes up by 0.28%, without controlling for counties' characteristics.

Since our goal is to estimate the effect of gun ownership after controlling for a rich set county characteristics, we next include the controls. First, we estimate the model by ols and then by an array of the modern regression methods using the double machine learning approach.

In [None]:
def formula_from_cols(df, y):
    return y + ' ~ ' + ' + '.join([col for col in df.columns if not col==y])

form = formula_from_cols(data,'logghomr')

In [None]:
ols_model = smf.ols(formula = form, data = data).fit(cov_type='cluster', cov_kwds={"groups": data['CountyCode']})
ols_model.summary()

After controlling for a rich set of characteristics, the point estimate of gun ownership reduces to $0.19$.

# DML algorithm

Here we perform inference on the predictive coefficient $\beta$ in our partially linear statistical model,

$$
Y = D\beta + g(Z) + \epsilon, \quad E (\epsilon | D, Z) = 0,
$$

using the **double machine learning** approach.

For $\tilde Y = Y- E(Y|Z)$ and $\tilde D= D- E(D|Z)$, we can write
$$
\tilde Y = \alpha \tilde D + \epsilon, \quad E (\epsilon |\tilde D) =0.
$$

Using cross-fitting, we employ modern regression methods
to build estimators $\hat \ell(Z)$ and $\hat m(Z)$ of $\ell(Z):=E(Y|Z)$ and $m(Z):=E(D|Z)$ to obtain the estimates of the residualized quantities:

$$
\tilde Y_i = Y_i  - \hat \ell (Z_i),   \quad \tilde D_i = D_i - \hat m(Z_i), \quad \text{ for each } i = 1,\dots,n.
$$

Finally, using ordinary least squares of $\tilde Y_i$ on $\tilde D_i$, we obtain the
estimate of $\beta$.

The following algorithm comsumes $Y, D, Z$, and a machine learning method for learning the residuals $\tilde Y$ and $\tilde D$, where the residuals are obtained by cross-validation (cross-fitting). Then, it prints the estimated coefficient $\beta$ and the corresponding standard error from the final OLS regression.

In [None]:
def dml(X, D, y, modely, modeld, *, nfolds, classifier=False, time = None, clu = None, cluster = True):
    '''
    DML for the Partially Linear Model setting with cross-fitting

    Input
    -----
    X: the controls
    D: the treatment
    y: the outcome
    modely: the ML model for predicting the outcome y
    modeld: the ML model for predicting the treatment D
    nfolds: the number of folds in cross-fitting
    classifier: bool, whether the modeld is a classifier or a regressor

    time: array of time indices, eg [0,1,...,T-1,0,1,...,T-1,...,0,1,...,T-1]
    clu: array of cluster indices, eg [1073, 1073, 1073, ..., 5055, 5055, 5055, 5055]
    cluster: bool, whether to use clustered standard errors

    Output
    ------
    point: the point estimate of the treatment effect of D on y
    stderr: the standard error of the treatment effect
    yhat: the cross-fitted predictions for the outcome y
    Dhat: the cross-fitted predictions for the treatment D
    resy: the outcome residuals
    resD: the treatment residuals
    epsilon: the final residual-on-residual OLS regression residual
    '''
    cv = KFold(n_splits=nfolds, shuffle=True, random_state=123) # shuffled k-folds
    yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1) # out-of-fold predictions for y
    # out-of-fold predictions for D
    # use predict or predict_proba dependent on classifier or regressor for D
    if classifier:
        Dhat = cross_val_predict(modeld, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
    else:
        Dhat = cross_val_predict(modeld, X, D, cv=cv, n_jobs=-1)
    # calculate outcome and treatment residuals
    resy = y - yhat
    resD = D - Dhat

    if cluster:
      # final stage ols clustered
      dml_data = pd.concat([clu, pd.Series(time), pd.Series(resy, name = 'resy'), pd.Series(resD, name = 'resD')], axis=1)

    else:
      # final stage ols nonclustered
      dml_data = pd.concat([pd.Series(resy, name = 'resy'), pd.Series(resD, name = 'resD')], axis=1)

    if cluster:
      # clustered standard errors
      ols_mod = smf.ols(formula = 'resy ~ 1 + resD', data = dml_data).fit(cov_type='cluster', cov_kwds={"groups": dml_data['CountyCode']})

    else:
      # regular ols
      ols_mod = smf.ols(formula = 'resy ~ 1 + resD', data = dml_data).fit()

    point = ols_mod.params[1]
    stderr = ols_mod.bse[1]
    epsilon = ols_mod.resid

    return point, stderr, yhat, Dhat, resy, resD, epsilon

In [None]:
def summary(point, stderr, yhat, Dhat, resy, resD, epsilon, X, D, y, *, name):
    '''
    Convenience summary function that takes the results of the DML function
    and summarizes several estimation quantities and performance metrics.
    '''
    return pd.DataFrame({'estimate': point, # point estimate
                         'stderr': stderr, # standard error
                         'lower': point - 1.96*stderr, # lower end of 95% confidence interval
                         'upper': point + 1.96*stderr, # upper end of 95% confidence interval
                         'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y
                         'rmse D': np.sqrt(np.mean(resD**2)) # RMSE of model that predicts treatment D
                         }, index=[name])

In the following, we apply the DML approach with the different versions of lasso.


## Lasso

In [None]:
!pip install multiprocess
!git clone https://github.com/maxhuppertz/hdmpy.git

Run the following command to install hdmpy for rigorous lasso:

``` !pip install multiprocess ```


```!git clone https://github.com/maxhuppertz/hdmpy.git ```

In [None]:
import hdmpy
from sklearn.base import BaseEstimator, clone

class RLasso(BaseEstimator):

    def __init__(self, *, post=True):
        self.post = post

    def fit(self, X, y):
        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)
        return self

    def predict(self, X):
        return np.array(X) @ np.array(self.rlasso_.est['beta']).flatten() + np.array(self.rlasso_.est['intercept'])

lasso_model = lambda: RLasso(post=False)

In [None]:
# DML with RLasso:
modely = make_pipeline(StandardScaler(), RLasso(post=False))
modeld = make_pipeline(StandardScaler(), RLasso(post=False))

# Run DML model with nfolds folds of cross-fitting
result_RLasso = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False, time = time, clu = clu, cluster = True)
table_RLasso = summary(*result_RLasso, Z,D, y, name = 'RLasso')
table_RLasso

In [None]:
# DML with Post-Lasso:
modely = make_pipeline(StandardScaler(), RLasso(post=True))
modeld = make_pipeline(StandardScaler(), RLasso(post=True))

# Run DML model with nfolds folds of cross-fitting
result_post = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False, time = time, clu = clu, cluster = True)
table_post = summary(*result_post, Z,D, y, name = 'Post Lasso')
table_post

In [None]:
# Now lets do Cross-validated Lasso, Ridge, ENet
cv = KFold(n_splits=10, shuffle=True, random_state=123) # shuffled k-folds

In [None]:
# Define LassoCV models with n_splits folds of cross-validation
modely = make_pipeline(StandardScaler(), LassoCV(cv=cv))
modeld = make_pipeline(StandardScaler(), LassoCV(cv=cv))

# Run DML model with nfolds folds of cross-fitting
result_LassoCV = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False, time = time, clu = clu, cluster = True)
table_LassoCV = summary(*result_LassoCV, Z,D, y, name = 'LassoCV')
table_LassoCV

In [None]:
# Define RidgeCV models with n_splits folds of cross-validation
modely = make_pipeline(StandardScaler(), RidgeCV(cv=cv))
modeld = make_pipeline(StandardScaler(), RidgeCV(cv=cv))

# Run DML model with nfolds folds of cross-fitting
result_RidgeCV = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False, time = time, clu = clu, cluster = True)
table_RidgeCV = summary(*result_RidgeCV, Z, D, y, name = 'RidgeCV')
table_RidgeCV

In [None]:
# Define ElasticNetCV models with n_splits folds of cross-validation
modely = make_pipeline(StandardScaler(), ElasticNetCV(l1_ratio = 0.5, cv=cv))
modeld = make_pipeline(StandardScaler(), ElasticNetCV(l1_ratio = 0.5, cv=cv))

# Run DML model with nfolds folds of cross-fitting
result_ENetCV = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False, time = time, clu = clu, cluster = True)
table_ENetCV = summary(*result_ENetCV, Z,D, y, name = 'ENetCV')
table_ENetCV

Here we also compute DML with OLS used as the ML method. Note this produces similar results to what we found in the beginning (FWL Theorem), but slightly different as we conduct cross-fitting.

In [None]:
# DML with OLS:
modely = make_pipeline(StandardScaler(), LinearRegression())
modeld = make_pipeline(StandardScaler(), LinearRegression())

# Run DML model with nfolds folds of cross-fitting
result_OLS = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False, time = time, clu = clu, cluster = True)
table_OLS = summary(*result_OLS, Z,D, y, name = 'OLS (DML)')
table_OLS

Next, we also apply Random Forest for comparison purposes.

### Random Forest


In [None]:
# DML with Random Forests. RFs don't require scaling but we do it for consistency
modely = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))
modeld = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=123))

# Run DML model with nfolds folds of cross-fitting (computationally intensive)
result_RF = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False, time = time, clu = clu, cluster = True)
table_RF = summary(*result_RF, Z,D, y, name = 'RF')
table_RF

### Neural Networks

In [None]:
# DML with NNs
modely = make_pipeline(StandardScaler(),
                       MLPRegressor((16, 16,), 'relu',
                                    learning_rate_init=0.01,
                                    batch_size=10, max_iter=100))
modeld = make_pipeline(StandardScaler(),
                       MLPRegressor((16, 16,), 'relu',
                                    learning_rate_init=0.01,
                                    batch_size=10, max_iter=100))

# Run DML model with nfolds folds of cross-fitting
result_NN = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False, time = time, clu = clu, cluster = True)
table_NN = summary(*result_NN, Z,D, y, name = 'NN')
table_NN

We conclude that the gun ownership rates are related to gun homicide rates - if gun ownership increases by 1% relative
to a trend then the predicted gun homicide rate goes up by about 0.20% controlling for counties' characteristics.

Finally, let's see which method is best. We computed the RMSE for predicting D and Y above, so let's see which of the methods works better.


In [None]:
rmses = pd.concat([table_OLS, table_RLasso, table_post, table_LassoCV, table_ENetCV, table_RidgeCV, table_RF, table_NN], axis=0).iloc[:,-2:]
rmses

It looks like the best method for predicting D is ElasticNetCV, and the best method for predicting Y is CV Ridge.


In [None]:
# DML with Bests:
modely = make_pipeline(StandardScaler(), RidgeCV(cv=cv))
modeld = make_pipeline(StandardScaler(), ElasticNetCV(l1_ratio = 0.5, cv=cv))

# Run DML model with nfolds folds of cross-fitting
result_best = dml(Z, D, Y, modely, modeld, nfolds=5, classifier=False, time = time, clu = clu, cluster = True)
table_best = summary(*result_best, Z,D, y, name = 'Best')
table_best

Let's organize the results in a table.

In [None]:
table = pd.concat([table_OLS, table_RLasso, table_post, table_LassoCV, table_ENetCV, table_RidgeCV, table_RF, table_NN, table_best], axis=0).iloc[:,0:2]
table = pd.concat([pd.DataFrame({'estimate': [simple_clu.params[0]], 'stderr': [simple_clu.std_errors[0]]}, index = ["Baseline (Y~D)"]),
                    pd.DataFrame({'estimate': [all_clu.params[0]], 'stderr': [all_clu.std_errors[0]]}, index = ["Baseline (Y~D+Z)"]),
                    table],axis=0)

In [None]:
print(table)