# Validation of other estimation methods for omega
- In this report, we try to estimate omega in a different way from the original paper and check AA-test result.
- The `omega` is the weight of features (donor pull) in Synthetic Control Methods.
- The classical Synthetic Control Methods (ADH) restrictions the following:
    - non-negativity of weights
    - summing to one
    - no intercept
- In the original paper, intercept is allowed for this. (It also incorporates the L2 regularization term into the loss function.)

## Additional methods for PySynthDID
### (1) Search zeta by cross validation
- `zeta` is a hyper-parameter in the estimation of `omega`
- In the original paper, theoretical values were used for zeta.
- In this note, we do not use this theoretical value, but perform cross-validation in the pre-intervention period and compare and discuss the results.
    - Grid Search
    - Baysian Optimaization

### (2) Significant relaxation of ADH conditions
- While the ADH condition is very good in terms of interpretability, it does not seem to be particularly necessary mathematically.
- Here, we relax the `sum(w)=1 condition` and the `non-negative constraint`. Specifically, we adopt Lasso, Rige, and ElasticNet, and after performing CV, we adopt the coefficients of sparse regression as `omega`

In [None]:
import warnings

warnings.filterwarnings("ignore")

import sys
import os

sys.path.append(os.path.abspath("../"))

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr
plt.style.use('ggplot')

from tqdm import tqdm

from synthdid.model import SynthDID
from synthdid.sample_data import fetch_CaliforniaSmoking

In [None]:
df = fetch_CaliforniaSmoking()

PRE_TEREM = [1970, 1988]
POST_TEREM = [1989, 2000]

TREATMENT = ["California"]

df.head()

In [None]:
sdid = SynthDID(df, PRE_TEREM, POST_TEREM, TREATMENT)
sdid.fit(zeta_type="base", sparce_estimation=True)

In [None]:
# eg.
sdid.estimated_params(model="ElasticNet")

## (1) Search zeta by cross validation
- No particular performance improvement was observed
- Considering the cost of cross-validation, we believe that the choice of theoretical values in the original paper is reasonable

In [None]:
PRE_TEREM2 = [1970, 1979]
POST_TEREM2 = [1980, 1988]

_sdid = SynthDID(df, PRE_TEREM2, POST_TEREM2, TREATMENT)

print ("zeta with original")
_sdid.fit(zeta_type="base")

_outcome = pd.DataFrame({"actual_y": _sdid.target_y()})

_outcome["did"] = _sdid.did_potentical_outcome()
_outcome["sc"] = _sdid.sc_potentical_outcome()
_outcome["sdid"] = _sdid.sdid_potentical_outcome()
print("original zeta : ", _sdid.zeta)

print ("zeta with grid_search")
_sdid.fit(zeta_type="grid_search", cv=3, cv_split_type="TimeSeriesSplit", n_candidate=10)
_outcome["sdid_grid_search"] = _sdid.sdid_potentical_outcome()
print("grid_search zeta : ", _sdid.zeta)

print ("zeta with bayesian optimaization")
_sdid.fit(zeta_type="bayesian_opt", cv=3, cv_split_type="TimeSeriesSplit")
_outcome["sdid_bayesian_opt"] = _sdid.sdid_potentical_outcome()
print("bayesian_opt zeta : ", _sdid.zeta)

_outcome = _outcome.loc[POST_TEREM2[0] : POST_TEREM2[1]]

_rmse = np.sqrt((_outcome.mean() - _outcome.mean()["actual_y"]) ** 2)
pd.DataFrame(_rmse).T[
    ["did", "sc", "sdid", "sdid_grid_search", "sdid_bayesian_opt"]
]


In [None]:
state_list = df.columns

result_rmse_list = []
result_zeta_dict = {}

for _state in tqdm(state_list):
    _sdid = SynthDID(df, PRE_TEREM2, POST_TEREM2, [_state])
    
    _sdid.fit(zeta_type="base")

    _outcome = pd.DataFrame({"actual_y": _sdid.target_y()})

    _outcome["did"] = _sdid.did_potentical_outcome()
    _outcome["sc"] = _sdid.sc_potentical_outcome()
    _outcome["sdid"] = _sdid.sdid_potentical_outcome()
    _base_zeta = _sdid.zeta
    
    print ("zeta with bayesian optimaization")
    _sdid.fit(zeta_type="bayesian_opt", cv=3, cv_split_type="TimeSeriesSplit")
    _outcome["sdid_bayesian_opt"] = _sdid.sdid_potentical_outcome()
    _cv_zeta = _sdid.zeta
    
    _outcome = _outcome.loc[POST_TEREM2[0] : POST_TEREM2[1]]

    _rmse = np.sqrt((_outcome.mean() - _outcome.mean()["actual_y"]) ** 2)
    _rmse = pd.DataFrame(_rmse).T[
        ["did", "sc", "sdid", "sdid_bayesian_opt"]
    ]
    _rmse.index = [_state]

    result_rmse_list.append(_rmse)
    result_zeta_dict[_state] = {"base" : _base_zeta, "cv" : _cv_zeta} 
    
result_rmse = pd.concat(result_rmse_list)

In [None]:
pd.DataFrame(result_zeta_dict).T

In [None]:
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1) 
_x = np.linspace(0, 50, 30)
_y = _x
sns.scatterplot(data=result_rmse, x="sdid_bayesian_opt", y="sdid", ax = ax)
ax.plot(_x, _y, color='black',  linestyle='solid',linewidth = 0.5)
ax.set_xlabel("RMSE : Synthetic Diff. in Diff with zeta CV search")
ax.set_ylabel("RMSE : Synthetic Diff. in Diff with original zeta")
#ax.set_xlim(0, 25)
#ax.set_ylim(0, 55)
plt.show()

In [None]:
result_rmse.mean()

---
## (2) Significant relaxation of ADH conditions
- The AAtest results below confirm that the original Synthetic Diff. in Diff. has better performance then sdid with Lasso, Rige, and ElasticNet
- This seems to be the assumption of the original paper. In the first place, SythDID has been developed to compensate for the shortcomings in classical SC because it concentrates on a specific donor pool and the results are not stable.
- Among the sparce regressions, the Rigge regression performed better, and the Lasso regression is similar to the classical SC, which seems to be the assumption of the original paper

In [None]:
state_list = df.columns

result_rmse_list = []

for _state in tqdm(state_list):
    _sdid = SynthDID(df, PRE_TEREM2, POST_TEREM2, [_state])
    _sdid.fit(zeta_type="base", sparce_estimation=True)

    _outcome = pd.DataFrame({"actual_y": _sdid.target_y()})

    _outcome["did"] = _sdid.did_potentical_outcome()
    _outcome["sc"] = _sdid.sc_potentical_outcome()
    _outcome["sdid"] = _sdid.sdid_potentical_outcome()
    _outcome["sdid_ElasticNet"] = _sdid.sparceReg_potentical_outcome(model="ElasticNet")
    _outcome["sdid_Lasso"] = _sdid.sparceReg_potentical_outcome(model="Lasso")
    _outcome["sdid_Ridge"] = _sdid.sparceReg_potentical_outcome(model="Ridge")
    _outcome = _outcome.loc[POST_TEREM2[0] : POST_TEREM2[1]]

    _rmse = np.sqrt((_outcome.mean() - _outcome.mean()["actual_y"]) ** 2)
    _rmse = pd.DataFrame(_rmse).T[
        ["did", "sc", "sdid", "sdid_ElasticNet", "sdid_Lasso", "sdid_Ridge"]
    ]
    _rmse.index = [_state]

    result_rmse_list.append(_rmse)
    
result_rmse = pd.concat(result_rmse_list)

In [None]:
result_rmse.mean()