# Other Omega Estimation Methods
- In this report, we try to estimate omega in a different way from the original paper and check AA-test result.
- The `omega` is the weight of a feature (donor pull) in Synthetic Control Methods.
- The classical Synthetic Control Methods (ADH) restrictions the following:
    - non-negativity of weights
    - summing to one
    - no intercept
- In the original paper, intercept is allowed for this. (It also incorporates the L2 regularization term into the loss function.)

## Additional methods for PySynthDID
### (1) Search zeta by cross validation
- `zeta` is a hyper-parameter in the estimation of `omega`
- In the original paper, theoretical values were used for zeta.
- In this module, we will search for a more optimal zeta by performing Cross-Validation in the pre-intervention period separately from this theoretical value.
    - Grid Search
    - Baysian Optimaization

### (2) Significant relaxation of ADH conditions
- While the ADH condition is very good in terms of interpretability, it does not seem to be particularly necessary mathematically.
- Here, we relax the `sum(w)=1 condition` and the `non-negative constraint`. Specifically, we adopt Lasso, Rige, and ElasticNet, and after performing CV, we adopt the coefficients of sparse regression as `omega`

In [1]:
import warnings

warnings.filterwarnings("ignore")

import sys
import os

sys.path.append(os.path.abspath("../"))

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr
plt.style.use('ggplot')

from tqdm import tqdm

from synthdid.model import SynthDID
from synthdid.sample_data import fetch_CaliforniaSmoking

In [2]:
df = fetch_CaliforniaSmoking()

PRE_TEREM = [1970, 1988]
POST_TEREM = [1989, 2000]

TREATMENT = ["California"]

df.head()

Unnamed: 0,Alabama,Arkansas,Colorado,Connecticut,Delaware,Georgia,Idaho,Illinois,Indiana,Iowa,...,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,West Virginia,Wisconsin,Wyoming,California
1970,89.800003,100.300003,124.800003,120.0,155.0,109.900002,102.400002,124.800003,134.600006,108.5,...,92.699997,99.800003,106.400002,65.5,122.599998,124.300003,114.5,106.400002,132.199997,123.0
1971,95.400002,104.099998,125.5,117.599998,161.100006,115.699997,108.5,125.599998,139.300003,108.400002,...,96.699997,106.300003,108.900002,67.699997,124.400002,128.399994,111.5,105.400002,131.699997,121.0
1972,101.099998,103.900002,134.300003,110.800003,156.300003,117.0,126.099998,126.599998,149.199997,109.400002,...,103.0,111.5,108.599998,71.300003,138.0,137.0,117.5,108.800003,140.0,123.5
1973,102.900002,108.0,137.899994,109.300003,154.699997,119.800003,121.800003,124.400002,156.0,110.599998,...,103.5,109.699997,110.400002,72.699997,146.800003,143.100006,116.599998,109.5,141.199997,124.400002
1974,108.199997,109.699997,132.800003,112.400002,151.300003,123.699997,125.599998,131.899994,159.600006,116.099998,...,108.400002,114.800003,114.699997,75.599998,151.800003,149.600006,119.900002,111.800003,145.800003,126.699997


In [3]:
sdid = SynthDID(df, PRE_TEREM, POST_TEREM, TREATMENT)
sdid.fit(zeta_type="base", sparce_estimation=True)

In [4]:
# eg.
sdid.estimated_params(model="ElasticNet")

Unnamed: 0,features,ElasticNet_weight
0,Alabama,-0.0
1,Arkansas,-0.0
2,Colorado,0.063
3,Connecticut,0.12
4,Delaware,0.0
5,Georgia,-0.0
6,Idaho,0.0
7,Illinois,0.203
8,Indiana,-0.0
9,Iowa,0.0


## (1) Search zeta by cross validation
-  In the meantime, only California's AA test (a test that considers the 8 years of the pre-intervention period as a pseudo-intervention period) was conducted.
- Substantial improvement was confirmed as shown below
- TODO: The estimated results for other states will be confirmed at a later date.

In [5]:
PRE_TEREM2 = [1970, 1979]
POST_TEREM2 = [1980, 1988]

_sdid = SynthDID(df, PRE_TEREM2, POST_TEREM2, TREATMENT)

print ("zeta with original")
_sdid.fit(zeta_type="base")

_outcome = pd.DataFrame({"actual_y": _sdid.target_y()})

_outcome["did"] = _sdid.did_potentical_outcome()
_outcome["sc"] = _sdid.sc_potentical_outcome()
_outcome["sdid"] = _sdid.sdid_potentical_outcome()

print ("zeta with grid_search")
_sdid.fit(zeta_type="grid_search")
_outcome["sdid_grid_search"] = _sdid.sdid_potentical_outcome()

print ("zeta with bayesian optimaization")
_sdid.fit(zeta_type="bayesian_opt")
_outcome["sdid_bayesian_opt"] = _sdid.sdid_potentical_outcome()
_outcome = _outcome.loc[POST_TEREM2[0] : POST_TEREM2[1]]

_rmse = np.sqrt((_outcome.mean() - _outcome.mean()["actual_y"]) ** 2)
pd.DataFrame(_rmse).T[
    ["did", "sc", "sdid", "sdid_grid_search", "sdid_bayesian_opt"]
]


zeta with original


  0%|          | 0/32 [00:00<?, ?it/s]

zeta with grid_search
cv: zeta


100%|██████████| 32/32 [04:29<00:00,  8.42s/it]


zeta with bayesian optimaization
|   iter    |  target   |   zeta    |
-------------------------------------
| [0m 1       [0m | [0m-3.65    [0m | [0m 15.22   [0m |
| [95m 2       [0m | [95m-3.069   [0m | [95m 9.173   [0m |
| [95m 3       [0m | [95m-2.965   [0m | [95m 8.046   [0m |
| [0m 4       [0m | [0m-3.159   [0m | [0m 10.08   [0m |
| [95m 5       [0m | [95m-2.517   [0m | [95m 2.539   [0m |
| [95m 6       [0m | [95m-1.814   [0m | [95m 0.01    [0m |
| [0m 7       [0m | [0m-1.814   [0m | [0m 0.01    [0m |
| [0m 8       [0m | [0m-1.814   [0m | [0m 0.01    [0m |
| [0m 9       [0m | [0m-1.814   [0m | [0m 0.01    [0m |
| [0m 10      [0m | [0m-1.814   [0m | [0m 0.01    [0m |
| [0m 11      [0m | [0m-1.814   [0m | [0m 0.01    [0m |
| [0m 12      [0m | [0m-1.814   [0m | [0m 0.01    [0m |
| [0m 13      [0m | [0m-1.814   [0m | [0m 0.01    [0m |
| [0m 14      [0m | [0m-1.814   [0m | [0m 0.01    [0m |
| [0m 15  

Unnamed: 0,did,sc,sdid,sdid_grid_search,sdid_bayesian_opt
0,11.50845,4.779877,4.485329,1.23445,1.522452


---
## (2) Significant relaxation of ADH conditions
- The AAtest results below confirm that the original Synthetic Diff. in Diff. has better performance then sdid with Lasso, Rige, and ElasticNet
- TODO: I'll try again soon, with the non-negative condition attached to the sparse regression.

In [6]:
state_list = df.columns

result_rmse_list = []

for _state in tqdm(state_list):
    _sdid = SynthDID(df, PRE_TEREM2, POST_TEREM2, [_state])
    _sdid.fit(zeta_type="base", sparce_estimation=True)

    _outcome = pd.DataFrame({"actual_y": _sdid.target_y()})

    _outcome["did"] = _sdid.did_potentical_outcome()
    _outcome["sc"] = _sdid.sc_potentical_outcome()
    _outcome["sdid"] = _sdid.sdid_potentical_outcome()
    _outcome["sdid_ElasticNet"] = _sdid.sparceReg_potentical_outcome(model="ElasticNet")
    _outcome["sdid_Lasso"] = _sdid.sparceReg_potentical_outcome(model="Lasso")
    _outcome["sdid_Ridge"] = _sdid.sparceReg_potentical_outcome(model="Ridge")
    _outcome = _outcome.loc[POST_TEREM2[0] : POST_TEREM2[1]]

    _rmse = np.sqrt((_outcome.mean() - _outcome.mean()["actual_y"]) ** 2)
    _rmse = pd.DataFrame(_rmse).T[
        ["did", "sc", "sdid", "sdid_ElasticNet", "sdid_Lasso", "sdid_Ridge"]
    ]
    _rmse.index = [_state]

    result_rmse_list.append(_rmse)
    
result_rmse = pd.concat(result_rmse_list)

100%|██████████| 39/39 [01:56<00:00,  2.98s/it]


In [7]:
result_rmse.mean()

did                9.436689
sc                 6.396863
sdid               4.448636
sdid_ElasticNet    7.483973
sdid_Lasso         6.606194
sdid_Ridge         5.696426
dtype: float64