# Tutorial

This notebook will guide you through a simple estimation task with skillmodels. For this we use again the second example model from the CHS replication files, so we can re-use the model specification file written in the previous section. 

You will then learn how to use this model specification to simulate or estimate a model. Moreover, you will learn how you can use estimagic to impose additional constraints on your parameter vector. 

In [1]:
import mkl
mkl.set_num_threads(1)

4

Setting the mkl number of threads to 1 is actually faster than allowing for multithreading on small datasets. Note that this has to be done before numpy is imported. In our case numpy is never imported directly, but the import of pandas or SkillModel will import numpy.

In [2]:
import json
import pandas as pd
import numpy as np
from skillmodels import SkillModel
import warnings
warnings.simplefilter("ignore")

First, load the model specification and prepare the dataset. In particular, we have to set the index of the dataset. Skillmodels will assume that the first index level identifies the individual and the second the period. The names of the levels are irrelevant. 

Also, it is good practice not to use floats in an index as this can lead to problems. 

In [3]:
with open("test_model2.json") as j:
    model_dict = json.load(j)
    
data = pd.read_stata("chs_test_ex2.dta")
# set anchoring outcome to nan in all but last period. 
data.loc[data["period"] != 7, "Q1"] = np.nan
for var in ["caseid", "period"]:
    data[var] = data[var].astype(int)
data.set_index(["caseid", "period"], inplace=True)
data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,index,y1,y2,y3,y4,y5,y6,y7,y8,y9,Q1,dy7,dy8,dy9,x1,x2,id
caseid,period,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0,0,1.909221,2.053261,1.679474,1.205891,2.195575,1.499965,0.873044,1.790903,1.191478,,1.0,1.0,1.0,0.473032,1.0,0.0
1,1,1,0.92599,1.828494,1.412966,1.235554,0.636243,1.534268,0.873044,1.790903,1.191478,,1.0,1.0,1.0,0.473032,1.0,0.0
1,2,2,1.95716,2.251265,-0.373637,2.875391,1.838843,2.253621,0.873044,1.790903,1.191478,,1.0,1.0,1.0,0.473032,1.0,0.0
1,3,3,1.236615,1.160494,3.001797,1.181049,2.17063,1.125383,0.873044,1.790903,1.191478,,1.0,1.0,1.0,0.473032,1.0,0.0
1,4,4,2.091614,0.664072,2.015088,1.193765,0.36427,0.161003,0.873044,1.790903,1.191478,,1.0,1.0,1.0,0.473032,1.0,0.0


Next we have to generate an instance of `SkillModel`

In [4]:
mod = SkillModel(model_dict=model_dict, dataset=data)

If you want to estimate a model, you can often greatly reduce the number of function evaluations needed during the optimization if you take some time to write down good start values. If you want to simulate a model, you will also need a parameter vector. 

Since skillmodels builds on estimagic, the parameter vector is not a numpy array but a pandas DataFrame. This DataFrame has a quite complicated MultiIndex that would be difficult to write down manually. Therefore, skillmodels has a helper function for that:

In [5]:
free, fixed = mod.start_params_helpers()
len(free)

236

You will typically only need the free parameters. Even the very simple example model has 208 of them! Next, let's save them as a csv file, so you can add your start values in the value column and look briefly at the first few parameters.

In [6]:
free.to_csv("start_params_template.csv")
free.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,value,lower,upper
category,period,name1,name2,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
delta,0,y1,constant,,-inf,inf
delta,0,y1,x1,,-inf,inf
delta,0,y2,constant,,-inf,inf
delta,0,y2,x1,,-inf,inf
delta,0,y3,constant,,-inf,inf


The parameters have a MultiIndex with four levels. The first is the broad category of the parameters, i.e. distinguishs loadings, control parameters, transition parameters and so on. The second level is the period in which that parameter is used. The last two levels contain more information on the particular parameters. The four levels togethether should give you enough information to understand what each parameter means. 

Let's now assume you filled out the value column of the csv file with good start values and saved it in a file called ``start_params.csv``. Since this is an example model I can cheat and use very good start values, namely the ones from the CHS replication files. 

In [7]:
start_params = pd.read_csv("start_params.csv").set_index(
    ["category", "period", "name1", "name2"])
start_params.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,value,lower,upper,chs_value,good_start_value,bad_start_value
category,period,name1,name2,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
delta,0,y1,constant,1.005455,-inf,inf,1.005455,1.0,0.0
delta,0,y1,x1,1.001618,-inf,inf,1.001618,1.0,0.0
delta,0,y2,constant,0.975992,-inf,inf,0.975992,1.0,0.0
delta,0,y2,x1,1.031439,-inf,inf,1.031439,1.0,0.0
delta,0,y3,constant,0.994139,-inf,inf,0.994139,1.0,0.0


In [8]:
sim_observed, sim_latent = mod.simulate(params=start_params, nobs=1000)
sim_observed.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Q1_fac1,constant,x1,y1,y2,y3,y4,y5,y6,y7,y8,y9
id,period,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,0,1.587361,1.0,0.773489,2.322101,1.719571,0.884202,1.276429,3.060186,2.681505,1.059527,1.89335,1.843493
0,1,1.920651,1.0,0.773489,1.650439,2.325665,2.863287,2.9704,2.699402,0.745351,,,
0,2,0.229962,1.0,0.773489,0.662036,2.826819,2.99426,1.787768,2.017581,2.725608,,,
0,3,1.395806,1.0,0.773489,1.713353,0.164938,1.277607,2.90227,2.710094,1.829509,,,
0,4,2.560069,1.0,0.773489,1.763881,0.270295,0.292526,1.547372,2.850572,2.259368,,,


Of course, you would usually rather simulate a dataset at parameters you estimated and not at ones you invented. For this you have to estimate the model first. This is as easy as typing:

``mod.fit(start_params=start_params)``

but to get the same values as CHS we will have to do a little more work. The reason is that on top of the many constraints skillmodels generates atuomatically from the model specification, CHS impose three more constraints:

1. The constant in the linear transition equation is fixed to 0
2. The initial mean of the states is not estimated but assumed to be zero. 
3. The anchoring parameters (intercepts, control variables, loadings and SDs of measurement error are pairwise equal across periods).

Fortunately, estimagic makes it easy to express such constraints:

In [9]:
additional_constraints = [
    {"loc": ("trans", 0, "fac2", "constant"), "type": "fixed", "value": 0},
    {"loc": "initial_mean", "type": "fixed", "value": 0},
    {"queries": [f"period == {i} & name1 == 'Q1_fac1'" for i in range(7)], "type": "pairwise_equality"}
]

Isn't this amazingly simple? If you are not impressed, take a moment to think how you would implement the last constraint if your parameters were just a numpy array instead of a DataFrame! If this motivates you to learn more about constraints in estimagic, check out the [documentation](https://estimagic.readthedocs.io/en/master/optimization/constraints.html). 

Next we can call the fit method of SkillModel. This is commented out because it takes quite long.

In [10]:
# info, params = mod.fit(
#     start_params=start_params, 
#     algorithm="scipy_L-BFGS-B",
#     user_constraints=additional_constraints,
#     dashboard=True,
#     db_options={"rollover": 10000},
#     algo_options={"ftol": 1e-8},
#     logging="log.db"
# )

The arguments of fit are all passed through to estimagic's maximize function. You can find more information on them in the [estimagic documentation](https://estimagic.readthedocs.io/en/master/optimization/index.html)

In [12]:
df = start_params.reset_index(drop=False)
df

Unnamed: 0,category,period,name1,name2,value,lower,upper,chs_value,good_start_value,bad_start_value
0,delta,0,y1,constant,1.005455,-inf,inf,1.005455,1.0,0.00
1,delta,0,y1,x1,1.001618,-inf,inf,1.001618,1.0,0.00
2,delta,0,y2,constant,0.975992,-inf,inf,0.975992,1.0,0.00
3,delta,0,y2,x1,1.031439,-inf,inf,1.031439,1.0,0.00
4,delta,0,y3,constant,0.994139,-inf,inf,0.994139,1.0,0.00
...,...,...,...,...,...,...,...,...,...,...
231,trans,0,fac1,fac2,0.174038,-inf,inf,0.174038,0.2,0.25
232,trans,0,fac1,fac3,0.166174,-inf,inf,0.166174,0.1,0.25
233,trans,0,fac1,phi,-0.407018,-inf,inf,-0.407018,-0.4,-0.20
234,trans,0,fac2,fac2,0.608871,-inf,inf,0.608871,0.6,0.50


In [26]:
full_trend_vars = ["y2", "y3"]


def get_anchoring_constraint_loc(params, no_control_vars):
    levels = ["category", "period", "name1", "name2"]
    df = params[levels]
    df = df.query(f"category == 'delta' & name1 in {no_control_vars} & name2 != 'constant'")
    df = df.set_index(levels)
    return df.index

In [27]:
get_anchoring_constraint_loc(df, full_trend_vars)

MultiIndex([('delta', 0, 'y2', 'x1'),
            ('delta', 0, 'y3', 'x1'),
            ('delta', 1, 'y2', 'x1'),
            ('delta', 1, 'y3', 'x1'),
            ('delta', 2, 'y2', 'x1'),
            ('delta', 2, 'y3', 'x1'),
            ('delta', 3, 'y2', 'x1'),
            ('delta', 3, 'y3', 'x1'),
            ('delta', 4, 'y2', 'x1'),
            ('delta', 4, 'y3', 'x1'),
            ('delta', 5, 'y2', 'x1'),
            ('delta', 5, 'y3', 'x1'),
            ('delta', 6, 'y2', 'x1'),
            ('delta', 6, 'y3', 'x1'),
            ('delta', 7, 'y2', 'x1'),
            ('delta', 7, 'y3', 'x1')],
           names=['category', 'period', 'name1', 'name2'])