# `likelihood_inference` Tutorial

Using simplified examples, we demonstrate how to use our functionality. We also provide side-by-side comparison to Stata results for robustness.

## When to use our functionality

The functionality is for statistical inference of maximum-likelihood estimations. The function also accepts design options for survey data. For details, please consult our background page [here]().

## How to use it

### Necessary objects 

To use our functionality, we need to define a few objects first. 

1) First, define your **dataset** as a pandas dataframe object. 

2) Second, define your model. We define a `logit` and `probit` function for illustration. 

3) Third, construct a dictionary containing the keys equal to the model's arguments and the values equal to whatever make up  those keys. This will be more clear below. This will be your `log_like_kwargs`.

4) Finally, create a **design options dictionary**, where the keys acceptable keys are "psu", "strata", "weight", and "fpc". The values are the column names from the dataset that correspond to that key. For example, if you wish to cluster by school, you would define the key-value pair as `{"psu": data["school"]}`. Given the variety of survey data and what they provide for statistical inference, the design dictionary can be empty, contain only weight, only the primary sampling unit (also referred to as cluster) and/or other variations of the four options. The dictionary is then converted into a **dataframe**. This is **not optional**. If you wish to not specify any design options, toss the function an empty pandas dataframe.  


Below, we present an example of the necessary objects and three illustrations of the function in action. 
To illustrate, we have programmed a logit and probit function. 

In [1]:
import os

import numpy as np
import pandas as pd
from patsy import dmatrices

from estimagic.inference.src.functions.mle_unconstrained import mle_processing
from estimagic.inference.src.functions.se_estimation import choose_case
from estimagic.inference.src.functions.se_estimation import cluster_robust_se
from estimagic.inference.src.functions.se_estimation import clustering
from estimagic.inference.src.functions.se_estimation import inference_table
from estimagic.inference.src.functions.se_estimation import \
    likelihood_inference
from estimagic.inference.src.functions.se_estimation import np_hess
from estimagic.inference.src.functions.se_estimation import np_jac
from estimagic.inference.src.functions.se_estimation import \
    observed_information_matrix
from estimagic.inference.src.functions.se_estimation import \
    outer_product_of_gradients
from estimagic.inference.src.functions.se_estimation import robust_se
from estimagic.inference.src.functions.se_estimation import strata_robust_se
from estimagic.inference.src.functions.se_estimation import stratification
from estimagic.inference.src.functions.se_estimation import variance_estimator

In [6]:
def logit(params, y, x, design_options):
    """Pseudo-log-likelihood contribution per individual.

    Args:
        params (pd.DataFrame): The index consists of the parmater names,
            the "value" column are the parameter values.
        y (np.array): 1d numpy array with the dependent variable
        x (np.array): 2d numpy array with the independent variables
        design_options (pd.DataFrame): dataframe containing psu, stratum,
            population/design weight and/or a finite population corrector (fpc)

    Returns:
        c (np.array): 1d numpy array with likelihood contribution per individual

    """
    q = 2 * y - 1
    # likelihood contribution
    c = np.log(1 / (1 + np.exp(-(q * np.dot(x, params["value"])))))
    if "weight" in design_options:
        return c * design_options["weight"].to_numpy()
    else:
        return c

In [9]:
# Read data
data = pd.read_csv("data.csv")

# Create logit keyword arguments
formula = "eco_friendly ~ ppltrst + male + income"
y, x = dmatrices(formula, data, return_type="dataframe")
y = y[y.columns[0]]
design_options = pd.DataFrame()

logit_kwargs = {"y": y, "x": x, "design_options": design_options}

`logit` takes `params`, `y`, `x`, and `design_options`. Above, `logit_kwargs` takes three arguments. The `params` argument in `likelihood_inference` below is already the estimated parameters. Estimagic uses the `maximize` function to estimate parameters; refer to their documentation [here](https://estimagic.readthedocs.io/en/latest/optimization/interface.html). Also, although we specified `design_options` in our `log_like_kwargs`, this is only because our `logit` function can take a weight for the contribution of the likelihood. Otherwise, you would just toss it in `likelihood_inference` (otherwise here would be if you just specify a cluster, strata and/or finite population corrector and not a weight). Weights affect parameter estimation, hence why it goes in before. Otherwise, it is unneccessary.

`likelihood_inference` then takes `logit` as `log_like_obs`, estimated params, `logit_kwargs` as `log_like_kwargs` and `design_options` as `design_options`. `cov_type` tells the function which variance estimator to use. We allow for three options: (1) `observed_information_matrix` or `"hessian"` (2) `outer_product_of_gradients` or `"jacobian"` and (3) White's standard errors or `"sandwich"`. The default is `"jacobian"`. Explanations and details on each of these estimators can be found in background or in the docstring below.

In [10]:
def likelihood_inference(
    log_like_obs, params, log_like_kwargs, design_options, cov_type="jacobian"
):
    """Pseudolikelihood estimation and inference.

    Args:
        log_like_obs (func): The pseudo-log-likelihood function. It is the
            log-likelihood contribution per individual.
        params (pd.DataFrame): The index consists of the paramater names specified
            by the user, the "value" column is the parameter values.
        log_like_kwargs (dict): In addition to the params argument directly
            taken by likelihood_inference function, additional keyword arguments for the
            likelihood function may include dependent variable, independent variables
            and design options.
            Example of simple logit model arguments:
                log_like_kwargs = {
                    "y": y,
                    "x": x,
                    "design_options": design_options
                }
        design_options (pd.DataFrame): dataframe containing psu, stratum,
            population/design weight and/or a finite population corrector (fpc)
        cov_type (str): One of ["opg", "oim", "sandwich"]. opg and oim only
            work when *design_options* is empty. opg is default.

    Returns:
        model_inference_table (pd.DataFrame):
            - "value": params that maximize likelihood
            - "standard_error": standard errors of the params
            - "ci_lower": using the 95% critical value of a normal distribution * -1
            - "ci_upper": using the 95% critical value of a normal distribution
        params_cov (pd.DataFrame): Covariance matrix of estimated parameters.
            Index and columns are the same as params.index.

    Examples:

        >>> from estimagic.inference.sample_models import logit
        >>> cc = choose_case
        >>> params = pd.DataFrame(data=[0.5, 0.5], columns=["value"])
        >>> x = np.array([[1., 5.], [1., 6.]])
        >>> y = np.array([[1., 1]])
        >>> d_opt = pd.DataFrame()
        >>> logit_kwargs = {"y": y, "x": x, "design_options": d_opt}
        >>> se, var = cc(logit, params, logit_kwargs, d_opt, cov_type="jacobian")
        >>> se, var
        (array([212.37277788,  40.10565957]), array([[45102.19678307, -8486.9195158 ],
               [-8486.9195158 ,  1608.46392969]]))

        >>> inf_table, cov = inference_table(params, se, var, cov_type="jacobian")

    """
    log_like_se, log_like_var = choose_case(
        log_like_obs, params, log_like_kwargs, design_options, cov_type
    )
    model_inference_table, params_cov = inference_table(
        params, log_like_se, log_like_var, cov_type
    )
    return model_inference_table, params_cov

## Example 1
### Logit illustration with design options not specified

When design options are not specified, your model has access to three variance estimators: (1) Robust or "Sandwich" estimator (2) Observed Information Matrix (3) Outer Product of Gradients. These are explained in the background section. 

In [11]:
# Define design_options and parameters dataframe
design_options = pd.DataFrame()
params = pd.DataFrame(
    data=[0.9659383, 0.0109796, -0.1890401, -0.0064468],
    index=["Intercept", "ppltrst", "male", "income"],
    columns=["value"],
)

# Running a logit model with design options not specified, robust
inf_table, params_cov = likelihood_inference(
    logit, params, logit_kwargs, design_options, cov_type="sandwich"
)

# Stata Results
stata_params_dict = {
    "value": [0.9659383, 0.0109796, -0.1890401, -0.0064468],
    "sandwich_standard_error": [0.0474801, 0.0067696, 0.0315615, 0.0059065],
    "ci_lower": [0.8728789, -0.0022886, -0.2508995, -0.0180233],
    "ci_upper": [1.058998, 0.0242478, -0.1271807, 0.0051297],
}
stata_params_df = pd.DataFrame(
    stata_params_dict, index=["Intercept", "ppltrst", "male", "income"]
)
stata_params_df, inf_table

(              value  sandwich_standard_error  ci_lower  ci_upper
 Intercept  0.965938                 0.047480  0.872879  1.058998
 ppltrst    0.010980                 0.006770 -0.002289  0.024248
 male      -0.189040                 0.031561 -0.250899 -0.127181
 income    -0.006447                 0.005907 -0.018023  0.005130,
               value  sandwich_standard_errors  ci_lower  ci_upper
 Intercept  0.965938                  0.047480  0.872877  1.058999
 ppltrst    0.010980                  0.006770 -0.002289  0.024248
 male      -0.189040                  0.031561 -0.250901 -0.127180
 income    -0.006447                  0.005906 -0.018024  0.005130)

## Example 2
### Logit illustration with primary sampling units or "clusters"

Suppose your data has primary sampling units (psu) or "clusters". You may specify the cluster variable in the `inference_design_options`. Again, we take the estimated parameters as given. 

In [14]:
# Define design_options and parameters dataframe
inference_design_options = pd.DataFrame({"psu": data["psu"]})
params = pd.DataFrame(
    data=[0.9659383, 0.0109796, -0.1890401, -0.0064468],
    index=["Intercept", "ppltrst", "male", "income"],
    columns=["value"],
)

# Running a logit model with design options not specified, robust
inf_table, params_cov = likelihood_inference(
    logit, params, logit_kwargs, inference_design_options, cov_type="sandwich"
)

# Stata Results
stata_params_dict = {
    "value": [0.9659383, 0.0109796, -0.1890401, -0.0064468],
    "sandwich_standard_error": [0.0504775, 0.0071368, 0.0318001, 0.0064663],
    "ci_lower": [0.8669933, -0.0030098, -0.2513741, -0.0191218],
    "ci_upper": [1.064883, 0.024969, -0.1267061, 0.0062283],
}
stata_params_df = pd.DataFrame(
    stata_params_dict, index=["Intercept", "ppltrst", "male", "income"]
)
stata_params_df, inf_table

(              value  sandwich_standard_error  ci_lower  ci_upper
 Intercept  0.965938                 0.050478  0.866993  1.064883
 ppltrst    0.010980                 0.007137 -0.003010  0.024969
 male      -0.189040                 0.031800 -0.251374 -0.126706
 income    -0.006447                 0.006466 -0.019122  0.006228,
               value  sandwich_standard_errors  ci_lower  ci_upper
 Intercept  0.965938                  0.050478  0.867002  1.064874
 ppltrst    0.010980                  0.007137 -0.003009  0.024968
 male      -0.189040                  0.031800 -0.251368 -0.126712
 income    -0.006447                  0.006466 -0.019121  0.006227)

Compared to running the model without any design specifications, the standard errors have jumped. This is expected, given observations are no longer independent; only independent between *clusters*. The magnitude of the jump is small here simply because there are 11,015 clusters and 19,751 observations. As the number of clusters approach the size of the data, it would approach the standard errors for independent observations. Likewise, we can expect a larger jump if less clusters are defined. 

## Example 3
### Probit illustration with psu and strata

For the following illustration, we specify the psu and the strata. In case stratum have just one cluster, we use the "grand mean" method. More on this in the background. Finally, when clusters or strata are defined, only the robust or "sandwich" estimation is possible. 

In [3]:
from scipy import stats


def probit(params, y, x, design_options):
    """Refer to logit docstring for details!"""
    q = 2 * y - 1
    c = np.log(stats.norm._cdf(np.dot(q[:, None] * x, params["value"])))
    if "weight" in design_options.columns:
        return c * design_options["weight"].to_numpy()
    else:
        return c

In [4]:
# Define design_options and parameters dataframe
inference_design_options = pd.DataFrame({"psu": data["psu"], "strata": data["stratum"]})
probit_design_options = pd.DataFrame()
params = pd.DataFrame(
    data=[0.595919, 0.0065084, -0.1136318, -0.0038559],
    index=["Intercept", "ppltrst", "male", "income"],
    columns=["value"],
)

# Defining probit keyword arguments.
probit_kwargs = {"y": y, "x": x, "design_options": probit_design_options}
inf_table, params_cov = likelihood_inference(
    probit, params, probit_kwargs, inference_design_options, cov_type="sandwich"
)

# For probit with psu, strata, robust
stata_params_dict = {
    "value": [0.595919, 0.0065084, -0.1136318, -0.0038559],
    "sandwich_standard_error": [0.029567, 0.0042209, 0.0189078, 0.0038124],
    "ci_lower": [0.5379617, -0.0017655, -0.1506948, -0.0113289],
    "ci_upper": [0.6538763, 0.0147822, -0.0765688, 0.0036172],
}
stata_params_df = pd.DataFrame(
    stata_params_dict, index=["Intercept", "ppltrst", "male", "income"]
)
stata_params_df, inf_table

  if __name__ == '__main__':


(              value  sandwich_standard_error  ci_lower  ci_upper
 Intercept  0.595919                 0.029567  0.537962  0.653876
 ppltrst    0.006508                 0.004221 -0.001765  0.014782
 male      -0.113632                 0.018908 -0.150695 -0.076569
 income    -0.003856                 0.003812 -0.011329  0.003617,
               value  sandwich_standard_errors  ci_lower  ci_upper
 Intercept  0.595919                  0.029589  0.537924  0.653914
 ppltrst    0.006508                  0.004228 -0.001779  0.014795
 male      -0.113632                  0.018975 -0.150824 -0.076440
 income    -0.003856                  0.003819 -0.011342  0.003630)