In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
import numpy as np
import pandas as pd
from skillmodels.config import TEST_DIR
import yaml
from skillmodels.visualize_factor_distributions import (
    univariate_densities,
    bivariate_density_contours,
    bivariate_density_surfaces,
    combine_distribution_plots,
)
from skillmodels.simulate_data import simulate_dataset
from skillmodels.filtered_states import get_filtered_states
from skillmodels.likelihood_function import get_maximization_inputs

<IPython.core.display.Javascript object>

# How to visualize the distribution of latent factors

We show how to create Kernel density plots for pairs of latent factors in two or three dimensions. As illustration we use the same example as in the [introductory tutorial](../getting_started/tutorial.ipynb). For more details of how to obtain the filtered states, also see that tutorial.

There are two kinds of data that can be visualized with the function described below:
1. Filtered states, i.e. the estimates of the latent factors in an empirical dataset
2. Simulated states, i.e. a synthetic dataset of latent factors that is generated for a parametrized model. 

Below, we show how to get both kinds of datasets, how to visualize the distribution of latent factors given one dataset and how to visualize the difference in distributions between two datasets.

## Getting filtered states 

In [3]:
with open(TEST_DIR / "model2.yaml") as y:
    model_dict = yaml.load(y, Loader=yaml.FullLoader)
params = pd.read_csv(TEST_DIR / "regression_vault" / f"one_stage_anchoring.csv")
params = params.set_index(["category", "period", "name1", "name2"])

data = pd.read_stata(TEST_DIR / "model2_simulated_data.dta")
data.set_index(["caseid", "period"], inplace=True)

<IPython.core.display.Javascript object>

## Plotting one dataset of states

In [4]:
kde_plots = univariate_densities(
    model_dict=model_dict, data=data, params=params, period=1
)
contour_plots = bivariate_density_contours(
    model_dict=model_dict, data=data, params=params, period=1
)



<IPython.core.display.Javascript object>

In [5]:
surface_plots = bivariate_density_surfaces(
    model_dict=model_dict, data=data, params=params, period=1
)

<IPython.core.display.Javascript object>

In [6]:
fig = combine_distribution_plots(
    kde_plots=kde_plots,
    contour_plots=contour_plots,
    surface_plots=surface_plots,
)

<IPython.core.display.Javascript object>

In [7]:
fig.show()

<IPython.core.display.Javascript object>

## Optional arguments of the plotting function

- You can omit the 3d Plots in the upper triangle by leaving out `add_3d_plots=True`. 
- You can modify the trade-off between runtime and plot quality by setting `n_points`, i.e. the number of points per dimension to different values. Default is 50.
- You can return the individual plots instead of a grid by setting `combine_plots_in_grid=False`. In that case the function returns a dictionary with figures that you can save for later use. 
- You can manually tweek the ranges over which the distributions are plotted. For that, you need to specify the argument `state_ranges`. This is a dictionary. The keys are the names of the latent factors. The values are DataFrames with the columns "period", "minimum", "maximum". The state_ranges are used to define the axis limits of the plots.
- lower_kde_kws (dict): Keyword arguments for seaborn.kdeplot, used to generate the plots in the lower triangle of the grid, i.e. the two dimensional kdeplot for each factor pair.
- diag_kde_kws (dict): Keyword arguments for seaborn.kdeplot, used to generate the plots on the diagonal of the grid, i.e. the one dimensional kdeplot for each factor. 
- surface_kws (dict): Keyword arguments for Axes.plot_surface, used to generate the plots in the upper triangle of the grid, i.e. the surface plot of the kernel density estimates for each factor pair.

## Getting simulated datasets (with and without policy)

One of the main application of skill formation models is to simulate the effect of counterfactual policies. To visualize the effect of a policy on factor distributions, we first need to simulate a dataset in which a policy has been active. 

In [8]:
sim_states = simulate_dataset(model_dict=model_dict, params=params, data=data,)[
    "anchored_states"
]["states"]

<IPython.core.display.Javascript object>

In [9]:
policies = [
    {"period": 1, "factor": "fac1", "effect_size": 3.5, "standard_deviation": 0.0},
    {"period": 1, "factor": "fac2", "effect_size": 3.5, "standard_deviation": 0.0},
]

<IPython.core.display.Javascript object>

In [10]:
sim_states_policy = simulate_dataset(
    model_dict=model_dict,
    params=params,
    data=data,
    policies=policies,
)["anchored_states"]["states"]

<IPython.core.display.Javascript object>

## Plotting differences in distributions

In [11]:
kde_plots = univariate_densities(
    model_dict=model_dict,states= {"baseline": sim_states, "subsidy": sim_states_policy}, data=data, params=params, period=1
)
contour_plots = bivariate_density_contours(
    model_dict=model_dict,states= {"baseline": sim_states, "subsidy": sim_states_policy}, data=data, params=params, period=1
)



<IPython.core.display.Javascript object>

In [12]:
fig = combine_distribution_plots(kde_plots, contour_plots, None, showlegend=True)

<IPython.core.display.Javascript object>

In [13]:
fig.show()

<IPython.core.display.Javascript object>

All the optional arguments stay the same. The only difference ist that 3d plots do not work for several datasets.

# Plotting with observed factors

In [16]:
model_dict["observed_factors"] = ["obs1"]

<IPython.core.display.Javascript object>

In [17]:
data["obs1"] = np.random.rand(data.shape[0])

<IPython.core.display.Javascript object>

In [18]:
params = get_maximization_inputs(model_dict=model_dict, data=data)["params_template"]
params["value"] = 0.1

<IPython.core.display.Javascript object>

In [22]:
kde_plots = univariate_densities(
    model_dict=model_dict, data=data, params=params, period=1, observed_factors=True
)
contour_plots = bivariate_density_contours(
    model_dict=model_dict, data=data, params=params, period=1, observed_factors=True
)

<IPython.core.display.Javascript object>

In [24]:
combine_distribution_plots(
    kde_plots=kde_plots,
    contour_plots=contour_plots,
    factor_order=["obs1", "fac1", "fac2"],
)

<IPython.core.display.Javascript object>

In [None]:
contou