# Testing `gilbert_elliot_model.determine_model`

This document builds off of `distribution_analysis.ipynb`, and focuses on the behavior of the package method `determine_model`. 



In [None]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

import gilbert_elliot_model as ge

from plotly.subplots import make_subplots

## Methodology
For these tests a true model was selected; one of either two-parameter, three-parameter, or four-parameter.
For a given model, the following procedure was repeated 300 times:
* Generate error pattern of length `n_obs = 1000`, call error pattern `errors`
* Run `ge.determine_model(errors)`
* Save the true model, the estimated model, and the associated most likely parameters for all possible models that are output from `ge.determine_model(errors)`.

This was performed for each of the possible models.

It is important to note that for most practical applications, `n_obs = 1000` is very short.
It is likely that we would see better identification success if `n_obs` was much larger, but that analysis would have taken longer to run and was not deemed necessary for this document.

## Conclusions
These conclusions are drawn from results shown in the [`Analysis`](##Analysis) section.

This analysis shows that the function `gilbert_elliot_model.determine_model()` tends to over-predict that a four-parameter model is likely. However it shows that relevant error statistics are preserved regardless. It is best practice to use expert judgement when determining which model to use, however this function can still be a worthwhile tool for comparing model options as needed.
In particular, this analysis suggests that without a specific reason for using a more complicated model, a simpler one can work very well.
It is important to notice as well however that if the error patterns had been longer, it is likely the correct model would have been identified more.
This is because as a pattern gets longer, the parameters of the model matter more and will then be easier to to estimate as well.

## Analysis
The following sections show a series of plots for each model type. 
Success is defined as, given an observation pattern, `errors` generated by model X, then model X is identified as the most-likely model to have generated `errors`.
If any other model is identified as more likely we label that trial a failure.

The first plots is a histogram of which model type was identified as most likely, with the success rate given in the title.
The second plot shows the estimated model error rate vs the true model error rate.
In particular, given the parameters $(p, r, k, h)$ were used to generate an observation, `errors`, and $(\hat{p}, \hat{r}, \hat{k}, \hat{h})$ were returned as the parameter estimates associated with the most-likely model we compare the true model error rate $\bar{x}(p, r, k, h)$ with the estimated model error rate $\hat{\bar{x}}(\hat{p}, \hat{r}, \hat{k}, \hat{h})$.
The final plot does the same comparison but with the true model burst length, $L_1(p, r, k, h)$ and the estimated model burst length $\hat{L}_1(\hat{p}, \hat{r}, \hat{k}, \hat{h})$ 

The documentation for the function `gilbert_elliot_model.determine_model` is shown below.

In [None]:
help(ge.determine_model)

In [None]:
def update_df_error_stats(df, expected_stats=None, stats=None):
    est_flag = '_estimate'
    target_flag = '_target'
    stats = ['error_rate', 'expected_burst_length']
    for out_stat in stats:
        df[out_stat] = df.apply(
        lambda row : ge.model_error_statistics(p=row['p' + est_flag],
                                               r=row['r' + est_flag],
                                               k=row['k' + est_flag],
                                               h=row['h' + est_flag])[out_stat],axis=1,)
        expected_out_stat = 'expected_' + out_stat
        df[expected_out_stat] = df.apply(
            lambda row : ge.model_error_statistics(p=row['p' + target_flag],
                                                   r=row['r' + target_flag],
                                                   k=row['k' + target_flag],
                                                   h=row['h' + target_flag])[out_stat],axis=1,)
        diff_out_stat = out_stat + '_error'
        df[diff_out_stat] = df[expected_out_stat] - df[out_stat]
    return df

In [None]:
# Load in data
models = ['two', 'three', 'four']
template_string = 'determine_model_XXX_param.csv'

model_dfs = {}
for model in models:
    fname = template_string.replace('XXX', model)
    model_dfs[model] = pd.read_csv(fname)


In [None]:
# for model_name, df in model_dfs.items():
#     print(f"{model_name} accuracy: {np.mean(df['model_est'] == df['model_type'])}")

def model_analysis(df):

    figures = []

    success_ix = df['model_est'] == df['model_type']
    id_rate = 100*np.mean(df['model_est'] == df['model_type'])

    fig = px.histogram(df, x='model_est',
                       title=f'{model_name}-parameter identification rate: {id_rate:.1f}%'
                      )
    figures.append(fig)

    params = ['p', 'r', 'k', 'h']
    new_rows = []
    for ix, row in df.iterrows():
        new_row = {}
        new_row['model_target'] = row['model_type']
        new_row['model_estimate'] = row['model_est']
        new_row['success'] = row['model_type'] == row['model_est']
        for pr in params:
            target_pr = pr + '_' + row['model_type']
            if row['model_est'] != row['model_type']:
                select_pr = pr + '_' + row['model_est']
            else:
                select_pr = target_pr
            new_row[pr + '_target'] = row[target_pr]
            new_row[pr + '_estimate'] = row[select_pr]
        new_rows.append(new_row)
    new_df = pd.DataFrame(new_rows)                  
    new_df = update_df_error_stats(new_df)



    stat = 'error_rate'
    stats = ['error_rate', 'expected_burst_length']
    for stat in stats:
        figi = px.scatter(new_df,
                        x=stat,
                        y='expected_' + stat,
                        color='success',
                        title=stat + ' expectation vs estimation',
                        )

        figures.append(figi)


    for fig in figures:
        fig.show(renderer='notebook')


### Two-parameter plots

In [None]:
model_name = 'two'
model_analysis(model_dfs[model_name])

### Three-parameter model analysis

In [None]:
model_name = 'three'
model_analysis(model_dfs[model_name])

### Four-parameter Model analysis

In [None]:
model_name = 'four'
model_analysis(model_dfs[model_name])