# Lab 6 - Model comparison

In this excercise we will learn about comparing models using metrics predicting out of sample behavior.

Main idea is to instead of splitting the dataset into two parts (test and training set) to estimate how model would behave in presence of new data. This is being done by evaluating so called log likelihood which is an array of values of logarithm of likelihood for each of datapoints individually.

We will use this informations with two metrics:

- Watanabe-Akaike Information Criterion (also known as Widely Applicable Information Criterion, WAIC), which is averages log likelihood and estimates the effective number of paraemeters in the model
- PSIS-LOOCV - Pareto Smoothed Importance Sampling Leave-one-out Cross Validation. It is an estimate of value obtained from Leave-one-out Cross Validation by using modified importance sampling method instead of running inference N times where N is number of samples, leaving one each time.
For this excercise code is provided in form of print-screens.

In [35]:
from cmdstanpy import CmdStanModel
import arviz as az
import numpy as np
import scipy.stats as stats
import scipy.stats as norm
import matplotlib.pyplot as plt
import pandas as pd

## Excercise 1 - generate data

1. Compile code_1.stan and code_2.stan
2. Generate data for rest of excercises.

In [36]:
F = 6           # F - number of letters in first name 
L = 6           # L - number of letters in last name 
N = (L+F)*100   # N = (L+F)*100

gen_quant = CmdStanModel(stan_file='code_1.stan')

samples = gen_quant.sample(data={'N': N}, 
                           fixed_param=True,
                           iter_sampling=1000, 
                           iter_warmup=0, 
                           chains = 1)

# Creation of pandas dataframe from resulting draws
df = samples.draws_pd()
display(df)


INFO:cmdstanpy:found newer exe file, not recompiling
INFO:cmdstanpy:CmdStan start processing
chain 1 |[34m██████████[0m| 00:00 Sampling completed                      

                                                                                


INFO:cmdstanpy:CmdStan done processing.





Unnamed: 0,lp__,accept_stat__,theta,y[1],y[2],y[3],y[4],y[5],y[6],y[7],...,y[1191],y[1192],y[1193],y[1194],y[1195],y[1196],y[1197],y[1198],y[1199],y[1200]
0,0.0,0.0,0.01,0.147432,0.337281,0.694644,1.124390,0.560981,-1.041810,0.869185,...,-0.858699,-0.928416,-0.583475,0.702746,-0.730756,1.833050,-1.715640,-0.108593,1.399110,-0.965111
1,0.0,0.0,0.01,0.360545,-0.091260,-1.062770,-1.276530,0.935135,-0.818048,-0.694577,...,0.030749,0.166271,0.185900,-0.670137,0.184587,0.645447,-1.100140,-0.415726,-0.623065,-0.160449
2,0.0,0.0,0.01,0.138783,0.760010,-0.650996,15.746500,0.300205,-0.012572,0.530823,...,-0.071499,-0.485011,0.414945,2.197030,-0.516201,-0.209986,1.246670,0.111284,-0.966353,0.969415
3,0.0,0.0,0.01,-0.774939,2.169030,-0.026868,2.488470,-0.790456,-1.069210,0.129276,...,0.884310,1.605740,-2.060960,-0.591254,0.674302,0.984685,0.162272,0.698009,-1.717430,-0.313426
4,0.0,0.0,0.01,-0.784996,-0.908145,0.723346,0.525329,0.213700,-0.122160,-0.009803,...,-1.076870,-0.344170,-0.434857,-2.017650,0.442977,0.137848,0.528688,-1.359680,-0.636072,-0.136038
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.01,-0.779451,-0.511974,1.057820,-2.175720,-0.776229,-0.567041,0.190016,...,-0.100845,-0.342686,0.804763,-0.301556,-2.113700,1.792690,-0.190678,-2.365180,0.061731,-0.942196
996,0.0,0.0,0.01,0.734438,1.401170,-0.272893,0.476263,7.959380,0.409941,0.379497,...,-0.405913,1.468650,-0.604401,0.957640,-0.861434,-0.485494,-0.323729,1.328650,0.578789,0.058941
997,0.0,0.0,0.01,0.173177,0.000756,-0.228164,-0.533564,0.570570,-1.362840,1.567910,...,0.885762,-1.272510,-0.951439,1.272390,-2.122500,1.019280,0.024941,-0.016005,-2.170290,-0.226614
998,0.0,0.0,0.01,-1.459300,0.580279,1.105340,0.192774,0.589961,1.383760,-0.212725,...,-1.012570,0.322716,0.688687,-2.331880,-1.175850,0.861797,-0.400651,-0.971605,-0.806640,0.473303


In [37]:
F = 6           # F - number of letters in first name 
L = 6           # L - number of letters in last name 
N = (L+F)*100   # N = (L+F)*100

gen_quant = CmdStanModel(stan_file='code_2.stan')

samples = gen_quant.sample(data={'N': N}, 
                           fixed_param=True,
                           iter_sampling=1000, 
                           iter_warmup=0, 
                           chains = 1)

# Creation of pandas dataframe from resulting draws
df = samples.draws_pd()
display(df)


INFO:cmdstanpy:found newer exe file, not recompiling
INFO:cmdstanpy:CmdStan start processing
chain 1 |[34m██████████[0m| 00:02 Sampling completed                      

                                                                                


INFO:cmdstanpy:CmdStan done processing.





Unnamed: 0,lp__,accept_stat__,"X[1,1]","X[2,1]","X[3,1]","X[4,1]","X[5,1]","X[6,1]","X[7,1]","X[8,1]",...,y[1191],y[1192],y[1193],y[1194],y[1195],y[1196],y[1197],y[1198],y[1199],y[1200]
0,0.0,0.0,-1.866130,0.905403,1.241160,0.901698,-0.772418,0.624116,-0.089948,-0.844375,...,2.660340,-3.954400,2.354620,-1.317260,-2.780080,-4.720880,0.781531,5.100360,2.161150,-1.023700
1,0.0,0.0,1.089860,-1.215210,0.274909,0.889937,-0.278354,1.656830,-0.651778,1.340400,...,2.134260,-0.638931,1.841900,1.001020,0.632327,0.176759,-4.660260,1.857420,7.572690,-2.546320
2,0.0,0.0,-0.297586,-1.668080,-0.695876,-1.110100,1.188050,-0.123772,0.376262,-1.553270,...,1.527500,-0.161663,-2.557690,1.052890,-0.402991,-3.128820,6.476220,1.192480,0.567378,6.157480
3,0.0,0.0,-0.588273,1.948580,0.269042,0.033962,1.089170,-0.057110,0.408362,-0.357169,...,-2.659590,-0.983285,1.021030,-1.358360,0.047771,1.749140,-5.591690,0.230935,2.306700,1.304460
4,0.0,0.0,1.210900,0.878756,-0.083978,-0.119786,0.301665,1.010530,-1.239400,1.497880,...,-0.739431,-6.753740,0.764196,0.964613,-2.942900,0.339135,1.166640,-1.800440,-0.477947,-0.173094
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,-0.454227,0.826103,-1.664240,-2.221110,-1.019100,-0.116910,-1.564120,0.495836,...,-3.068000,1.032480,2.133000,3.971990,-1.785230,4.293310,2.175650,-2.421920,-1.941230,2.959490
996,0.0,0.0,0.076357,0.138174,0.709832,-1.154750,1.515990,-0.931918,-0.197932,0.492320,...,1.890720,1.374240,-4.410470,6.027700,-1.010120,-1.483690,0.072129,-1.766510,2.657540,-1.515940
997,0.0,0.0,-0.227366,0.558113,-0.521765,0.939516,0.052594,0.968348,0.666795,1.107860,...,-0.663357,-7.880610,4.714770,-4.653020,-0.779238,-2.429780,-0.338320,-4.224790,1.236340,-0.526321
998,0.0,0.0,0.177105,-1.207410,-0.573684,0.048020,-1.129570,-1.162840,-0.864363,-0.868310,...,-3.551270,-0.174854,1.860520,-0.036575,-3.170540,-3.024580,1.761500,-3.330160,1.235530,4.628620


## Excercise 2 - compare normal and student models for data from first file

1. Compile both models
2. Fit both models
3. Using az.compare and az.plot_compare analyze both models using ```loo``` and ```waic``` criteria.

### 1. Compile models

In [38]:
gen_model = CmdStanModel(stan_file='code_3.stan')

F = 6           # F - number of letters in first name 
L = 6           # L - number of letters in last name 
N = (L+F)*100   # N = (L+F)*100

samples = gen_model.sample(data={'N': N, 'y': [1.0]*N}, 
                           fixed_param=True,
                           iter_sampling=1000, 
                           iter_warmup=0, 
                           chains = 1)

# Creation of pandas dataframe from resulting draws
df = samples.draws_pd()
display(df)

#--------------------------------------------------#

gen_data = CmdStanModel(stan_file='code_4.stan')

F = 6           # F - number of letters in first name 
L = 6           # L - number of letters in last name 
N = (L+F)*100   # N = (L+F)*100

samples = gen_data.sample(data={'N': N, 'y': [1.0]*N}, 
                          fixed_param=True,
                          iter_sampling=1000, 
                          iter_warmup=0, 
                          chains = 1)

# Creation of pandas dataframe from resulting draws
df = samples.draws_pd()
display(df)

INFO:cmdstanpy:found newer exe file, not recompiling
INFO:cmdstanpy:CmdStan start processing
chain 1 |[34m██████████[0m| 00:01 Sampling completed                      

                                                                                


INFO:cmdstanpy:CmdStan done processing.





Unnamed: 0,lp__,accept_stat__,sigma,mu,log_lik[1],log_lik[2],log_lik[3],log_lik[4],log_lik[5],log_lik[6],...,y_hat[1191],y_hat[1192],y_hat[1193],y_hat[1194],y_hat[1195],y_hat[1196],y_hat[1197],y_hat[1198],y_hat[1199],y_hat[1200]
0,0.0,0.0,0.762677,-1.77806,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,...,-1.469220,-1.478370,-2.68093,-0.557914,-2.939420,-0.060771,-3.528650,-1.79870,-1.654380,-0.504699
1,0.0,0.0,0.762677,-1.77806,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,...,-1.773480,-2.451630,-2.01434,-1.121080,-1.645980,-1.519990,-1.375990,-2.16068,-2.534170,-2.327850
2,0.0,0.0,0.762677,-1.77806,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,...,-0.214452,-1.457880,-2.17218,-1.691370,-2.069170,-0.730311,-0.751317,-3.23701,-1.781350,-1.223470
3,0.0,0.0,0.762677,-1.77806,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,...,-2.626610,-0.629467,-1.66595,-3.600130,-1.592280,-2.246830,-1.504430,-1.30572,-2.551320,-1.612830
4,0.0,0.0,0.762677,-1.77806,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,...,-1.048040,-0.630297,-2.49699,-1.125190,-1.590330,-0.810262,-1.271180,-2.20233,-2.156800,-2.329920
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.762677,-1.77806,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,...,-1.889050,-1.610030,-1.62745,-1.302060,-0.896012,-2.811510,-2.378730,-2.02873,-2.449660,-2.255320
996,0.0,0.0,0.762677,-1.77806,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,...,-2.171390,-1.452760,-1.93502,-2.667480,-1.675330,-1.409650,-2.966620,-1.02974,-1.540320,-1.518410
997,0.0,0.0,0.762677,-1.77806,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,...,-1.178020,-2.557410,-2.43450,-3.123270,-1.905440,-2.535850,-1.573620,-1.85424,-0.576674,-3.183050
998,0.0,0.0,0.762677,-1.77806,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,-7.28198,...,-2.231130,-2.131730,-2.07940,-2.601110,-2.131010,-1.486790,-2.601250,-1.84935,-2.034160,-1.934630


INFO:cmdstanpy:found newer exe file, not recompiling
INFO:cmdstanpy:CmdStan start processing
chain 1 |[34m██████████[0m| 00:01 Sampling completed                      

                                                                                


INFO:cmdstanpy:CmdStan done processing.





Unnamed: 0,lp__,accept_stat__,sigma,mu,nu,log_lik[1],log_lik[2],log_lik[3],log_lik[4],log_lik[5],...,y_hat[1191],y_hat[1192],y_hat[1193],y_hat[1194],y_hat[1195],y_hat[1196],y_hat[1197],y_hat[1198],y_hat[1199],y_hat[1200]
0,0.0,0.0,1.42206,1.8065,0.294284,-2.30942,-2.30942,-2.30942,-2.30942,-2.30942,...,-68.026000,1325.70000,-6.66560,5.536180e+00,7.381510,1.61071,18.48300,10.116100,-4.73683,0.675965
1,0.0,0.0,1.42206,1.8065,0.294284,-2.30942,-2.30942,-2.30942,-2.30942,-2.30942,...,51233.900000,6.43806,1.98477,7.364460e+00,6384.290000,1.69480,52.35510,0.863594,4.60383,3.078770
2,0.0,0.0,1.42206,1.8065,0.294284,-2.30942,-2.30942,-2.30942,-2.30942,-2.30942,...,0.326895,7.28971,-33955.80000,-8.865380e-01,1.478220,-2.54999,13.40570,19.447300,10.52500,-1.527290
3,0.0,0.0,1.42206,1.8065,0.294284,-2.30942,-2.30942,-2.30942,-2.30942,-2.30942,...,2.464350,8678.02000,14.53470,1.111820e+00,5.471070,-103019.00000,25619.70000,1.417110,-52.43280,5.884250
4,0.0,0.0,1.42206,1.8065,0.294284,-2.30942,-2.30942,-2.30942,-2.30942,-2.30942,...,8605.490000,-156.32700,24.15140,6.422750e+00,2.653720,-7536.28000,-1.26130,154943.000000,123.82200,2.195450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,1.42206,1.8065,0.294284,-2.30942,-2.30942,-2.30942,-2.30942,-2.30942,...,0.461489,-33.74800,1.13582,-2.434990e+00,2.026730,4.00584,-9.77908,16634.000000,5.97106,-33.247500
996,0.0,0.0,1.42206,1.8065,0.294284,-2.30942,-2.30942,-2.30942,-2.30942,-2.30942,...,2.137310,1.75346,-1.81215,1.414400e+06,-104.308000,4916.58000,2.29480,-0.542943,26.98330,2.360870
997,0.0,0.0,1.42206,1.8065,0.294284,-2.30942,-2.30942,-2.30942,-2.30942,-2.30942,...,1958.720000,169.55100,-15.44690,-6.638990e+02,-15.088200,2.97073,37.87780,-8.810020,420.81700,-8.353310
998,0.0,0.0,1.42206,1.8065,0.294284,-2.30942,-2.30942,-2.30942,-2.30942,-2.30942,...,-0.186071,-95.42900,2.23957,1.708140e+00,-1.597040,-52.35220,16.04070,0.516628,2.92039,4.419500


### 2. Fit models

In [43]:
# # Fit the model code_3.stan
# data_dict = {'N': N, 'y': [0] * N}  # Replace [0] * N with your actual measurements

# fit = gen_model.sample(data=data_dict,
#                        fixed_param=True,
#                        iter_sampling=1000, 
#                        iter_warmup=0, 
#                        chains=1)

# samples = fit.stan_variables()
# df = pd.DataFrame(samples)
# df = fit.draws_pd()
# display(df)


# # Fit the model code_4.stan
# data_dict = {'N': N, 'y': [0] * N}  # Replace [0] * N with your actual measurements

# fit = gen_data.sample(data=data_dict, 
#                        iter_sampling=1000, 
#                        iter_warmup=0, 
#                        chains=1)

# samples = fit.stan_variables()
# df = pd.DataFrame(samples)
# df = fit.draws_pd()
# display(df)

### 3. Using az.compare and az.plot_compare analyze both models using ```loo``` and ```waic``` criteria.

## Excercise 3 - compare models with different numbers of predictors

1. Compile model
2. Compare models for 1, 2 and 3 predictors as in previous excercise

### 1. Compile model

In [45]:
# gen_quanti = CmdStanModel(stan_file='code_5.stan')

# F = 6           # F - number of letters in first name 
# L = 6           # L - number of letters in last name 
# N = (L+F)*100   # N = (L+F)*100

# samples = gen_quanti.sample(data={'N': N, 'y': [1.0]*N}, 
#                            fixed_param=True,
#                            iter_sampling=1000, 
#                            iter_warmup=0, 
#                            chains = 1)

# # Creation of pandas dataframe from resulting draws
# df = samples.draws_pd()
# display(df)

INFO:cmdstanpy:found newer exe file, not recompiling
INFO:cmdstanpy:CmdStan start processing
chain 1 |[33m          [0m| 00:00 StatusERROR:cmdstanpy:Chain [1] error: error during processing Operation not permitted
chain 1 |[33m██████████[0m| 00:00 Sampling completed

                                                                                


INFO:cmdstanpy:CmdStan done processing.





RuntimeError: Error during sampling:
Exception: variable does not exist; processing stage=data initialization; variable name=K; base type=int (in '/home/LaboratoryClasses_Data_Analitycs/Lab_6_Model_comparison/code_5.stan', line 3, column 4 to column 10)Command and output files:
RunSet: chains=1, chain_ids=[1], num_processes=1
 cmd (chain 1):
	['/home/LaboratoryClasses_Data_Analitycs/Lab_6_Model_comparison/code_5', 'id=1', 'random', 'seed=61737', 'data', 'file=/tmp/tmpb1ms_eyc/xvzjz1on.json', 'output', 'file=/tmp/tmpb1ms_eyc/code_5-20230418132201.csv', 'method=sample', 'num_samples=1000', 'num_warmup=0', 'algorithm=fixed_param']
 retcodes=[1]
 per-chain output files (showing chain 1 only):
 csv_file:
	/tmp/tmpb1ms_eyc/code_5-20230418132201.csv
 console_msgs (if any):
	/tmp/tmpb1ms_eyc/code_5-20230418132201_0-stdout.txt

### 2. Compare models for 1, 2 and 3 predictors
Using az.compare and az.plot_compare analyze models using ```loo``` and ```waic``` criteria.