# Exploratory data analysis with BayesDB

Adapted by Jon Clucas of the [Child Mind Institute](https://www.childmind.org) from a notebook that was prepared for The DARPA PPAML PI meeting, July 2017 by [Feras Saad](http://fsaad.mit.edu) of the MIT Probabilistic Computing Project (Probcomp).

### Setting up the Jupyter environment

The first step is to load the `jupyter_probcomp.magics` library, which provides BayesDB hooks for data exploration, plotting, querying, and analysis through this Jupyter notebook environment. The second cell allows plots from matplotlib and javascript to be shown inline.

In [1]:
%load_ext jupyter_probcomp.magics

session_id: root@61e0523943b1_2017-12-15T21:26:45.455080_4


In [2]:
%matplotlib inline
%vizgpm inline

<IPython.core.display.Javascript object>

### Creating a BayesDB `.bdb` file on disk

We next use the `%bayesdb` magic to create a `.bdb` file on disk named `hbnq.bdb`. This file will store all the data and models created in this session.

In [3]:
!rm -f resources/hbnq.bdb
%bayesdb resources/hbnq.bdb

u'Loaded: resources/hbnq.bdb'

### Ingesting data from a `.csv` file into a BayesDB table

The hbnq dataset is stored in the csv file `data/training1.csv`. Each column of the csv file is a variable, and each row is a record. We use the `CREATE TABLE` BQL query, with the pathname of the csv file, to convert the csv data into a database table named `hbnq_t`.

In [4]:
%bql DROP TABLE IF EXISTS "hbnq_t"
%bql CREATE TABLE "hbnq_t" FROM 'data/training1_ADHD_subtypes.csv'

Almost all datasets have missing values, and special tokens such as `NaN` or `NA` indicating a particular cell is missing. In the HBNQ data, empty strings are used. To tell BayesDB to treat empty strings as SQL `NULL` we use the `.nullify` command, followed by the name of the table and the string `''` which represents missing data. Over 30,000 cells have been converted to `NULL`, illustrating that the data is quite sparse.

In [5]:
%bql .nullify hbnq_t ''

Nullified 45142 cells


### Running basic queries on the table using BQL and SQL

Now that the HBNQ dataset has been loaded into at table, and missing values converted to `NULL`, we can run standard SQL queries to explore the contents of the data. For example, we can select the first 5 records. Observe that each row in the table is a participant, and each column is a variable. Scroll through the names in the header of the table to get a sense of the variables in the dataset. 

In [6]:
%bql SELECT * FROM "hbnq_t" LIMIT 5;

Unnamed: 0,EID,Sex,Age,APQ_P_01,APQ_P_02,APQ_P_03,APQ_P_04,APQ_P_05,APQ_P_06,APQ_P_07,APQ_P_08,APQ_P_09,APQ_P_10,APQ_P_11,APQ_P_12,APQ_P_13,APQ_P_14,APQ_P_15,APQ_P_16,APQ_P_17,APQ_P_18,APQ_P_19,APQ_P_20,APQ_P_21,APQ_P_22,APQ_P_23,APQ_P_24,APQ_P_25,APQ_P_26,APQ_P_27,APQ_P_28,APQ_P_29,APQ_P_30,APQ_P_31,APQ_P_32,APQ_P_33,APQ_P_34,APQ_P_35,APQ_P_36,APQ_P_37,APQ_P_38,APQ_P_39,APQ_P_40,APQ_P_41,APQ_P_42,APQ_SR_01,APQ_SR_01A,APQ_SR_02,APQ_SR_03,APQ_SR_04,APQ_SR_04A,APQ_SR_05,APQ_SR_06,APQ_SR_07,APQ_SR_07A,APQ_SR_08,APQ_SR_09,APQ_SR_09A,APQ_SR_10,APQ_SR_11,APQ_SR_11A,APQ_SR_12,APQ_SR_13,APQ_SR_14,APQ_SR_14A,APQ_SR_15,APQ_SR_15A,APQ_SR_16,APQ_SR_17,APQ_SR_18,APQ_SR_19,APQ_SR_20,APQ_SR_20A,APQ_SR_21,APQ_SR_22,APQ_SR_23,APQ_SR_24,APQ_SR_25,APQ_SR_26,APQ_SR_26A,APQ_SR_27,APQ_SR_28,APQ_SR_29,APQ_SR_30,APQ_SR_31,APQ_SR_32,APQ_SR_33,APQ_SR_34,APQ_SR_35,APQ_SR_36,APQ_SR_37,APQ_SR_38,APQ_SR_39,APQ_SR_40,APQ_SR_41,APQ_SR_42,ARI_P_01,ARI_P_02,ARI_P_03,ARI_P_04,ARI_P_05,ARI_P_06,ARI_P_07,ARI_S_01,ARI_S_02,ARI_S_03,ARI_S_04,ARI_S_05,ARI_S_06,ARI_S_07,ASSQ_01,ASSQ_02,ASSQ_03,ASSQ_04,ASSQ_05,ASSQ_06,ASSQ_07,ASSQ_08,ASSQ_09,ASSQ_10,ASSQ_11,ASSQ_12,ASSQ_13,ASSQ_14,ASSQ_15,ASSQ_16,ASSQ_17,ASSQ_18,ASSQ_19,ASSQ_20,ASSQ_21,ASSQ_22,ASSQ_23,ASSQ_24,ASSQ_25,ASSQ_26,ASSQ_27,financialsupport,Barratt_P1_Edu,Barratt_P2_Edu,Barratt_P1_Occ,Barratt_P2_Occ,C3SR_01,C3SR_02,C3SR_03,C3SR_04,C3SR_05,C3SR_06,C3SR_07,C3SR_08,C3SR_09,C3SR_10,C3SR_11,C3SR_12,C3SR_13,C3SR_14,C3SR_15,C3SR_16,C3SR_17,C3SR_18,C3SR_19,C3SR_20,C3SR_21,C3SR_22,C3SR_23,C3SR_24,C3SR_25,C3SR_26,C3SR_27,C3SR_28,C3SR_29,C3SR_30,C3SR_31,C3SR_32,C3SR_33,C3SR_34,C3SR_35,C3SR_36,C3SR_37,C3SR_38,C3SR_39,C3SR_NI,C3SR_PI,CCSC_01,CCSC_02,CCSC_03,CCSC_04,CCSC_05,CCSC_06,CCSC_07,CCSC_08,CCSC_09,CCSC_10,CCSC_11,CCSC_12,CCSC_13,CCSC_14,CCSC_15,CCSC_16,CCSC_17,CCSC_18,CCSC_19,CCSC_20,CCSC_21,CCSC_22,CCSC_23,CCSC_24,CCSC_25,CCSC_26,CCSC_27,CCSC_28,CCSC_29,CCSC_30,CCSC_31,CCSC_32,CCSC_33,CCSC_34,CCSC_35,CCSC_36,CCSC_37,CCSC_38,CCSC_39,CCSC_40,CCSC_41,CCSC_42,CCSC_43,CCSC_44,CCSC_45,CCSC_46,CCSC_47,CCSC_48,CCSC_49,CCSC_50,CCSC_51,CCSC_52,CCSC_53,CCSC_54,CCSC_55,CCSC_56,CCSC_PFC,CCSC_DPS,CCSC_SU,CCSC_AC,CCSC_AA,CCSC_REP,CCSC_WT,CCSC_PCR,CCSC_POS,CCSC_REL,CCSC_SS,CCSC_SUPMF,CCSC_SUPOA,CCSC_SUPEER,CCSC_SUPSIB,CPIC_01,CPIC_02,CPIC_03,CPIC_04,CPIC_05,CPIC_06,CPIC_07,CPIC_08,CPIC_09,CPIC_10,CPIC_11,CPIC_12,CPIC_13,CPIC_14,CPIC_15,CPIC_16,CPIC_17,CPIC_18,CPIC_19,CPIC_20,CPIC_21,CPIC_22,CPIC_23,CPIC_24,CPIC_25,CPIC_26,CPIC_27,CPIC_28,CPIC_29,CPIC_30,CPIC_31,CPIC_32,CPIC_34,CPIC_33,CPIC_35,CPIC_36,CPIC_37,CPIC_38,CPIC_39,CPIC_40,CPIC_41,CPIC_42,CPIC_43,CPIC_44,CPIC_45,CPIC_46,CPIC_47,CPIC_48,CPIC_49,CPIC_50,CPIC_51,DTS_01,DTS_02,DTS_03,DTS_04,DTS_05,DTS_06,DTS_07,DTS_08,DTS_09,DTS_10,DTS_11,DTS_12,DTS_13,DTS_14,DTS_15,EHQ_01,EHQ_02,EHQ_03,EHQ_04,EHQ_05,EHQ_06,EHQ_07,EHQ_08,EHQ_09,EHQ_10,EHQ_11,EHQ_12,EHQ_13,EHQ_14,EHQ_15,MDD_1A,MDD_1B,MDD_2A,MDD_2B,MDD_3A,MDD_3B,MDD_4,MDD_5,MDD_6,MDD_7,MDD_8A,MDD_8B,MDD_9,SocAnx_01,SocAnx_02,SocAnx_03,SocAnx_04A,SocAnx_04B,SocAnx_05,DMDD_1A,DMDD_1B,DMDD_1C,DMDD_2A,DMDD_2B,DMDD_2C,DMDD_3A,DMDD_3B,DMDD_3C,DMDD_4A,DMDD_4B,DMDD_4C,DMDD_5A,DMDD_5B,DMDD_5C,DMDD_6A,DMDD_6B,DMDD_6C,DMDD_7A,DMDD_7B,DMDD_7C,DMDD_8A,DMDD_8B,DMDD_8C,DMDD_9A,DMDD_9B,DMDD_9C,DMDD_10A,DMDD_10B,DMDD_10C,Panic_A01,Panic_A02,Panic_A03,Panic_A01A,Panic_A01B,Panic_A02A,Panic_A02B,Panic_A03A,Panic_A03B,Panic_B01,Panic_B02,Panic_B03,Panic_B04,Panic_B05,Panic_B06,Panic_B07,Panic_B08,Panic_B09,Panic_B10,Panic_B11,Panic_B12,Panic_B13,FGC_Incomplete_Reason,FGC_PU,FGC_PU_Zone,FGC_SRL,FGC_SRL_Zone,FGC_SRR,FGC_SRR_Zone,FGC_GSD_Zone,FGC_GSD,FGC_GSND,FSQ_01,FSQ_02,FSQ_03,FSQ_04,FSQ_05a,FSQ_05b,FSQ_05c,FSQ_05d,FSQ_05e,FSQ_05f,FSQ_05g,FSQ_05h,FSQ_05i,FSQ_05j,FSQ_06,FSQ_07,FSQ_08,IAT_01,IAT_02,IAT_03,IAT_04,IAT_05,IAT_06,IAT_07,IAT_08,IAT_09,IAT_10,IAT_11,IAT_12,IAT_13,IAT_14,IAT_15,IAT_16,IAT_17,IAT_18,IAT_19,IAT_20,MFQ_P_01,MFQ_P_02,MFQ_P_03,MFQ_P_04,MFQ_P_05,MFQ_P_06,MFQ_P_07,MFQ_P_08,MFQ_P_09,MFQ_P_10,MFQ_P_11,MFQ_P_12,MFQ_P_13,MFQ_P_14,MFQ_P_15,MFQ_P_16,MFQ_P_17,MFQ_P_18,MFQ_P_19,MFQ_P_20,MFQ_P_21,MFQ_P_22,MFQ_P_23,MFQ_P_24,MFQ_P_25,MFQ_P_26,MFQ_P_27,MFQ_P_28,MFQ_P_29,MFQ_P_30,MFQ_P_31,MFQ_P_32,MFQ_P_33,MFQ_P_34,MFQ_SR_01,MFQ_SR_02,MFQ_SR_03,MFQ_SR_04,MFQ_SR_05,MFQ_SR_06,MFQ_SR_07,MFQ_SR_08,MFQ_SR_09,MFQ_SR_10,MFQ_SR_11,MFQ_SR_12,MFQ_SR_13,MFQ_SR_14,MFQ_SR_15,MFQ_SR_16,MFQ_SR_17,MFQ_SR_18,MFQ_SR_19,MFQ_SR_20,MFQ_SR_21,MFQ_SR_22,MFQ_SR_23,MFQ_SR_24,MFQ_SR_25,MFQ_SR_26,MFQ_SR_27,MFQ_SR_28,MFQ_SR_29,MFQ_SR_30,MFQ_SR_31,MFQ_SR_32,MFQ_SR_33,NIH7_Incomplete_Reason,PAQ_A_01a,PAQ_A_01b,PAQ_A_01c,PAQ_A_01d,PAQ_A_01e,PAQ_A_01f,PAQ_A_01g,PAQ_A_01h,PAQ_A_01i,PAQ_A_01j,PAQ_A_01k,PAQ_A_01l,PAQ_A_01m,PAQ_A_01n,PAQ_A_01o,PAQ_A_01p,PAQ_A_01q,PAQ_A_01r,PAQ_A_01s,PAQ_A_01t,PAQ_A_01u,PAQ_A_01v,PAQ_A_01w_text,PAQ_A_01w,PAQ_A_01x_text,PAQ_A_01x,PAQ_A_01_Average,PAQ_A_02,PAQ_A_03,PAQ_A_04,PAQ_A_05,PAQ_A_06,PAQ_A_07,PAQ_A_08a,PAQ_A_08b,PAQ_A_08c,PAQ_A_08d,PAQ_A_08e,PAQ_A_08f,PAQ_A_08g,PAQ_A_09,Relation,PBQ_01,PBQ_01A,PBQ_02,PBQ_02A,PBQ_03,PBQ_03A,PBQ_03B,PBQ_03B_1,PBQ_03C,PBQ_04,PBQ_04A,PBQ_05,PBQ_05A,PBQ_05B,PBQ_05C,PBQ_06,PBQ_06A,PBQ_07,PBQ_07A,PBQ_07B,PBQ_07C,PBQ_08,PBQ_09,PBQ_09A,PBQ_10,PBQ_10A,PBQ_11,PBQ_11A,PBQ_12,PBQ_12A,PBQ_13,PBQ_13A,PBQ_13B,PBQ_13D,PBQ_13E,PBQ_13F,PBQ_13G,PBQ_13H,PBQ_14,PBQ_14A,PBQ_15,PBQ_16,PBQ_17,PBQ_17A,PBQ_17B,PBQ_18,PBQ_18A,PBQ_19,PBQ_19A,PBQ_21,PBQ_22,PBQ_23,PBQ_23A,PBQ_24,PBQ_24A,PBQ_25,PBQ_25A,PBQ_26,PBQ_26A,PBQ_27,PBQ_27A,PCIAT_01,PCIAT_02,PCIAT_03,PCIAT_04,PCIAT_05,PCIAT_06,PCIAT_07,PCIAT_08,PCIAT_09,PCIAT_10,PCIAT_11,PCIAT_12,PCIAT_13,PCIAT_14,PCIAT_15,PCIAT_16,PCIAT_17,PCIAT_18,PCIAT_19,PCIAT_20,PPS_M_01,PPS_M_02,PPS_M_03,PPS_M_04,PPS_M_05,PPS_M_06,PPS_M_Score,PPS_F_01,PPS_F_02,PPS_F_03,PPS_F_04,PPS_F_05,PPS_F_06,PPS_F_Score,PPVT_IncompleteReason,PPVT_Valid,PPVT4_StandardScore,PSI_01,PSI_02,PSI_03,PSI_04,PSI_05,PSI_06,PSI_07,PSI_08,PSI_09,PSI_10,PSI_11,PSI_12,PSI_13,PSI_14,PSI_15,PSI_16,PSI_17,PSI_18,PSI_19,PSI_20,PSI_21,PSI_22,PSI_23,PSI_24,PSI_25,PSI_26,PSI_27,PSI_28,PSI_29,PSI_30,PSI_31,PSI_32,PSI_33,PSI_34,PSI_35,PSI_36,PSI_DC,PSI_PD,SCARED_P_01,SCARED_P_02,SCARED_P_03,SCARED_P_04,SCARED_P_05,SCARED_P_06,SCARED_P_07,SCARED_P_08,SCARED_P_09,SCARED_P_10,SCARED_P_11,SCARED_P_12,SCARED_P_13,SCARED_P_14,SCARED_P_15,SCARED_P_16,SCARED_P_17,SCARED_P_18,SCARED_P_19,SCARED_P_20,SCARED_P_21,SCARED_P_22,SCARED_P_23,SCARED_P_24,SCARED_P_25,SCARED_P_26,SCARED_P_27,SCARED_P_28,SCARED_P_29,SCARED_P_30,SCARED_P_31,SCARED_P_32,SCARED_P_33,SCARED_P_34,SCARED_P_35,SCARED_P_36,SCARED_P_37,SCARED_P_38,SCARED_P_39,SCARED_P_40,SCARED_P_41,SCARED_SR_01,SCARED_SR_02,SCARED_SR_03,SCARED_SR_04,SCARED_SR_05,SCARED_SR_06,SCARED_SR_07,SCARED_SR_08,SCARED_SR_09,SCARED_SR_10,SCARED_SR_11,SCARED_SR_12,SCARED_SR_13,SCARED_SR_14,SCARED_SR_15,SCARED_SR_16,SCARED_SR_17,SCARED_SR_18,SCARED_SR_19,SCARED_SR_20,SCARED_SR_21,SCARED_SR_22,SCARED_SR_23,SCARED_SR_24,SCARED_SR_25,SCARED_SR_26,SCARED_SR_27,SCARED_SR_28,SCARED_SR_29,SCARED_SR_30,SCARED_SR_31,SCARED_SR_32,SCARED_SR_33,SCARED_SR_34,SCARED_SR_35,SCARED_SR_36,SCARED_SR_37,SCARED_SR_38,SCARED_SR_39,SCARED_SR_40,SCARED_SR_41,SCQ_01,SCQ_02,SCQ_03,SCQ_04,SCQ_05,SCQ_06,SCQ_07,SCQ_08,SCQ_09,SCQ_10,SCQ_11,SCQ_12,SCQ_13,SCQ_14,SCQ_15,SCQ_16,SCQ_17,SCQ_18,SCQ_19,SCQ_20,SCQ_21,SCQ_22,SCQ_23,SCQ_24,SCQ_25,SCQ_26,SCQ_27,SCQ_28,SCQ_29,SCQ_30,SCQ_31,SCQ_32,SCQ_33,SCQ_34,SCQ_35,SCQ_36,SCQ_37,SCQ_38,SCQ_39,SCQ_40,SDQ_01,SDQ_02,SDQ_03,SDQ_04,SDQ_05,SDQ_06,SDQ_07,SDQ_08,SDQ_09,SDQ_10,SDQ_11,SDQ_12,SDQ_13,SDQ_14,SDQ_15,SDQ_16,SDQ_17,SDQ_18,SDQ_19,SDQ_20,SDQ_21,SDQ_22,SDQ_23,SDQ_24,SDQ_25,SDQ_26,SDQ_27,SDQ_28,SDQ_29_a,SDQ_29_b,SDQ_29_c,SDQ_29_d,SDQ_30,SWAN_01,SWAN_02,SWAN_03,SWAN_04,SWAN_05,SWAN_06,SWAN_07,SWAN_08,SWAN_09,SWAN_10,SWAN_11,SWAN_12,SWAN_13,SWAN_14,SWAN_15,SWAN_16,SWAN_17,SWAN_18,CSC_01C,CSC_01P,CSC_02C,CSC_02P,CSC_03C,CSC_03P,CSC_04C,CSC_04P,CSC_05C,CSC_05P,CSC_06C,CSC_06P,CSC_07C,CSC_07P,CSC_08C,CSC_08P,CSC_09C,CSC_09P,CSC_10C,CSC_10P,CSC_11C,CSC_11P,CSC_12C,CSC_12P,CSC_13C,CSC_13P,CSC_14C,CSC_14P,CSC_15C,CSC_15P,CSC_16C,CSC_16P,CSC_17C,CSC_17P,CSC_18C,CSC_18P,CSC_19C,CSC_19P,CSC_20C,CSC_20P,CSC_21C,CSC_21P,CSC_22C,CSC_22P,CSC_23C,CSC_23P,CSC_24C,CSC_24P,CSC_25C,CSC_25P,CSC_26C,CSC_26P,CSC_27C,CSC_27P,CSC_28C,CSC_28P,CSC_29C,CSC_29P,CSC_30C,CSC_30P,CSC_31C,CSC_31P,CSC_32C,CSC_32P,CSC_33C,CSC_33P,CSC_34C,CSC_34P,CSC_35C,CSC_35P,CSC_36C,CSC_36P,CSC_37C,CSC_37P,CSC_38C,CSC_38P,CSC_39C,CSC_39P,CSC_40C,CSC_40P,CSC_41C,CSC_41P,CSC_42C,CSC_42P,CSC_43C,CSC_43P,CSC_44C,CSC_44P,CSC_45C,CSC_45P,CSC_46C,CSC_46P,CSC_47C,CSC_47P,CSC_48C,CSC_48P,CSC_49C,CSC_49P,CSC_50C,CSC_50P,CSC_51C,CSC_51P,CSC_52C,CSC_52P,CSC_53C,CSC_53P,CSC_54C,CSC_54P,CSC_55aC,CSC_55aP,CSC_55bC,CSC_55bP,CSC_55cC,CSC_55cP,CSC_55dC,CSC_55dP,CSC_55eC,CSC_55eP,CSC_55fC,CSC_55fP,CSC_55gC,CSC_55gP,CSC_55hC,CSC_55hP,CSC_55iC,CSC_55iP,ADHD subtype
0,NDARRM363BXZ,0,10.1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,5.0,5.0,1.0,4.0,5.0,3.0,2.0,3.0,2.0,1.0,5.0,4.0,1.0,5.0,1.0,1.0,4.0,1.0,1.0,5.0,5.0,4.0,3.0,4.0,5.0,4.0,1.0,1.0,4.0,1.0,1.0,2.0,5.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,3.0,5.0,1.0,1.0,,,,,,,,,,,,,,,0,0,1,0,2,0,0,0,1,2,1,1,1,0,1,2,1,1,1,1,0,1,2,1,0,0,0,,,,,,2,0,2,3,3,3,3,1,2,0,3,2,0,3,3,3,3,3,0,3,3,0,2,2,1,0,2,2,3,2,3,0,0,0,2,3,2,0,0,2,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,0.5,0,1,1,1,1,1,-1,-1,1,1,1,-1,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,0,8.0,1,7.5,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2,2,2,0,0,1,1,0,2,2,0,2,1,1.0,1,0,2,1,1,1,0,2,1,2,1,0,0,2,1,2,2,2,2,2,0,1,0,0,2,1,1,2,2,0,0,0,1,1,0,0,0,0,0,0,1,2,1,2,1,2,2,2,2,0,0,1,0,0,0,0,0,0,0,2,2,0,0,0,2,0,2,1,0,1,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,1,0,0,0,0,0,1,1,1,1,0,1,2,2,1,1,1,1,1,2,2,0,1,1,1,2,2,2,0,0,1,1,1,1,2,1,2,3,2,1,1,1,0,3,1,1,1,2,2,2,2,1,1,,,,,,,,,,0,1.0,1,1.0,0,0.0,0,0.0,0,1.0,1,1.0,1,1.0,0,0.0,0,1.0,0,0.0,1,1.0,1,1.0,0,0.0,1,1.0,0,0.0,1,1.0,1,1.0,1,1.0,1,1.0,1,1.0,0,0.0,0,0.0,0,1.0,1,0.0,1,1.0,0,0.0,1,1.0,0,1.0,0,1.0,1,0.0,0,0.0,0,0.0,1,1.0,1,1.0,1,1.0,1,1.0,1,1.0,0,1.0,1,1.0,0,1.0,0,0.0,0,0.0,0,1.0,0,0.0,0,1.0,0,0.0,0,0.0,0,0.0,1,1.0,1,1.0,1,1.0,0,1.0,0.0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ADHD-Combined Type
1,NDARUW586LLL,1,12.3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,1,0,0,1,0,0,2,2,1,0,0,0,1,2,1,2,0,0,0,1,0,0,0,0,1,,9.0,9.0,15.0,5.0,2,3,3,3,3,2,3,0,0,0,3,2,2,3,0,3,3,2,3,2,2,0,3,2,3,2,3,2,3,1,3,0,0,0,0,3,0,1,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1,1,1,1,1,-1,1,1,1,1,1,1,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,4.0,0,5.0,0,2.0,17.0,18.6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2,1,0,2,0,2,2,0,1,2,1,0,2,0.0,0,1,1,1,0,2,1,2,2,2,1,0,1,0,0,1,1,2,0,2,0,2,0,2,1,1,0,2,2,0,2,0,1,1,0,1,1,0,0,0,1,1,0,1,2,2,1,2,2,1,0,2,0,1,0,1,2,0,2,0,1,0,2,2,2,0,0,0,0,1,1,0,0,1,1,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,1,1,0,1,0,0,1,2,2,0,1,0,1,2,2,1,1,0,2,1,2,2,2,1,1,2,2,0,1,2,2,3,3,2,2,0,0,2,3,2,3,1,2,3,3,3,1,3,1.0,0.0,0.0,0.0,2.0,1.0,3.0,0.0,2.0,0,1.0,1,1.0,0,1.0,1,1.0,1,1.0,1,1.0,1,1.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,,0,0.0,0,0.0,1,1.0,0,0.0,1,1.0,0,0.0,1,1.0,1,1.0,0,,1,1.0,1,1.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,1.0,1,1.0,0,0.0,0,0.0,1,,1,1.0,1,1.0,0,1.0,0,0.0,0,,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,1,1.0,0,0.0,1,1.0,1,1.0,0.0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ADHD-Inattentive Type
2,NDARNH263WZP,1,15.1,4.0,,3.0,4.0,3.0,5.0,3.0,2.0,5.0,4.0,3.0,1.0,4.0,4.0,5.0,4.0,3.0,3.0,5.0,4.0,5.0,2.0,4.0,1.0,1.0,4.0,5.0,1.0,2.0,3.0,1.0,2.0,1.0,2.0,2.0,4.0,3.0,1.0,2.0,4.0,2.0,1.0,2.0,1.0,1.0,2.0,4.0,2.0,1.0,3.0,3.0,1.0,3.0,3.0,2.0,3.0,1.0,1.0,2.0,2.0,3.0,1.0,4.0,2.0,1.0,4.0,1.0,4.0,2.0,2.0,2.0,3.0,1.0,2.0,3.0,5.0,2.0,3.0,2.0,3.0,2.0,1.0,4.0,3.0,2.0,3.0,5.0,1.0,3.0,5.0,1.0,1.0,1.0,,,,,,,,,,,,,,,0,0,0,0,2,0,0,2,1,1,2,2,2,0,1,2,2,1,2,1,0,0,0,0,1,0,0,,12.0,12.0,15.0,15.0,1,2,3,1,3,2,2,0,1,0,3,1,2,2,0,3,1,1,1,0,2,1,2,0,1,2,3,0,1,1,1,0,1,2,2,3,2,3,1,2,0,1.0,4.0,2.0,1.0,3.0,2.0,4.0,3.0,1.0,3.0,3.0,4.0,1.0,2.0,3.0,2.0,,3.0,1.0,3.0,3.0,4.0,3.0,3.0,2.0,3.0,4.0,2.0,2.0,3.0,4.0,1.0,3.0,1.0,4.0,1.0,1.0,4.0,3.0,4.0,1.0,2.0,2.0,1.0,3.0,3.0,3.0,3.0,2.0,1.0,1.0,4.0,3.0,3.0,4.0,1.0,2.75,3.0,3.5,3.0,3.75,2.25,3.0,3.0,2.75,2.25,1.5,1.0,0.75,2.8,1.0,2.0,1.0,0.0,0.0,1.0,2.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,2.0,2.0,2.0,1.0,0.0,2.0,0.0,,,,,,,,,,,,,,,,1.0,1.0,1,1,1,1,1,1,1,1,1,1,1,1,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,14.0,1,14.0,1,2.0,38.0,34.8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,5.0,1.0,3.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,1.25,3.0,2.0,1.0,1.0,1.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,2,0,1,0,0,0,0,1,0,0,0,1,,0,2,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,2,1,1,2,0,1,0,2,0,0,2,0,0,0,0,0,0,0,2,1,1,2,0,1,2,0,2,2,0,0,0,0,2,2,0,1,0,1,0,2,2,2,1,1,1,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1,1,1,1,1,1,1,1,0,1,1,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,2,2,2,1,1,0,1,1,2,1,2,0,2,1,1,0,2,0,0,0,1,3,3,2,2,2,2,2,3,0,2,0,3,0,2,0,1,0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0,1.0,1,1.0,1,1.0,0,0.0,1,1.0,0,0.0,0,0.0,1,0.0,0,0.0,0,0.0,1,1.0,0,1.0,0,0.0,0,0.0,0,0.0,0,0.0,0,1.0,0,0.0,1,1.0,1,1.0,0,1.0,0,0.0,0,,1,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,1,1.0,0,0.0,0,1.0,0,1.0,1,1.0,0,0.0,0,0.0,1,1.0,1,1.0,1,1.0,1,1.0,1,1.0,1,0.0,1,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,1,1.0,0,0.0,0,1.0,0,0.0,,0.0,1,0.0,1,0.0,0,0.0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,ADHD-Combined Type
3,NDAREY721PVD,0,11.3,5.0,5.0,1.0,3.0,3.0,1.0,3.0,3.0,5.0,1.0,5.0,1.0,5.0,5.0,3.0,5.0,1.0,5.0,5.0,5.0,1.0,1.0,3.0,1.0,1.0,4.0,5.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,1.0,4.0,4.0,1.0,2.0,4.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,5.0,5.0,5.0,1.0,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,1,0,0,0,0,0,0,1,0,0,,21.0,,40.0,,0,0,3,3,3,3,3,0,3,0,2,0,0,3,0,3,0,3,3,0,0,0,0,0,1,3,3,0,0,0,0,0,0,0,0,0,0,0,0,2,1,1.0,1.0,4.0,2.0,1.0,4.0,1.0,2.0,1.0,3.0,2.0,4.0,1.0,4.0,1.0,4.0,1.0,3.0,2.0,4.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,4.0,2.0,3.0,1.0,4.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,4.0,1.0,1.0,1.0,4.0,1.0,2.0,1.0,1.0,2.75,2.75,3.0,2.33,2.5,2.0,2.5,1.17,1.25,2.5,1.44,2.0,1.25,1.4,1.0,2.0,1.0,2.0,2.0,0.0,0.0,2.0,0.0,0.0,1.0,2.0,1.0,2.0,0.0,0.0,1.0,2.0,2.0,2.0,0.0,2.0,2.0,0.0,2.0,0.0,2.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,,,,,,,,,,,,,,,,1.0,1.0,1,1,1,1,1,1,1,1,1,1,1,1,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5,0,12.5,1,11.5,1,1.0,13.3,12.1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,2,0,2,0,2,0,2,2,0,0,0,2,0.0,0,1,1,0,0,2,0,2,1,0,2,0,2,1,0,1,2,1,0,2,1,1,0,2,2,2,0,1,2,2,1,0,2,0,0,2,1,0,0,0,0,0,1,2,0,1,0,0,1,0,2,2,0,1,2,0,0,2,1,0,0,2,0,0,2,2,2,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,2,1,0,2,0,1,0,1,2,1,0,0,0,0,2,1,2,0,0,2,0,0,0,1,1,2,3,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0,,0,,0,,0,,0,,1,,0,,0,,0,,0,,0,,1,,0,,0,,0,,0,,0,,0,,1,,1,,1,,0,,1,,0,,1,,0,,0,,0,,0,,0,,0,,0,,0,,1,,1,,1,,1,,0,,0,,0,,0,,0,,0,,0,,0,,0,,1,,0,,1,,0,,0,,0,,0.0,,0,,0,,0,,0,,0,,,,,,,,,,,,ADHD-Inattentive Type
4,NDARAX573RMT,1,12.0,4.0,3.0,3.0,3.0,3.0,1.0,3.0,1.0,5.0,1.0,3.0,3.0,3.0,4.0,5.0,3.0,1.0,3.0,4.0,4.0,1.0,3.0,3.0,1.0,3.0,3.0,4.0,1.0,1.0,3.0,2.0,2.0,3.0,3.0,3.0,3.0,3.0,1.0,4.0,3.0,1.0,3.0,3.0,4.0,4.0,3.0,3.0,3.0,4.0,2.0,3.0,4.0,3.0,4.0,3.0,1.0,2.0,4.0,2.0,5.0,4.0,4.0,4.0,2.0,1.0,1.0,4.0,4.0,4.0,2.0,2.0,4.0,4.0,1.0,3.0,3.0,1.0,3.0,1.0,3.0,3.0,2.0,3.0,1.0,2.0,2.0,4.0,1.0,1.0,3.0,3.0,1.0,1.0,,,,,,,,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,,18.0,6.0,25.0,5.0,1,2,1,2,1,2,2,1,0,0,2,1,1,1,1,1,0,1,1,2,2,1,3,1,1,0,2,1,1,2,2,0,0,0,0,0,0,2,0,1,0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.83,1.75,1.75,2.0,1.92,2.0,1.0,1.94,2.0,1.75,2.0,2.0,1.0,1.0,2.0,2.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,2.0,2.0,2.0,0.0,2.0,1.0,1.0,2.0,2.0,1.0,0.0,2.0,2.0,2.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,,,,,,,,,,,,,-0.5,-1.0,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,11.0,1,12.0,1,2.0,17.7,13.6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,1,2,1,2,0,1,0,0,1,2,0,1,0,0.0,1,2,0,0,1,1,1,1,0,0,0,0,0,1,0,0,1,1,0,1,1,1,0,1,1,0,0,1,1,0,1,0,0,0,1,1,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,1,0,2,1,2,2,1,1,1,1,1,2,2,2,0,0,1,2,0,2,0,2,1,2,1,1,1,0,0,2,3,2,0,3,1,0,2,1,1,2.0,1.0,0.0,0.0,2.0,2.0,1.0,1.0,2.0,1,0.0,1,0.0,1,0.0,0,1.0,0,0.0,1,1.0,0,0.0,0,0.0,0,0.0,1,1.0,0,0.0,1,0.0,0,,0,0.0,0,0.0,0,1.0,0,1.0,0,0.0,1,1.0,1,1.0,0,0.0,1,1.0,1,0.0,1,0.0,0,0.0,0,0.0,0,0.0,0,,0,0.0,1,1.0,0,0.0,0,0.0,0,0.0,1,1.0,1,1.0,1,1.0,1,1.0,1,0.0,1,0.0,1,0.0,0,0.0,0,0.0,0,0.0,1,0.0,0,0.0,0,0.0,0,0.0,0,0.0,1,0.0,1,1.0,1,0.0,0,0.0,0.0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,ADHD-Inattentive Type


We can also find the total number of records (i.e. participants).

In [7]:
%bql SELECT COUNT(*) FROM "hbnq_t";

Unnamed: 0,"""COUNT""(*)"
0,187


### Creating a BayesDB population for the HBNQ data

The notion of a "population" is a central concept in BayesDB. For a standard database table, such as `hbnq_t`, each column is associated with a [data type](https://sqlite.org/datatype3.html), which in sqlite3 are `TEXT`, `REAL`, `INTEGER`, and `BLOB`. For a BayesDB population, each variable is associated with a _statistical data type_. These statistical types, such as `NOMINAL`, `NUMERICAL`, `MAGNITUDE`, and `COUNTS`, specify the set of values and default probability distributions used for building probabilistic models of the data in the population. In this tutorial, we will use the `NUMERICAL` and `NOMINAL` statistical data types.

We can use the `GUESS SCHEMA FOR <table>` command from the Metamodeling Language (MML) in BayesDB to guess the statistical data types of variables in the table. The guesses use heuristics based on the contents in the cells. The `num_distinct` column shows the number of unique values for that variable, and the `reason` column explains which heuristic was used to make the guess.

In [8]:
%time
%mml GUESS SCHEMA FOR "hbnq_t"

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 6.91 µs


Unnamed: 0,column,stattype,num_distinct,reason
0,EID,key,187,This was the first column in the table with a...
1,Sex,nominal,2,There are fewer than 20 distinct numerical va...
2,Age,numerical,72,There are at least 20 unique numerical values...
3,APQ_P_01,nominal,4,There are fewer than 20 distinct numerical va...
4,APQ_P_02,nominal,5,There are fewer than 20 distinct numerical va...
5,APQ_P_03,nominal,6,There are fewer than 20 distinct numerical va...
6,APQ_P_04,nominal,6,There are fewer than 20 distinct numerical va...
7,APQ_P_05,nominal,6,There are fewer than 20 distinct numerical va...
8,APQ_P_06,nominal,6,There are fewer than 20 distinct numerical va...
9,APQ_P_07,nominal,5,There are fewer than 20 distinct numerical va...


In [9]:
hbnq = %bql SELECT * FROM "hbnq_t";

*The `country` column has been identified as the `key`, since it is a unique identifer for each row. Many of the variables have been also correctly guessed  as `NUMERICAL`. A few variables have been incorrectly guessesd as `NOMINAL`, probably due to the sparsity in the dataset (the reason is that there are only a few distinct values for that variable in the table).*

Now that we know the HBNQ variables and their statistical datatypes, we use MML to create a population for the `hbnq_t_subsample` table. We will call the population `hbnq`. The population schema uses the statistical types guessed by BayesDB (from the previous cell) for all variables, except for a set of manual overrides for the incorrect guesses.

In [10]:
nonnumerical = ['EID', 'Sex', 'ADHD subtype']
hbnq_nonnumerical = ", ".join([
    '"{0}"'.format(
        c.encode(
            "utf-8"
        )
    ) for c in hbnq.columns if c not in nonnumerical
])

In [11]:
hbnq_nonnumerical

'"Age", "APQ_P_01", "APQ_P_02", "APQ_P_03", "APQ_P_04", "APQ_P_05", "APQ_P_06", "APQ_P_07", "APQ_P_08", "APQ_P_09", "APQ_P_10", "APQ_P_11", "APQ_P_12", "APQ_P_13", "APQ_P_14", "APQ_P_15", "APQ_P_16", "APQ_P_17", "APQ_P_18", "APQ_P_19", "APQ_P_20", "APQ_P_21", "APQ_P_22", "APQ_P_23", "APQ_P_24", "APQ_P_25", "APQ_P_26", "APQ_P_27", "APQ_P_28", "APQ_P_29", "APQ_P_30", "APQ_P_31", "APQ_P_32", "APQ_P_33", "APQ_P_34", "APQ_P_35", "APQ_P_36", "APQ_P_37", "APQ_P_38", "APQ_P_39", "APQ_P_40", "APQ_P_41", "APQ_P_42", "APQ_SR_01", "APQ_SR_01A", "APQ_SR_02", "APQ_SR_03", "APQ_SR_04", "APQ_SR_04A", "APQ_SR_05", "APQ_SR_06", "APQ_SR_07", "APQ_SR_07A", "APQ_SR_08", "APQ_SR_09", "APQ_SR_09A", "APQ_SR_10", "APQ_SR_11", "APQ_SR_11A", "APQ_SR_12", "APQ_SR_13", "APQ_SR_14", "APQ_SR_14A", "APQ_SR_15", "APQ_SR_15A", "APQ_SR_16", "APQ_SR_17", "APQ_SR_18", "APQ_SR_19", "APQ_SR_20", "APQ_SR_20A", "APQ_SR_21", "APQ_SR_22", "APQ_SR_23", "APQ_SR_24", "APQ_SR_25", "APQ_SR_26", "APQ_SR_26A", "APQ_SR_27", "APQ_SR_28"

In [12]:
%%mml -i hbnq_nonnumerical
CREATE POPULATION "hbnq" FOR "hbnq_t" WITH SCHEMA (
    -- Use the guesses from the previous cell for all variables.
    GUESS STATTYPES FOR (*);

    -- Manually override incorrectly guessed statistical types.
    MODEL
        "Age", "APQ_P_01", "APQ_P_02", "APQ_P_03", "APQ_P_04", "APQ_P_05", "APQ_P_06", "APQ_P_07", "APQ_P_08", "APQ_P_09", "APQ_P_10", "APQ_P_11", "APQ_P_12", "APQ_P_13", "APQ_P_14", "APQ_P_15", "APQ_P_16", "APQ_P_17", "APQ_P_18", "APQ_P_19", "APQ_P_20", "APQ_P_21", "APQ_P_22", "APQ_P_23", "APQ_P_24", "APQ_P_25", "APQ_P_26", "APQ_P_27", "APQ_P_28", "APQ_P_29", "APQ_P_30", "APQ_P_31", "APQ_P_32", "APQ_P_33", "APQ_P_34", "APQ_P_35", "APQ_P_36", "APQ_P_37", "APQ_P_38", "APQ_P_39", "APQ_P_40", "APQ_P_41", "APQ_P_42", "APQ_SR_01", "APQ_SR_01A", "APQ_SR_02", "APQ_SR_03", "APQ_SR_04", "APQ_SR_04A", "APQ_SR_05", "APQ_SR_06", "APQ_SR_07", "APQ_SR_07A", "APQ_SR_08", "APQ_SR_09", "APQ_SR_09A", "APQ_SR_10", "APQ_SR_11", "APQ_SR_11A", "APQ_SR_12", "APQ_SR_13", "APQ_SR_14", "APQ_SR_14A", "APQ_SR_15", "APQ_SR_15A", "APQ_SR_16", "APQ_SR_17", "APQ_SR_18", "APQ_SR_19", "APQ_SR_20", "APQ_SR_20A", "APQ_SR_21", "APQ_SR_22", "APQ_SR_23", "APQ_SR_24", "APQ_SR_25", "APQ_SR_26", "APQ_SR_26A", "APQ_SR_27", "APQ_SR_28", "APQ_SR_29", "APQ_SR_30", "APQ_SR_31", "APQ_SR_32", "APQ_SR_33", "APQ_SR_34", "APQ_SR_35", "APQ_SR_36", "APQ_SR_37", "APQ_SR_38", "APQ_SR_39", "APQ_SR_40", "APQ_SR_41", "APQ_SR_42", "ARI_P_01", "ARI_P_02", "ARI_P_03", "ARI_P_04", "ARI_P_05", "ARI_P_06", "ARI_P_07", "ARI_S_01", "ARI_S_02", "ARI_S_03", "ARI_S_04", "ARI_S_05", "ARI_S_06", "ARI_S_07", "ASSQ_01", "ASSQ_02", "ASSQ_03", "ASSQ_04", "ASSQ_05", "ASSQ_06", "ASSQ_07", "ASSQ_08", "ASSQ_09", "ASSQ_10", "ASSQ_11", "ASSQ_12", "ASSQ_13", "ASSQ_14", "ASSQ_15", "ASSQ_16", "ASSQ_17", "ASSQ_18", "ASSQ_19", "ASSQ_20", "ASSQ_21", "ASSQ_22", "ASSQ_23", "ASSQ_24", "ASSQ_25", "ASSQ_26", "ASSQ_27", "financialsupport", "Barratt_P1_Edu", "Barratt_P2_Edu", "Barratt_P1_Occ", "Barratt_P2_Occ", "C3SR_01", "C3SR_02", "C3SR_03", "C3SR_04", "C3SR_05", "C3SR_06", "C3SR_07", "C3SR_08", "C3SR_09", "C3SR_10", "C3SR_11", "C3SR_12", "C3SR_13", "C3SR_14", "C3SR_15", "C3SR_16", "C3SR_17", "C3SR_18", "C3SR_19", "C3SR_20", "C3SR_21", "C3SR_22", "C3SR_23", "C3SR_24", "C3SR_25", "C3SR_26", "C3SR_27", "C3SR_28", "C3SR_29", "C3SR_30", "C3SR_31", "C3SR_32", "C3SR_33", "C3SR_34", "C3SR_35", "C3SR_36", "C3SR_37", "C3SR_38", "C3SR_39", "C3SR_NI", "C3SR_PI", "CCSC_01", "CCSC_02", "CCSC_03", "CCSC_04", "CCSC_05", "CCSC_06", "CCSC_07", "CCSC_08", "CCSC_09", "CCSC_10", "CCSC_11", "CCSC_12", "CCSC_13", "CCSC_14", "CCSC_15", "CCSC_16", "CCSC_17", "CCSC_18", "CCSC_19", "CCSC_20", "CCSC_21", "CCSC_22", "CCSC_23", "CCSC_24", "CCSC_25", "CCSC_26", "CCSC_27", "CCSC_28", "CCSC_29", "CCSC_30", "CCSC_31", "CCSC_32", "CCSC_33", "CCSC_34", "CCSC_35", "CCSC_36", "CCSC_37", "CCSC_38", "CCSC_39", "CCSC_40", "CCSC_41", "CCSC_42", "CCSC_43", "CCSC_44", "CCSC_45", "CCSC_46", "CCSC_47", "CCSC_48", "CCSC_49", "CCSC_50", "CCSC_51", "CCSC_52", "CCSC_53", "CCSC_54", "CCSC_55", "CCSC_56", "CCSC_PFC", "CCSC_DPS", "CCSC_SU", "CCSC_AC", "CCSC_AA", "CCSC_REP", "CCSC_WT", "CCSC_PCR", "CCSC_POS", "CCSC_REL", "CCSC_SS", "CCSC_SUPMF", "CCSC_SUPOA", "CCSC_SUPEER", "CCSC_SUPSIB", "CPIC_01", "CPIC_02", "CPIC_03", "CPIC_04", "CPIC_05", "CPIC_06", "CPIC_07", "CPIC_08", "CPIC_09", "CPIC_10", "CPIC_11", "CPIC_12", "CPIC_13", "CPIC_14", "CPIC_15", "CPIC_16", "CPIC_17", "CPIC_18", "CPIC_19", "CPIC_20", "CPIC_21", "CPIC_22", "CPIC_23", "CPIC_24", "CPIC_25", "CPIC_26", "CPIC_27", "CPIC_28", "CPIC_29", "CPIC_30", "CPIC_31", "CPIC_32", "CPIC_34", "CPIC_33", "CPIC_35", "CPIC_36", "CPIC_37", "CPIC_38", "CPIC_39", "CPIC_40", "CPIC_41", "CPIC_42", "CPIC_43", "CPIC_44", "CPIC_45", "CPIC_46", "CPIC_47", "CPIC_48", "CPIC_49", "CPIC_50", "CPIC_51", "DTS_01", "DTS_02", "DTS_03", "DTS_04", "DTS_05", "DTS_06", "DTS_07", "DTS_08", "DTS_09", "DTS_10", "DTS_11", "DTS_12", "DTS_13", "DTS_14", "DTS_15", "EHQ_01", "EHQ_02", "EHQ_03", "EHQ_04", "EHQ_05", "EHQ_06", "EHQ_07", "EHQ_08", "EHQ_09", "EHQ_10", "EHQ_11", "EHQ_12", "EHQ_13", "EHQ_14", "EHQ_15", "MDD_1A", "MDD_1B", "MDD_2A", "MDD_2B", "MDD_3A", "MDD_3B", "MDD_4", "MDD_5", "MDD_6", "MDD_7", "MDD_8A", "MDD_8B", "MDD_9", "SocAnx_01", "SocAnx_02", "SocAnx_03", "SocAnx_04A", "SocAnx_04B", "SocAnx_05", "DMDD_1A", "DMDD_1B", "DMDD_1C", "DMDD_2A", "DMDD_2B", "DMDD_2C", "DMDD_3A", "DMDD_3B", "DMDD_3C", "DMDD_4A", "DMDD_4B", "DMDD_4C", "DMDD_5A", "DMDD_5B", "DMDD_5C", "DMDD_6A", "DMDD_6B", "DMDD_6C", "DMDD_7A", "DMDD_7B", "DMDD_7C", "DMDD_8A", "DMDD_8B", "DMDD_8C", "DMDD_9A", "DMDD_9B", "DMDD_9C", "DMDD_10A", "DMDD_10B", "DMDD_10C", "Panic_A01", "Panic_A02", "Panic_A03", "Panic_A01A", "Panic_A01B", "Panic_A02A", "Panic_A02B", "Panic_A03A", "Panic_A03B", "Panic_B01", "Panic_B02", "Panic_B03", "Panic_B04", "Panic_B05", "Panic_B06", "Panic_B07", "Panic_B08", "Panic_B09", "Panic_B10", "Panic_B11", "Panic_B12", "Panic_B13", "FGC_Incomplete_Reason", "FGC_PU", "FGC_PU_Zone", "FGC_SRL", "FGC_SRL_Zone", "FGC_SRR", "FGC_SRR_Zone", "FGC_GSD_Zone", "FGC_GSD", "FGC_GSND", "FSQ_01", "FSQ_02", "FSQ_03", "FSQ_04", "FSQ_05a", "FSQ_05b", "FSQ_05c", "FSQ_05d", "FSQ_05e", "FSQ_05f", "FSQ_05g", "FSQ_05h", "FSQ_05i", "FSQ_05j", "FSQ_06", "FSQ_07", "FSQ_08", "IAT_01", "IAT_02", "IAT_03", "IAT_04", "IAT_05", "IAT_06", "IAT_07", "IAT_08", "IAT_09", "IAT_10", "IAT_11", "IAT_12", "IAT_13", "IAT_14", "IAT_15", "IAT_16", "IAT_17", "IAT_18", "IAT_19", "IAT_20", "MFQ_P_01", "MFQ_P_02", "MFQ_P_03", "MFQ_P_04", "MFQ_P_05", "MFQ_P_06", "MFQ_P_07", "MFQ_P_08", "MFQ_P_09", "MFQ_P_10", "MFQ_P_11", "MFQ_P_12", "MFQ_P_13", "MFQ_P_14", "MFQ_P_15", "MFQ_P_16", "MFQ_P_17", "MFQ_P_18", "MFQ_P_19", "MFQ_P_20", "MFQ_P_21", "MFQ_P_22", "MFQ_P_23", "MFQ_P_24", "MFQ_P_25", "MFQ_P_26", "MFQ_P_27", "MFQ_P_28", "MFQ_P_29", "MFQ_P_30", "MFQ_P_31", "MFQ_P_32", "MFQ_P_33", "MFQ_P_34", "MFQ_SR_01", "MFQ_SR_02", "MFQ_SR_03", "MFQ_SR_04", "MFQ_SR_05", "MFQ_SR_06", "MFQ_SR_07", "MFQ_SR_08", "MFQ_SR_09", "MFQ_SR_10", "MFQ_SR_11", "MFQ_SR_12", "MFQ_SR_13", "MFQ_SR_14", "MFQ_SR_15", "MFQ_SR_16", "MFQ_SR_17", "MFQ_SR_18", "MFQ_SR_19", "MFQ_SR_20", "MFQ_SR_21", "MFQ_SR_22", "MFQ_SR_23", "MFQ_SR_24", "MFQ_SR_25", "MFQ_SR_26", "MFQ_SR_27", "MFQ_SR_28", "MFQ_SR_29", "MFQ_SR_30", "MFQ_SR_31", "MFQ_SR_32", "MFQ_SR_33", "NIH7_Incomplete_Reason", "PAQ_A_01a", "PAQ_A_01b", "PAQ_A_01c", "PAQ_A_01d", "PAQ_A_01e", "PAQ_A_01f", "PAQ_A_01g", "PAQ_A_01h", "PAQ_A_01i", "PAQ_A_01j", "PAQ_A_01k", "PAQ_A_01l", "PAQ_A_01m", "PAQ_A_01n", "PAQ_A_01o", "PAQ_A_01p", "PAQ_A_01q", "PAQ_A_01r", "PAQ_A_01s", "PAQ_A_01t", "PAQ_A_01u", "PAQ_A_01v", "PAQ_A_01w_text", "PAQ_A_01w", "PAQ_A_01x_text", "PAQ_A_01x", "PAQ_A_01_Average", "PAQ_A_02", "PAQ_A_03", "PAQ_A_04", "PAQ_A_05", "PAQ_A_06", "PAQ_A_07", "PAQ_A_08a", "PAQ_A_08b", "PAQ_A_08c", "PAQ_A_08d", "PAQ_A_08e", "PAQ_A_08f", "PAQ_A_08g", "PAQ_A_09", "Relation", "PBQ_01", "PBQ_01A", "PBQ_02", "PBQ_02A", "PBQ_03", "PBQ_03A", "PBQ_03B", "PBQ_03B_1", "PBQ_03C", "PBQ_04", "PBQ_04A", "PBQ_05", "PBQ_05A", "PBQ_05B", "PBQ_05C", "PBQ_06", "PBQ_06A", "PBQ_07", "PBQ_07A", "PBQ_07B", "PBQ_07C", "PBQ_08", "PBQ_09", "PBQ_09A", "PBQ_10", "PBQ_10A", "PBQ_11", "PBQ_11A", "PBQ_12", "PBQ_12A", "PBQ_13", "PBQ_13A", "PBQ_13B", "PBQ_13D", "PBQ_13E", "PBQ_13F", "PBQ_13G", "PBQ_13H", "PBQ_14", "PBQ_14A", "PBQ_15", "PBQ_16", "PBQ_17", "PBQ_17A", "PBQ_17B", "PBQ_18", "PBQ_18A", "PBQ_19", "PBQ_19A", "PBQ_21", "PBQ_22", "PBQ_23", "PBQ_23A", "PBQ_24", "PBQ_24A", "PBQ_25", "PBQ_25A", "PBQ_26", "PBQ_26A", "PBQ_27", "PBQ_27A", "PCIAT_01", "PCIAT_02", "PCIAT_03", "PCIAT_04", "PCIAT_05", "PCIAT_06", "PCIAT_07", "PCIAT_08", "PCIAT_09", "PCIAT_10", "PCIAT_11", "PCIAT_12", "PCIAT_13", "PCIAT_14", "PCIAT_15", "PCIAT_16", "PCIAT_17", "PCIAT_18", "PCIAT_19", "PCIAT_20", "PPS_M_01", "PPS_M_02", "PPS_M_03", "PPS_M_04", "PPS_M_05", "PPS_M_06", "PPS_M_Score", "PPS_F_01", "PPS_F_02", "PPS_F_03", "PPS_F_04", "PPS_F_05", "PPS_F_06", "PPS_F_Score", "PPVT_IncompleteReason", "PPVT_Valid", "PPVT4_StandardScore", "PSI_01", "PSI_02", "PSI_03", "PSI_04", "PSI_05", "PSI_06", "PSI_07", "PSI_08", "PSI_09", "PSI_10", "PSI_11", "PSI_12", "PSI_13", "PSI_14", "PSI_15", "PSI_16", "PSI_17", "PSI_18", "PSI_19", "PSI_20", "PSI_21", "PSI_22", "PSI_23", "PSI_24", "PSI_25", "PSI_26", "PSI_27", "PSI_28", "PSI_29", "PSI_30", "PSI_31", "PSI_32", "PSI_33", "PSI_34", "PSI_35", "PSI_36", "PSI_DC", "PSI_PD", "SCARED_P_01", "SCARED_P_02", "SCARED_P_03", "SCARED_P_04", "SCARED_P_05", "SCARED_P_06", "SCARED_P_07", "SCARED_P_08", "SCARED_P_09", "SCARED_P_10", "SCARED_P_11", "SCARED_P_12", "SCARED_P_13", "SCARED_P_14", "SCARED_P_15", "SCARED_P_16", "SCARED_P_17", "SCARED_P_18", "SCARED_P_19", "SCARED_P_20", "SCARED_P_21", "SCARED_P_22", "SCARED_P_23", "SCARED_P_24", "SCARED_P_25", "SCARED_P_26", "SCARED_P_27", "SCARED_P_28", "SCARED_P_29", "SCARED_P_30", "SCARED_P_31", "SCARED_P_32", "SCARED_P_33", "SCARED_P_34", "SCARED_P_35", "SCARED_P_36", "SCARED_P_37", "SCARED_P_38", "SCARED_P_39", "SCARED_P_40", "SCARED_P_41", "SCARED_SR_01", "SCARED_SR_02", "SCARED_SR_03", "SCARED_SR_04", "SCARED_SR_05", "SCARED_SR_06", "SCARED_SR_07", "SCARED_SR_08", "SCARED_SR_09", "SCARED_SR_10", "SCARED_SR_11", "SCARED_SR_12", "SCARED_SR_13", "SCARED_SR_14", "SCARED_SR_15", "SCARED_SR_16", "SCARED_SR_17", "SCARED_SR_18", "SCARED_SR_19", "SCARED_SR_20", "SCARED_SR_21", "SCARED_SR_22", "SCARED_SR_23", "SCARED_SR_24", "SCARED_SR_25", "SCARED_SR_26", "SCARED_SR_27", "SCARED_SR_28", "SCARED_SR_29", "SCARED_SR_30", "SCARED_SR_31", "SCARED_SR_32", "SCARED_SR_33", "SCARED_SR_34", "SCARED_SR_35", "SCARED_SR_36", "SCARED_SR_37", "SCARED_SR_38", "SCARED_SR_39", "SCARED_SR_40", "SCARED_SR_41", "SCQ_01", "SCQ_02", "SCQ_03", "SCQ_04", "SCQ_05", "SCQ_06", "SCQ_07", "SCQ_08", "SCQ_09", "SCQ_10", "SCQ_11", "SCQ_12", "SCQ_13", "SCQ_14", "SCQ_15", "SCQ_16", "SCQ_17", "SCQ_18", "SCQ_19", "SCQ_20", "SCQ_21", "SCQ_22", "SCQ_23", "SCQ_24", "SCQ_25", "SCQ_26", "SCQ_27", "SCQ_28", "SCQ_29", "SCQ_30", "SCQ_31", "SCQ_32", "SCQ_33", "SCQ_34", "SCQ_35", "SCQ_36", "SCQ_37", "SCQ_38", "SCQ_39", "SCQ_40", "SDQ_01", "SDQ_02", "SDQ_03", "SDQ_04", "SDQ_05", "SDQ_06", "SDQ_07", "SDQ_08", "SDQ_09", "SDQ_10", "SDQ_11", "SDQ_12", "SDQ_13", "SDQ_14", "SDQ_15", "SDQ_16", "SDQ_17", "SDQ_18", "SDQ_19", "SDQ_20", "SDQ_21", "SDQ_22", "SDQ_23", "SDQ_24", "SDQ_25", "SDQ_26", "SDQ_27", "SDQ_28", "SDQ_29_a", "SDQ_29_b", "SDQ_29_c", "SDQ_29_d", "SDQ_30", "SWAN_01", "SWAN_02", "SWAN_03", "SWAN_04", "SWAN_05", "SWAN_06", "SWAN_07", "SWAN_08", "SWAN_09", "SWAN_10", "SWAN_11", "SWAN_12", "SWAN_13", "SWAN_14", "SWAN_15", "SWAN_16", "SWAN_17", "SWAN_18", "CSC_01C", "CSC_01P", "CSC_02C", "CSC_02P", "CSC_03C", "CSC_03P", "CSC_04C", "CSC_04P", "CSC_05C", "CSC_05P", "CSC_06C", "CSC_06P", "CSC_07C", "CSC_07P", "CSC_08C", "CSC_08P", "CSC_09C", "CSC_09P", "CSC_10C", "CSC_10P", "CSC_11C", "CSC_11P", "CSC_12C", "CSC_12P", "CSC_13C", "CSC_13P", "CSC_14C", "CSC_14P", "CSC_15C", "CSC_15P", "CSC_16C", "CSC_16P", "CSC_17C", "CSC_17P", "CSC_18C", "CSC_18P", "CSC_19C", "CSC_19P", "CSC_20C", "CSC_20P", "CSC_21C", "CSC_21P", "CSC_22C", "CSC_22P", "CSC_23C", "CSC_23P", "CSC_24C", "CSC_24P", "CSC_25C", "CSC_25P", "CSC_26C", "CSC_26P", "CSC_27C", "CSC_27P", "CSC_28C", "CSC_28P", "CSC_29C", "CSC_29P", "CSC_30C", "CSC_30P", "CSC_31C", "CSC_31P", "CSC_32C", "CSC_32P", "CSC_33C", "CSC_33P", "CSC_34C", "CSC_34P", "CSC_35C", "CSC_35P", "CSC_36C", "CSC_36P", "CSC_37C", "CSC_37P", "CSC_38C", "CSC_38P", "CSC_39C", "CSC_39P", "CSC_40C", "CSC_40P", "CSC_41C", "CSC_41P", "CSC_42C", "CSC_42P", "CSC_43C", "CSC_43P", "CSC_44C", "CSC_44P", "CSC_45C", "CSC_45P", "CSC_46C", "CSC_46P", "CSC_47C", "CSC_47P", "CSC_48C", "CSC_48P", "CSC_49C", "CSC_49P", "CSC_50C", "CSC_50P", "CSC_51C", "CSC_51P", "CSC_52C", "CSC_52P", "CSC_53C", "CSC_53P", "CSC_54C", "CSC_54P", "CSC_55aC", "CSC_55aP", "CSC_55bC", "CSC_55bP", "CSC_55cC", "CSC_55cP", "CSC_55dC", "CSC_55dP", "CSC_55eC", "CSC_55eP", "CSC_55fC", "CSC_55fP", "CSC_55gC", "CSC_55gP", "CSC_55hC", "CSC_55hP", "CSC_55iC", "CSC_55iP"
    AS NUMERICAL;
);

### Creating an analysis schema for the population using CrossCat

Now that we have created the `gapminder` population, the next step is to analyze the data by building probabilistic models which explain the data generating process. Probabilistic data analyses in BayesDB are specified using an `ANALYSIS SCHEMA`. The default model discovery engine in BayesDB is Cross-Categorization [(Crosscat)](http://jmlr.org/papers/v17/11-392.html). CrossCat is a Bayesian factorial mixture model which learns a full joint distribution over all variables in the population, using a divide-and-conquer approach. We will explore CrossCat more in this notebook.

For now we use MML to declare the an analysis schema named `hbnq_crosscat` for the `hbnq` population. Note that that we have left the schema in `crosscat()` empty, which will apply the built-in defaults. Specifying custom analysis schemas is the subject of another tutorial.

In [13]:
%time
%mml CREATE ANALYSIS SCHEMA "hbnq_crosscat" FOR "hbnq" WITH BASELINE crosscat();

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 6.91 µs


After specifying the `hbnq_crosscat` analysis schema, we now need to initialize `ANALYSES` for the schema. We can think of an `ANALYSIS SCHEMA` as specifying a hypothesis space of explanations for the data generating process for the population, and each `ANALYSIS` is a candidate hypothesis. We start by creating only 1 analysis, which is initialized __randomly__.

In [None]:
%mml INITIALIZE 1 ANALYSIS FOR "hbnq_crosscat";

### Visualizing a CrossCat hypothesis

As mentioned earlier, CrossCat learns the full joint distribution of all variables in the population using divide-and-conquer:

- First, CrossCat partitions the variables into a set of _views_; all the variables in a particular view are modeled jointly, and two variables in different views are independent of one another.
- Second, within each view, CrossCat clusters the rows using a non-parametric mixture model.

The name Cross-Categorization is derived from this two-step process: first categorize the variables into views, and then categorize the rows into clusters within each view of variables. It is important to note that two different views A and B are likely to induce different clusterings of the rows.

To get a sense of CrossCat's hypothesis space, we can render the hypothesis specified by a particular analysis using the `.render_crosscat [options] <analysis_schema> <analysis_identifier>` plotting command. The `--subsample=50` option says to only show a subsample of 50 rows in the rendering (even though `gampinder_crosscat` is modeling all countries in the `gapminder` population); `--rowlabels=country` specifies which column in the table to use to label the rows. Finally `gapminder_crosscat 0` means to render the first (and only) analysis in the `gapminder_crosscat` anlaysis schema.

In [None]:
%mml .render_crosscat \
    --subsample=50 --rowlabels=EID --xticklabelsize=small --yticklabelsize=xx-small --progress=True --width=64 \
    hbnq_crosscat 0

Creating figure...
[=====                         ] 18.21%





__To view a full-size image of the rendering, either double click the image, or right-click and select "Open image in new tab."__

Again, we emphasize that the CrossCat hypothesis shown above is __randomly__ initialized based on the two-step clustering process we have described. Each block of columns shows a view of dependent variables. The clusters within a view are demarcated using solid pink lines. Each row is a country from `gapminder`. The color of a cell shows the magnitude of the data (normalized between 0 and 1, where light indicates lower values and dark indices higher values).

### Using BQL to query CrossCat models

In the CrossCat rendering, each pair of variables is either in the same view (and therefore probably dependent), or in different views (and therefore independent). We can query the detected probable dependencies between all pairs of variables using the `DEPENDENCE PROBABILITY` estimator in BQL. The next query produces a heatmap of all pairs of dependencies. In the heatmap below, each row and column is a variable, and the color of a cell is a value between 0 and 1 (lighter is nearer to 0, and darker is nearer to 1) indicating the amount of evidence for a predictive relationship or dependency between these two variables. Since we have initialized only 1 CrossCat analysis, each cell is exactly either 0 (if those variables are in different views), or 1 (variables are in the same view). Confirm that the blocks shown in the heatmap match up with the blocks of variables from the rendering. 

In [None]:
%bql .interactive_heatmap ESTIMATE DEPENDENCE PROBABILITY FROM PAIRWISE VARIABLES OF hbnq;

### Improving Crosscat hypotheses using MML `ANALYZE`

Now that we have initialized a CrossCat hypothesis and investigated its state, it is time to improve our initial guess by exploring the hypothesis space to find hypotheses that better explain the data. In particular, our single CrossCat analysis has both spurious dependencies as well as independencies between variables which we would expect to be depedenct (study the heatmap and rendering, can you locate some of these pairs?).

We can improve the CrossCat hypothesis by using the MML `ANALYZE` command, which takes the name of an analysis schema, an amount of iterations or seconds, and optional arguments. It then searches for improved hypotheses. Here, the `OPTIMIZED` keyword indicates to BayesDB to use a faster inference engine --- we generally recommend using this flag.

In [None]:
%mml ANALYZE "hbnq_crosscat" FOR 200 ITERATIONS WAIT (OPTIMIZED);

Let us look at the new CrossCat after running 200 steps of analysis. Study the dependent variables. Can you identify a "theme" or "category" which summarizes each view? For example, the fourth view from the left has variables "residential electricty consumption", "electricity generation per person", and "hydro production", indicating that this may view might summarize energy variables. Which dependencies or independencies are still spurious?

In [None]:
%mml .render_crosscat \
    --subsample=50 --rowlabels=EIN --xticklabelsize=small --yticklabelsize=xx-small --progress=True --width=64 \
    hbnq_crosscat 0

We can again visualize the probability there exists a dependence, between all pairs of variables, using BQL. How does this heatmap differ qualitatively from the dependence probability heatmap we plotted prior to running `ANALYZE`?

In [None]:
%bql .interactive_heatmap ESTIMATE DEPENDENCE PROBABILITY FROM PAIRWISE VARIABLES OF hbnq;

Recall that in addition to learning a clustering of variables, CrossCat additionally learns a clustering of the rows within each view. These clusters are separated using pink lines in the CrossCat rendering. We can use the `SIMILARITY IN THE CONTEXT OF <variable>` query in BQL to study CrossCat's row partition in the view of `<variable>`.

In the heatmap below, each row and column is a participant, and the value of a cell (between 0 and 1) indicates the probability that those two countries are relevant for formulating predictions about each other. Do these clusters of participants make sense?

In [None]:
%bql .interactive_heatmap --label0=rowid --label1=EID --table=hbnq_t \
    ESTIMATE SIMILARITY IN THE CONTEXT OF "ADHD subtype" FROM PAIRWISE hbnq;

### Initializing more CrossCat analyses

So far, we used `INITIALIZE 1 ANALYSIS FOR gapminder_crosscat` to create a single analysis. As a result, all of our heatmaps (such as a variable dependencies and row similarities) had "sharp" values (either 1 or 0). Since CrossCat has a very large hypothesis space, we can significantly improve modeling by creating an ensemble of analyses, where each analysis searches the hypothesis space for hypotheses that fit the data well. All queries in BQL will then become weighted averages of the query results from each individual analysis in the ensemble.

The `%multiprocess on` magic activates multiprocessing BayesDB which allow us to initialize, analyze and run queries on analyses using multiple cores on the host machine.

In [None]:
%multiprocess on

The following MML command ensures the `gapminder_crosscat` analysis schema will have a total of 32 analyses in the ensemble (recall that we already initialized 1 analysis, so the 31 new analyses will be added).

In [None]:
%mml INITIALIZE 32 MODELS IF NOT EXISTS FOR "hbnq_crosscat";

Again, we run 1 minute of analysis.

In [None]:
%mml ANALYZE "hbnq_crosscat" FOR 150 ITERATIONS WAIT (OPTIMIZED);

Let use produce some renderings of the analyses (here we choose 5, 7 and 15). Where is there consensus among these three analyses? Where do they disagree?

In [None]:
%mml .render_crosscat \
    --subsample=50 --rowlabels=EID --xticklabelsize=small --yticklabelsize=x-small hbnq_crosscat 5
%mml .render_crosscat \
    --subsample=50 --rowlabels=EID --xticklabelsize=small --yticklabelsize=x-small hbnq_crosscat 7
%mml .render_crosscat \
    --subsample=50 --rowlabels=EID --xticklabelsize=small --yticklabelsize=x-small hbnq_crosscat 15

### Exploring probable dependencies between variables and comparing CrossCat dependence probability to linear (Pearson R) correlation

As mentioned earlier, all BQL queries are aggregated across the 32 analyses in the ensemble. We will create a table named `dependencies` which contains the pairwise `DEPENDENCE PROBABILITY` values between the Gapminder variables. The value of a cell (between 0 and 1) is the fraction of analyses in the ensemble where those two variables are detected to be probably dependent (i.e. they are in the same view).

In [None]:
%%bql
CREATE TABLE dependencies AS
ESTIMATE
    DEPENDENCE PROBABILITY AS "depprob"
FROM PAIRWISE VARIABLES OF hbnq;

Here are five random rows from the `dependencies` table.

In [None]:
%bql SELECT * FROM "dependencies" ORDER BY RANDOM() LIMIT 5;

We again summarize the `dependencies` table using a heatmap. Study this dependence heatmap, and compare it to the heatmap produced when there was only 1 analysis. Which common-sense dependencies were missed by the single model, but identified by the ensemble as probably dependent?

In [None]:
%bql .interactive_heatmap SELECT name0, name1, depprob FROM dependencies;

Let us compare dependence probabilities from CrossCat to linear (Pearson r) correlation values, a very common technique for finding predictive relationships. We can compute the Pearson R (and its p-value) in BayesDB using the `CORRELATION` and `CORRELATION PVALUE` queries. The following cell creates a table named `correlations`, which contains the R and p-value for all pairs of variables.

In [None]:
%%bql
CREATE TABLE "correlations" AS
ESTIMATE
    CORRELATION AS "correlation",
    CORRELATION PVALUE AS "pvalue"
FROM PAIRWISE VARIABLES OF "hbnq"

Here are five random rows from the `correlations` table.

In [None]:
%bql SELECT * FROM "correlations" ORDER BY RANDOM() LIMIT 5;

__Emphasis__: There is a signficiant difference between `DEPENDENCE PROBABILITY`, `CORRELATION`, and `CORRELATION PVALUE`. We outline these differences below, which will help us make comparisons between predictive relationships detected by CrossCat versus Pearson correlation.

- `DEPENDENCE PROBABILITY`: Returns a value between [0,1] indicating the __probability there exists__ a predictive relationship (statistical dependence) between two variables.

- `CORRELATION`: Returns a value between [0,1] indicating the __strength__ of the linear relationsip between two variables, where 0 means no linear correlation, and 1 means perfect linear correlation.

- `CORRELATION PVALUE`: Returns a value between (0, 1) indicating the tail probability of the observed correlation value between two variables, under the null hypothesis that the two variables have zero correlation.

Based on these distinctions, there is no immediate way to numerically compare `DEPENDENCE PROBABILITY` with `CORRELATION/CORRELATION PVALUE`. However, it is possible to compare the inferences about predictive relationships that each method gives rise to, which we do in the next section.

Let us first produce a heatmap of the raw correlation values. The following query shows the raw correlation values (between 0 and 1) for all pairs of variables where the p-value is less than 0.01 (note that we are not accounting for multiple-testing using e.g. Bonferroni correction). Pairs of variables where the p-value exceeds 0.01 (and thus the null hypothesis of independence cannot be rejected) are shown in gray. The sparsity of the data makes it difficult to draw inferences about many variables.

In [None]:
%bql .interactive_heatmap SELECT name0, name1, "correlation" FROM "correlations" WHERE "pvalue" < 0.01

Explore the heatmap, and compare it to the heatmap from `DEPENDENCE PROBABILITY`. The patterns of dependence relationships differ significantly, how?

We can use BQL to find variables which CrossCat believes are probably dependent, but correlation believes are independent (either the null hypothesis of independence cannot be rejected, or the correlation value is significant and near zero).

In [None]:
%%sql
SELECT
    "name0",
    "name1",
    "dependencies"."depprob",
    "correlations"."correlation",
    "correlations"."pvalue"
FROM
    "dependencies"
    JOIN "correlations"
    USING ("name0", "name1")
WHERE
    -- CrossCat: probability dependent.
    "dependencies"."depprob" > 0.85
    AND (
    -- Correlation: cannot reject null hypothesis of independence.
    "correlations"."pvalue" > 0.05
    OR (
    -- Correlation: linear relationship is significant and near zero.
    "correlations"."pvalue" < 0.05 AND "correlations"."correlation" < 0.05))

Let us try to determine why the dependence probability of `female 5-9 years (%)` with `female 20-39 (%)` is very high according to CrossCat (greater than 0.90), but the linear correlation coefficient is near 0 (and the null hypothesis of independence cannot be rejected anyway).

The next cell produces a scatter plot of these two variables. We notice that they are indeed dependent, although the pattern is quite strange; it has a piecewise relationship where `female 20-39` is linearly correlated with `female 5-9 years` with an upward trend for values of `female 5-9 years` less than 10, but with a downward trend for values of `female 5-9 years` greater than 10. What are some possible explanations for the piecewise relationship?

In [None]:
%bql .interactive_scatter SELECT "adhd", "Anx" FROM "hbnq_t_subsample"

This finding illustrates that CrossCat analyses are able to detect non-linear and heteroskedastic predictive relationships between variables, which match our common-sense intuitions. A similar pattern can be seen in the scatter plot between `female 20-39 years` and `children per woman`, shown in the next cell.

In [None]:
%bql .interactive_scatter SELECT "Dx2", "asd" FROM "hbnq_t_subsample"

We can use also BQL to find variables which CrossCat believes are probably independent, but correlation believes are dependent (a statistically significant non-zero correlation value, where we are using an R cutoff of 0.15). The following query shows a list of such variables.

In [None]:
%%sql
SELECT
    "name0",
    "name1",
    "dependencies"."depprob",
    "correlations"."correlation",
    "correlations"."pvalue"
FROM
    "dependencies"
    JOIN "correlations"
    USING ("name0", "name1")
WHERE
    -- CrossCat: high uncertainty about dependence probability.
    "dependencies"."depprob" < 0.05
    AND (
    -- Correlation: statistically significant dependence.
    "correlations"."pvalue" < 0.05 AND "correlations"."correlation" > 0.15)
LIMIT 10

-----

Let us genreate scatter plots of some these variables, such as `armed forces personnel` versus `epidemic affected`, and `working hours per week` versus `air accidents killed`, which linear correlations spuriously detects are dependent, but CrossCat correctly identifies as probably independent.

%bql .interactive_scatter SELECT "armed forces personnel", "epidemic affected" FROM "gapminder_t_subsample"

%bql .interactive_scatter SELECT "working hours per week", "air accidents killed" FROM "gapminder_t_subsample"

We notice that linear correlation is deceived into detecting a dependency due to a single outlier in both cases. As a non-parametric mixture model, CrossCat is more robust to outliers and irregular patterns in the data, especially when there is insufficient evidence in the data to result in CrossCat reporting probable dependencies (as is the case in the two scatter plots, with only one data point deviating from the zero-dependence trend).

Bonferroni correction for multiple testing would perhaps render the p-values of these correlations as statistically insignificant. However, Bonferroni is also highly conservative, and will cause many common-sense relationships to be insigificant as well under linear correlation. These design trade-offs are very common in drawing inferences from frequentist methods such Pearson R.

Some next questions you might explore include:

- For which variables do CrossCat and linear correation agree about dependencies?
- Which pairs of variables have the most uncertainty about their dependence probability (a dependence probability value of 0.5 represents the most uncertainy, or a light green color)?

### Finding countries which are predictively relevant to one another

Now that we have explored dependencies between variables found by CrossCat, we next move to finding which countries in the Gapminder population are relevant to formulating predictions about one another. Recall from the visualizations of CrossCat that, within each _view_ of dependent variables, CrossCat learns a clustering of the countries. These country clusterings will typically differ based on the view. Also recall that different CrossCat analyses in the ensemble (we have 32 in total) will find different clusterings for both variables and rows, which we aggregate over using BQL queries.

We will use the BQL `PREDICTIVE RELEVANCE` query to find rows which are relevant for formulating predictions about one another, in the "context" of a variable in the population. The context of a variable is the set of variables in the CrossCat view for the context variable. A full description of this query can be found in 

> _Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes_. Saad, F.; Casarsa, L.; and Mansinghka, V. arXiv preprint, arXiv:1704.01087. 2017. [PDF](https://arxiv.org/abs/1704.01087).

In this tutorial, we will focus only on the simplest queries for exploratory analysis.

The next set of queries are more efficient to run without multiprocessing, so we will disable it.

%multiprocess off

Our first query finds the top 15 countries which are relevant to formulating predictions about the `life expectancy at birth` (and variables in its context) for the `United States`. We find that the result set contains rich nations with Western-style democracies and similar standards of living to that of the US. Notice that the set of relevant countries in this case is not based on geographic features.

%%bql
.interactive_bar 
ESTIMATE
    "country",
    PREDICTIVE RELEVANCE
        TO EXISTING ROWS (country='United States')
        IN THE CONTEXT OF "life expectancy at birth"
        AS "Relevance to United States (Life Expectancy)"
FROM gapminder
ORDER BY "Relevance to United States (Life Expectancy)"
DESC LIMIT 15

We can also find the list of countries which are relevant to the `United States` in the context of `armed forces personnel`. The result set is now much different, where almost no countries are predictively relevant. Why?

%%bql
.interactive_bar 
ESTIMATE
    "country",
    PREDICTIVE RELEVANCE
        TO EXISTING ROWS (country='United States')
        IN THE CONTEXT OF "armed forces personnel"
        AS "Relevance to United States (Military Personnel)"
FROM gapminder
ORDER BY "Relevance to United States (Military Personnel)" DESC LIMIT 10

Let us sort the countires by their `armed forces personnel`. We notice that there are other countries whose `armed forces personnel` is on the same scale as the US. But none of these countries are typically considered to share the same macroeconomic or political characteristics as the United States, so are not detected as predictively relevant.

%bql SELECT "country", "armed forces personnel" FROM gapminder_t ORDER BY "armed forces personnel" DESC limit 10

We now find the top 15 countries which are relevant to formulating predictions about the `life expectancy at birth` (and variables in its context) for the Ukraine. Notice that these countries are all Eastern European nations and former Soviet states. In this case, CrossCat has recovered relevances which are related to both geography as well as shared macroeconomic and political characteristics.

%%bql
.interactive_bar 
ESTIMATE
    "country",
    PREDICTIVE RELEVANCE
        TO EXISTING ROWS (country='Ukraine')
        IN THE CONTEXT OF "life expectancy at birth"
        AS "Relevance to Ukraine (Life Expectancy)"
FROM gapminder
ORDER BY "Relevance to Ukraine (Life Expectancy)" DESC LIMIT 15