<img src="resources/header.png"/>

# Questionnaire shortening

Authored by: Ulrich Schaechtle and Veronica Weiner of the MIT Probabilistic Computing Project (Probcomp). Prepared for: meeting with Arno Klein and Jon Clucas of the Child Mind institute.

This notebook demonstrates how to shorten a questionnaire using BayesDB. Specifically, we show how to use BayesDB's conditional mutual information query to select a set of questions (here 10 out of the original ~800 questions) that, taken together, are most informative about a certain diagnosis (here for (i) autism and (ii) attention deficit hyperactivity disorder).

## Outline:

1. Building a probabilistic model for the questionnaire data with BayesDB
2. Selecting questions for a shortened questionnaire
3. Results: (a) for autism, (b) for ADHD


## 1. Building a probabilistic model for the questionnaire data with BayesDB

In [None]:
%run resources/utils
%load_ext jupyter_probcomp.magics

In [None]:
%matplotlib inline
%vizgpm inline

In [None]:
%bayesdb resources/bdb/questionnaire_shortening.bdb

In [None]:
%bql CREATE TABLE "raw_questionnaire_responses" FROM 'resources/data/init_data.csv'

In [None]:
%bql .nullify raw_questionnaire_responses ''

In [None]:
%%mml
CREATE POPULATION questionnaire_responses_population FOR "raw_questionnaire_responses" WITH SCHEMA (
    GUESS STATTYPES FOR (*);
    MODEL
         "Neurodevelopmental Disorders",
         "Substance Related and Addictive Disorders",
         "Adjustment Disorder",
         "Feeding and Eating Disorders",
         "SCQ_30",
         "SCQ_01",
         "Schizophrenia Spectrum and other Psychotic Disorders",
         "Neurodevelopmental Disorder",
         "Trauma and Stressor Related Disorders",
         "Tic Disorder",
         "Elimination Disorders",
         "SCQ_28",
         "Other Conditions That May Be a Focus of Clinical Attention",
         "Bipolar and Related Disorders",
         "Obsessive Compulsive and Related Disorders",
         "Motor Disorder",
         "Somatic Symptom and Related Disorders",
         "Intellectual Disability",
         "Sleep-Wake Disorders" 
    AS 
        NOMINAL;
    MODEL 
         "Age" 
    AS 
        NUMERICAL;
    IGNORE	 
         "EID";
);

In [None]:
%mml CREATE ANALYSIS SCHEMA "questionnaire_responses_m" FOR questionnaire_responses_population WITH BASELINE crosscat();

In [None]:
%mml INITIALIZE 1 ANALYSES FOR "questionnaire_responses_m";

In [None]:
%mml ANALYZE "questionnaire_responses_m" FOR 240 MINUTES WAIT (OPTIMIZED);

## 2. Selecting questions for a shortened questionnaire

The data table (`raw_questionnaire_responses`) comprises roughly 800 unique questions from a set
of questionnaires. The aim of asking subjects all those questions is getting information
about certain diagnossis, such as autism or ADHD. But letting every child
answer all of the roughly 800 questions in the questionnaire takes time. Can we select just the
10 most informative questions for a certain diagnosis?


In [None]:
desired_number_of_questions = 10

We demonstrate that it is possible to shorten the childmind questionnaire by 
generating shortened versions of it for the diagnoses:
- Autism Spectrum Disorder
- Attention-Deficit/Hyperactivity Disorder

In [None]:
diagnoses = ["Autism Spectrum Disorder", "Attention-Deficit/Hyperactivity Disorder"]

The number of samples parameter is a Monte Carlo accuracy parameter, also set
to the 10 for this demo.

In [None]:
n_samples = 10

In [None]:
for diagnosis in diagnoses:
    print "==== Shorten questionnaire to detect: %s. ====" % diagnosis
    # Get the names of all candidate variables for the shortened questionnaire.
    df = %bql SELECT * FROM raw_questionnaire_responses LIMIT 1
    candidate_questions = df.columns.tolist()
    # Remove the diagnosis from the candidates.
    candidate_questions.remove(diagnosis)
    # Remove the subject ID from the candidates.
    candidate_questions.remove("EID")

    selected_questions = [] # Initialize: no 
    # While we don't have the desired number of quesions, keep searching. 
    while len(selected_questions) < desired_number_of_questions:
            current_scores = []
            # Loop through all the candidate questions.
            for next_column in candidate_questions:
                current_bql_pattern = get_bql_pattern(
                    next_column,
                    diagnosis,
                    n_samples,
                    selected_questions
                )
                if (not selected_questions) and next_column == candidate_questions[0]:
                    print ""
                    print "First CMI query run to shorten the questionnaire:"
                    print ""
                    print current_bql_pattern
                df = %bql {current_bql_pattern}
                current_scores.append(read_mi(df))
            new_question = candidate_questions[argmax(current_scores)]
            selected_questions.append(new_question)
            candidate_questions.remove(new_question)
    print ""
    print "Last CMI query run to shorten the questionnaire:"
    print current_bql_pattern
    print ""
    print "Questions selected:"
    for question in selected_questions:
        print "   -- " + question
    print "---------------------------------"
    print ""

## 3a. Results for Autism

### Questions selected for a shortened questionnaire about autism


#### Autism Spectrum Screening Questionnaire (ASSQ)
- ASSQ_16:  [Child] can be with other children but only on his/her terms
- ASSQ_17:  [Child] lacks best friend
- ASSQ_20:  [Child] has clumsy, ill coordinated, ungainly, awkward movements or gestures
- ASSQ_11:  [Child] uses language freely but fails to make adjustments to fit social contexts or the needs of different listeners

#### Social Communication Questionnaire (SCQ)
- SCQ_03: Does she/he ever use odd phrases or say the same thing over and over in almost exactly the same way (either phrases that she/he hears other people use or ones that she/he makes up?
- SCQ_04: Does she/he ever use socially inappropriate questions or statements? For example, does she/he ever regularly ask personal questions or make personal comments at awkward times?
- SCQ_08: Does she/he ever have things that she/he seems to have to do in a very particular way or order or rituals that she/he insists that you go though?
- SCQ_11: Does she/he ever have any interests that preoccupy her/him and might seem off to other people (e.g., traffic lights, drainpipes, or timetables?)
- SCQ_31: Does she/he ever try to comfort you if you are sad or hurt?

#### Mood and Feelings Questionnaire (MFQ) Parent Report
- MFQ_P_30: S/he thought s/he could never be as good as other kids.

## 3b. Results for ADHD

### Questions selected for a shortened questionnaire about ADHD


#### The SWAN Rating Scale for ADHD (SWAN)
- SWAN_03: [Child] Listens when spoken to directly.
- SWAN_04: [Child] Follows through on instructions and finishes school work and chores
- SWAN_05: [Child] Organizes tasks and activities
- SWAN_07: [Child] Keeps track of things necessary for activities (doesn't lose them)
- SWAN_11: [Child] Stays seated (when required by class rules or social conventions)
- SWAN_14: [Child] Settles down and rests (controls excessive talking)
- SWAN_16: [Child] Reflects on questions (controls blurting out answers)
- SWAN_17: [Child] Awaits turn (stands in line and takes turns)
- SWAN_18: [Child] Enters into conversation and games without interrupting or intruding
---------------
#### Strength and Difficulties Questionnaire (SDQ)
- SDQ_25: Good attention span, sees chores or homework through to the end