# UNSEEN prevalence of complex mental health difficulties

The purpose of this Jupyter notebook is to query the Connected Bradford database and calculate the prevalence of complex mental health difficulties within the Connected Bradford cohort.

The current approach to identifying patients with complex mental health difficulties is to search for the presence of either particular clinical codes and prescriptions of particular medications. The clinical conditions of interest are shown below with their parent SNOMED CT code. All these codes and their children are searched:

| Condition | Parent <br /> SNOMED code |
| --------- | ---------- |
| Bipolar disorder | 13746004 |
|         Borderline personality disorder | 20010003 |
|                      Chronic depression | 192080009 |
|  Chronic post-traumatic stress disorder | 313182004 |
|  Complex post-traumatic stress disorder | 443919007 |
|         Developmental academic disorder | 1855002 |
|                               Dysthymia | 78667006 |
|                    Personality disorder | 33449004 |
|                           Schizophrenia | 58214004 |

The medications of interest are shown below. The name of the medication is searched so any dose or, for example, salt of a medication are included in the search:

| Drugs used in psychoses <br /> and related disorders | Hypnotics and anxiolytics | Antidepressants |
| --- | --- | ---
| Risperidone | Diazepam | Clomipramine |
| Olanzapine | Zopiclione | Citalopram |
| Quetiapine | | Duloxetine |
| | | Excitalopram |
| | | Lupentioxol |
| | | Mirtazapine |
| | | Paroxetine |
| | | Sertraline |
| | | Trazodone |
| | | Venlafaxine |
| | | Floxetine |


The follow caveats apply:
- patients with a record in Connected Bradford's primary care dataset.
- patients aged between 18 and 70 years old, inclusive.
- identification based on the records of medication prescriptions only refer to
    - repeat presciptions,
    - within the previous year from the date the query is run,
    - by matching the name of the prescribed medication with the main drug name.
- identification based on the records of clinical conditions query parent codes and all children, active or inactive.

The first step is to set up the Google BigQuery client and load the codelists and list of medications. Each of the clinical codelists are defined using [OpenCodelist](https://opencodelists.org).

In [6]:
import pandas
import numpy
from google.cloud import bigquery
from datetime import date
from dateutil.relativedelta import relativedelta
from IPython.display import Markdown as md, display

# Instantiate bigQuery client.
client = bigquery.Client()

# Set folder path.
folder = '/home/jupyter/UNSEEN/codelists/'

# Clinical codes of interest.
codes_to_query_bipolar = pandas.read_csv(folder + "ciaranmci-bipolar-disorder-6a0308d7.csv")
codes_to_query_borderline = pandas.read_csv(folder + "ciaranmci-borderline-personality-disorder-1ed4af38.csv")
codes_to_query_depression = pandas.read_csv(folder + "ciaranmci-chronic-depression-53a65598.csv")
codes_to_query_chronicPTSD = pandas.read_csv(folder + "ciaranmci-chronic-post-traumatic-stress-disorder-3a96e263.csv")
codes_to_query_complexPTSD = pandas.read_csv(folder + "ciaranmci-complex-post-traumatic-stress-disorder-21876f2e.csv")
codes_to_query_devAcademicDisorder = pandas.read_csv(folder + "ciaranmci-developmental-academic-disorder-50f395a2.csv")
codes_to_query_dysthymia = pandas.read_csv(folder + "ciaranmci-dysthymia-6f6888c3.csv")
codes_to_query_personalityDisorder = pandas.read_csv(folder + "ciaranmci-personality-disorder-243a2f24.csv")
codes_to_query_schizophrenia = pandas.read_csv(folder + "ciaranmci-schizophrenia-05c53c03.csv")
codes_to_query_all = pandas.read_csv(folder + "ciaranmci-unseen-snomed-codes-to-identify-cmhd-0b2abbef.csv")

# Medications of interest.
medications_to_query_psychosisAndRelated = pandas.read_csv(folder + "UNSEEN medications_psychosisAndRelated.csv")
medications_to_query_hypnoticsAndAnxiolytics = pandas.read_csv(folder + "UNSEEN medications_hypnoticsAndAnxiolytics.csv")
medications_to_query_antidepressants = pandas.read_csv(folder + "UNSEEN medications_antidepressants.csv")
medications_to_query_all = pandas.read_csv(folder + "UNSEEN medications list.csv")


Next, we create a Python pandas.Dataframe that contains the results from a SQL query of the Connected Bradford dataset. The columns in our pandas.DataFrame will be:
- person_id
    - A unique identifer for each person.
    - NOTE: The count of unique person_id gives us our denominator in all prevalence estimates (n = 699,622). The count of person_id that results from the code below is equal to the count of unique people that have a clinical code or have have a prescription in the Connected Bradford primary care data table (v5), who are aged between 18 and 70, inclusive. This is a subset of the entire Connected Bradford dataset (n_all = 6,956,643) who are 18-70 year olds in Connected Bradford (n_18to70 = 1,384,509); counts taken on 24th Oct 2022.
- Bipolar
    - Integer indicator for the presence of a clinical code for bipolar disorder.
- Borderline
    - Integer indicator for the presence of a clinical code for borderline personality disorder.
- ChronicPTSD
    - Integer indicator for the presence of a clinical code for chronic post-traumatic stress disorder.
- ComplexPTSD
    - Integer indicator for the presence of a clinical code for complex posttraumatic stress disorder.
- Depression
    - Integer indicator for the presence of a clinical code for depression.
- DevAcademicDisorder
    - Integer indicator for the presence of a clinical code for developmental academic disorder.
- Dysthymia
    - Integer indicator for the presence of a clinical code for dysthymia disorder.
- PersonalityDisorder
    - Integer indicator for the presence of a clinical code for personality disorder.
- Schizophrenia
    - Integer indicator for the presence of a clinical code for schizophrenia disorder.
- Meds_PsychosisAndRelated
    - Integer indicator for the presence of any of the names of any of the medications of interest in the group "Drugs used in psychoses and related disorders".
- Meds_hypnoticsAndAnxiolytics
    - Integer indicator for the presence of any of the names of any of the medications of interest in the group "Hypnotics and anxiolytics".
- Meds_antidepressants
    - Integer indicator for the presence of any of the names of any of the medications of interest in the group "Antidepressants".
- ClinicalCode_Any
    - Integer indicator for the presence of a clinical code for any of the aforementioned conditions.
- Meds_Any
    - Integer indicator for the presence of any of the names of any of the medications of interest in any of the aforementioned groups of medications of interest.
- \_Any 
    - Integer indicator for the presence of a clinical code for any of the aforementioned conditions _OR_ the presence of any of the names of any of the medications of interest in any of the aforementioned groups of medications of interest.

In [4]:
sql = """
WITH
# The first CTE will specify the 'spine' of the data table by selecting the unique list of person IDs.
tbl_persons AS (
    SELECT
        DISTINCT person_id
    FROM
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.person
    # Limiting to age range 18-70.
    WHERE
        (EXTRACT(YEAR FROM CURRENT_DATE()) - year_of_birth) BETWEEN 18 AND 70
),

# The following CTEs extract each clinical codelist into a SQL table before querying the person_ID 
# associated with the clinical codes.
#
#  ## Bipolar disorder
tbl_bipolar AS ( 
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_bipolar["code"].tolist())) + """'
                ]) AS snomedcode
),
tbl_bipolar_persons AS (
    SELECT
        DISTINCT person_id
        ,src_snomedcode
    FROM
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRCode, tbl_bipolar
    WHERE
        src_snomedcode IN (tbl_bipolar.snomedcode)
),
#  ## Borderline personality disorder
tbl_borderline AS ( 
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_borderline["code"].tolist())) + """'
                ]) AS snomedcode
),
tbl_borderline_persons AS (
    SELECT
        DISTINCT person_id
        ,src_snomedcode
    FROM
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRCode, tbl_borderline
    WHERE
        src_snomedcode IN (tbl_borderline.snomedcode)
),
#  ## Chronic PTSD
tbl_chronicPTSD AS ( 
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_chronicPTSD["code"].tolist())) + """'
                ]) AS snomedcode
),
tbl_chronicPTSD_persons AS (
    SELECT
        DISTINCT person_id
        ,src_snomedcode
    FROM
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRCode, tbl_chronicPTSD
    WHERE
        src_snomedcode IN (tbl_chronicPTSD.snomedcode)
),
#  ## Complex PTSD
tbl_complexPTSD AS ( 
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_chronicPTSD["code"].tolist())) + """'
                ]) AS snomedcode
),
tbl_complexPTSD_persons AS (
    SELECT
        DISTINCT person_id
        ,src_snomedcode
    FROM
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRCode, tbl_complexPTSD
    WHERE
        src_snomedcode IN (tbl_complexPTSD.snomedcode)
),
#  ## Depression
tbl_depression AS ( 
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_depression["code"].tolist())) + """'
                ]) AS snomedcode
),
tbl_depression_persons AS (
    SELECT
        DISTINCT person_id
        ,src_snomedcode
    FROM
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRCode, tbl_depression
    WHERE
        src_snomedcode IN (tbl_depression.snomedcode)
),
#  ## Developmental academic disorder
tbl_devAcademicDisorder AS ( 
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_devAcademicDisorder["code"].tolist())) + """'
                ]) AS snomedcode
),
tbl_devAcademicDisorder_persons AS (
    SELECT
        DISTINCT person_id
        ,src_snomedcode
    FROM
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRCode, tbl_devAcademicDisorder
    WHERE
        src_snomedcode IN (tbl_devAcademicDisorder.snomedcode)
),
#  ## Dysthymia
tbl_dysthymia AS ( 
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_dysthymia["code"].tolist())) + """'
                ]) AS snomedcode
),
tbl_dysthymia_persons AS (
    SELECT
        DISTINCT person_id
        ,src_snomedcode
    FROM
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRCode, tbl_dysthymia
    WHERE
        src_snomedcode IN (tbl_dysthymia.snomedcode)
),
#  ## Personality disorder
tbl_personalityDisorder AS ( 
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_personalityDisorder["code"].tolist())) + """'
                ]) AS snomedcode
),
tbl_personalityDisorder_persons AS (
    SELECT
        DISTINCT person_id
        ,src_snomedcode
    FROM
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRCode, tbl_personalityDisorder
    WHERE
        src_snomedcode IN (tbl_personalityDisorder.snomedcode)
),
#  ## Personality disorder
tbl_schizophrenia AS ( 
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_schizophrenia["code"].tolist())) + """'
                ]) AS snomedcode
),
tbl_schizophrenia_persons AS (
    SELECT
        DISTINCT person_id
        ,src_snomedcode
    FROM
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRCode, tbl_schizophrenia
    WHERE
        src_snomedcode IN (tbl_schizophrenia.snomedcode)
),


# The following CTEs extract each medication list into a SQL table before querying the person_ID 
# associated with the medications (combined into medication type).
#
#  ## Drugs used in psychosis and related disorders.
tbl_meds_psychosisAndRelated AS (
    SELECT
        Medication
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, medications_to_query_psychosisAndRelated["Medication"].tolist())) + """'
                ]) AS Medication
),
tbl_meds_psychosisAndRelated_persons AS (
    SELECT
        DISTINCT Tblb.person_id,
        tbl_meds_psychosisAndRelated.Medication
    FROM
        tbl_meds_psychosisAndRelated
    LEFT JOIN
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRPrimaryCareMedication AS Tblb
    ON
        Tblb.src_nameofmedication LIKE CONCAT('%',tbl_meds_psychosisAndRelated.Medication,'%')
    WHERE CAST(src_isrepeatmedication AS BOOL) IS TRUE 
        #AND CAST(src_datemedicationstart AS DATE) > DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR)
),
#  ## Hypnotics and anxiolyitcs
tbl_meds_hypnoticsAndAnxiolytics AS (
    SELECT
        Medication
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, medications_to_query_hypnoticsAndAnxiolytics["Medication"].tolist())) + """'
                ]) AS Medication
),
tbl_meds_hypnoticsAndAnxiolytics_persons AS (
    SELECT
        DISTINCT Tblb.person_id,
        tbl_meds_hypnoticsAndAnxiolytics.Medication
    FROM
        tbl_meds_hypnoticsAndAnxiolytics
    LEFT JOIN
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRPrimaryCareMedication AS Tblb
    ON
        Tblb.src_nameofmedication LIKE CONCAT('%',tbl_meds_hypnoticsAndAnxiolytics.Medication,'%')
    WHERE CAST(src_isrepeatmedication AS BOOL) IS TRUE
        # AND CAST(src_datemedicationstart AS DATE) > DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR)
),
#  ## Antidepressants
tbl_meds_antidepressants AS (
    SELECT
        Medication
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, medications_to_query_antidepressants["Medication"].tolist())) + """'
                ]) AS Medication
),
tbl_meds_antidepressants_persons AS (
    SELECT
        DISTINCT Tblb.person_id,
        tbl_meds_antidepressants.Medication
    FROM
        tbl_meds_antidepressants
    LEFT JOIN
        yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.tbl_SRPrimaryCareMedication AS Tblb
    ON
        Tblb.src_nameofmedication LIKE CONCAT('%',tbl_meds_antidepressants.Medication,'%')
    WHERE CAST(src_isrepeatmedication AS BOOL) IS TRUE
        # AND CAST(src_datemedicationstart AS DATE) > DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR)
)


# Finally, we use the above CTEs to define a table with one row per patient and one column for each
# clinical code and medication group. The code and medication columns are populated by interger
# values with '1' indicating that the code or medication is present in patient record and '0' indicating
# otherwise.
SELECT
    DISTINCT tbl_persons.person_id
    ,CASE WHEN tbl_bipolar_persons.person_id IS NULL THEN 0 ELSE 1 END AS Bipolar
    ,CASE WHEN tbl_borderline_persons.person_id IS NULL THEN 0 ELSE 1 END AS Borderline
    ,CASE WHEN tbl_chronicPTSD_persons.person_id IS NULL THEN 0 ELSE 1 END AS ChronicPTSD
    ,CASE WHEN tbl_complexPTSD_persons.person_id IS NULL THEN 0 ELSE 1 END AS ComplexPTSD
    ,CASE WHEN tbl_depression_persons.person_id IS NULL THEN 0 ELSE 1 END AS Depression
    ,CASE WHEN tbl_devAcademicDisorder_persons.person_id IS NULL THEN 0 ELSE 1 END AS DevAcademicDisorder
    ,CASE WHEN tbl_dysthymia_persons.person_id IS NULL THEN 0 ELSE 1 END AS Dysthymia
    ,CASE WHEN tbl_personalityDisorder_persons.person_id IS NULL THEN 0 ELSE 1 END AS PersonalityDisorder
    ,CASE WHEN tbl_schizophrenia_persons.person_id IS NULL THEN 0 ELSE 1 END AS Schizophrenia
    ,CASE WHEN tbl_meds_psychosisAndRelated_persons.person_id IS NULL THEN 0 ELSE 1 END AS Meds_PsychosisAndRelated
    ,CASE WHEN tbl_meds_hypnoticsAndAnxiolytics_persons.person_id IS NULL THEN 0 ELSE 1 END AS Meds_hypnoticsAndAnxiolytics
    ,CASE WHEN tbl_meds_antidepressants_persons.person_id IS NULL THEN 0 ELSE 1 END AS Meds_antidepressants
FROM tbl_persons
LEFT OUTER JOIN tbl_bipolar_persons ON tbl_persons.person_id = tbl_bipolar_persons.person_id
LEFT OUTER JOIN tbl_borderline_persons ON tbl_persons.person_id = tbl_borderline_persons.person_id
LEFT OUTER JOIN tbl_chronicPTSD_persons ON tbl_persons.person_id = tbl_chronicPTSD_persons.person_id
LEFT OUTER JOIN tbl_complexPTSD_persons ON tbl_persons.person_id = tbl_complexPTSD_persons.person_id
LEFT OUTER JOIN tbl_depression_persons ON tbl_persons.person_id = tbl_depression_persons.person_id
LEFT OUTER JOIN tbl_devAcademicDisorder_persons ON tbl_persons.person_id = tbl_devAcademicDisorder_persons.person_id
LEFT OUTER JOIN tbl_dysthymia_persons ON tbl_persons.person_id = tbl_dysthymia_persons.person_id
LEFT OUTER JOIN tbl_personalityDisorder_persons ON tbl_persons.person_id = tbl_personalityDisorder_persons.person_id
LEFT OUTER JOIN tbl_schizophrenia_persons ON tbl_persons.person_id = tbl_schizophrenia_persons.person_id
LEFT OUTER JOIN tbl_meds_psychosisAndRelated_persons ON tbl_persons.person_id = tbl_meds_psychosisAndRelated_persons.person_id
LEFT OUTER JOIN tbl_meds_hypnoticsAndAnxiolytics_persons ON tbl_persons.person_id = tbl_meds_hypnoticsAndAnxiolytics_persons.person_id
LEFT OUTER JOIN tbl_meds_antidepressants_persons ON tbl_persons.person_id = tbl_meds_antidepressants_persons.person_id
ORDER BY tbl_persons.person_id
"""

bqTable = client.query(sql).to_dataframe()
bqTable["ClinicalCode_Any"] = \
    bqTable[['Bipolar', 'Borderline', 'ChronicPTSD',
             'ComplexPTSD', 'DevAcademicDisorder', 'Depression',
             'Dysthymia','PersonalityDisorder', 'Schizophrenia']].max(axis = 1)
bqTable["Prescriptions_Any"] = \
    bqTable[['Meds_PsychosisAndRelated', 'Meds_hypnoticsAndAnxiolytics', 'Meds_antidepressants']].max(axis = 1)
bqTable
bqTable['_Any'] = bqTable.loc[:,bqTable.columns !=  'person_id'].max(axis = 1)

In [7]:
# Prepare header and note for presentation.
now = date.today()
#then = (now - relativedelta(years = 1)).strftime('%d-%b-%Y')
now = now.strftime('%d-%b-%Y')
display(
    md("""
## Prevalence per thousand, breakdown

To mitigate disclosure, counts are rounded to the nearest 5 before proportions are calculated.

The prevalence values refer to the period up to %s.
       """
       %(now)
       )
)

# Prepare table for presentation.
df_prevalence = \
    pd.DataFrame(data = {'numerator'   : (round(bqTable.loc[:,bqTable.columns !=  'person_id'].sum() / 5) * 5).astype(int),
                         'denominator' : (numpy.rint(np.repeat(bqTable.loc[:,bqTable.columns !=  'person_id'].shape[0],
                                                               bqTable.shape[1]-1, axis = 0) / 5) * 5).astype(int)})
df_prevalence['prevalence per thousand'] = \
    ((df_prevalence['numerator'] / df_prevalence['denominator'] ) * 1000).astype(int)                
df_prevalence


## Prevalence per thousand, breakdown

To mitigate disclosure, counts are rounded to the nearest 5 before proportions are calculated.

The prevalence values refer to the period up to 21-Nov-2022.
       

Unnamed: 0,numerator,denominator,prevalence per thousand
Bipolar,2220,699620,3
Borderline,485,699620,0
ChronicPTSD,120,699620,0
ComplexPTSD,120,699620,0
Depression,1230,699620,1
DevAcademicDisorder,2295,699620,3
Dysthymia,540,699620,0
PersonalityDisorder,3645,699620,5
Schizophrenia,2770,699620,3
Meds_PsychosisAndRelated,9820,699620,14


## Prevalence query, play area.

In [8]:
# Choose which columns to aggregate for a new, aggreageted prevalence estimate.
cols = \
    ['Bipolar', 'Borderline']

df_prevalence_play = \
    (round(pandas.DataFrame(data = {'numerator'   : bqTable[cols].sum()}).to_numpy().sum() / 699620 * 1000) / 5) * 5
print("The prevalence is", df_prevalence_play)

The prevalence is 4.0


# Entropy and hit rates of our caseness variable
We can also calculate the entropy and hit rates of the variable that represents complex mental health difficulties using the data we have to calculate the overall prevalence.

### Hit rates of our caseness variable
The two hit rates of the caseness variable refers to the classification accuracy values of classifying everyone as either having or not having complex mental health difficulties. For example, if 20% of our cohort were defined as having a record of complex mental health difficulties, then the hit rate (a.k.a. classification accuracy) for assuming everyone has complex mental health difficulties would be 20%. Conversely, the hit rate for assuming no one has complex mental health difficulties would be 80%. The better performing of these two assumptions is the simplest and best-yet rule we have for identifying complex mental health difficulties. We do this to illustrate the benchmark hit rate that we are trying to beat with our sets of indicative features, which should improve out classification accuracy to make up for complicating our classification rule.

### Entropy of our caseness variable
The entropy of our caseness variable is a measure of how uncertain we can be about a given patient's value. If the entropy is low, it tells us that most patients either have complex mental health difficulties or most patients do not have complex mental health difficulties. Either way, we learn from the entropy value that going through the effort of finding indicative features isn’t going to be much more of a help than classifying unclassified patients using the benchmark hit rate. If this is the case, perhaps a better use of our time would be to focus our research on defining CMHD so that it distinguishes our sample of patients better.

If the entropy value is high - though it is theoretically bounded by $log_{base}2$ where $base$ is the logarithmic base of our choosing - it tells us that we have a non-trivial mix of patients with and without complex mental health difficulties, in our sample. We would learn that we are justified in wanting to reduce the uncertainty (a.k.a. _surprisal_) inherent in our definition of complex mental health difficulties. Given that our caseness variable has only two possible values, the highest entropy we can expect is akin to knowing the outcome of a coin toss. We can calculate a scaled entropy - 0 to 1, or 0% to 100% - that tells us how close to the coin-toss scenario we would be if we were asked to identify whether a patient has complex mental health difficulties, but without any further information to help.
<br/>
<br/>
<br/>
<br/>

In the code block below, we calculate the:
1. hit rate assuming everyone has complex mental health difficulties;
2. hit rate assuming no one has complex mental health difficulties;
3. entropy of our variable that indicates complex mental health difficulties; _and_
4. entropy scaled to the theoretical maximum, given base _e_.

## Results

In [9]:
import scipy.stats
import math
# Calculate the hit rates.
hitRate_all = round(df_prevalence.iloc[-1]['prevalence per thousand'] / 10, 1)
hitRate_none = round(100 - hitRate_all, 1)
# Calculate the entropy and the entropy scaled to theoretical maximum.
ones = df_prevalence.iloc[-1]['numerator']
zeros = df_prevalence.iloc[-1]['denominator'] - df_prevalence.iloc[-1]['numerator']
a = pandas.Series([ele for ele in [0] for i in range(zeros)])
b = pandas.Series([ele for ele in [1] for i in range(ones)])
c = a.append(b)
myEntropy = scipy.stats.entropy(c.value_counts(), base = math.e)
myEntropy_scaled = round(myEntropy / math.log(2, math.e), 3) * 100
myEntropy = round(myEntropy, 3)
# Clean up.
del(a, b, c)
# Present results.
varnames = pandas.Series(["Entropy (nats)", "Scaled entropy (%)", "Hit rate_all (%)", "Hit rate_none (%)",])
varvals = pandas.Series([myEntropy, myEntropy_scaled, hitRate_all, hitRate_none])
pandas.concat([varnames, varvals], axis=1)
pandas.DataFrame(data = {"Variable" : varnames, "Value" : varvals})


Unnamed: 0,Variable,Value
0,Entropy (nats),0.4
1,Scaled entropy (%),57.7
2,Hit rate_all (%),13.7
3,Hit rate_none (%),86.3


Paying particular attention to the second and fourth rows of the table above, we now know that:
1. based on the scaled entropy, our variable for indicating complex mental health difficulties is 57.7% as uncertain/surprising/unforeseeable as it could possibly be (note: about half as bad as the coin-toss scenario); _and_
2. we would correctly classify 86.3% of patients in this sample if we simply assumed that no one has complex mental health difficulties.

The first point encourages us to continue with the initial research aim because our caseness variable does indeed appear to be inherently uncertain. Our intention to reduce this uncertainty by leveraging information from other sources seems like it will be worthwhile. Given this encourgement to continue, the second point defines a benchmark for the indicative performance of any feature set that we evalaute in our study. Specifically, any feature set that we suggest to improve our certainty of knowing that someone has complex mental health difficulties must correctly identify >86.3% of patients in our sample. Otherwise, the added feature set is a needless complication to our attempt to know whether someone has complex mental health difficulties.