# Caseness cohort breakdown

The purposes of this notebook are:

1. to produce the caseness array. The caseness array is an n-by-2 array containing patient ID and a binary vector indicating whether the patient's record meets our definition for the caseness of complex mental health difficulties.

2. to provide a breakdown of the count of patient records that met our definition for the caseness of complex mental health difficulties.

## Imports

In [1]:
# Get helper functions.
%run 'UNSEEN_helper_functions.ipynb'
# Refresh stored variables, if they are present.
%store -r

## Load requisites

In [2]:
# Set index date. Usually CURRENT_DATE() but Dec 2021 will be used until cB fixed the missing prescriptions.
myIndexDate =  '2021-12-31'

# Set the duration for which a person must have been registered with their current general practice, in years.
min_GP_registeration_duration = 1

# Set the capture window for criteria diagnoses and prescriptions, in years.
Dx_window = 10
Rx_window = 10
Rx_window_caseness = 10 # Must be <= Rx_window because Rx_window is applied first.

# Set parameters for disclosivity adjustments.
redaction_threshold = 7
target_round = 10

# Set the database attributes.
global server_id
server_id = 'yhcr-prd-phm-bia-core'
global database_id
database_id = 'CB_FDM_PrimaryCare_V7'

# Set folder location.
folder_loc = os.path.dirname(os.path.abspath("UNSEEN create caseness array.ipynb"))
folder = folder_loc + '/codelists/'

%store server_id database_id myIndexDate redaction_threshold \
target_round Dx_window Rx_window Rx_window_caseness min_GP_registeration_duration

Stored 'server_id' (str)
Stored 'database_id' (str)
Stored 'myIndexDate' (str)
Stored 'redaction_threshold' (int)
Stored 'target_round' (int)
Stored 'Dx_window' (int)
Stored 'Rx_window' (int)
Stored 'Rx_window_caseness' (int)
Stored 'min_GP_registeration_duration' (int)


## Load codelist CSV files.
We used [opencodelists.org](https://www.opencodelists.org) to define codelists that define the set of SNOMED-CT codes used to identify patients based on various attributes.

In [3]:
# Clinical codes of interest.
codes_to_query_mentalIllHealth = set( pandas.read_csv(folder + "mental_ill_health_codelist.txt", sep = '\t')["Id"] )
codes_to_query_bipolar = set( pandas.read_csv(folder + "ciaranmci-bipolar-disorder-6a0308d7.csv")["code"] )
codes_to_query_schizophrenia = set( pandas.read_csv(folder + "ciaranmci-schizophrenia-05c53c03.csv")["code"] )
codes_to_query_dementia = set( pandas.read_csv(folder + "bristol-dementia-snomed-ct-v13-7a6320f3.csv")["code"] )
# ## Exclude codes for Bipolar, Schizophrenia, and Dementia from the list of codes that define inclusion into the study population.
codes_to_exclude = set( codes_to_query_bipolar.union(codes_to_query_schizophrenia).union(codes_to_query_dementia) )
codes_to_query_population = codes_to_query_mentalIllHealth.difference(codes_to_exclude)

# ## Create codelist for the cases.
codes_to_query_borderline = set( pandas.read_csv(folder + "ciaranmci-borderline-personality-disorder-1ed4af38.csv")["code"] )
codes_to_query_chronicDepression = set( pandas.read_csv(folder + "ciaranmci-chronic-depression-53a65598.csv")["code"] )
codes_to_query_chronicPTSD = set( pandas.read_csv(folder + "ciaranmci-chronic-post-traumatic-stress-disorder-3a96e263.csv")["code"] )
codes_to_query_complexPTSD = set( pandas.read_csv(folder + "ciaranmci-complex-post-traumatic-stress-disorder-21876f2e.csv")["code"] )
codes_to_query_dysthymia = set( pandas.read_csv(folder + "ciaranmci-dysthymia-6f6888c3.csv")["code"] )
codes_to_query_personalityDisorder = set( pandas.read_csv(folder + "ciaranmci-personality-disorder-5c4cd31b.csv")["code"] )

codes_to_query_caseness = \
    set(
        codes_to_query_borderline.union(
            codes_to_query_chronicDepression.union(
                codes_to_query_chronicPTSD.union(
                    codes_to_query_complexPTSD.union(
                        codes_to_query_dysthymia.union(
                            codes_to_query_personalityDisorder
                        )
                    )
                )
            )
        )
    ).difference(codes_to_exclude)

# Medications of interest.
medications_to_query_psychosisAndRelated = pandas.read_csv(folder + "UNSEEN_medications_psychosisAndRelated.csv")
medications_to_query_hypnoticsAndAnxiolytics = pandas.read_csv(folder + "UNSEEN_medications_hypnoticsAndAnxiolytics.csv")
medications_to_query_antidepressants = pandas.read_csv(folder + "UNSEEN_medications_antidepressants.csv")
medications_to_query_all = pandas.read_csv(folder + "UNSEEN_medications_list.csv")

The script below is an edited version of the main script in `UNSESSN_create_caseness_variables.ipynb`. The main edit is that the `tbl_persons_with_caseness_codes` SQL Common Table Expression (CTE) is replaced by similar CTEs for each of the component diagnoses. I also replace `tbl_persons_with_medications` with similar CTEs for each of the component medications.

The list of component diagnoses are:
1. Borderline personality disorder
2. Chronic depression
3. Chronic posttraumatic stress disorder
4. Complex posttraumatic stress disorder
5. Dysthymia
6. Personality disorder

The list of component medications are:
1. Antidepressants
2. Hypnotics and anxiolytics
3. Medications associated with psychosis and related disorders

## Define the BigQuery script that creates the study population.

I first specify the population of interest with `person_id` values. These `person_id` values are filtered as follows:
1. all people with a `person_id` in Connected Bradford's primary-care table,
2. aged between 18 and 70, inclusive, as of today's date (note: today's date is used rather than the `myIndexDate` date),
3. who have been registered with their practice for at least `min_GP_registeration_duration` years,
4. who have a record of a SNOMED-CT diagnostic code from `codes_to_query_population` within `Dx_windows` years prior to the `myIndexDate` date, or who have a record of prescriptions for medicines from `medications_to_query_all` within `Rx_windows` years prior to the `myIndexDate` date,
5. excluding those people who have a record of a SNOMED-CT diagnostic code from `codes_to_query_schizophrenia` or `codes_to_query_bipolar`, which are collated in `codes_to_exclude_population`.

The sequence I apply is to first use a single query to satisfy filters 1, 2, and 3. Then, I run two separate queries to satisfy each of the components of filter 4: one query considers medications, and the other considers diagnostic codes. Finally, join the results of the filter-4 queries and apply the exclusion from filter 5.

In [4]:
sql_declarations = \
"""
DECLARE myIndexDate DATE DEFAULT '""" + myIndexDate + """';
DECLARE min_GP_registeration_duration INT64 DEFAULT """ + str(min_GP_registeration_duration) + """;
DECLARE Rx_window INT64 DEFAULT """ + str(Rx_window) + """;
DECLARE Rx_window_caseness INT64 DEFAULT """ + str(Rx_window_caseness) + """;
DECLARE redaction_threshold INT64 DEFAULT """ + str(redaction_threshold) + """;
DECLARE target_round INT64 DEFAULT """ + str(target_round) + """;
"""

sql_studyPopulation = \
"""
WITH
# Set up table of SNOMED-CT codes that will be queried for filter 4.
tbl_codes_population AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_population)) + """'
                ]) AS snomedcode
)
# Set up table of SNOMED-CT codes that will be excluded for filter 5.
,tbl_codes_to_exclude AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_exclude)) + """'
                ]) AS snomedcode
)
# Set up table of medications that will be queried for filter 4.
,tbl_medications AS (
    SELECT
        nameofmedication
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, medications_to_query_all["Medication"].tolist())) + """'
                ]) AS nameofmedication
)
# Set up table of SNOMED-CT codes that will be queried to define the caseness of complex mental health difficulties.
,tbl_codes_caseness AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_caseness)) + """'
                ]) AS snomedcode
)



# First query to satisfy filters 1, 2, and 3.
,tbl_persons_firstFilters AS (
    SELECT
        DISTINCT person.person_id
        ,person.year_of_birth
    FROM
        # Querying this table effectively applies filter 1.
        """ + server_id + """.""" + database_id + """.person
    # This join is filtering for GP registration.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srpatientregistration USING(person_id)
    WHERE
        # Apply filter 2.
        (EXTRACT(YEAR FROM CURRENT_DATE()) - person.year_of_birth) BETWEEN 18 AND 70
        AND
        # Apply filter 3.
        tbl_srpatientregistration.tbl_srpatientregistration_start_date <
            DATE_SUB(myIndexDate, INTERVAL min_GP_registeration_duration YEAR)
)     
# First component of filter 4: patients with SNOMED-CT codes of interest.
,tbl_persons_and_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,year_of_birth
        ,tbl_srcode.snomedcode
    FROM
        tbl_persons_firstFilters
    # This join appends the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode
        ON tbl_persons_firstFilters.person_id = tbl_srcode.person_id
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_population
        ON tbl_srcode.snomedcode = tbl_codes_population.snomedcode
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)
# Second component of filter 4: patients with prescriptions for medications of interest.
,tbl_persons_and_medications AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,year_of_birth
        ,tbl_srprimarycaremedication.datemedicationstart # This extra column is needed for a later query to distinguish cases and controls.
    FROM
        tbl_persons_firstFilters
    # This join is appending the medication table so that I can query medications.
    # It also, effectively, removes any patients without a prescription because
    # it is an INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srprimarycaremedication
        ON tbl_persons_firstFilters.person_id = tbl_srprimarycaremedication.person_id
    # This cross join conveniently creates all possible combinations of values of the
    # previous join result and `tbl_medications`. This sets up my interim result to 
    # easily do a row-wise comparison of the medications of interest with the variously-
    # worded `nameofmedication` values in the database.
    CROSS JOIN
        tbl_medications
    WHERE
        # This filters for the medications of interest.
        REGEXP_CONTAINS(tbl_srprimarycaremedication.nameofmedication, tbl_medications.nameofmedication) = True
        AND
        DATE_DIFF(myIndexDate, CAST(tbl_srprimarycaremedication.datemedicationstart AS DATE), YEAR) BETWEEN 0 AND """ + str(Rx_window) + """
)
# Combine components of filter 4 with an OR statement (more correctly, using `DISTINCT` on a `UNION ALL`).
# This completes the sequence to create the study population.
,tbl_studyPopulation_no_caseness AS (
    SELECT
        DISTINCT person_id
        ,tbl_persons_and_codes.year_of_birth
    FROM
        tbl_persons_and_codes
    FULL OUTER JOIN
        tbl_persons_and_medications USING(person_id)
    WHERE
        tbl_persons_and_codes.snomedcode NOT IN (SELECT snomedcode FROM tbl_codes_to_exclude)
)
"""

## Additional subqueries.

In [5]:
sql_caseness_components_codelist_CTEs = \
"""
,tbl_codes_borderlinePD AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_borderline)) + """'
                ]) AS snomedcode
)
,tbl_codes_chronicDepression AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_chronicDepression)) + """'
                ]) AS snomedcode
)
,tbl_codes_chronicPTSD AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_chronicPTSD)) + """'
                ]) AS snomedcode
)
,tbl_codes_complexPTSD AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_complexPTSD)) + """'
                ]) AS snomedcode
)
,tbl_codes_dysthymia AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_dysthymia)) + """'
                ]) AS snomedcode
)
,tbl_codes_personalityDisorder AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_personalityDisorder)) + """'
                ]) AS snomedcode
)
,tbl_codes_bipolar AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_bipolar)) + """'
                ]) AS snomedcode
)
,tbl_codes_schizophrenia AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_schizophrenia)) + """'
                ]) AS snomedcode
)
,tbl_codes_dementia AS (
    SELECT
        snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_dementia)) + """'
                ]) AS snomedcode
)




,tbl_medications_antidepressants AS (
    SELECT
        nameofmedication
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, medications_to_query_antidepressants["Medication"].tolist())) + """'
                ]) AS nameofmedication
)
,tbl_medications_hypnoticsAndAnxiolytics AS (
    SELECT
        nameofmedication
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, medications_to_query_hypnoticsAndAnxiolytics["Medication"].tolist())) + """'
                ]) AS nameofmedication
)
,tbl_medications_psychosisAndRelated AS (
    SELECT
        nameofmedication
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, medications_to_query_psychosisAndRelated["Medication"].tolist())) + """'
                ]) AS nameofmedication
)
"""



sql_caseness_components_CTEs = \
"""
,tbl_persons_with_borderlinePD_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS borderlinePD
    FROM
        tbl_persons_firstFilters
    # This join gets the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode USING(person_id)
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_borderlinePD USING(snomedcode)
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)
,tbl_persons_with_chronicDepression_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS chronicDepression
    FROM
        tbl_persons_firstFilters
    # This join gets the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode USING(person_id)
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_chronicDepression USING(snomedcode)
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)
,tbl_persons_with_chronicPTSD_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS chronicPTSD
    FROM
        tbl_persons_firstFilters
    # This join gets the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode USING(person_id)
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_chronicPTSD USING(snomedcode)
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)
,tbl_persons_with_complexPTSD_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS complexPTSD
    FROM
        tbl_persons_firstFilters
    # This join gets the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode USING(person_id)
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_complexPTSD USING(snomedcode)
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)
,tbl_persons_with_dysthymia_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS dysthymia
    FROM
        tbl_persons_firstFilters
    # This join gets the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode USING(person_id)
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_dysthymia USING(snomedcode)
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)
,tbl_persons_with_personalityDisorder_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS personalityDisorder
    FROM
        tbl_persons_firstFilters
    # This join gets the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode USING(person_id)
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_personalityDisorder USING(snomedcode)
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)
,tbl_persons_with_bipolar_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS bipolar
    FROM
        tbl_persons_firstFilters
    # This join gets the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode USING(person_id)
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_bipolar USING(snomedcode)
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)
,tbl_persons_with_schizophrenia_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS schizophrenia
    FROM
        tbl_persons_firstFilters
    # This join gets the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode USING(person_id)
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_schizophrenia USING(snomedcode)
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)
,tbl_persons_with_dementia_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS dementia
    FROM
        tbl_persons_firstFilters
    # This join gets the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode USING(person_id)
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_dementia USING(snomedcode)
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)




,tbl_persons_with_antidepressants_meds AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS antidepressants
    FROM
        tbl_persons_firstFilters
    # This join is adding the medication table so that I can query medications.
    # It also, effectively, removes any patients without a prescription because
    # it is an INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srprimarycaremedication USING(person_id)
    # This cross join conveniently creates all possible combinations of values of the
    # previous join result and `tbl_medications`. This sets up my interim result to 
    # easily do a row-wise comparison of the medications of interest with the variously-
    # worded `nameofmedication` values in the database.
    CROSS JOIN
        tbl_medications_antidepressants
    WHERE
        # This filters for the medications of interest.
        REGEXP_CONTAINS(tbl_srprimarycaremedication.nameofmedication, tbl_medications_antidepressants.nameofmedication) = True
        AND
        DATE_DIFF(myIndexDate, CAST(tbl_srprimarycaremedication.datemedicationstart AS DATE), YEAR) BETWEEN 0 AND """ + str(Rx_window_caseness) + """
)
,tbl_persons_with_hypnoticsAndAnxiolytics_meds AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS hypnoticsAndAnxiolytics
    FROM
        tbl_persons_firstFilters
    # This join is adding the medication table so that I can query medications.
    # It also, effectively, removes any patients without a prescription because
    # it is an INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srprimarycaremedication USING(person_id)
    # This cross join conveniently creates all possible combinations of values of the
    # previous join result and `tbl_medications`. This sets up my interim result to 
    # easily do a row-wise comparison of the medications of interest with the variously-
    # worded `nameofmedication` values in the database.
    CROSS JOIN
        tbl_medications_hypnoticsAndAnxiolytics
    WHERE
        # This filters for the medications of interest.
        REGEXP_CONTAINS(tbl_srprimarycaremedication.nameofmedication, tbl_medications_hypnoticsAndAnxiolytics.nameofmedication) = True
        AND
        DATE_DIFF(myIndexDate, CAST(tbl_srprimarycaremedication.datemedicationstart AS DATE), YEAR) BETWEEN 0 AND """ + str(Rx_window) + """
)
,tbl_persons_with_psychosisAndRelated_meds AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,1 AS psychosisAndRelated
    FROM
        tbl_persons_firstFilters
    # This join is adding the medication table so that I can query medications.
    # It also, effectively, removes any patients without a prescription because
    # it is an INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srprimarycaremedication USING(person_id)
    # This cross join conveniently creates all possible combinations of values of the
    # previous join result and `tbl_medications`. This sets up my interim result to 
    # easily do a row-wise comparison of the medications of interest with the variously-
    # worded `nameofmedication` values in the database.
    CROSS JOIN
        tbl_medications_psychosisAndRelated
    WHERE
        # This filters for the medications of interest.
        REGEXP_CONTAINS(tbl_srprimarycaremedication.nameofmedication, tbl_medications_psychosisAndRelated.nameofmedication) = True
        AND
        DATE_DIFF(myIndexDate, CAST(tbl_srprimarycaremedication.datemedicationstart AS DATE), YEAR) BETWEEN 0 AND """ + str(Rx_window) + """
)



,tbl_studyPopulation_casenessBreakdown AS (
    SELECT
        DISTINCT tbl_studyPopulation_no_caseness.person_id
        ,CASE WHEN borderlinePD IS NULL THEN 0 ELSE 1 END AS borderlinePD
        ,CASE WHEN chronicDepression IS NULL THEN 0 ELSE 1 END AS chronicDepression
        ,CASE WHEN chronicPTSD IS NULL THEN 0 ELSE 1 END AS chronicPTSD
        ,CASE WHEN complexPTSD IS NULL THEN 0 ELSE 1 END AS complexPTSD
        ,CASE WHEN dysthymia IS NULL THEN 0 ELSE 1 END AS dysthymia
        ,CASE WHEN personalityDisorder IS NULL THEN 0 ELSE 1 END AS personalityDisorder
        
        ,CASE WHEN bipolar IS NULL THEN 0 ELSE 1 END AS bipolar
        ,CASE WHEN schizophrenia IS NULL THEN 0 ELSE 1 END AS schizophrenia
        ,CASE WHEN dementia IS NULL THEN 0 ELSE 1 END AS dementia
        
        ,CASE WHEN antidepressants IS NULL THEN 0 ELSE 1 END AS antidepressants
        ,CASE WHEN hypnoticsAndAnxiolytics IS NULL THEN 0 ELSE 1 END AS hypnoticsAndAnxiolytics
        ,CASE WHEN psychosisAndRelated IS NULL THEN 0 ELSE 1 END AS psychosisAndRelated
    FROM
        tbl_studyPopulation_no_caseness
    LEFT JOIN tbl_persons_with_borderlinePD_codes USING(person_id)
    LEFT JOIN tbl_persons_with_chronicDepression_codes USING(person_id)
    LEFT JOIN tbl_persons_with_chronicPTSD_codes USING(person_id)
    LEFT JOIN tbl_persons_with_complexPTSD_codes USING(person_id)
    LEFT JOIN tbl_persons_with_dysthymia_codes USING(person_id)
    LEFT JOIN tbl_persons_with_personalityDisorder_codes USING(person_id)
    
    LEFT JOIN tbl_persons_with_bipolar_codes USING(person_id)
    LEFT JOIN tbl_persons_with_schizophrenia_codes USING(person_id)
    LEFT JOIN tbl_persons_with_dementia_codes USING(person_id)
    
    LEFT JOIN tbl_persons_with_antidepressants_meds USING(person_id)
    LEFT JOIN tbl_persons_with_hypnoticsAndAnxiolytics_meds USING(person_id)
    LEFT JOIN tbl_persons_with_psychosisAndRelated_meds USING(person_id)
)
"""

## Final select.

In [6]:
sql_final_select =\
"""
SELECT * FROM tbl_studyPopulation_casenessBreakdown ORDER BY person_id
"""
cohort_breakdown_array = pandas.read_gbq(sql_declarations + sql_studyPopulation + sql_caseness_components_codelist_CTEs + sql_caseness_components_CTEs + sql_final_select)



## Calculate prevalence of components.

In [7]:
# Add columns.
cohort_breakdown_array['anyPD'] = \
    pandas.DataFrame( cohort_breakdown_array.loc[:, cohort_breakdown_array.columns.isin(['borderlinePD', 'personalityDisorder'])].any(axis = 1) )

cohort_breakdown_array['anyPTSD'] = \
    pandas.DataFrame( cohort_breakdown_array.loc[:, cohort_breakdown_array.columns.isin(['chronicPTSD', 'complexPTSD'])].any(axis = 1) )

cohort_breakdown_array['anyDepression'] = \
    pandas.DataFrame( cohort_breakdown_array.loc[:, cohort_breakdown_array.columns.isin(['chronicDepression', 'dysthymia'])].any(axis = 1) )

cohort_breakdown_array['anyExclusion'] = \
    pandas.DataFrame( cohort_breakdown_array.loc[:, cohort_breakdown_array.columns.isin(['bipolar', 'schizophrenia', 'dementia'])].any(axis = 1) )

cohort_breakdown_array['anyDiagnosis'] = \
    pandas.DataFrame( cohort_breakdown_array.loc[cohort_breakdown_array.anyExclusion == False,
                                                  cohort_breakdown_array.columns.isin(['anyPD', 'anyPTSD', 'anyDepression'])
                                                 ].any(axis = 1), columns = ['count_of_diagnoses'] )

cohort_breakdown_array['count_of_diagnoses'] = \
    pandas.DataFrame( cohort_breakdown_array.loc[cohort_breakdown_array.anyExclusion == False,
                                                   cohort_breakdown_array.columns.isin(['borderlinePD', 'personalityDisorder', 'chronicPTSD', 'complexPTSD',
                                                                                          'chronicDepression', 'dysthymia'])
                                                  ].sum(axis = 1), columns = ['count_of_diagnoses'] )

cohort_breakdown_array['anyMedication'] = \
    pandas.DataFrame( cohort_breakdown_array.loc[cohort_breakdown_array.anyExclusion == False,
                                                   cohort_breakdown_array.columns.isin(['antidepressants', 'hypnoticsAndAnxiolytics', 'psychosisAndRelated'])
                                                  ].any(axis = 1) )

cohort_breakdown_array['count_of_medications'] = \
    pandas.DataFrame( cohort_breakdown_array.loc[cohort_breakdown_array.anyExclusion == False,
                                                   cohort_breakdown_array.columns.isin(['antidepressants', 'hypnoticsAndAnxiolytics', 'psychosisAndRelated'])
                                                  ].sum(axis = 1), columns = ['count_of_medications'] )

cohort_breakdown_array['caseness_1isYes'] = \
    pandas.DataFrame( cohort_breakdown_array.loc[cohort_breakdown_array.anyExclusion == False
                                                 ,cohort_breakdown_array.columns.isin(['anyDiagnosis', 'anyMedication'])
                                                ].all(axis = 1)
                     , columns = ['caseness_1isYes'] 
                    )

caseness_array =  \
    pandas.DataFrame(
        cohort_breakdown_array.loc[(cohort_breakdown_array.anyExclusion == False)
                                   ,cohort_breakdown_array.columns.isin(['person_id', 'caseness_1isYes'])
                                  ]).reset_index()

# Calculate base counts then redact and round before calculating proportions.
counts = \
    (round(cohort_breakdown_array.iloc[:,
                                        ~cohort_breakdown_array.columns.isin(['person_id', 'count_of_diagnoses', 'count_of_medications'])].sum().astype(float) / target_round) * target_round
    ).astype(int)
count_studyPopulation = len(cohort_breakdown_array)
percentage = round((counts / count_studyPopulation) * 100, 2 )
prevalence_per_thousand = round((counts / count_studyPopulation) * 1000, 2 )


# Calculate important cohort descriptive statistics.
count_caseness = counts['caseness_1isYes']
count_control = count_studyPopulation - count_caseness
caseness_prevalence = percentage['caseness_1isYes'] / 100
percentage_caseness = percentage['caseness_1isYes']

# Present feedback.
if count_studyPopulation < redaction_threshold:
    print(f'The count of patients in the Connected Bradford dataset demonstrating mental ill-health is so low that does not suffice our redaction threshold.')
    print(f'Therefore, no further action will be taken because the study population is too small.')

if count_caseness < redaction_threshold:
    print(f'The count of patients in the Connected Bradford dataset demonstrating the caseness of complex mental health difficulties is so low that does not suffice our redaction threshold.')
    print(f'Therefore, we will use the redaction threshold as the imputed count of caseness.')
    count_caseness = redaction_threshold
    caseness_prevalence = count_caseness / count_studyPopulation
    count_control = count_studyPopulation - count_caseness
else:
    print(f'\033[1mNOTE: The percentage of caseness among those with mental ill-health in the Connected Bradford dataset is {percentage_caseness}%\033[0m. \n')



display( pandas.DataFrame(data = {'counts' : counts, 'percentage' : percentage, 'prevalence_per_thousand' : prevalence_per_thousand} ) )

display( pandas.DataFrame(data = {'counts' : cohort_breakdown_array.iloc[:, cohort_breakdown_array.columns.isin(['count_of_diagnoses', 'count_of_medications'])].sum(),
                                  'percentage' : round(cohort_breakdown_array.iloc[:, cohort_breakdown_array.columns.isin(['count_of_diagnoses', 'count_of_medications'])].sum() / 
                                                       len(cohort_breakdown_array), 2),
                                  'prevalence_per_thousand' : round(cohort_breakdown_array.iloc[:, cohort_breakdown_array.columns.isin(['count_of_diagnoses', 'count_of_medications'])].sum() / 
                                                                    len(cohort_breakdown_array) * 10, 2)
                                 } ) )

%store sql_declarations sql_studyPopulation sql_caseness_components_codelist_CTEs sql_caseness_components_CTEs \
        caseness_array cohort_breakdown_array count_studyPopulation count_caseness count_control \
        caseness_prevalence percentage_caseness

[1mNOTE: The percentage of caseness among those with mental ill-health in the Connected Bradford dataset is 2.16%[0m. 



Unnamed: 0,counts,percentage,prevalence_per_thousand
borderlinePD,490,0.31,3.09
chronicDepression,1180,0.74,7.45
chronicPTSD,120,0.08,0.76
complexPTSD,0,0.0,0.0
dysthymia,520,0.33,3.28
personalityDisorder,3650,2.3,23.04
bipolar,1860,1.17,11.74
schizophrenia,2180,1.38,13.76
dementia,350,0.22,2.21
antidepressants,95340,60.17,601.71


Unnamed: 0,counts,percentage,prevalence_per_thousand
count_of_diagnoses,5048.0,0.03,0.32
count_of_medications,131480.0,0.83,8.3


Stored 'sql_declarations' (str)
Stored 'sql_studyPopulation' (str)
Stored 'sql_caseness_components_codelist_CTEs' (str)
Stored 'sql_caseness_components_CTEs' (str)
Stored 'caseness_array' (DataFrame)
Stored 'cohort_breakdown_array' (DataFrame)
Stored 'count_studyPopulation' (int)
Stored 'count_caseness' (int64)
Stored 'count_control' (int64)
Stored 'caseness_prevalence' (float64)
Stored 'percentage_caseness' (float64)


### Percentage with each count of diagnoses

In [8]:
percentage_with_at_least_one_diagnosis = \
    round(
        (
            round(
                len(
                    cohort_breakdown_array.loc[
                        (cohort_breakdown_array.count_of_diagnoses > 0)
                    ]
                ) / target_round
            ) * target_round
        ) / count_studyPopulation * 100
    , 1)
print(f'\033[1mNOTE: The percentage of patient records with at least one diagnosis is {percentage_with_at_least_one_diagnosis}%\033[0m')

display(
    pandas.DataFrame(
        data = {
            'count_with_each_count_of_diagnoses' : round(cohort_breakdown_array.count_of_diagnoses.value_counts() / target_round) * target_round,
            'percentage_with_each_count_of_diagnoses' : round(round(cohort_breakdown_array.count_of_diagnoses.value_counts() / target_round) * target_round / count_studyPopulation * 100,2)
        }
    )
)

[1mNOTE: The percentage of patient records with at least one diagnosis is 2.9%[0m


Unnamed: 0,count_with_each_count_of_diagnoses,percentage_with_each_count_of_diagnoses
0.0,149810.0,94.55
1.0,4030.0,2.54
2.0,470.0,0.3
3.0,20.0,0.01
4.0,0.0,0.0


### Percentage with each count of medication (note that 0 is not the largest)

In [9]:
percentage_with_at_least_one_medication = \
    round(
        (
            round(
                len(
                    cohort_breakdown_array.loc[
                        (cohort_breakdown_array.count_of_medications > 0)
                    ]
                ) / target_round
            ) * target_round
        ) / count_studyPopulation * 100
    , 1)
print(f'\033[1mNOTE: The percentage of patient records with at least one diagnosis is {percentage_with_at_least_one_medication}%\033[0m')

display(
    pandas.DataFrame(
        data = {
            'count_with_each_count_of_medications' : round(cohort_breakdown_array.count_of_medications.value_counts() / target_round) * target_round,
            'percentage_with_each_count_of_medications' : round(round(cohort_breakdown_array.count_of_medications.value_counts() / target_round) * target_round / count_studyPopulation * 100,2)
        }
    )
)

[1mNOTE: The percentage of patient records with at least one diagnosis is 62.6%[0m


Unnamed: 0,count_with_each_count_of_medications,percentage_with_each_count_of_medications
1.0,69770.0,44.03
0.0,55160.0,34.81
2.0,26510.0,16.73
3.0,2900.0,1.83


## Calculating the entropy of the caseness

In [10]:
print("\n Calculating the entropy of the caseness variable in nats...")
_, entropy_caseness_scaled = entropy_output(caseness_array.caseness_1isYes)
%store entropy_caseness_scaled


 Calculating the entropy of the caseness variable in nats...
	 Caseness variable entropy = 0.106 nats
	 The caseness variable's entropy is 15.4 % of its theoretical maximum

Stored 'entropy_caseness_scaled' (float64)


## Calculating hit rates

In [11]:
print("\n Calculating the hit rates of the caseness variable in nats...")
hitRate_none, hitRate_all = hitrate_output(caseness_array.caseness_1isYes)
%store hitRate_none


 Calculating the hit rates of the caseness variable in nats...
	 Hit rate (all) = 2.218 %
	 Hit rate (none) = 97.782 %
	 Odds (No : Yes) = 44-times less likely to demonstrate caseness than to not.
Stored 'hitRate_none' (float64)


## Concluding comments.

In [12]:
display(
    Markdown(
f"""    
We now know that:
1. based on the scaled entropy, our variable indicating the caseness of complex mental health difficulties is $\le{round(entropy_caseness_scaled, 1)}\%$ as uncertain/surprising/unforeseeable
as it could possibly be; _and_

2. we would correctly classify $\ge{round(hitRate_none, 1)}\%$ of patient records in this sample if we simply assumed that none met our definition of complex mental health difficulties.

The first point tells us that definite caseness of complex mental health difficulties can be known with considerable certainty, in this dataset. There is only so much room for improvement via feature sets.

The second point defines a benchmark for the indicative performance of any feature set that we evaluate in our study. Specifically, any feature set that we suggest to improve our certainty of knowing
that someone has complex mental health difficulties must correctly identify $\ge{round(hitRate_none, 1)}\%$ of patient records in our sample. Otherwise, the added feature set is a needless
complication to our attempt to know whether or not someone has complex mental health difficulties (which we can often safely assume they don't). This is such a high benchmark that we will be very
unlikely to find such a feature set.

We must remember that we are not trying to out-predict an identification rule based on caseness prevalence. Rather, we are trying to find feature sets that correlate with this
caseness prevalence. Large correlations would be difficult to find using variance-based methods like Pearson's product moment correlation or regression methods because the variance of the
caseness variable is so low. Our approach based on mutual-information is better suited to this situation because its fundamental concept is coincidence rather than covariance.
"""
        )
)

    
We now know that:
1. based on the scaled entropy, our variable indicating the caseness of complex mental health difficulties is $\le15.4\%$ as uncertain/surprising/unforeseeable
as it could possibly be; _and_

2. we would correctly classify $\ge97.8\%$ of patient records in this sample if we simply assumed that none met our definition of complex mental health difficulties.

The first point tells us that definite caseness of complex mental health difficulties can be known with considerable certainty, in this dataset. There is only so much room for improvement via feature sets.

The second point defines a benchmark for the indicative performance of any feature set that we evaluate in our study. Specifically, any feature set that we suggest to improve our certainty of knowing
that someone has complex mental health difficulties must correctly identify $\ge97.8\%$ of patient records in our sample. Otherwise, the added feature set is a needless
complication to our attempt to know whether or not someone has complex mental health difficulties (which we can often safely assume they don't). This is such a high benchmark that we will be very
unlikely to find such a feature set.

We must remember that we are not trying to out-predict an identification rule based on caseness prevalence. Rather, we are trying to find feature sets that correlate with this
caseness prevalence. Large correlations would be difficult to find using variance-based methods like Pearson's product moment correlation or regression methods because the variance of the
caseness variable is so low. Our approach based on mutual-information is better suited to this situation because its fundamental concept is coincidence rather than covariance.


In [13]:
# Below, I compute the cells of the contingency table for a rule that says no one has caseness of complex mental health difficulties.
#
# True positives. Zero because the rule says no one demonstrates 'Definite caseness' so no "positives" of any kind exist.
tp = 0
# False positives. Zero because the rule says no one demonstrates 'Definite caseness' so no "positives" of any kind exist.
fp = 0
# True negatives. The opposite of the hit rates calculated previously, which assumed the rule that everyone demonstrated active caseness.
tn = hitRate_none / 100 * count_studyPopulation
# False negatives. The opposite of the hit rates calculated previously, which assumed the rule that everyone demonstrated active caseness.
fn = hitRate_all / 100 * count_studyPopulation

# Below, I compute the evaluation statistics.
#
# Class balance accuracy.
cba = round( 0.5 * ( (tp / max( (tp + fn), (tp + fp) ) ) + (tn / max( (tn + fp), (tn + fn) ) ) ), 2)
# Odds ratio.
OR = 'Not a number because one of the odds is zero.' if min( (tp * tn) , (fp * fn) ) == 0 else round( (tp * tn) / (fp * fn), 2)
# Positive predictive value.
ppv = 0 if (tp + fp) == 0 else round( tp / (tp + fp), 2)
# Negative predictive value.
npv = 0 if (tn + fn) == 0 else round( tn / (tn + fn), 2)

display(
    Markdown(
f"""    
If we assume a rule that says no record meets our definition for the caseness of complex mental health difficulties, then we get the following approximate values for our evaluation statistics:

| Statistic                      |    Value   |
| ------------------------------ | -----------|
| Normalised mutual information  | x \u2192 0 |
| Class balance accuracy         | {cba}      |
| Odds ratio                     | {OR}       |
| Positive predictive value      | {ppv}      |
| Negative predictive value      | {npv}      |

"""
    )
)

    
If we assume a rule that says no record meets our definition for the caseness of complex mental health difficulties, then we get the following approximate values for our evaluation statistics:

| Statistic                      |    Value   |
| ------------------------------ | -----------|
| Normalised mutual information  | x → 0 |
| Class balance accuracy         | 0.49      |
| Odds ratio                     | Not a number because one of the odds is zero.       |
| Positive predictive value      | 0      |
| Negative predictive value      | 0.98      |

