# Caseness array

The purpose of this notebook is to produce the caseness array. The caseness array is an n-by-2 array containing patient ID and a binary vector indicating whether the patient's record meets our definition for the caseness of complex mental health difficulties.

## Imports

In [2]:
# Get helper functions.
%run 'UNSEEN_helper_functions.ipynb'
# Refresh stored variables, if they are present.
%store -r

## Load requisites

In [3]:
# Set index date. Usually CURRENT_DATE() but Dec 2021 will be used until cB fixed the missing prescriptions.
myIndexDate =  '2021-12-31'

# Set the duration for which a person must have been registered with their current general practice, in years.
min_GP_registeration_duration = 1

# Set the capture window for criteria diagnoses and prescriptions, in years.
Dx_window = 10
Rx_window = 10
Rx_window_caseness = 10 # Must be <= Rx_window because Rx_window is applied first.

# Set parameters for disclosivity adjustments.
redaction_threshold = 7
target_round = 10

# Set the database attributes.
global server_id
server_id = 'yhcr-prd-phm-bia-core'
global database_id
database_id = 'CB_FDM_PrimaryCare_V7'

# Set folder location.
folder_loc = os.path.dirname(os.path.abspath("UNSEEN create caseness array.ipynb"))
folder = folder_loc + '/codelists/'

%store server_id database_id myIndexDate redaction_threshold \
target_round Dx_window Rx_window Rx_window_caseness min_GP_registeration_duration

Stored 'server_id' (str)
Stored 'database_id' (str)
Stored 'myIndexDate' (str)
Stored 'redaction_threshold' (int)
Stored 'target_round' (int)
Stored 'Dx_window' (int)
Stored 'Rx_window' (int)
Stored 'Rx_window_caseness' (int)
Stored 'min_GP_registeration_duration' (int)


## Load codelist CSV files.
We used [opencodelists.org](https://www.opencodelist.org) to define codelists that define the set of SNOMED-CT codes used to identify patients based on various attributes.

In [11]:
# Clinical codes of interest.
codes_to_query_mentalIllHealth = pandas.read_csv(folder + "mental_ill_health_codelist.txt", sep = '\t')
codes_to_query_bipolar = pandas.read_csv(folder + "ciaranmci-bipolar-disorder-6a0308d7.csv")
codes_to_query_schizophrenia = pandas.read_csv(folder + "ciaranmci-schizophrenia-05c53c03.csv")
# ## Exclude bipolar and schizophrenia from the study population.
codes_to_query_mentalIllHealth = pandas.DataFrame(
    list(
        set(codes_to_query_mentalIllHealth["Id"]).difference(
            set(codes_to_query_bipolar["code"]).union(
                set(codes_to_query_schizophrenia["code"])
            )
        )
    )
    ,columns = ["Id"]
)

# ## Create codelist for the cases.
codes_to_query_caseness = pandas.read_csv(folder + "ciaranmci-unseen-snomed-codes-to-identify-cmhd-0e6bb986.csv")
codes_to_query_devAcademicDisorder = pandas.read_csv(folder + "ciaranmci-developmental-academic-disorder-755c4650.csv")
# ## Exclude Developmental Academic Disorder from the cases.
codes_to_query_caseness = pandas.DataFrame(
    list(
        set(codes_to_query_caseness["code"]).difference(
            set(codes_to_query_bipolar["code"]).union(
                set(codes_to_query_schizophrenia["code"])
            ).union(
                set(codes_to_query_devAcademicDisorder["code"])
            )
        )
    )
    ,columns = ["code"]
)
codes_to_query_borderline = pandas.read_csv(folder + "ciaranmci-borderline-personality-disorder-1ed4af38.csv")
codes_to_query_chronicDepression = pandas.read_csv(folder + "ciaranmci-chronic-depression-53a65598.csv")
codes_to_query_chronicPTSD = pandas.read_csv(folder + "ciaranmci-chronic-post-traumatic-stress-disorder-3a96e263.csv")
codes_to_query_complexPTSD = pandas.read_csv(folder + "ciaranmci-complex-post-traumatic-stress-disorder-21876f2e.csv")
codes_to_query_dysthymia = pandas.read_csv(folder + "ciaranmci-dysthymia-6f6888c3.csv")
codes_to_query_personalityDisorder = pandas.read_csv(folder + "ciaranmci-personality-disorder-5c4cd31b.csv")

# Medications of interest.
medications_to_query_psychosisAndRelated = pandas.read_csv(folder + "UNSEEN_medications_psychosisAndRelated.csv")
medications_to_query_hypnoticsAndAnxiolytics = pandas.read_csv(folder + "UNSEEN_medications_hypnoticsAndAnxiolytics.csv")
medications_to_query_antidepressants = pandas.read_csv(folder + "UNSEEN_medications_antidepressants.csv")
medications_to_query_all = pandas.read_csv(folder + "UNSEEN_medications_list.csv")

## Define the BigQuery script that creates the study population.

I first specify the population of interest with `person_id` values. These `person_id` values are filtered as follows:
1. all people with a `person_id` in Connected Bradford's primary-care table,
2. aged between 18 and 70, inclusive, as of today's date (note: today's date is used rather than the `myIndexDate` date),
3. who have been registered with their practice for at least `min_GP_registeration_duration` years,
4. who have a record of a SNOMED-CT diagnostic code from `codes_to_query_mentalIllHealth` within `Dx_windows` years prior to the `myIndexDate` date, or who have a record of prescriptions for medicines from `medications_to_query_all` within `Rx_windows` years prior to the `myIndexDate` date,
5. excluding those people who have a record of a SNOMED-CT diagnostic code from `codes_to_query_schizophrenia` or `codes_to_query_bipolar`.

The sequence I apply is to first use a single query to satisfy filters 1, 2, and 3. Then, I run two separate queries to satisfy each of the components of filter 4 and 5: one query considers diagnostic codes, and the other considers medications. Finally, I UNION ALL these two results of `person_id` values and return unique values.

In [12]:
sql_declarations = \
"""
DECLARE myIndexDate DATE DEFAULT '""" + myIndexDate + """';
DECLARE min_GP_registeration_duration INT64 DEFAULT """ + str(min_GP_registeration_duration) + """;
DECLARE Rx_window INT64 DEFAULT """ + str(Rx_window) + """;
DECLARE Rx_window_caseness INT64 DEFAULT """ + str(Rx_window_caseness) + """;
DECLARE redaction_threshold INT64 DEFAULT """ + str(redaction_threshold) + """;
DECLARE target_round INT64 DEFAULT """ + str(target_round) + """;
"""

sql_studyPopulation = \
"""
WITH
# Set up table of SNOMED-CT codes that will be queried for filter 4.
tbl_codes_mentalIllHealth AS (
    SELECT
        my_snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_mentalIllHealth["Id"].tolist())) + """'
                ]) AS my_snomedcode
)
# Set up table of medications that will be queried for filter 4.
,tbl_medications AS (
    SELECT
        my_nameofmedication
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, medications_to_query_all["Medication"].tolist())) + """'
                ]) AS my_nameofmedication
)
# Set up table of SNOMED-CT codes that will be queried to define the caseness of complex mental health difficulties.
,tbl_codes_caseness AS (
    SELECT
        my_snomedcode
    FROM
        UNNEST([
                '""" + '\', \''.join(map(str, codes_to_query_caseness["code"].tolist())) + """'
                ]) AS my_snomedcode
)
# First query to satisfy filters 1, 2, and 3.
,tbl_persons_firstFilters AS (
    SELECT
        DISTINCT person.person_id
        ,person.year_of_birth
    FROM
        # Querying this table effectively applies filter 1.
        """ + server_id + """.""" + database_id + """.person
    # This join is filtering for GP registration.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srpatientregistration
        ON person.person_id = tbl_srpatientregistration.person_id
    WHERE
        # Apply filter 2.
        (EXTRACT(YEAR FROM CURRENT_DATE()) - person.year_of_birth) BETWEEN 18 AND 70
        AND
        # Apply filter 3.
        tbl_srpatientregistration.tbl_srpatientregistration_start_date <
            DATE_SUB(myIndexDate, INTERVAL min_GP_registeration_duration YEAR)
)     
# First component of filter, part 1: patients with SNOMED-CT codes of interest.
,tbl_persons_and_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,year_of_birth
        ,tbl_srcode.snomedcode
    FROM
        tbl_persons_firstFilters
    # This join gets the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode
        ON tbl_persons_firstFilters.person_id = tbl_srcode.person_id
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_mentalIllHealth
        ON tbl_srcode.snomedcode = tbl_codes_mentalIllHealth.my_snomedcode
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)
# Second component of filter: patients with prescriptions for medications of interest.
,tbl_persons_and_medications AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
        ,year_of_birth
        ,tbl_srprimarycaremedication.datemedicationstart # This extra column is needed for a later query to distinguish cases and controls.
    FROM
        tbl_persons_firstFilters
    # This join is adding the medication table so that I can query medications.
    # It also, effectively, removes any patients without a prescription because
    # it is an INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srprimarycaremedication
        ON tbl_persons_firstFilters.person_id = tbl_srprimarycaremedication.person_id
    # This cross join conveniently creates all possible combinations of values of the
    # previous join result and `tbl_medications`. This sets up my interim result to 
    # easily do a row-wise comparison of the medications of interest with the variously-
    # worded `nameofmedication` values in the database.
    CROSS JOIN
        tbl_medications
    WHERE
        # This filters for the medications of interest.
        REGEXP_CONTAINS(nameofmedication, tbl_medications.my_nameofmedication) = True
        AND
        DATE_DIFF(myIndexDate, CAST(tbl_srprimarycaremedication.datemedicationstart AS DATE), YEAR) BETWEEN 0 AND """ + str(Rx_window) + """
)
# Combine components of filter 4 with an OR statement.
# This completes the sequence to create the study population.
,tbl_studyPopulation_no_caseness AS (
    SELECT
        DISTINCT person_id
        ,year_of_birth
    FROM
        (
        SELECT person_id, year_of_birth FROM tbl_persons_and_codes
        UNION ALL
        SELECT person_id, year_of_birth FROM tbl_persons_and_medications
         )
)
"""

## Define the BigQuery script that distinguishes the control-group patients from the patients with caseness of complex mental health difficulties.

The caseness and non-caseness groups are identified from `tbl_studyPopulation_no_caseness` by defining the caseness group as follows:
1. all people in `tbl_studyPopulation_no_caseness`,
2. who have a record of a SNOMED-CT diagnostic code from `codes_to_query_caseness` at any time prior to the `myIndexDate` date,
3. who have record of prescriptions for medications from `medications_to_query_all` within `Rx_windows_caseness` years prior to the `myIndexDate` date.

The sequence I apply is to first create a new SQL Common Table Expression (CTE) similar to `tbl_persons_and_codes_pre` but where I select codes in `tbl_codes_caseness` rather than the codes in `tbl_codes_mentalIllHealth`. Then I will create another CTE that filters the results from `tbl_persons_and_medications` for a `datemedicationstart` value within the bounds. I then do an INNER JOIN of these two CTEs to retrieve patients with the diagnostic code from `codes_to_query_caseness` _and_ recent prescriptions for medications from `medications_to_query_all`. I add a column called `caseness_1isYes` that equals 1 for all rows. Finally, I OUTER JOIN this to `tbl_studyPopulation_no_caseness` and use a CASE-WHEN statement to impute a NULL as a 0 value for all rows of `caseness_1isYes` in `tbl_studyPopulation_no_caseness` to indicate that these patient records are controls.

In [13]:
sql_caseness_CTEs = \
"""
,tbl_persons_with_caseness_codes AS (
    SELECT
        DISTINCT tbl_persons_firstFilters.person_id
    FROM
        tbl_persons_firstFilters
    # This join gets the diagnostic SNOMED-CT codes, and filters for 
    # the patients for which we have diagnostic codes because it is an
    # INNER JOIN.
    JOIN
        """ + server_id + """.""" + database_id + """.tbl_srcode
        ON tbl_persons_firstFilters.person_id = tbl_srcode.person_id
    # This join is filtering for patients with the diagnostic SNOMED-CT codes
    # of interest by using an INNER JOIN, which acts like an intersection in
    # set operations.
    JOIN 
        tbl_codes_caseness
        ON tbl_srcode.snomedcode = tbl_codes_caseness.my_snomedcode
    WHERE
        # This filters for diagnoses prior to the index date.
        tbl_srcode.dateevent < myIndexDate
)
,tbl_persons_with_caseness_medications AS (
    SELECT
        DISTINCT person_id
    FROM
        tbl_persons_and_medications
    WHERE
        DATE_DIFF(myIndexDate, CAST(datemedicationstart AS DATE), YEAR) BETWEEN 0 AND """ + str(Rx_window_caseness) + """

)
,tbl_persons_with_caseness_codes_and_medications AS (
    SELECT
        DISTINCT tbl_persons_with_caseness_codes.person_id
        ,1 AS caseness_1isYes
    FROM
        tbl_persons_with_caseness_codes
    JOIN
        tbl_persons_with_caseness_medications
        ON tbl_persons_with_caseness_codes.person_id = tbl_persons_with_caseness_medications.person_id
)
,tbl_studyPopulation AS (
    SELECT
        DISTINCT tbl_studyPopulation_no_caseness.person_id
        ,CASE WHEN caseness_1isYes IS NULL THEN 0 ELSE 1 END AS caseness_1isYes
    FROM
        tbl_studyPopulation_no_caseness
    LEFT JOIN
        tbl_persons_with_caseness_codes_and_medications
        ON tbl_studyPopulation_no_caseness.person_id = tbl_persons_with_caseness_codes_and_medications.person_id
)
"""

## Run the BigQuery query to produce the caseness_array.

In [8]:
sql_final_select =\
"""
SELECT * FROM tbl_studyPopulation ORDER BY person_id
"""
caseness_array = pandas.read_gbq(sql_declarations + sql_studyPopulation + sql_caseness_CTEs + sql_final_select)


# Calculate the prevalence of caseness among those with mental ill-health.
count_caseness = round(caseness_array.caseness_1isYes.sum() / target_round) * target_round
count_studyPopulation = round( len(caseness_array) / target_round ) * target_round
caseness_prevalence = count_caseness / count_studyPopulation
count_control = count_studyPopulation - count_caseness

if count_studyPopulation < redaction_threshold:
    print(f'The count of patients in the Connected Bradford dataset demonstrating mental ill-health is so low that does not suffice our redaction threshold.')
    print(f'Therefore, no further action will be taken because the study population is too small.')

if count_caseness < redaction_threshold:
    print(f'The count of patients in the Connected Bradford dataset demonstrating the caseness of complex mental health difficulties is so low that does not suffice our redaction threshold.')
    print(f'Therefore, we will use the redaction threshold as the imputed count of caseness.')
    count_caseness = redaction_threshold
    caseness_prevalence = count_caseness / count_studyPopulation
    count_control = count_studyPopulation - count_caseness
else:
    print(f'The prevalence of caseness among those with mental ill-health in the Connected Bradford dataset is \033[1m{round(caseness_prevalence * 100, 1)}%\033[0m. \n')

%store sql_declarations sql_studyPopulation caseness_array count_studyPopulation count_caseness count_control caseness_prevalence



The prevalence of caseness among those with mental ill-health in the Connected Bradford dataset is [1m2.0%[0m. 

Stored 'sql_declarations' (str)
Stored 'sql_studyPopulation' (str)
Stored 'caseness_array' (DataFrame)
Stored 'count_studyPopulation' (int)
Stored 'count_caseness' (int)
Stored 'count_control' (int)
Stored 'caseness_prevalence' (float)


## Calculating the entropy of the caseness

In [20]:
print("\n Calculating the entropy of the caseness variable in nats...")
entropy_caseness_scaled = entropy_output(caseness_array.caseness_1isYes)
%store entropy_caseness_scaled


 Calculating the entropy of the caseness variable in nats...
	 Caseness variable entropy = 0.097 nats
	 The caseness variable's entropy is 14.0 % of its theoretical maximum



## Calculating hit rates

In [10]:
print("\n Calculating the hit rates of the caseness variable in nats...")
hitRate_none, hitRate_all = hitrate_output(caseness_array.caseness_1isYes)
%store hitRate_none


 Calculating the hit rates of the caseness variable in nats...
	 Hit rate (all) = 1.969 %
	 Hit rate (none) = 98.031 %
	 Odds (No : Yes) = 49-times less likely to demonstrate caseness than to not.
Stored 'hitRate_none' (float64)


## Concluding comments.

In [3]:
display(
    Markdown(
f"""    
We now know that:
1. based on the scaled entropy, our variable indicating the caseness of complex mental health difficulties is $\le{round(entropy_caseness_scaled, 1)}\%$ as uncertain/surprising/unforeseeable
as it could possibly be; _and_

2. we would correctly classify $\ge{round(hitRate_none, 1)}\%$ of patient records in this sample if we simply assumed that none met our definition of complex mental health difficulties.

The first point tells us that definite caseness of complex mental health difficulties can be known with considerable certainty, in this dataset. There is only so much room for improvement via feature sets.

The second point defines a benchmark for the indicative performance of any feature set that we evaluate in our study. Specifically, any feature set that we suggest to improve our certainty of knowing
that someone has complex mental health difficulties must correctly identify $\ge{round(hitRate_none, 1)}\%$ of patient records in our sample. Otherwise, the added feature set is a needless
complication to our attempt to know whether or not someone has complex mental health difficulties (which we can often safely assume they don't). This is such a high benchmark that we will be very
unlikely to find such a feature set.

We must remember that we are not trying to out-predict an identification rule based on caseness prevalence. Rather, we are trying to find feature sets that correlate with this
caseness prevalence. Large correlations would be difficult to find using variance-based methods like Pearson's product moment correlation or regression methods because the variance of the
caseness variable is so low. Our approach based on mutual-information is better suited to this situation because its fundamental concept is coincidence rather than covariance.
"""
        )
)

    
We now know that:
1. based on the scaled entropy, our variable indicating the caseness of complex mental health difficulties is $\le14.0\%$ as uncertain/surprising/unforeseeable
as it could possibly be; _and_

2. we would correctly classify $\ge98.0\%$ of patient records in this sample if we simply assumed that none met our definition of complex mental health difficulties.

The first point tells us that definite caseness of complex mental health difficulties can be known with considerable certainty, in this dataset. There is only so much room for improvement via feature sets.

The second point defines a benchmark for the indicative performance of any feature set that we evaluate in our study. Specifically, any feature set that we suggest to improve our certainty of knowing
that someone has complex mental health difficulties must correctly identify $\ge98.0\%$ of patient records in our sample. Otherwise, the added feature set is a needless
complication to our attempt to know whether or not someone has complex mental health difficulties (which we can often safely assume they don't). This is such a high benchmark that we will be very
unlikely to find such a feature set.

We must remember that we are not trying to out-predict an identification rule based on caseness prevalence. Rather, we are trying to find feature sets that correlate with this
caseness prevalence. Large correlations would be difficult to find using variance-based methods like Pearson's product moment correlation or regression methods because the variance of the
caseness variable is so low. Our approach based on mutual-information is better suited to this situation because its fundamental concept is coincidence rather than covariance.


In [8]:
# Below, I compute the cells of the contingency table for a rule that says no one has caseness of complex mental health difficulties.
#
# True positives. Zero because the rule says no one demonstrates 'Definite caseness' so no "positives" of any kind exist.
tp = 0
# False positives. Zero because the rule says no one demonstrates 'Definite caseness' so no "positives" of any kind exist.
fp = 0
# True negatives. The opposite of the hit rates calculated previously, which assumed the rule that everyone demonstrated active caseness.
tn = hitRate_none / 100 * count_studyPopulation
# False negatives. The opposite of the hit rates calculated previously, which assumed the rule that everyone demonstrated active caseness.
fn = hitRate_all / 100 * count_studyPopulation

# Below, I compute the evaluation statistics.
#
# Class balance accuracy.
cba = round( 0.5 * ( (tp / max( (tp + fn), (tp + fp) ) ) + (tn / max( (tn + fp), (tn + fn) ) ) ), 2)
# Odds ratio.
OR = 'Not a number because one of the odds is zero.' if min( (tp * tn) , (fp * fn) ) == 0 else round( (tp * tn) / (fp * fn), 2)
# Positive predictive value.
ppv = 0 if (tp + fp) == 0 else round( tp / (tp + fp), 2)
# Negative predictive value.
npv = 0 if (tn + fn) == 0 else round( tn / (tn + fn), 2)

display(
    Markdown(
f"""    
If we assume a rule that says no record meets our definition for the caseness of complex mental health difficulties, then we get the following approximate values for our evaluation statistics:

| Statistic                      |    Value   |
| ------------------------------ | -----------|
| Normalised mutual information  | x \u2192 0 |
| Class balance accuracy         | {cba}      |
| Odds ratio                     | {OR}       |
| Positive predictive value      | {ppv}      |
| Negative predictive value      | {npv}      |

"""
    )
)

    
If we assume a rule that says no record meets our definition for the caseness of complex mental health difficulties, then we get the following approximate values for our evaluation statistics:

| Statistic                      |    Value   |
| ------------------------------ | -----------|
| Normalised mutual information  | x → 0 |
| Class balance accuracy         | 0.49      |
| Odds ratio                     | Not a number because one of the odds is zero.       |
| Positive predictive value      | 0      |
| Negative predictive value      | 0.98      |

