# Feature-set array

The purpose of this notebook is to produce the initial feature-set array. The initial feature set-array is an n-by-p array containing patient ID and categorical vectors for each feature indicating which level of the feature the patient is recorded to have expressed.

### Imports

In [1]:
import pandas
pandas.set_option('display.max_colwidth', None)
import numpy
from google.cloud import bigquery
from datetime import date
import scipy.stats
import math
import sklearn.metrics
import itertools
from IPython.display import display, Markdown, Latex

# Instantiate the BigQuery client.
client = bigquery.Client()

## Which feature sets to start with?

The intended use of the feature sets is to support clinicians in identifying patients with complex mental health difficulties so that resources can be appropriately delivered. Without information like feature sets, the clinician’s expected accuracy in identifying patients with complex mental difficulties is 50% - a coin toss. Alternatively, with no other information, one could assume everyone has complex mental health difficulties or no one does. A balance must be struck between these extremes to facilitate optimal and appropriate use of healthcare resources that maximises benefit for the unidentified patients in need. Some boundaries can be defined for this purpose.

The first boundary is that any useful feature set must not be more than 50% prevalent because although a greater prevalence might improve informativeness in the worst-case scenario, it would be at the cost of positive and negative predictive values in any scenario.

The second boundary requires some analysis. If the caseness of complex mental health difficulties was a fundamental concept, then any useful feature set must be at least as prevalent as complex mental health difficulties because a lesser prevalence would, at best, disimprove informativeness and negative predictive value with no benefit to positive predictive value. However, the caseness of complex mental health difficulties is a composite of independent component criteria about records of medications and diagnoses. Specifically, it is the indicated by the presence of a diagnosis and then further qualified by the presence of a presciption within three months of our data extraction.

Therefore, if we consider prevalence as a probability, the sentiment of the second boundary is better worded as "any useful feature set must be at least as prevalent _as the least-prevalent combination of criterion diagnosis and medication_". Practically, this equates to the product of the smallest condition and medication prevalence values.

The prevalence of the component criteria diagnoses are available in our script calculating the caseness variable for complex mental health difficulties. 
</br>
</br>

__*__ Note that for all arguments presented, we assume that the only signal of interest is the presence of the feature set rather than its absence because it is not possible to distinguish missingness from the genuine absence of a feature set (and each would provide different, indistinguishable information).

In [4]:
%run ./"UNSEEN create caseness array.ipynb"
display(
    Markdown("""
Running our script to calculate the prevalence of our caseness variable shows that
no component diagnosis has a prevalence greater than $%s\%%$. This sets the lower bound for feature-set prevalence.

We conclude based on our arguments that a feature set prevalence must satisfy $%s \le prevalence_{feature\ set_{i}} \le 0.50$,
or $%s \le count\ of\ patients_{feature\ set_{i}} \le %s$
"""
       %(min_criterion_prev,
         min_criterion_prev,
         f'{int(min_criterion_prev / 100 * denominator_as_int):,}', 
         f'{int(0.5 * denominator_as_int):,}' )
       )
)


Running our script to calculate the prevalence of our caseness variable shows that
no component diagnosis has a prevalence greater than $0.03\%$. This sets the lower bound for feature-set prevalence.

We conclude based on our arguments that a feature set prevalence must satisfy $0.03 \le prevalence_{feature\ set_{i}} \le 0.50$,
or $210 \le count\ of\ patients_{feature\ set_{i}} \le 351,605$


### What is the prevalence of single-SNOMED-CT code feature sets?
The first feature sets to be assessed are individual SNOMED-CT codes found in the Connected Bradford primary care table.

The tables outputted below shows the count of patient records in which unique SNOMED-CT codes occur. The first table provides counts are aggregated in ranges from $<10$ to $>10,000,000$ by factors of 10. The second table presents counts aggregate in the ranges defined by the arguments made previously.

In [7]:
# Declare your redaction threshold and target rounding number.
redaction_threshold = 7
target_round = 10
sql_variables = \
"""
DECLARE redaction_threshold INT64 DEFAULT """ + str(redaction_threshold) + """;
DECLARE target_round INT64 DEFAULT """ + str(target_round) + """;
"""

# Declare lower and upp boundaries for feature-set prevalence
lower_bound = int(min_criterion_prev / 100 * denominator_as_int)
upper_bound = int(0.5 * denominator_as_int)
sql_variables = \
    sql_variables + \
"""
DECLARE lower_bound INT64 DEFAULT """ + str(lower_bound) + """;
DECLARE upper_bound INT64 DEFAULT """ + str(upper_bound) + """;
"""


sql_base = \
"""
WITH
tbl_persons AS (
SELECT
    DISTINCT person_id
FROM
    yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.person
# Limiting to age range 18-70.
WHERE
    (EXTRACT(YEAR FROM CURRENT_DATE()) - year_of_birth) BETWEEN 18 AND 70
)
,tbl_patients_per_code AS (
SELECT
    DISTINCT a.src_snomedcode,
    COUNT(DISTINCT tbl_persons.person_id) AS count_patients_with_code
FROM
    `yhcr-prd-phm-bia-core.CY_FDM_PrimaryCare_v5.tbl_SRCode` AS a    
RIGHT JOIN
    tbl_persons
    ON a.person_id = tbl_persons.person_id
GROUP BY
    a.src_snomedcode
ORDER BY
    count_patients_with_code DESC
)
"""

sql_full_table = \
"""
,tbl_category_full AS
(
SELECT
  DISTINCT src_snomedcode
  ,CASE
    WHEN count_patients_with_code < 10 THEN "<10"
    WHEN count_patients_with_code < 100 THEN "10 =< code < 100"
    WHEN count_patients_with_code < 1000 THEN "100 =< code < 1,000"
    WHEN count_patients_with_code < 10000 THEN "1,000 =< code < 10,000"
    WHEN count_patients_with_code < 100000 THEN "10,000 =< code < 100,000"
    WHEN count_patients_with_code < 1000000 THEN "100,000 =< code < 1,000,000"
    WHEN count_patients_with_code < 10000000 THEN "1,000,000 =< code < 10,000,000"
    WHEN count_patients_with_code >= 10000000 THEN "code >= 10,000,000"
  END AS cnt_SNOMED
FROM tbl_patients_per_code
ORDER BY cnt_SNOMED
)

SELECT
  COUNT(cnt_SNOMED) AS This_many_codes__
  ,cnt_SNOMED AS __occur_for_this_many_patients
FROM tbl_category_full
GROUP BY cnt_SNOMED
ORDER BY This_many_codes__ DESC
"""
full_Table = client.query(sql_variables + sql_base + sql_full_table).to_dataframe()
display(full_Table)

sql_boundary_table = \
"""
,tbl_category_boundary AS
(
SELECT
  DISTINCT src_snomedcode
  ,CASE
    WHEN count_patients_with_code < lower_bound THEN "too infrequent (occurs in < """ + f'{lower_bound:,}' + """ patients' records)"
    WHEN count_patients_with_code <= upper_bound THEN "within bounds"
    ELSE "too frequent (occurs in > """ + f'{upper_bound:,}' + """ patients' records)"
  END AS cnt_SNOMED
FROM tbl_patients_per_code
ORDER BY cnt_SNOMED
)

SELECT
  COUNT(cnt_SNOMED) AS This_many_codes__
  ,cnt_SNOMED AS __occur_this_often
FROM tbl_category_boundary
GROUP BY cnt_SNOMED
ORDER BY This_many_codes__ DESC
"""
boundary_Table = client.query(sql_variables + sql_base + sql_boundary_table).to_dataframe()
display(boundary_Table)

# Prepare the table for extracting data.
boundary_Table.set_index('__occur_this_often', inplace = True)
n_within_bounds = int(boundary_Table.loc['within bounds'])

Unnamed: 0,This_many_codes__,__occur_for_this_many_patients
0,40287,<10
1,22996,10 =< code < 100
2,12139,"100 =< code < 1,000"
3,5483,"1,000 =< code < 10,000"
4,1357,"10,000 =< code < 100,000"
5,146,"100,000 =< code < 1,000,000"


Unnamed: 0,This_many_codes__,__occur_this_often
0,68164,too infrequent (occurs in < 210 patients' records)
1,14210,within bounds
2,34,"too frequent (occurs in > 351,605 patients' records)"


In [8]:
# Display message.
display(
    Markdown(
"""
The table above shows that most SNOMED-CT codes occur infrequently in patients' records, with a
handfull of codes showing up in many patient's records.

Recall that for a feature to be informative, it must occur in
$%s \le count\ of\ patients_{feature\ set_{i}} \le %s$ patients' records.

#### Interim conclusion
We can infer that __%s feature sets (defined solely by the presence of a single SNOMED-CT code) might be
informative of the caseness of complex mental health difficulties, in our particular cohort within the Connected Bradford dataset__.
"""
        %(f'{int(min_criterion_prev / 100 * denominator_as_int):,}',
          f'{int(0.5 * denominator_as_int):,}',
         f'{n_within_bounds:,}')
    )
)


The table above shows that most SNOMED-CT codes occur infrequently in patients' records, with a
handfull of codes showing up in many patient's records.

Recall that for a feature to be informative, it must occur in
$210 \le count\ of\ patients_{feature\ set_{i}} \le 351,605$ patients' records.

#### Interim conclusion
We can infer that __14,210 feature sets (defined solely by the presence of a single SNOMED-CT code) might be
informative of the caseness of complex mental health difficulties, in our particular cohort within the Connected Bradford dataset__.


#### Making a list of the single-feature feature sets of interest
The following code defines a list of SNOMED-CT codes (that appear in our cohort from the Connected Bradford dataset) that we will carry forward as single-feature feature sets.

In [10]:
sql_singleFS_select = \
"""
SELECT
    src_snomedcode
FROM
    tbl_patients_per_code
WHERE
    count_patients_with_code BETWEEN lower_bound AND upper_bound
"""
ls_single_feature_featureSet = client.query(sql_variables + sql_base + sql_singleFS_select).to_dataframe()
display(ls_single_feature_featureSet)
ls_single_feature_featureSet.to_csv("Feature set lab/" + "ls_single_feature_featureSet.csv", index = False)

Unnamed: 0,src_snomedcode
0,1020291000000106
1,1022791000000101
2,78564009
3,279991000000102
4,773011000000101
...,...
14205,184076002
14206,73103007
14207,224973000
14208,310088004


### What is the prevalence of pair- and triplet-composite feature sets?
The next question we might ask is whether pairs or triplets of SNOMED-CT codes might be informative of the cases of complex mental health difficulties.

Re-running the previous SQL and Python scripts isn't a good idea because the volume of combinations quickly becomes too great. Instead, we will handle composite feature sets in a loop during the evaluation stage so as not to overburden memnory.