# Feature-set array

The purpose of this notebook is to produce the initial feature-set array. The initial feature set-array is an n-by-p array containing patient ID and categorical vectors for each feature indicating which level of the feature the patient is recorded to have expressed.

### Imports

In [1]:
import pandas
import numpy
from google.cloud import bigquery
from datetime import date
import scipy.stats
import math
import sklearn.metrics

# Instantiate the BigQuery client.
client = bigquery.Client()

## Which feature sets to start with?
The initial feature sets will be individual SNOMED CT codes found in the Connected Bradford primary care table. But many SNOMED CT codes are infrequently used so they will not be able to provide any information about our caseness variable (i.e. complex mental health difficulties).

The table outputted below shows the count of unique SNOMED CT codes that occur within a selection of ranges from $<10$ to $>10,000,000$ by factors of 10.

In [2]:
sql = """
WITH
tbl AS
(
SELECT
  DISTINCT src_snomedcode
  ,CASE
    WHEN COUNT(src_snomedcode) < 10 THEN "<10"
    WHEN COUNT(src_snomedcode) < 100 THEN "10 =< code < 100"
    WHEN COUNT(src_snomedcode) < 1000 THEN "100 =< code < 1,000"
    WHEN COUNT(src_snomedcode) < 10000 THEN "1,000 =< code < 10,000"
    WHEN COUNT(src_snomedcode) < 100000 THEN "10,000 =< code < 100,000"
    WHEN COUNT(src_snomedcode) < 1000000 THEN "100,000 =< code < 1,000,000"
    WHEN COUNT(src_snomedcode) < 10000000 THEN "1,000,000 =< code < 10,000,000"
    WHEN COUNT(src_snomedcode) >= 10000000 THEN "code >= 10,000,000"
  END AS cnt_SNOMED
FROM `yhcr-prd-phm-bia-core.CY_FDM_PrimaryCare_v5.tbl_SRCode`
GROUP BY src_snomedcode
)

SELECT
  DISTINCT cnt_SNOMED AS Count_category
  ,COUNT(cnt_SNOMED) AS Count_of_occurences
FROM tbl
GROUP BY cnt_SNOMED
ORDER BY Count_of_occurences DESC
"""
bqTable = client.query(sql).to_dataframe()
bqTable

Unnamed: 0,Count_category,Count_of_occurences
0,<10,38831
1,10 =< code < 100,25977
2,"100 =< code < 1,000",15262
3,"1,000 =< code < 10,000",7678
4,"10,000 =< code < 100,000",2975
5,"100,000 =< code < 1,000,000",538
6,"1,000,000 =< code < 10,000,000",94
7,"code >= 10,000,000",2


The table shows that most SNOMED CT codes are infrequently coded. We also know that the prevalence of CMHD in the sample is 13.7%. If we assume that the only signal of interest is from the _presence_ of a clinical code rather than its absence*, then the minimum odds ratio for a given feature can only occur if a) the prevalence (or 1 - the prevalence) is 50%, and b) the instances of the feature are evenly distributed throughout the levels of the caseness variable. Taking the first criterion only, this means we would need at least $0.5\times n_{sample}$ instances of a SNOMED CT, and assume that each instance occurrs for each patient only once. Given that $n_{sample}\approx 699,620$, this requires that for clinical codes to be of any use, the need to occurr at least $349,810$ times. By this logic, only the codes in categories 5', 6 and 7 should be taken forward for further study, where 5' is an updated category defined by $349,810 \le code < 1,000,000$. This logic is followed while acknowledging that $349,810$ is a necessary but not sufficient minimum occurence of a SNOMED CT code in the sample.

The final count of unique SNOMED CT codes that will be carried forward for study, `Count_of_codes`, is provided below.
<br/><br/><br/>

*I think I am justified in assuming that the presence of a clinical code is more information than its absence because so many SNOMED CT codes are not used that distinguishing relevant from irrelevant absences is overly burdensome. Based on the counts shown in the table above, our sample cohort demonstrate only approximately a quarter of the $352,567$ SNOMED CT codes.

In [3]:
sql = """
WITH
tbl AS
(
SELECT
    DISTINCT src_snomedcode
    ,COUNT(src_snomedcode) AS cnt_code
FROM `yhcr-prd-phm-bia-core.CY_FDM_PrimaryCare_v5.tbl_SRCode`
GROUP BY src_snomedcode
)


SELECT
  COUNT(src_snomedcode) AS Count_of_codes
FROM tbl
WHERE
    cnt_code >= 349810
"""
bqTable = client.query(sql).to_dataframe()
bqTable

Unnamed: 0,Count_of_codes
0,216


## Creating the initial feature-set array

To produce the initial feature-set array, we need to define the list of unique SNOMED CT codes and check whether each patient has that code in their primary care record. The code below produces an n-by-p array where each column contains the count of times that a code is recorded for a given patient.

In [3]:
# I'm thankful for the following stackoverflow thread about pivot queries:
# https://stackoverflow.com/questions/50293482/how-to-create-crosstab-with-two-field-in-bigquery-with-standart-or-legacy-sql.

sql_with = """
WITH
tbl_persons AS
(
SELECT
    DISTINCT person_id
FROM
    yhcr-prd-phm-bia-core.CY_MYSPACE_CMC.person
# Limiting to age range 18-70.
WHERE
    (EXTRACT(YEAR FROM CURRENT_DATE()) - year_of_birth) BETWEEN 18 AND 70
)
,tbl_codes_and_count AS
(
SELECT
    DISTINCT src_snomedcode
    ,COUNT(src_snomedcode) AS cnt_code
FROM `yhcr-prd-phm-bia-core.CY_FDM_PrimaryCare_v5.tbl_SRCode`
GROUP BY src_snomedcode
)
,tbl_codes_of_interest AS
(
SELECT
  src_snomedcode AS SNOMEDcode
FROM tbl_codes_and_count
WHERE
    cnt_code >= (SELECT COUNT(person_id)/2 FROM tbl_persons)
    # This justification for this filter is described in the
    # previous part of the Jupyter notebook.
)
,tbl_persons_and_codes AS
(
SELECT
    tbl_persons.person_id
    ,tbl_codes.src_snomedcode
FROM 
    tbl_persons
LEFT JOIN
    yhcr-prd-phm-bia-core.CY_FDM_PrimaryCare_v5.tbl_SRCode AS tbl_codes
ON
    tbl_persons.person_id = tbl_codes.person_id
)
,tbl_persons_codes_of_interest AS
(
SELECT
  tbl_persons_and_codes.person_id
  ,tbl_codes_of_interest.SNOMEDcode
FROM
  tbl_persons_and_codes
LEFT JOIN
  tbl_codes_of_interest
ON 
  tbl_codes_of_interest.SNOMEDcode = tbl_persons_and_codes.src_snomedcode
)
"""
sql_pivot = """
SELECT
    CONCAT("SELECT person_id,", STRING_AGG(CONCAT("COUNTIF(SNOMEDcode='",SNOMEDcode,"') AS `_",SNOMEDcode,"`")), 
        " FROM `tbl_persons_codes_of_interest`",
        " GROUP BY person_id ORDER BY person_id")
FROM (  SELECT DISTINCT SNOMEDcode FROM `tbl_persons_codes_of_interest` ORDER BY SNOMEDcode  )
"""

sql = client.query(sql_with + sql_pivot).to_dataframe()['f0_'].iloc[0]
featureSet_array = client.query(sql_with + sql).to_dataframe()

## Create the Feature Set ID table.
This table is a look-up table of feature-set IDs that shows which features make up the feature set. The table is instantiated on the assumption that feature sets will include no more than five features.

In [5]:
# Instantiate the feature set id table.
featureSet_ID_table = \
    pandas.DataFrame(columns = ['Feature set ID', 'Feature Set 1', 'Feature Set 2',
                               'Feature Set 3', 'Feature Set 4', 'Feature Set 5'
                               ])
# Populate the feature set id table with the individual features.
featureSet_ID_table['Feature set ID'] = \
    featureSet_ID_table['Feature Set 1'] = \
        featureSet_array.columns[featureSet_array.columns != 'person_id']