# Filtering feature sets based on prevalence

The intended use of the feature sets is to support clinicians in identifying patients with complex mental health difficulties so that resources can be appropriately delivered. Without information like feature sets, the clinician’s expected accuracy in identifying patients with complex mental difficulties is $50\%$ - a coin toss. Alternatively, with no other information, one could assume everyone has complex mental health difficulties or no one does. A balance must be struck between these extremes to facilitate optimal and appropriate use of healthcare resources that maximises benefit for the unidentified patients in need. Some boundaries can be defined for this purpose.

We will only consider feature sets whose prevalence is between $0.7$- and $1.4$-times the prevalence of the caseness variable. This is based on our simulations that suggest this prevalence range is needed for a feature set to be capable of a normalised mutual information $\ge0.8$ (in the best-case scenario across a range of caseness prevalence values; `Supplementary material xNMIsimx`). In middle- and worst-case scenarios, it would not be possible for feature sets to satisfy the $\ge0.8$ threshold for normalised mutual information. Further explanations are in `Supplementary material xNMIsimx`). Initially, we prefer to our high but informative threshold over a lower and less-informative threshold because it maintains the caseness variable as the target reference rather than the candidate feature sets.

### Imports and helper functions

In [2]:
%run 'UNSEEN_helper_functions.ipynb'
# Refresh stored variables.
%store -r

### Prerequisites

In [7]:
# Check if the variables related to the caseness variables exist.
if 'definiteCaseness_count' not in globals():
    %run ./"UNSEEN_create_caseness_variables.ipynb"

In [6]:
# Display message.
display(
    Markdown(
f"""
Running our script to calculate the prevalence of our caseness variable shows that:
- the prevalence of 'Possible caseness' is ${round(possibleCaseness_prevalence,3):,}\%$, which equates to a redacted and rounded count of ${int(possibleCaseness_count):,}$
- the prevalence of 'Definite caseness' is ${round(definiteCaseness_prevalence,3):,}\%$, which equates to a redacted and rounded count of ${int(definiteCaseness_count):,}$

Our simulations suggested upper- and lower-bound scaling factors of $0.7$ and $1.4$ to ensure  feature sets have at least $80\%$ normalised mutual information
with the caseness variables, in the best-case scenario.

These data mean that:
- for the 'Possible caseness' variable, we will only consider feature sets that are present in at least ${int(possibleCaseness_count_LB):,}$
patients' records and no more than ${int(possibleCaseness_count_UB):,}$ patients' records.
- for the 'Definite caseness' variable, we will only consider feature sets that are present in at least ${int(definiteCaseness_count_LB):,}$
patients' records and no more than ${int(definiteCaseness_count_UB):,}$ patients' records.
"""
       )
)


Running our script to calculate the prevalence of our caseness variable shows that:
- the prevalence of 'Possible caseness' is $0.123\%$, which equates to a redacted and rounded count of $860$
- the prevalence of 'Definite caseness' is $0.009\%$, which equates to a redacted and rounded count of $60$

Our simulations suggested upper- and lower-bound scaling factors of $0.7$ and $1.4$ to ensure  feature sets have at least $80\%$ normalised mutual information
with the caseness variables, in the best-case scenario.

These data mean that:
- for the 'Possible caseness' variable, we will only consider feature sets that are present in at least $600$
patients' records and no more than $1,200$ patients' records.
- for the 'Definite caseness' variable, we will only consider feature sets that are present in at least $40$
patients' records and no more than $90$ patients' records.
