# Filtering feature sets based on prevalence

The intended use of the feature sets is to support clinicians in identifying patients with complex mental health difficulties so that resources can be appropriately delivered. Without information like feature sets, the clinician’s expected accuracy in identifying patients with complex mental difficulties is 50% - a coin toss. Alternatively, with no other information, one could assume everyone has complex mental health difficulties or no one does. A balance must be struck between these extremes to facilitate optimal and appropriate use of healthcare resources that maximises benefit for the unidentified patients in need. Some boundaries can be defined for this purpose.

The first boundary is that any useful feature set must not be more than 50% prevalent because although a greater prevalence might improve informativeness in the worst-case scenario, it would be at the cost of positive and negative predictive values in any scenario.

The second boundary requires some analysis. If the caseness of complex mental health difficulties was a fundamental concept, then any useful feature set must be at least as prevalent as complex mental health difficulties because a lesser prevalence would, at best, disimprove informativeness and negative predictive value with no benefit to positive predictive value. However, the caseness of complex mental health difficulties is a composite of independent component criteria about records of medications and diagnoses. Specifically, it is the indicated by the presence of a diagnosis and then further qualified by the presence of a presciption within three months of our data extraction.

Therefore, if we consider prevalence as a probability, the sentiment of the second boundary is better worded as "any useful feature set must be at least as prevalent _as the least-prevalent combination of criterion diagnosis and medication_". Practically, this equates to the product of the smallest condition and medication prevalence values.

The prevalence of the component criteria diagnoses are available in our script calculating the prevalence of our definitive caseness variable for complex mental health difficulties. 
</br>
</br>

__*__ Note that for all arguments presented, we assume that the only signal of interest is the presence of the feature set rather than its absence because it is not possible to distinguish missingness from the genuine absence of a feature set (and each would provide different, indistinguishable information).

In [23]:

# Refresh stored variables.
%store -r

# Check if the variables related to the caseness variables exist.
if 'max_criterion_prev' not in globals():
    %run ./"UNSEEN_create_caseness_variables.ipynb"

# Display message.
display(
    Markdown("""
Running our script to calculate the prevalence of our caseness variable shows that
no component diagnosis has a prevalence greater than $%s\%%$. This sets the lower bound for feature-set prevalence.

We conclude based on our arguments that a feature set prevalence must satisfy $%s \le prevalence_{feature\ set_{i}} \le %s$,
or $%s \le count\ of\ patients_{feature\ set_{i}} \le %s$
"""
       %(min_criterion_prev,
         min_criterion_prev,
         max_criterion_prev,
         f'{min_criterion_count:,}', 
         f'{max_criterion_count:,}' )
       )
)


Running our script to calculate the prevalence of our caseness variable shows that
no component diagnosis has a prevalence greater than $0.03\%$. This sets the lower bound for feature-set prevalence.

We conclude based on our arguments that a feature set prevalence must satisfy $0.03 \le prevalence_{feature\ set_{i}} \le 0.5$,
or $210 \le count\ of\ patients_{feature\ set_{i}} \le 3,516$
