<h1><center>D212: Data Mining II</center></h1>
<br>
<center>Task 3: Market Basket Analysis</center>
<br>
<center></center>
<br>
<center>Department of Information Technology, Western Governor's University</center>
<br>
<center>Dr. Kesselly Kamara</center>
<br>
<center>July 30, 2024</center>
<br>
<br>
<br>
<br>

## A1. Research Question
As a hospital's analyst, I have been tasked to perform market basket analysis to find associations between patient prescriptions that could help the hospital in making strategic decisions. My research question is as follows: What other prescriptions are associated with the prescription and/or purchase of blood pressure medications?

While I could have left this question much more broad such as "Do any prescriptions have associations with any other prescriptions," I imagine in a real-world scenario, the research question would be more targeted. The idea for this research question comes from my own experience. A doctor once told me anxiety and high blood pressure are often seen together, and if one is being treated for anxiety, one should also be treated for high blood pressure. I will put that association to the test here.

## A2. Analysis Goals
The idea of this analysis is to find conditions that have high rates of co-occurrence with high blood pressure. Doing so may help the hospital to be proactive during appointments or hospital stays with patients who have this preexisting condition so that patients get the best patient care possible. Physicians could be proactive by asking the right questions based on this analysis to ensure every patient is on every medication they *need* to be on in order to be healthy.

I will note that market basket analysis fits much better in a retail setting than in a hospital setting. There is no storefront to rearrange, no customers to upsell or cross-sell. Attempting to make the goal of this market basket analysis to do any one of those things would be a breach of law or ethics. Thus, the sole goal of this market basket analysis is to better patient care.

## B1. Explanation of Market Basket Analysis
Market basket analysis works by finding a set of association rules, a simple example of which is "if hotdog, then hotdog buns." In this example, "hotdog" is the antecedent and, "hotdog buns" is the consequent. This rule implies that if someone purchases hotdogs, they are likely to also purchase hotdog buns.

Market basket analysis starts with a dataset of transactions. Using our dataset, we could make a complete list of all possible combinations of the items, including rules with more than one consequent or antecedent, but that list is far too large to reasonably work with. The more items we have, the more unruly it gets-- and quickly. Thus, part of market basket analysis is finding a metric that will allow the quick identification of rules that are "interesting." For example, consider the dataset below:

| Transaction ID | Item1 | Item2 | Item3 |
| --- | --- | --- | --- 
| 001 | Hotdog | Hotdog bun | Mustard |
| 002 | Mustard | Ketchup | |
| 003 | Relish | | |
| 004 | Hotdog | Ketchup | |

In the above dataset, we probably wouldn't want to consider rules such as "if relish, then hotdog" because no such transactions exist in the dataset. The number of times an itemset (such as hotdogs and relish purchased together) occurs divided by the total number of transactions is known as "support." The number of transactions in the above dataset with both relish and hotdogs is 0, and the total number of transactions in the dataset is 4. Thus, support for this itemset is 0. In this case, we might consider "pruning" this itemset-- that is, discarding it from consideration. Using the support metric is only one method of pruning, in fact-- there are others. However, I will only discuss support here for simplicity's sake. Plus, that is the metric I will use to prune in the analysis below.

The apriori algorithm is used to identify frequent itemsets based on a support threshold you choose. For example, in the above dataset, I could set the support threshold to .25, meaning the itemset needs to occur in 1 out of the 4 transactions to be retained for consideration.

Association rules are derived from the list of frequent itemsets by splitting them into antecedents and consequents, as in the first example (Sivek, 2021). From these, other metrics can be calculated, such as confidence and lift. These can be used to further prune the list of association rules. Because I use lift below, I will explain lift here. Lift measures the strength of association between two itemset A and itemset B while considering the expected occurrence of itemset B given that itemset A is purchased. This is calculated as the support of A and B divided by the support of A times the support of B. A lift value greater than 1 means that A's purchase increases the likelihood B will be purchased too, more than would be expected by chance. Thus, lift greater than 1 is a commonly set threshold for whether or not to keep rules.

Once the list of association rules has been pruned enough, one should be left with only "interesting rules" that are useful to the analyst. Rules with high lift, support, and confidence (or any other metric combination one wishes to use,) indicate that the purchase of the antecedent (hotdogs) is positively associated with the purchase of the consequent (hotdog buns,) meaning if someone purchases hotdogs, they are likely to also purchase hotdog buns. Therefore, when a store layout is planned...we just may want to put the hotdogs near the hotdog buns for convenience.

## B2. Example Transaction
Below, I have printed the eighth transaction in the dataset as an example. Each transaction in the dataset is in its own row. There are 20 columns, each of which can hold a prescription the patient received/purchased. In this transaction, one can see that the patient received/purchased two different drugs: paroxetine and allopurinol.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.frequent_patterns import association_rules, apriori
from mlxtend.preprocessing import TransactionEncoder

#load csv into pandas dataframe.
df=pd.read_csv('C:/Users/essay/Documents/D212 PA Dataset/medical_market_basket.csv')

print(df.iloc[7])

Presc01     paroxetine
Presc02    allopurinol
Presc03            NaN
Presc04            NaN
Presc05            NaN
Presc06            NaN
Presc07            NaN
Presc08            NaN
Presc09            NaN
Presc10            NaN
Presc11            NaN
Presc12            NaN
Presc13            NaN
Presc14            NaN
Presc15            NaN
Presc16            NaN
Presc17            NaN
Presc18            NaN
Presc19            NaN
Presc20            NaN
Name: 7, dtype: object


## B3. Market Basket Analysis Assumptions
One of the major assumptions market basket analysis makes is that the transactions in the dataset are independent (Deniran, 2023). This means that each row in the dataset is a single transaction made by one person at some date and time. If the transactions in the dataset are not independent, the reliability of market basket analysis could be adversely affected, generating association rules that could be nonsense.

## C1. Dataset Transformation
The following steps were performed to transform the data for analysis:
1. Load csv into a dataframe (see code above.)
2. Remove blank rows that appear after every row of legitimate data.
3. Transform the dataframe into a list of lists format.
4. Transform the result from previous step into a boolean dataframe, where each column is a product, and the row values represent whether or not the product was purchased in the transaction. Each row is a transaction.
5. Drop the "nan" column, because that is not a product and needs to be cleaned.
6. Export clean data to a csv file.

In [2]:
# Let's just look at our dataset first so we can get a general idea of what it contains/how its structured.
df.head(10)

Unnamed: 0,Presc01,Presc02,Presc03,Presc04,Presc05,Presc06,Presc07,Presc08,Presc09,Presc10,Presc11,Presc12,Presc13,Presc14,Presc15,Presc16,Presc17,Presc18,Presc19,Presc20
0,,,,,,,,,,,,,,,,,,,,
1,amlodipine,albuterol aerosol,allopurinol,pantoprazole,lorazepam,omeprazole,mometasone,fluconozole,gabapentin,pravastatin,cialis,losartan,metoprolol succinate XL,sulfamethoxazole,abilify,spironolactone,albuterol HFA,levofloxacin,promethazine,glipizide
2,,,,,,,,,,,,,,,,,,,,
3,citalopram,benicar,amphetamine salt combo xr,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,
5,enalapril,,,,,,,,,,,,,,,,,,,
6,,,,,,,,,,,,,,,,,,,,
7,paroxetine,allopurinol,,,,,,,,,,,,,,,,,,
8,,,,,,,,,,,,,,,,,,,,
9,abilify,atorvastatin,folic acid,naproxen,losartan,,,,,,,,,,,,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15002 entries, 0 to 15001
Data columns (total 20 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Presc01  7501 non-null   object
 1   Presc02  5747 non-null   object
 2   Presc03  4389 non-null   object
 3   Presc04  3345 non-null   object
 4   Presc05  2529 non-null   object
 5   Presc06  1864 non-null   object
 6   Presc07  1369 non-null   object
 7   Presc08  981 non-null    object
 8   Presc09  654 non-null    object
 9   Presc10  395 non-null    object
 10  Presc11  256 non-null    object
 11  Presc12  154 non-null    object
 12  Presc13  87 non-null     object
 13  Presc14  47 non-null     object
 14  Presc15  25 non-null     object
 15  Presc16  8 non-null      object
 16  Presc17  4 non-null      object
 17  Presc18  4 non-null      object
 18  Presc19  3 non-null      object
 19  Presc20  1 non-null      object
dtypes: object(20)
memory usage: 2.3+ MB


In [4]:
# It seems we have a blank row after every legitimate row of data. We need to drop these.
df = df[df['Presc01'].notna()]

# Fix index, because it's only odd numbers now and that's bothersome to me.
df.reset_index(drop=True, inplace=True)

# Check the dataframe to see if that worked
df.head(10)

Unnamed: 0,Presc01,Presc02,Presc03,Presc04,Presc05,Presc06,Presc07,Presc08,Presc09,Presc10,Presc11,Presc12,Presc13,Presc14,Presc15,Presc16,Presc17,Presc18,Presc19,Presc20
0,amlodipine,albuterol aerosol,allopurinol,pantoprazole,lorazepam,omeprazole,mometasone,fluconozole,gabapentin,pravastatin,cialis,losartan,metoprolol succinate XL,sulfamethoxazole,abilify,spironolactone,albuterol HFA,levofloxacin,promethazine,glipizide
1,citalopram,benicar,amphetamine salt combo xr,,,,,,,,,,,,,,,,,
2,enalapril,,,,,,,,,,,,,,,,,,,
3,paroxetine,allopurinol,,,,,,,,,,,,,,,,,,
4,abilify,atorvastatin,folic acid,naproxen,losartan,,,,,,,,,,,,,,,
5,cialis,,,,,,,,,,,,,,,,,,,
6,hydrochlorothiazide,glyburide,,,,,,,,,,,,,,,,,,
7,metformin,salmeterol inhaler,sertraline HCI,,,,,,,,,,,,,,,,,
8,metoprolol,carvedilol,losartan,,,,,,,,,,,,,,,,,
9,glyburide,,,,,,,,,,,,,,,,,,,


In [5]:
# Make dataframe into a list of lists. [In-Text Citation: (Kmair, 2023).]

rows = []
for i in range(0, 7501):
    rows.append([str(df.values[i, j]) for j in range(0, 20)])

In [6]:
# Use transaction encoder to transform dataset into boolean dataframe
# This can then be used in apriori algorithm.
encoder = TransactionEncoder()
array = encoder.fit(rows).transform(rows)
transactions = pd.DataFrame(array, columns=encoder.columns_)
transactions

Unnamed: 0,Duloxetine,Premarin,Yaz,abilify,acetaminophen,actonel,albuterol HFA,albuterol aerosol,alendronate,allopurinol,...,trazodone HCI,triamcinolone Ace topical,triamterene,trimethoprim DS,valaciclovir,valsartan,venlafaxine XR,verapamil SR,viagra,zolpidem
0,False,False,False,True,False,False,True,True,False,True,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7497,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7498,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7499,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [7]:
# Drop columns that are empty
cleaned_transactions = transactions.drop(['nan'], axis = 1)
cleaned_transactions

Unnamed: 0,Duloxetine,Premarin,Yaz,abilify,acetaminophen,actonel,albuterol HFA,albuterol aerosol,alendronate,allopurinol,...,trazodone HCI,triamcinolone Ace topical,triamterene,trimethoprim DS,valaciclovir,valsartan,venlafaxine XR,verapamil SR,viagra,zolpidem
0,False,False,False,True,False,False,True,True,False,True,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7497,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7498,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7499,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [8]:
cleaned_transactions.to_csv('cleaned_transactions.csv', index = False)

## C2. Generation of Association Rules
Association rules are created using two functions: the apriori algorithm and association_rules. First, to identify frequent itemsets and prune any associations without the appropriate support level (a threshold the analyst sets fairly arbitrarily,) the apriori algorithm is used. I set the support threshold equal to 0.01 because I didn't want to exclude too many itemsets nor include any itemsets with what I thought had too little support. A threshold of 0.01 means that the itemset had to occur roughly 75 times in the dataset, which seemed reasonable enough to me.

After using apriori, association_rules is used to generate the actual association rules from those frequent itemsets while further reducing the number of rules based on another metric (in my case below, I used lift.) Association_rules delineates between the antecedent(s) and consequent(s) and provides metrics like support, confidence, lift, and many other metrics for each rule.

In [9]:
freq_items = apriori(cleaned_transactions, min_support = 0.01, use_colnames=True)
freq_items.head(5)

Unnamed: 0,support,itemsets
0,0.011998,(Duloxetine)
1,0.046794,(Premarin)
2,0.238368,(abilify)
3,0.015731,(acetaminophen)
4,0.011998,(actonel)


In [11]:
rules = association_rules(freq_items, metric = 'lift', min_threshold = 1.0)
rules.head(5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Premarin),(diazepam),0.046794,0.163845,0.011598,0.247863,1.512793,0.003932,1.111706,0.355611
1,(diazepam),(Premarin),0.163845,0.046794,0.011598,0.070789,1.512793,0.003932,1.025824,0.405392
2,(abilify),(allopurinol),0.238368,0.033329,0.011598,0.048658,1.459926,0.003654,1.016113,0.41363
3,(allopurinol),(abilify),0.033329,0.238368,0.011598,0.348,1.459926,0.003654,1.168147,0.325896
4,(abilify),(amlodipine),0.238368,0.071457,0.023597,0.098993,1.385352,0.006564,1.030562,0.365218


## C3. Association Rules Metrics
Below you will find the code output of the association_rules function, which I stored in "rules." This rules table shows the support, lift, and confidence of every rule generated, as well as many other metrics. The full table with 405 rules is given and can be scrolled through in the ipynb version of this file. If you are using the pdf file to grade this paper, you will be unable to scroll.

In [12]:
pd.set_option('display.max_rows', None)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Premarin),(diazepam),0.046794,0.163845,0.011598,0.247863,1.512793,0.003932,1.111706,0.355611
1,(diazepam),(Premarin),0.163845,0.046794,0.011598,0.070789,1.512793,0.003932,1.025824,0.405392
2,(abilify),(allopurinol),0.238368,0.033329,0.011598,0.048658,1.459926,0.003654,1.016113,0.41363
3,(allopurinol),(abilify),0.033329,0.238368,0.011598,0.348,1.459926,0.003654,1.168147,0.325896
4,(abilify),(amlodipine),0.238368,0.071457,0.023597,0.098993,1.385352,0.006564,1.030562,0.365218
5,(amlodipine),(abilify),0.071457,0.238368,0.023597,0.330224,1.385352,0.006564,1.137144,0.299568
6,(abilify),(amphetamine salt combo),0.238368,0.068391,0.024397,0.102349,1.49653,0.008095,1.03783,0.435627
7,(amphetamine salt combo),(abilify),0.068391,0.238368,0.024397,0.356725,1.49653,0.008095,1.183991,0.356144
8,(amphetamine salt combo xr),(abilify),0.179709,0.238368,0.050927,0.283383,1.188845,0.00809,1.062815,0.193648
9,(abilify),(amphetamine salt combo xr),0.238368,0.179709,0.050927,0.213647,1.188845,0.00809,1.043158,0.208562


## C4. Top Three Association Rules
Below you will find the top three rules in the rules table in terms of the confidence value. The top three rules are as follows:
1. If someone buys/is prescribed amphetamine salt combo XR and lisinopril, then they are likely to buy/be prescribed abilify.
2. If someone buys/is prescribed lisinopril and atorvastatin, then they are likely to buy/be prescribed abilify.
3. If someone buys/is prescribed lisinopril and diazepam, then they are likely to buy/be prescribed abilify.

Since we already pruned based on support and lift, I chose to use confidence to identify the top three rules. Confidence is the proportion of transactions that contain all the items in the itemset over the proportion of transactions containing just one of those items (Sivek, 2021). In the case of rule one, this is the proportion of transactions that contain amphetamine salt combo xr, lisinopril, and abilify divided by the proportion of transactions that contain only amphetamine salt combo xr and lisinopril. In the case of rule two, this is the proportion of transactions that contain lisinopril, atorvastatin, and abilify divided by the proportion of transactions that contain only lisinopril and atorvastatin. Lastly, in the case of rule three, this is the proportion of transactions that contain lisinopril, diazepam, and abilify divided by the proportion of transactions that contain only lisinopril and diazepam.

The confidence value for rule one indicates that roughly 51% of the transactions that contain amphetamine salt combo xr and lisinopril also contain abilify. For rule 2, roughly 50% of the transactions that contain lisinopril and atorvastatin also contain abilify. Finally, for rule 3, roughly 47% of the transactions that contain lisinopril and diazepam also contain abilify. These confidence values are decently high. While there is no threshold for confidence above which a rule is considered "good" like there is for lift, 50% seems reasonably high to me. With confidences hovering around 50% for these three rules, we can assume that the associations between the rules' antecedents and their consequents is reasonably strong.

In [13]:
# Let's also check the stats for the confidence column to inform our final top 3 dataframe
# High confidence is good.
rules['confidence'].describe()

count    406.000000
mean       0.183657
std        0.103076
min        0.042506
25%        0.099265
50%        0.168497
75%        0.246304
max        0.506667
Name: confidence, dtype: float64

In [15]:
top_3 = rules.sort_values(by = 'confidence', ascending = False)
top_3.head(3)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
325,"(amphetamine salt combo xr, lisinopril)",(abilify),0.019997,0.238368,0.010132,0.506667,2.125563,0.005365,1.543848,0.540342
344,"(lisinopril, atorvastatin)",(abilify),0.021997,0.238368,0.011065,0.50303,2.110308,0.005822,1.532552,0.537969
390,"(diazepam, lisinopril)",(abilify),0.023064,0.238368,0.010932,0.473988,1.988472,0.005434,1.447937,0.508837


## D1. Analysis Results and Metric Significance
For this section, I am unsure if the rubric is attempting to instruct me to go back to my research question and answer it or if it intends for me to continue my analysis of the top three rules given above. I will, therefore, attempt to do both just in case.

With regard to the top 3 rules, let us first discuss support. Support is the proportion of transactions with the itemset in question. Support has already been limited to 0.01 by the apriori algorithm, so no itemset for which we are generating association rules should appear less than roughly 75 times in the dataset (0.01 times 7501 transactions.) This ensures the itemsets we are generating rules for are not so infrequent that we can't reasonably draw conclusions. For example, if we only had one transaction out of 7501 where lisinopril and atorvastatin and abilify were purchased together, would we really want to try and extrapolate that that means something since it happened once? Common sense should tell us no. 

Now that support is defined, let us analyze support in our top three rules. Support for all three of the top three rules is just over 0.01. Since all three rules have similar support, let us just analyze rule one since the pattern will be the same for the other two rules. A support of roughly 0.01 for rule one indicates that about 1% of the transactions contained all three medications: amphetamine salt combo xr, lisinopril, and abilify. This is roughly 75 transactions, which I have already determined seems to be a reasonable number of transactions to use to try and draw assumptions.

Lift measures association strength and is, for the first rule, the support of amphetamine salt combo xr, lisinopril, and abilify divided by the support of amphetamine salt combo xr and lisinopril times the support of abilify. Lift takes into account the expected random occurrences of the consequent given the antecedent is purchased/prescribed. If lift is greater than 1, we can say that the association, for example, between amphetamine salt xr and lisinopril as the antecedent and abilify as the consequent, is significant and beyond what we would expect to see due to chance. That is, the fact that the antecedent was purchased increases the likelihood the consequent will be purchased. In the case of our top three rules, we can see that lift is very high for all three of them, sitting around or above 2. Because this metric is much greater than 1, we can say that the antecedents all increase the likelihood of the purchase of their consequents. Purchasing amphetamine combo xr and lisinopril increases the likelihood abilify will be purchased. Purchasing lisinopril and atorvastatin increases the likelihood abilify will be purchased. Purchasing lisinopril and diazepam increases the likelihood abilify will be purchased.

Lastly, we'll look at confidence for our top three rules. Confidence measures the likelihood that if the antecedent is purchased, the consequent will also be purchased. It is the probability someone will purchase the consequent given that they've purchased the antecedent. The confidence for each of the top three rules sits around 50%, which as I stated before, seems reasonably high to me. There is no real threshold for confidence like there is for lift. Intuitively speaking though, if the likelihood that you'll purchase abilify given that you've already purchased amphetamine salt combo xr and lisinopril is 50%, that seems like a pretty good likelihood to me to be able to call this a good rule.

Now that we've got one interpretation of the top three rules out of the way, let's look at answering the research question. Note that I will not be defining lift, confidence, and support again.

In [16]:
# Identify blood pressure meds
med_names = cleaned_transactions.columns.values
print(med_names)

['Duloxetine' 'Premarin' 'Yaz' 'abilify' 'acetaminophen' 'actonel'
 'albuterol HFA' 'albuterol aerosol' 'alendronate' 'allopurinol'
 'alprazolam' 'amitriptyline' 'amlodipine' 'amoxicillin' 'amphetamine'
 'amphetamine salt combo' 'amphetamine salt combo xr' 'atenolol'
 'atorvastatin' 'azithromycin' 'benazepril' 'benicar' 'boniva'
 'bupropion sr' 'carisoprodol' 'carvedilol' 'cefdinir' 'celebrex'
 'celecoxib' 'cephalexin' 'cialis' 'ciprofloxacin' 'citalopram'
 'clavulanate K+' 'clonazepam' 'clonidine HCI' 'clopidogrel'
 'clotrimazole' 'codeine' 'crestor' 'cyclobenzaprine' 'cymbalta'
 'dextroamphetamine XR' 'diazepam' 'diclofenac sodium'
 'doxycycline hyclate' 'enalapril' 'escitalopram' 'esomeprazole'
 'ezetimibe' 'fenofibrate' 'fexofenadine' 'finasteride'
 'flovent hfa 110mcg inhaler' 'fluconozole' 'fluoxetine HCI' 'fluticasone'
 'fluticasone nasal spray' 'folic acid' 'furosemide' 'gabapentin'
 'glimepiride' 'glipizide' 'glyburide' 'hydrochlorothiazide' 'hydrocodone'
 'hydrocortisone 2.5%

In [17]:
# Isolate a list of bp meds based on my experience as a pharmacy tech.
bp_meds = ['amlodipine', 'atenolol', 'benazepril', 'benicar', 'carvedilol', 'clonidine', 'enalapril', 'lisinopril', 'losartan', 'metoprolol', 'metoprolol succinate XL', 'metoprolol tartrate', 'valsartan', 'verapamil SR']

In [18]:
# Isolate rules where antecedent contains a bp_med
mask = rules.antecedents.apply(lambda x: any(item for item in bp_meds if item in x))
ant_bp = rules[mask]
ant_bp

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
5,(amlodipine),(abilify),0.071457,0.238368,0.023597,0.330224,1.385352,0.006564,1.137144,0.299568
12,(carvedilol),(abilify),0.17411,0.238368,0.059725,0.343032,1.439085,0.018223,1.159314,0.369437
39,(lisinopril),(abilify),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401,0.474369
44,(metoprolol),(abilify),0.095321,0.238368,0.035729,0.374825,1.572463,0.013007,1.21827,0.402413
47,(metoprolol succinate XL),(abilify),0.04746,0.238368,0.015065,0.317416,1.331619,0.003752,1.115806,0.261443
56,(carvedilol),(alprazolam),0.17411,0.079323,0.013998,0.080398,1.013557,0.000187,1.001169,0.016196
63,(amlodipine),(amphetamine salt combo),0.071457,0.068391,0.011199,0.156716,2.291481,0.006311,1.10474,0.606974
65,(amlodipine),(amphetamine salt combo xr),0.071457,0.179709,0.014131,0.197761,1.10045,0.00129,1.022502,0.098306
66,(amlodipine),(atorvastatin),0.071457,0.129583,0.017598,0.246269,1.900474,0.008338,1.154811,0.510279
68,(carvedilol),(amlodipine),0.17411,0.071457,0.021197,0.121746,1.70376,0.008756,1.05726,0.500143


In [19]:
# Isolate rules where consequent contains a bp_med
mask = rules.consequents.apply(lambda x: any(item for item in bp_meds if item in x))
con_bp = rules[mask]
con_bp

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
4,(abilify),(amlodipine),0.238368,0.071457,0.023597,0.098993,1.385352,0.006564,1.030562,0.365218
13,(abilify),(carvedilol),0.238368,0.17411,0.059725,0.250559,1.439085,0.018223,1.102008,0.400606
38,(abilify),(lisinopril),0.238368,0.098254,0.040928,0.1717,1.747522,0.017507,1.088672,0.561638
45,(abilify),(metoprolol),0.238368,0.095321,0.035729,0.149888,1.572463,0.013007,1.064189,0.477993
46,(abilify),(metoprolol succinate XL),0.238368,0.04746,0.015065,0.063199,1.331619,0.003752,1.016801,0.326975
57,(alprazolam),(carvedilol),0.079323,0.17411,0.013998,0.176471,1.013557,0.000187,1.002866,0.014528
62,(amphetamine salt combo),(amlodipine),0.068391,0.071457,0.011199,0.163743,2.291481,0.006311,1.110355,0.604976
64,(amphetamine salt combo xr),(amlodipine),0.179709,0.071457,0.014131,0.078635,1.10045,0.00129,1.00779,0.111279
67,(atorvastatin),(amlodipine),0.129583,0.071457,0.017598,0.135802,1.900474,0.008338,1.074457,0.544355
68,(carvedilol),(amlodipine),0.17411,0.071457,0.021197,0.121746,1.70376,0.008756,1.05726,0.500143


In [20]:
bp_med_rules = pd.concat([ant_bp, con_bp])
bp_med_rules.sort_index()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
4,(abilify),(amlodipine),0.238368,0.071457,0.023597,0.098993,1.385352,0.006564,1.030562,0.365218
5,(amlodipine),(abilify),0.071457,0.238368,0.023597,0.330224,1.385352,0.006564,1.137144,0.299568
12,(carvedilol),(abilify),0.17411,0.238368,0.059725,0.343032,1.439085,0.018223,1.159314,0.369437
13,(abilify),(carvedilol),0.238368,0.17411,0.059725,0.250559,1.439085,0.018223,1.102008,0.400606
38,(abilify),(lisinopril),0.238368,0.098254,0.040928,0.1717,1.747522,0.017507,1.088672,0.561638
39,(lisinopril),(abilify),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401,0.474369
44,(metoprolol),(abilify),0.095321,0.238368,0.035729,0.374825,1.572463,0.013007,1.21827,0.402413
45,(abilify),(metoprolol),0.238368,0.095321,0.035729,0.149888,1.572463,0.013007,1.064189,0.477993
46,(abilify),(metoprolol succinate XL),0.238368,0.04746,0.015065,0.063199,1.331619,0.003752,1.016801,0.326975
47,(metoprolol succinate XL),(abilify),0.04746,0.238368,0.015065,0.317416,1.331619,0.003752,1.115806,0.261443


In [21]:
# When we concatenated, some rules got in there twice because both the antecedent
# and the consequent were bp meds. So we'll remove those dupes now.
bp_med_rules = bp_med_rules.drop_duplicates(keep='first')
bp_med_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
5,(amlodipine),(abilify),0.071457,0.238368,0.023597,0.330224,1.385352,0.006564,1.137144,0.299568
12,(carvedilol),(abilify),0.17411,0.238368,0.059725,0.343032,1.439085,0.018223,1.159314,0.369437
39,(lisinopril),(abilify),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401,0.474369
44,(metoprolol),(abilify),0.095321,0.238368,0.035729,0.374825,1.572463,0.013007,1.21827,0.402413
47,(metoprolol succinate XL),(abilify),0.04746,0.238368,0.015065,0.317416,1.331619,0.003752,1.115806,0.261443
56,(carvedilol),(alprazolam),0.17411,0.079323,0.013998,0.080398,1.013557,0.000187,1.001169,0.016196
63,(amlodipine),(amphetamine salt combo),0.071457,0.068391,0.011199,0.156716,2.291481,0.006311,1.10474,0.606974
65,(amlodipine),(amphetamine salt combo xr),0.071457,0.179709,0.014131,0.197761,1.10045,0.00129,1.022502,0.098306
66,(amlodipine),(atorvastatin),0.071457,0.129583,0.017598,0.246269,1.900474,0.008338,1.154811,0.510279
68,(carvedilol),(amlodipine),0.17411,0.071457,0.021197,0.121746,1.70376,0.008756,1.05726,0.500143


In [23]:
# top 10 rules for bp meds by confidence
top_10_bp = bp_med_rules.sort_values(by = 'confidence', ascending = False)
top_10_bp.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
325,"(amphetamine salt combo xr, lisinopril)",(abilify),0.019997,0.238368,0.010132,0.506667,2.125563,0.005365,1.543848,0.540342
344,"(lisinopril, atorvastatin)",(abilify),0.021997,0.238368,0.011065,0.50303,2.110308,0.005822,1.532552,0.537969
390,"(diazepam, lisinopril)",(abilify),0.023064,0.238368,0.010932,0.473988,1.988472,0.005434,1.447937,0.508837
349,"(metoprolol, atorvastatin)",(abilify),0.023597,0.238368,0.011065,0.468927,1.967236,0.00544,1.434136,0.503555
361,"(carvedilol, doxycycline hyclate)",(abilify),0.025197,0.238368,0.011465,0.455026,1.908923,0.005459,1.397557,0.488452
367,"(carvedilol, glipizide)",(abilify),0.02293,0.238368,0.010265,0.447674,1.878079,0.004799,1.378954,0.478514
331,"(carvedilol, atorvastatin)",(abilify),0.035462,0.238368,0.015731,0.443609,1.861024,0.007278,1.368879,0.479672
377,"(carvedilol, lisinopril)",(abilify),0.039195,0.238368,0.017064,0.435374,1.826477,0.007722,1.348914,0.470957
383,"(metoprolol, carvedilol)",(abilify),0.027863,0.238368,0.011998,0.430622,1.806541,0.005357,1.337656,0.459252
378,"(abilify, lisinopril)",(carvedilol),0.040928,0.17411,0.017064,0.416938,2.394681,0.009938,1.41647,0.607262


Remember that the research question was this: What other prescriptions are associated with the prescription and/or purchase of blood pressure medications?

Above, I have isolated the top ten rules that answer this question, ordered by confidence, as we did before with the "top three rules" that were far more general. These rules are rules where any blood pressure medication was contained in either the antecedent or the consequent. Our strongest rules all have support hovering around 0.01 to nearly 0.02. For example, for the first rule in this list, about 1% of the transactions in the dataset contained all three items in the rule (amphetamine salt combo xr, lisinopril, and abilify.) The pattern is the same for the rest of the rules. While this might be considered a little low-- only 75 or so transactions in the dataset contained this itemset--it is above the limit we first set when using the apriori algorithm (as it should be.) Rules with support closer to 0.02 are better, with roughly 150 transactions containing the rule's itemset.

Lift for all of the rules ranges from 1.8 to 2.39. This is a high value for lift. We've already set a threshold for lift to be greater than one-- lower than that and the rule is "not interesting." But being significantly greater than 1 like this implies that the antecedents of each of these rules increase the likelihood of their respective antecedents being purchased. In the case of the tenth rule in this list, for example, the purchase of both abilify and lisinopril increases the likelihood that carvedilol will be purchased too. The consequents have a significant association with the antecedents, more significant than would be expected due to random chance.

Lastly, confidence. For these rules, confidence ranges from roughly 0.41 to roughly 0.51. For the rule with confidence of 0.41, this means that given that the antecedent was purchased, the likelihood the consequent will also be purchased is 41%. I would prefer rules that are 50% or higher, but 41% is not bad by any means. Luckily, we do have two rules that are above 50%, which are strong rules by my definition of the word. Remember, the interpretation of what a "strong" confidence level is is largely up to the analyst and possibly the decision-makers involved in the research question. How strong is strong enough?

Because all three metrics are good, we can consider using these top ten rules to inform our patient care practices. It seems like there is a pattern to all but one of the rules: If a patient is prescribed/purchasing a blood pressure medication, they are likely to also purchase/be prescribed abilify.

## D2. Practical Significance
Again, because I am unsure what the vague rubric wants, I will answer this in terms of both just the top three rules in section C4 as well as in terms of the answer to my research question.

First, let's look at the top three rules. Since we've determined that the associations between these rules are both interesting (due to support and lift) and strong (due to confidence and, to a degree, lift,) we can use these three rules to inform how we interact with patients. There are three things we can tell physicians to look out for when conducting patient care:
1. If a patient is taking both amphetamine combo xr and lisinopril, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.
2. If a patient is taking both lisinopril and atorvastatin, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.
3. If a patient is taking both lisinopril and diazepam, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.

Abilify is commonly used to treat many mental disorders including but not limited to: schizophrenia, bipolar disorder, and sometimes in autism spectrum disorder, major depressive disorder, or Tourette syndrome (Gettu & Saadabadi, 2023). In order to improve patient care, physicians should screen for these conditions whenever a patient is already the specific medications listed above in bullets.

Now, let's speak to the practical significance of the ten rules isolated for the research question. These rules are also both interesting and strong, enough that they can be used to inform how a physician conducts patient care. There are ten things a physician can look out for when doing so:
1. If a patient is taking both amphetamine combo xr and lisinopril, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.
2. If a patient is taking both lisinopril and atorvastatin, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.
3. If a patient is taking both lisinopril and diazepam, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.
4. If a patient is taking both metoprolol and atorvastatin, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.
5. If a patient is taking both doxycycline hyclate and carvedilol, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.
6. If a patient is taking both carvedilol and glipizide, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.
7. If a patient is taking both carvedilol and atorvastatin, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.
8. If a patient is taking both carvedilol and lisinopril, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.
9. If a patient is taking both carvedilol and metoprolol, they should perhaps consider the patient's need for abilify and conduct an examination to determine if the patient has a condition that warrants its use.
10. If a patient is taking both abilify and lisinopril, they should perhaps consider the patient's need for carvedilol and conduct an examination to determine if the patient has a condition that warrants its use.

Of course, in the case of any of these rules, the physician should take into consideration the scientific knowledge of medicine the analyst does not have. These rules suggest the co-occurrence of conditions medicated for by the antecedent and conditions medicated for by the consequent. Thus, it is suggested by these rules that people who are medicated for high blood pressure and various other conditions (in the case of amphetamine combo xr, ADHD, for example,) also have a prevalent co-occurrence of the conditions treated by abilify. The tenth rule, however, suggests that those with conditions treated by abilify and lisinopril have a prevalent co-occurrence of the conditions treated with carvedilol (a high blood pressure medication.) It is important to note that these rules do *not* reveal a cause and effect relationship between any of the antecedents and their consequents-- only a suggested co-occurrence.

## D3. Recommended Action
No matter which way one considers answering the rubric's questions for sections D1 and D2, my suggestions for recommended actions are the same.

Because of the highly scientific nature of the subject of the data, I would recommend that a group of physicians and analysts discuss these rules together to ensure no harm could come from implementing a protocol that encourages doctors to screen for conditions they might not otherwise screen for during a patient visit. Since mental health checks do not seem invasive to me and blood pressure is regularly checked anyway, as an analyst I would say no harm could come of this, but then again, I am not a doctor. Patient safety is of the utmost concern-- it is even more important than bettering patient care.

During this same discussion, physicians can also help weed out rules that simply do not make sense. As an analyst and former pharmacy technician, I would probably encourage the discussion and possible removal of the fifth rule in the top ten list because it states "if doxycycline and carvedilol, then abilify." Doxycycline is an antibiotic and carvedilol treats high blood pressure. Infections are a grossly common occurrence in people with or without high blood pressure. The inclusion of doxycycline in this rule may really be a red herring.

If, during the discussion, it is determined no harm will come to the patient if screenings are implemented, then I would recommend that physicians begin conducting mental health screening if any of the antecedents in the top three rules are found in the patient's history, as well as any of the antecedents of the first nine rules of the top ten rules. For the tenth rule of the top ten set of rules, if a patient is found to be taking both abilify and lisinopril, the physician should consider screening for conditions that would warrant the use of carvedilol--likely high blood pressure.

From the viewpoint of a former pharmacy technician and current analyst, I would also suggest "cleaning" this data differently next time. Many medications in the dataset are given in both their generic and brand name forms. An example of this is Cymbalta, which is a brand name medication, and duloxetine, which is its generic form. Instead of including both as separate products, I would pick "duloxetine" as the name for the product and modify all transactions that have the product "Cymbalta" to say "duloxetine" instead. They are the same drug-- they treat the same thing and have no reason to be separated.

In addition to the brand name/generic name issue, there is also the issue of the same medication having a unique product for each of its long-acting and short-acting forms. An example of this is how the dataset includes unique products for all of the following, which are the same medication: metoprolol, metoprolol succinate XL, metoprolol tartrate. Next time, it would probably be better to clean the dataset so that any form of metoprolol is just "metoprolol," without the indicator for how long-acting the medication is. They are all the same medication and treat the same condition-- high blood pressure.

## E. Panopto
Here is the link to my Panopto video:https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=905b8fa7-5346-4b81-9db9-b1c4014e3f60

## F. Code Sources

Kmair, S. (2023, December 19). Market Basket analysis - Sara Kmair - Medium. *Medium.* https://sarakmair.medium.com/market-basket-analysis-8dc699b7e27

## G. Content Sources

Deniran, O. H. (2023, November 27). Boosting Sales with Data: The Power of Market Basket Analysis in Retail. *Medium.* https://medium.com/@chemistry8526/boosting-sales-with-data-the-power-of-market-basket-analysis-in-retail-c79cc10a14df

Gettu, N., & Saadabadi, A. (2023, May 16). *Aripiprazole.* StatPearls - NCBI Bookshelf. https://www.ncbi.nlm.nih.gov/books/NBK547739/

Sivek, S. C., PhD. (2021, December 16). Market Basket Analysis 101: Key Concepts - towards Data science. *Medium.* https://towardsdatascience.com/market-basket-analysis-101-key-concepts-1ddc6876cd00