## 1. Importing Data and Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/drug-performance-evaluation/Drug_clean.csv')

#Checking first ten rows of the Dataframe
df.head(10)

## 2. Data Handling and Data Cleaning

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
df.describe()

With the describe method, we can see that the Easeofuse, Satisfaction and Effective columns are graded from a score of 1(min) to 5(max). This was one of the questions that I was curious about as the dataset description did not mention about how the ease of use and effectiveness are graded. 

We can also see that all columns have 685 counts, which mean that there are no missing values.

The next thing I want to look at are the unique values of the categorical columns. Just to get an idea on what we're dealing with, eg. how many types of conditions are included, how many drugs are used, etc.



In [None]:
#We can also recheck that there are no missing values with the function below
df.isna().sum()

In [None]:
#Renaming Columns for ease of use
df = df.rename(columns = {'Condition': 'condition', 'Drug':'drug', 'Indication':'indication', 'Type':'type', 'Effective':'effective', 'Reviews':'review', 'EaseOfUse':'ease_of_use', 'Satisfaction': 'satisfaction', 'Information':'information'})

Let's look at some of the categorical columns.

In [None]:
#Checking Condition Column

df['condition'].value_counts()

In [None]:
#capitalizing the conditions for standard format
df.condition = df.condition.str.capitalize()

In [None]:
#check capitalization
df['condition'].value_counts()

In [None]:
df['drug'].value_counts()

There are 37 unique values in Condition and 470 different drugs used in this dataset. To make it clearer, I used the value_counts() to display the 37 types of Conditions. We can see that the most common condition is hypertension, followed by atopic dermatitis, fever, and reflux disease. 

At the start of this notebook, I initially thought of making a model to predict the most effective drug given a condition, but the data is very imbalanced. It is also the type of imbalance where I believe that data will not benefit the training of a machine learning model because: 
1. There are conditions that appear in the data 1-3 times, which is insufficient for training for that particular condition. 
2. If we were to pursue training a model, it can only be trained with conditions that have higher number of samples, like hypertension or dematitis. Evenso, these numbers are considered to be very low sample sizes for a machine learning model to train against. 
3. In order to train a machine learning model that will benefit the user, more data for each conditions are needed.

And for those reasons, I will only carry out EDA on this dataset. 😊

In [None]:
df['indication'].value_counts()

**Off-label** refers to the practice of prescribing a drug for a different purpose than what the FDA approved. This practice is called “off-label” because the drug is being used in a way not described on its package insert. This insert is known as its “label.”

**On-label** refers the drug is being used in the same indication, dose, route of administration, patient populations, and drug formulation. There is no deviation from the approved FDA label. 

We can also see that there are unknown category under indication, which we will rename to 'unknown'.

In [None]:
#Renaming \r\r\n in the indication column

df['indication'] = df['indication'].str.replace('\r\r\n', 'Unknown')

df['indication'].value_counts()

In [None]:
df['type'].value_counts()

In [None]:
#Renaming \r\r\n in the type column

df['type'] = df['type'].str.replace('\r\r\n', 'Unknown')

df['type'].value_counts()

**RX Type Drugs** are commonly known as drugs that need medical prescription. <br>
**OTC Type Drugs** are drugs that can be bought Over The Counter without prescription.


In [None]:
#let's look at the review column 
df['review'].value_counts()

In [None]:
#We will change the data type into float which is more suitable than integers
df['review'] = df['review'].astype('float64')
df.dtypes

## 3. Exploratory Data Analysis

Since the effective, ease_of_use, satisfcation columns are graded withint the range of 1-5, I will split them into 3 categories: Low, Med, High. 

- **Effective** = 0-1: Uneffective, 2-3: Slightly Effective, 4-5: Effective
- **ease_of_use** = 0-1: Difficult, 2-3: Normal, 4-5: Easy
- **satisfaction** = 0-1: Unsatisfied, 2-3: Normal, 4-5: Satisfied

In [None]:
#Categorizing the Column of Effective

effectiveness = []

for score in df['effective']:
    if score < 2.0 : effectiveness.append('Uneffective')
    elif score <= 3.0 : effectiveness.append('Slightly Effective')
    elif score <= 5.0 : effectiveness.append('Effective')
        
easeofuse = []

for score in df['ease_of_use']:
    if score < 2.0 : easeofuse.append('Difficult')
    elif score <= 3.0 : easeofuse.append('Normal')
    elif score <= 5.0 : easeofuse.append('Easy')
        
satisfaction_level = []

for score in df['satisfaction']:
    if score < 2.0 : satisfaction_level.append('Unsatisfied')
    elif score <= 3.0 : satisfaction_level.append('Normal')
    elif score <= 5.0 : satisfaction_level.append('Satisfied')

In [None]:
df['level_of_effectiveness'] = effectiveness
df['level_of_use'] = easeofuse
df['level_of_satisfaction'] = satisfaction_level

In [None]:
#check
df.head(10)

### 3.1 Forms of drugs
>Drug forms play a vital role in ensuring that medications are safe, effective, and convenient for patients. The selection of the appropriate drug form is a critical decision in the development and administration of pharmaceuticals. It takes into account factors such as the drug's properties, patient needs, and regulatory requirements.

In [None]:
hyper = df[df['condition'] == 'Hypertension']
atopic_derm = df[df['condition'] == 'Atopic dermatitis']
fever =df[df['condition'] == 'Fever']

In [None]:
#color variables to keep my visualisation theme consistent
color1= "#361d32"
color2= "#543c52"
color3= "#f55a51"
color4= "#edd2cb"
color5= "#f1e8e6"

In [None]:
#set color and figure size
background_color = "#fafafa"
fig = plt.figure(figsize=(23, 12), facecolor=background_color)
gs = fig.add_gridspec(2, 2)

#add subplots and customizations
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[1, 0])
ax3 = fig.add_subplot(gs[1, 1])

#plot and set each plot title
sns.countplot(data=df, x='Form', ax=ax0, color="#361d32")
ax0.set_title('Drug Forms Used for all conditions', fontsize=12, fontweight='bold', fontfamily='georgia')

sns.countplot(data=hyper, x='Form', ax=ax1, color="#543c52")
ax1.set_title('Drug Forms Used for Hypertension', fontsize=12, fontweight='bold', fontfamily='georgia')

sns.countplot(data=atopic_derm, x='Form', ax=ax2, color="#f55a51")
ax2.set_title('Drug Forms Used for Atopic Dermatitis', fontsize=12, fontweight='bold', fontfamily='georgia')

sns.countplot(data=fever, x='Form', ax=ax3,color="#edd2cb")
ax3.set_title('Drug Forms Used for Fever', fontsize=12, fontweight='bold', fontfamily='georgia')

#inserting formats for all ax subplots
for j in range(0,4):
    for i in locals()["ax"+str(j)].containers:
        locals()["ax"+str(j)].bar_label(i, label_type='edge')
        locals()["ax"+str(j)].set_facecolor(background_color)
        locals()["ax"+str(j)].grid(color='gray', linestyle=':', axis='y', zorder=0,  dashes=(1,5))


plt.show()

> The most common form of drugs used is tablets, liquid(drink), and cream. I have also included the a few conditions, hypertension, atopmic dermatitis, and fever to see which form is mostly used for each condition. As we can see, the most common forms of drugs are different for each condition. Tablets are mostly used for hypertension treatments due to the fact that the condition requires internal intervention, as compared to atopic dermatitis, which is a skin condition, which is more aided by topical treatments. Most often topical medication means application to body surfaces such as the skin or mucous membranes to treat ailments via a large range of classes including creams, foams, gels, lotions, and ointments. As for fever, there is a variation of forms used, from liquid, tablets to others. 

**Notes**
What is the difference between capsules and tablets?
Tablets are entirely composed of medication, while capsules only contain medication inside the shell. And there's the fact that capsules can't be crushed or split, while tablets often can. For more information, you can check out https://diferr.com/difference-between-tablets-and-capsules/

### 3.2 Distribution of drug effectiveness, ease of use and satisfaction

In [None]:
#set figure size and gridspec for subplotting
fig = plt.figure(figsize=(12, 18), facecolor=background_color)
gs = fig.add_gridspec(3, 1)

#add subplots and customizations
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[1, 0])
ax2 = fig.add_subplot(gs[2, 0])

for i in range(0,3):
    locals()["ax"+str(i)].set_facecolor(background_color)
    
sns.kdeplot(df, x=df['effective'], ax=ax0, fill=True, color=color1)
sns.kdeplot(df, x=df['satisfaction'], ax=ax1,fill=True, color=color3)
sns.kdeplot(df, x=df['ease_of_use'], ax=ax2, fill=True, color=color4)

ax0.set(
    title='Effectiveness Distribution', 
    xlabel='Effective index')

ax1.set(
    title='Satisfaction Distribution', 
    xlabel='Satisfaction index')

ax2.set(
    title='Ease of Use Distribution', 
    xlabel='Ease of Use index')

plt.show()

> From the Effectiveness Distribution, the majority of the effective ratings fall around the ratings of 3 to ~4. This means that people find the majority of the drugs to treating the specific condition are quite effective.
<br>
From the Satisfaction Distribution, the majority ratings falls between 2-3 up to 4, this means that the majority satisfaction to the drugs in regards to the condition are Normal to Satisfied.
<br>
From the Ease of Used Distribution, the majority ratings are quite high, showing that the ease of use for the drugs are easy to use.</br>

Overall, we can say that the drugs used in respect to conditions have pretty good ratings.

### 3.3 Correlations between Numerical Variables

In [None]:
numcol_df = df[['satisfaction','effective','ease_of_use','review','Price']]
numcol_df

In [None]:
df.head(10)

In [None]:
#Heatmap of Review, Effective, Ease of Use & Satisfaction

corr_heat = sns.heatmap(numcol_df.corr(), cmap="PiYG", annot=True)

corr_heat.set_title('Heatmap of Review, Effective, Ease of Use & Satisfaction', y=1.05, fontweight='heavy')

plt.yticks(rotation=0)
plt.show()

> The heatmap tells us that there is a strong positive correlation between **effectiveness and satisfaction** of people using the drugs as treatments. This means that the better the effectiveness of the drug, the more statisfied the user is. There is also a positive correlation between effectiveness, ease of use and satisfaction.
<br>
There is no positive correlation between the number of reviews and other rating variables, meaning that the more reviews the drug has from user does not affect the other 3 ratings.

Now that we know there is some correlation going on, we can further prove this by using the scatterplots.

In [None]:
#Correlation Between Ratings

fig = plt.figure(figsize=(25, 8), facecolor=background_color)
gs = fig.add_gridspec(1,3)

#add subplots and customizations
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[0, 1])
ax2 = fig.add_subplot(gs[0, 2])

fig.suptitle('Correlation Between Ratings', fontweight='bold', size=20, fontfamily='georgia')

for i in range(0,3):
    locals()["ax"+str(i)].set_facecolor(background_color)

sns.scatterplot(ax=ax0,data=df, x='effective', y='ease_of_use', color = color3)
sns.scatterplot(ax=ax1,data=df, x='effective', y='satisfaction', color = color2)
sns.scatterplot(ax=ax2,data=df, x='ease_of_use', y='satisfaction', color = color4)

ax0.set_title('Correlation Between Ease of Use & Effectiveness', fontweight='bold', fontfamily='georgia', fontsize=12)
ax0.set_ylabel('Ease Of use', fontweight='bold')
ax0.set_xlabel('Effectiveness', fontweight='bold')

ax1.set_title('Correlation Between Satisfaction & Effectiveness', fontweight='bold',fontfamily='georgia', fontsize=12)
ax1.set_ylabel('Satisfaction', fontweight='bold')
ax1.set_xlabel('Effectiveness', fontweight='bold')

ax2.set_title('Correlation Between Satisfaction & Ease of Use', fontweight='bold',fontfamily='georgia', fontsize=12)
ax2.set_ylabel('Satisfaction', fontweight='bold')
ax2.set_xlabel('Ease of Use', fontweight='bold')

plt.show()

### 3.4 The highest rated and most effective drugs for condition
I want to show a comparison betweem OTP and RX that is sorted by the number of reviews.

The 3 conditions that I am interested in are:
1. Hypertension
2. Back Pain
3. Fever

The most common health conditions we all experience during our daily life. :) 

In [None]:
#First I will filter the rows with conditions from different columns, sorting from most reviews to least

RX_hyperten = df.loc[(df['level_of_effectiveness']=='Effective') & (df['type']=='RX') & (df['condition']=='Hypertension')].sort_values(by='review', ascending=False)
#There is no OTC hypertension drugs

RX_fever = df.loc[(df['level_of_effectiveness']=='Effective') & (df['type']=='RX') & (df['condition']=='Fever')].sort_values(by='review', ascending=False)
OTC_fever = df.loc[(df['level_of_effectiveness']=='Effective') & (df['type']=='OTC') & (df['condition']=='Fever')].sort_values(by='review', ascending=False)

#There is no RX drugs for back pain
OTC_backpain = df.loc[(df['level_of_effectiveness']=='Effective') & (df['type']=='OTC') & (df['condition']=='Back pain')].sort_values(by='review', ascending=False)


In [None]:
RX_hyperten

In [None]:
#Getting the top 5 drugs
top_RX_hyperten = RX_hyperten[0:5]
top_RX_fever = RX_fever[0:5]
top_OTC_fever = OTC_fever[0:5]
top_OTC_backpain = OTC_backpain[0:5]

In [None]:
#checking that hypertension does not have OTC type drugs
OTC_hyperten = df.loc[(df['condition']=='Hypertension')].sort_values(by='review', ascending=False)
OTC_hyperten['type'].value_counts()

In [None]:
#checking that back pain does not have RX type drugs
RX_backpain = df.loc[(df['condition']=='Back pain')].sort_values(by='review', ascending=False)
RX_backpain['type'].value_counts()

In [None]:
#set color and figure size
fig = plt.figure(figsize=(12, 5), facecolor=background_color)

fig.suptitle('Most Effective Prescription Drugs for Hypertension', fontweight='bold', size=12, fontfamily='georgia')

ax = sns.barplot(data=top_RX_hyperten,x='review',y='drug',color= color1, order=top_RX_hyperten.sort_values('effective', ascending=False).drug)

ax.set_ylabel(None)
ax.set_facecolor(background_color)

for p in ["top", "right"]:
        ax.spines[p].set_visible(False)
        
for j in ax.containers:
        ax.bar_label(j, label_type='edge')
        
fig.text(0.03, 0.89, "Drug Effectiveness ↑", fontweight='bold')

plt.show()

Reviews provide valuable insights into how well a medication works and whether it is safe for use. Patients who have taken the medication can report any side effects or adverse reactions, which can be crucial for identifying potential safety concerns. However, it's essential to approach reviews with a critical eye, as they can sometimes be biased or inaccurate. Not all reviews are equally reliable, and some may be influenced by factors such as personal preferences, expectations, or placebo effect

>From the above chart, we can see that although Lisinopril is ranked second for drug effectiveness, but it has 2000+ more reviews, as compared to Nebivolol (ranked 1st).

It is also important to note that different medications/drugs react differently to every individual, and that some drugs that might work better for someone else, might not work the same for you. 

In [None]:
#set color and figure size
fig = plt.figure(figsize=(12, 10), facecolor=background_color)
gs = fig.add_gridspec(2, 1,hspace=0.3)

#add subplots and customizations
ax0 = fig.add_subplot(gs[0, 0])
ax1 = fig.add_subplot(gs[1, 0])

sns.barplot(ax=ax0, data=top_OTC_fever,x='review',y='drug',color= color2, order=top_OTC_fever.sort_values('effective', ascending=False).drug)
sns.barplot(ax=ax1, data=top_RX_fever,x='review',y='drug',color= color3, order=top_RX_fever.sort_values('effective', ascending=False).drug)

for i in range(0, 2):
    locals()["ax"+str(i)].set_facecolor(background_color)
    locals()["ax"+str(i)].set_ylabel(None)
    locals()["ax"+str(i)].set_xlabel("Effectiveness")
    for p in ["top", "right"]:
        locals()["ax"+str(i)].spines[p].set_visible(False)
    for j in locals()["ax"+str(i)].containers:
        locals()["ax"+str(i)].bar_label(j, label_type='edge')

    
ax0.set_title("Most Effective Over-The-Counter (OTC) Drugs for Fever", fontfamily="georgia", fontsize=15, fontweight="bold")
ax1.set_title("Most Effective Precription (RX) Drugs for Fever", fontfamily="georgia", fontsize=15, fontweight="bold")

#adding some extra text on figure
fig.text(0.03, 0.89, "Drug Effectiveness ↑", fontweight='bold')
fig.text(0.03, 0.45, "Drug Effectiveness ↑", fontweight='bold')


plt.show()

In [None]:
#set color and figure size
fig = plt.figure(figsize=(12, 5), facecolor=background_color)

fig.suptitle('Most Effective Over-The-Counter(OTC) Drugs for Back Pain', fontweight='bold', size=12, fontfamily='georgia')

ax = sns.barplot(data=top_OTC_backpain,x='review',y='drug',color= color1, order=top_OTC_backpain.sort_values('effective', ascending=False).drug)

ax.set_ylabel(None)
ax.set_facecolor(background_color)

for p in ["top", "right"]:
        ax.spines[p].set_visible(False)
        
for j in ax.containers:
        ax.bar_label(j, label_type='edge')
        
fig.text(0.03, 0.89, "Drug Effectiveness ↑", fontweight='bold')

plt.show()

### 4. Summary
In this analysis process, we have managed to discover the top drugs used in the most common occuring conditions: fever, backpain and hypertension.

We also learned that there is positive correlations between the ratings variable, effective, satisfaction and ease of use. This makes a sense because, the more effective a drug is at relieving pain/symptons of a certain condition, the more satisfied the user will be. Apart from that, the ease of use also affects the satisfaction level of the drug.

By putting the number of reviews into consideration, we are able to gain more insight on which drug has more reliability in overall effectiveness that is also maintained through a greater number of users, rather than just one or two users. A low number of reviews for a medication may not necessarily reflect its quality or effectiveness. It could be due to various factors, including the medication's newness, limited availability, or niche use. When considering a medication with a low number of reviews, it's essential to consult with healthcare professionals, consider other sources of information, and evaluate the available evidence to make an informed decision about its suitability for a particular condition or individual.