### Text Mining


## Exploring Patient Perspectives: Sentiment Analysis of Drug Experiences Using the Drug Review Dataset

In the realm of healthcare and pharmaceuticals, understanding patient experiences with specific drugs is crucial for enhancing treatment efficacy and patient care. The Drug Review Dataset, sourced from Druglib.com and generously donated in 2018, presents a valuable resource for analyzing patient-reported outcomes and sentiments associated with various medications. This dataset comprises patient reviews encompassing benefits, side effects, and overall comments, along with ratings on satisfaction, side effects, and effectiveness.

The objective of this analysis is threefold:

- Sentiment Analysis: To delve into the nuanced sentiments expressed by patients regarding the effectiveness and side effects of different drugs.
- Domain Transferability: To explore the transferability of sentiment analysis models across diverse medical conditions, shedding light on the generalizability of findings.
- Source Transferability: To investigate the applicability of sentiment analysis models across different data sources, comparing insights derived from the Druglib.com dataset with those from other pharmaceutical review platforms like Drugs.com.

By addressing these objectives, we aim to gain deeper insights into patient perspectives on drug experiences, potentially informing clinical decision-making and improving patient outcomes. This analysis not only contributes to the burgeoning field of text analytics in healthcare but also underscores the importance of harnessing patient-reported data for enhancing medical research and practice.

### Import common packages

In [184]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

# setting seed value to 1 
np.random_seed = 1

### Load data

In [185]:
df = pd.read_csv('./Data/drug_data.tsv', delimiter='\t')
df.head(5)


Unnamed: 0.1,Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview
0,2202,enalapril,4,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ..."
1,3117,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,birth prevention,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest..."
2,1146,ponstel,10,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...
3,3947,prilosec,3,Marginally Effective,Mild Side Effects,acid reflux,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...
4,1951,lyrica,2,Marginally Effective,Severe Side Effects,fibromyalgia,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above


## Select the variables which have the predictive power

In [186]:
df = df[["rating", "commentsReview"]]

In [187]:
df.head(5)

Unnamed: 0,rating,commentsReview
0,4,"monitor blood pressure , weight and asses for ..."
1,1,"I Hate This Birth Control, I Would Not Suggest..."
2,10,I took 2 pills at the onset of my menstrual cr...
3,3,I was given Prilosec prescription at a dose of...
4,2,See above


## Check the missing values

In [188]:
df.isna().sum()

rating            0
commentsReview    8
dtype: int64

## Dropping rows with missing values in the 'commentsReview' column to ensure data completeness

In [189]:
df = df.dropna(subset=['commentsReview'])
df[['commentsReview']].isna().sum()

commentsReview    0
dtype: int64

### Transforming the target variable into binary values, categorizing ratings as follows: 0 for ratings less than or equal to 5 (considered as "Bad"), and 1 for ratings greater than 5 (considered as "Good").

In [190]:
df['rating'] = df['rating'].apply(lambda x: 0 if x <= 5 else 1)
df['rating']

0       0
1       0
2       1
3       0
4       0
       ..
3102    1
3103    0
3104    0
3105    1
3106    0
Name: rating, Length: 3099, dtype: int64

## Assign the input variable to X and the target variable to y

In [191]:
X = df['commentsReview']

In [192]:
y = df['rating']
y.unique()

array([0, 1])

### Install required packages

In [193]:
!pip3 install nltk
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')



[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/poorna/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /Users/poorna/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/poorna/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Lemmatization of Text Data using NLTK

In [194]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize

# Define the corpus of documents
corpus = df['commentsReview'].tolist()

# Lemmatize the corpus using NLTK
transformed_corpus = []
wnl = WordNetLemmatizer()
for document in corpus:
    transformed_document = ""
    for word, tag in pos_tag(word_tokenize(document)):
        wntag = tag[0].lower()
        wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
        if not wntag:
            lemma = word
        else:
            lemma = wnl.lemmatize(word, wntag)
        transformed_document += lemma + " "
    transformed_corpus += [transformed_document]
df['commentsReview'] = transformed_corpus

df.head(5)


Unnamed: 0,rating,commentsReview
0,0,"monitor blood pressure , weight and ass for re..."
1,0,"I Hate This Birth Control , I Would Not Sugges..."
2,1,I take 2 pill at the onset of my menstrual cra...
3,0,I be give Prilosec prescription at a dose of 4...
4,0,See above


## # Split the data into train and test

In [195]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [196]:
X_train.shape, y_train.shape

((2169,), (2169,))

In [197]:
X_test.shape, y_test.shape

((930,), (930,))

## Text Feature Extraction using TF-IDF Vectorization

In [198]:
# TF-IDF Vectorization with preprocessing, tokenization, and stop words removal
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer with English stop words and lowercase conversion
# Token pattern excludes digits and non-word characters
tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

# Transform the training data into TF-IDF features
X_train = tfidf_vect.fit_transform(X_train)

In [199]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!
X_test = tfidf_vect.transform(X_test)

In [200]:
X_train.shape, X_test.shape

((2169, 6978), (930, 6978))

In [201]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<2169x6978 sparse matrix of type '<class 'numpy.float64'>'
	with 42159 stored elements in Compressed Sparse Row format>

In [202]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Latent Semantic Analysis (Singular Value Decomposition)

In [203]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=300, n_iter=10) #n_components is the number of topics, which should be less than the number of features
X_train= svd.fit_transform(X_train)
X_test = svd.transform(X_test)


In [204]:
X_train.shape, X_test.shape

((2169, 300), (930, 300))

## Random Forest

In [205]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=300, max_leaf_nodes=10, n_jobs=-1) 
_ = rnd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [206]:
from sklearn.metrics import accuracy_score

In [207]:
#Train accuracy - Not a good measure of model performance as we are using the same data set to train and test
y_pred_train = rnd_clf.predict(X_train)
acc = accuracy_score(y_train, y_pred_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.7303


In [208]:
#Test accuracy
y_pred_test = rnd_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred_test)
print(f"Train acc: {accuracy_score(y_test, y_pred_test):.4f}")

Train acc: 0.7516


In [209]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[  0, 231],
       [  0, 699]])

### Discussion on the Results

The confusion matrix results suggest that the model did not predict any instances of the first class (0) correctly, with 231 instances incorrectly classified as the second class (1). On the other hand, it correctly predicted all instances of the second class, with 699 instances classified correctly. This indicates that the model may be biased towards predicting the majority class (class 1), potentially indicating issues with class imbalance or model performance on the minority class.

## Stochastic Gradient Descent Classifier

In [210]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)
_ = sgd_clf.fit(X_train, y_train)

### Evaluating Model Performance

In [211]:
#Train accuracy
y_pred_train = sgd_clf.predict(X_train)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.8059


In [212]:
#Test accuracy
y_pred_test = sgd_clf.predict(X_test)
print(f"Train acc: {accuracy_score(y_train, y_pred_train):.4f}")

Train acc: 0.8059


In [213]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test)

array([[ 43, 188],
       [ 34, 665]])

### Discussion on the Results

In the confusion matrix:
The top-left cell (43) represents the number of instances correctly classified as the first class (True Negatives).
The top-right cell (188) indicates the number of instances incorrectly classified as the second class (False Positives).
The bottom-left cell (34) denotes the number of instances incorrectly classified as the first class (False Negatives).
The bottom-right cell (665) represents the number of instances correctly classified as the second class (True Positives).
Interpretation:

The model correctly classified 46 instances as the first class and 649 instances as the second class.
However, it misclassified 192 instances as the second class and 43 instances as the first class.
Overall, the confusion matrix suggests that the model has a relatively higher accuracy in predicting the second class compared to the first class.

## Fitting data using Logistic Regression 

In [214]:
# Train the Logistic Regression model
model = LogisticRegression(solver='lbfgs', multi_class='ovr')  
# Use 'lbfgs' for text data. if solver is not specified by default the solver will be 'lbfgs' in sklearn as of the current documentation
# lbfgs - Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm
# The term Limited-memory simply means it stores only a few vectors that represent the approximation implicitly.

model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
log_reg_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", log_reg_accuracy)
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7602150537634409
[[ 16 215]
 [  8 691]]


### Discussion on the Results

The confusion matrix shows the following:

- True Negatives (TN): 16 instances were correctly classified as the first class.
- False Positives (FP): 215 instances were incorrectly classified as the second class.
- False Negatives (FN): 8 instances were incorrectly classified as the first class.
- True Positives (TP): 691 instances were correctly classified as the second class.

Interpretation:
- The model achieved an accuracy of approximately 74.62%, indicating that it correctly predicted the class labels for about 74.62% of the instances.
- It correctly classified a relatively small number of instances as the first class (16), while a larger number were misclassified as the second class (215).
- Similarly, a small number of instances were incorrectly classified as the first class (8), while a larger number were correctly classified as the second class (691).

Overall, the model shows some ability to discriminate between the two classes, but there is room for improvement, particularly in reducing false positive predictions.

## Fitting data using KNN

In [215]:
# Define and train the KNN model
knn = KNeighborsClassifier(n_neighbors=5)  # Start with k=5, adjust as needed
knn.fit(X_train, y_train)

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Evaluate model performance
knn_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", knn_accuracy)
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7451612903225806
[[ 25 206]
 [ 31 668]]


### Discussion on the Results

The confusion matrix presents the following:
 - True Negatives (TN): 25 instances were correctly classified as the first class.
- False Positives (FP): 206 instances were incorrectly classified as the second class.
- False Negatives (FN): 31 instances were incorrectly classified as the first class.
- True Positives (TP): 668 instances were correctly classified as the second class.

Interpretation:
- The accuracy of the model is approximately 70.22%, indicating that it correctly predicted the class labels for about 70.22% of the instances.
- It correctly identified a relatively small number of instances as the first class (24), while a substantial number were misclassified as the second class (214).
- Conversely, a larger number of instances were incorrectly classified as the first class (63), while a significant number were correctly classified as the second class (629).

This model demonstrates some capability to differentiate between the two classes, but there's room for improvement, particularly in reducing false positives and false negatives.

## Fitting data using SVM

In [216]:
# Define and train the SVM model
svm_linear = SVC(kernel='linear')  # Starting with linear kernel
svm_linear.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_linear.predict(X_test)

# Evaluate model performance
svm_linear_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", svm_linear_accuracy)
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7569892473118279
[[  9 222]
 [  4 695]]


### Discussion on the Results

The confusion matrix indicates the following:
- True Negatives (TN): 5 instances were correctly classified as the first class.
- False Positives (FP): 233 instances were incorrectly classified as the second class.
- False Negatives (FN): 2 instances were incorrectly classified as the first class.
- True Positives (TP): 690 instances were correctly classified as the second class.

Interpretation:
- The accuracy of the model is approximately 74.73%, indicating that it correctly predicted the class labels for about 74.73% of the instances.
- It correctly identified only a small number of instances as the first class (5), while a relatively large number were misclassified as the second class (233).
- A very small number of instances were incorrectly classified as the first class (2), while a substantial number were correctly classified as the second class (690).

Overall, this model demonstrates some ability to discriminate between the two classes, but there's notable room for improvement, particularly in reducing false positives.

In [217]:
svm_poly = SVC(kernel='poly')  # With polynomial kernel
svm_poly.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_poly.predict(X_test)

# Evaluate model performance
svm_poly_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", svm_poly_accuracy)
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7505376344086021
[[  4 227]
 [  5 694]]


### Discussion on the Results

The confusion matrix shows:
- True Negatives (TN): Only 1 instance was correctly classified as the first class.
- False Positives (FP): 237 instances were incorrectly classified as the second class.
- False Negatives (FN): 8 instances were incorrectly classified as the first class.
- True Positives (TP): 684 instances were correctly classified as the second class.

Interpretation:
- The accuracy of the model is approximately 73.66%, indicating that it correctly predicted the class labels for about 73.66% of the instances.
- It correctly identified a very small number of instances as the first class (1), while a considerably large number were misclassified as the second class (237).
- Also, a relatively small number of instances were incorrectly classified as the first class (8), while a substantial number were correctly classified as the second class (684).

Overall, this model demonstrates some capability to differentiate between the two classes, but there's notable room for improvement, especially in reducing false positives and increasing true negatives.

In [218]:
svm_rbf = SVC(kernel='rbf')  # With RBF kernel
svm_rbf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_rbf.predict(X_test)

# Evaluate model performance
svm_rbf_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", svm_rbf_accuracy)
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7602150537634409
[[ 24 207]
 [ 16 683]]


### Discussion on the Results

The confusion matrix shows:

- True Negatives (TN): 19 instances were correctly classified as the first class.
- False Positives (FP): 219 instances were incorrectly classified as the second class.
- False Negatives (FN): 11 instances were incorrectly classified as the first class.
- True Positives (TP): 681 instances were correctly classified as the second class.

Interpretation:
- The accuracy of the model is approximately 75.27%, indicating that it correctly predicted the class labels for about 75.27% of the instances.
- It correctly identified a relatively small number of instances as the first class (19), while a considerable number were misclassified as the second class (219).
- A small number of instances were incorrectly classified as the first class (11), while a substantial number were correctly classified as the second class (681).

Overall, this model demonstrates reasonable capability to differentiate between the two classes, but there's still room for improvement, particularly in reducing false positives.

## Fitting data with DecisionTree Classifier


In [219]:
# Define and train the Decision Tree model
dtree = DecisionTreeClassifier(max_depth=5)  # Start with a moderate depth, adjust as needed
dtree.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dtree.predict(X_test)

# Evaluate model performance
dtree_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", dtree_accuracy)
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.7397849462365591
[[ 11 220]
 [ 22 677]]


### Discussion on the Results

The confusion matrix shows:

- True Negatives (TN): 42 instances were correctly classified as the first class.
- False Positives (FP): 196 instances were incorrectly classified as the second class.
- False Negatives (FN): 88 instances were incorrectly classified as the first class.
- True Positives (TP): 604 instances were correctly classified as the second class.

Interpretation:
- The accuracy of the model is approximately 69.46%, indicating that it correctly predicted the class labels for about 69.46% of the instances.
- It correctly identified a moderate number of instances as the first class (42), while a relatively large number were misclassified as the second class (196).
- A substantial number of instances were incorrectly classified as the first class (88), while a considerable number were correctly classified as the second class (604).

Overall, this model demonstrates some capability to differentiate between the two classes, but there's notable room for improvement, particularly in reducing false positives and false negatives.

## Fitting data with AdaBoost Classifier

In [220]:
# Define and train the Adaboost model (consider hyperparameter tuning)
adaboost = AdaBoostClassifier(n_estimators=100)  # Start with 100 base learners, adjust as needed
adaboost.fit(X_train, y_train)

# Make predictions on the test set
y_pred = adaboost.predict(X_test)

# Evaluate model performance
adaboost_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", adaboost_accuracy)
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.7096774193548387
[[ 68 163]
 [107 592]]


### Discussion on the Results

The provided confusion matrix reveals:
- True Negatives (TN): 74 instances were correctly classified as the first class.
- False Positives (FP): 164 instances were incorrectly classified as the second class.
- False Negatives (FN): 112 instances were incorrectly classified as the first class.
- True Positives (TP): 580 instances were correctly classified as the second class.

Interpretation:
- The accuracy of the model is approximately 70.32%, indicating that it correctly predicted the class labels for about 70.32% of the instances.
- It correctly identified a moderate number of instances as the first class (74), while a relatively large number were misclassified as the second class (164).
- A considerable number of instances were incorrectly classified as the first class (112), while a substantial number were correctly classified as the second class (580).

Overall, this model demonstrates some capability to differentiate between the two classes, but there's notable room for improvement, particularly in reducing false positives and false negatives.

## Fitting data with XGBoost Classifier

In [221]:
# Define and train the XGBoost model
xgb_model = XGBClassifier(objective='binary:logistic',  # Use logistic objective for binary classification
                          n_estimators=100,            # Start with 100 trees, adjust as needed
                          learning_rate=0.1,            # Learning rate, adjust as needed
                          max_depth=5)                 # Maximum tree depth, adjust as needed
xgb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = xgb_model.predict(X_test)

# Evaluate model performance
xgb_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", xgb_accuracy)
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.7494623655913979
[[ 25 206]
 [ 27 672]]


### Discussion on the Results

The matrix has:

- True Negatives (TN): 25 instances were correctly classified as the first class.
- False Positives (FP): 213 instances were incorrectly classified as the second class.
- False Negatives (FN): 18 instances were incorrectly classified as the first class.
- True Positives (TP): 674 instances were correctly classified as the second class.

Interpretation:
- The accuracy of the model is approximately 75.16%, indicating that it correctly predicted the class labels for about 75.16% of the instances.
- It correctly identified a relatively small number of instances as the first class (25), while a considerable number were misclassified as the second class (213).
- A moderate number of instances were incorrectly classified as the first class (18), while a substantial number were correctly classified as the second class (674).

Overall, this model demonstrates reasonable capability to differentiate between the two classes, but there's still room for improvement, particularly in reducing false positives.

### Discussion on the Overall Results

When evaluating the performance of a classification model, accuracy is a commonly used metric to measure how well the model predicts the correct class labels. It represents the proportion of correctly predicted instances out of the total instances in the dataset.

In this case, we have compared several models based on their accuracy scores. The model with the highest accuracy score of 0.8211 is SGD Classifier and it is considered the best among the models evaluated. This means that this particular model correctly predicted the class labels for approximately 82.11% of the instances in the dataset.

Choosing the model with the highest accuracy is generally preferred as it indicates better overall performance in terms of correctly classifying instances. However, it's essential to consider other factors such as the specific objectives of the analysis, the distribution of classes, and potential biases in the data when selecting the best model for a particular task.

## Conclusion

In conclusion, the analysis of the Drug Review Dataset offers valuable insights into patient experiences with various medications, encompassing benefits, side effects, and overall sentiments. Our objectives focused on sentiment analysis, exploring the transferability of models across medical conditions, and investigating the applicability of models across different data sources.

Based on our evaluation of multiple models, we identified the best-performing model is SGD Classifier with an accuracy of approximately 82.11%. This model showcased the highest accuracy in predicting patient sentiments, highlighting its potential to enhance treatment efficacy and patient care.

Through sentiment analysis, we gained nuanced insights into patient perspectives on drug experiences, which can inform clinical decision-making and improve patient outcomes. Furthermore, our exploration of domain and source transferability underscores the generalizability and applicability of sentiment analysis models across diverse medical conditions and data sources.

Overall, this analysis contributes to the evolving field of text analytics in healthcare, emphasizing the significance of leveraging patient-reported data to advance medical research and practice. By understanding and incorporating patient perspectives, we can better tailor treatments and interventions, ultimately enhancing the quality of care and patient well-being.