<a href="https://colab.research.google.com/github/Ritu-95/Machine-Learning-An-Overview/blob/main/Drug_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Drug Sentiment Analysis

Drug sentiment analysis is the process of analyzing opinions, emotions, and attitudes expressed in text data about drugs or medications. This analysis can provide valuable insights into how people perceive and use drugs, and it can be useful for drug companies, healthcare providers, and policymakers.

### Problem Statement

The problem is to identify the sentiments of the user from their reviews.

### Data Description :

Data Source: https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29

- drugName (categorical): name of drug
- condition (categorical): name of condition
- review (text): patient review
- rating (numerical): 10 star patient rating
- date (date): date of review entry
- usefulCount (numerical): number of users who found review useful

The structure of the data is that a patient with a unique ID purchases a drug that meets his condition and writes a review and rating for the drug he/she purchased on the date. Afterwards, if the others read that review and find it helpful, they will click usefulCount, which will add 1 for the variable.

In [19]:
#import all the necessary packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
from matplotlib import style
style.use('ggplot')

In [30]:
#read the train and test data
import csv
train = pd.read_csv('drugsComTrain_raw.tsv', sep='\t',quoting=csv.QUOTE_NONE, encoding='utf-8') #train data
test = pd.read_csv('drugsComTest_raw.tsv', sep='\t',quoting=csv.QUOTE_NONE, encoding='utf-8') #test data

In [31]:
#rename the first column

train.rename(columns={'Unnamed: 0': 'uniqueID'}, inplace=True)
test.rename(columns={'Unnamed: 0': 'uniqueID'}, inplace=True)

In [32]:
#check the head of train data
train.head()

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""""""It has no side effect, I take it in combina...",9.0,"May 20, 2012",27.0
1,95260,Guanfacine,ADHD,"""""""My son is halfway through his fourth week o...",,,
2,We have tried many different medications and s...,8.0,"April 27, 2010",192,,,
3,92703,Lybrel,Birth Control,"""""""I used to take another oral contraceptive, ...",,,
4,The positive side is that I didn&#039;t have a...,5.0,"December 14, 2009",17,,,


In [33]:
#check the head of test data
test.head()

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""""""I&#039;ve tried a few antidepressants over ...",10.0,"February 28, 2012",22.0
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""""""My son has Crohn&#039;s disease and has don...",8.0,"May 17, 2009",17.0
2,159672,Bactrim,Urinary Tract Infection,"""""""Quick reduction of symptoms""""""",9.0,"September 29, 2017",3.0
3,39293,Contrave,Weight Loss,"""""""Contrave combines drugs that were used for ...",9.0,"March 5, 2017",35.0
4,97768,Cyclafem 1 / 35,Birth Control,"""""""I have been on this birth control for one c...",9.0,"October 22, 2015",4.0


By looking at the head of train and test data we see that there are 7 features in our Dataset but we don't have any sentiment feature which can serve as our target variable. We will make a target feature out of rating. If Rating is greater than 5 we will assign it as positive else we will assign it as negative.

In [34]:
#check the shape of the given dataset
print(f'train has {train.shape[0]} number of rows and {train.shape[1]} number of columns')
print(f'train has {test.shape[0]} number of rows and {test.shape[1]} number of columns')

train has 83500 number of rows and 7 number of columns
train has 65987 number of rows and 7 number of columns


In [35]:
#check the columns in train
train.columns

Index(['uniqueID', 'drugName', 'condition', 'review', 'rating', 'date',
       'usefulCount'],
      dtype='object')

### Exploratory Data Analysis

Merge the train and test data as there are no target labels. We will perform our EDA and Pre-processing on merged data. Then we will dive the data into 70 : 30 ratio for training and testing

In [36]:
#merge train and test data

merge = [train,test]
merged_data = pd.concat(merge,ignore_index=True)

merged_data.shape   #check the shape of merged_data

(149487, 7)

##### Check number of uniqueIds to see if there's any duplicate record in our dataset

In [48]:
#check uniqueID
merged_data['uniqueID'].nunique()
merged_data['rating']=merged_data['rating'].astype(float)

There are 215063 uniqueIds meaning that every record is unique.

##### Check information of the merged data

In [49]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107696 entries, 0 to 149486
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   uniqueID     107696 non-null  object 
 1   drugName     107696 non-null  object 
 2   condition    107696 non-null  object 
 3   review       107696 non-null  object 
 4   rating       107696 non-null  float64
 5   date         107696 non-null  object 
 6   usefulCount  107696 non-null  float64
dtypes: float64(2), object(5)
memory usage: 6.6+ MB


### Data Pre-Processing

In [50]:
# check the null values
merged_data.isnull().sum()

uniqueID       0
drugName       0
condition      0
review         0
rating         0
date           0
usefulCount    0
dtype: int64

We only have null values in condition. We will drop the records with null values as it only accounts for 0.5 % of total data.

In [51]:
# drop the null values
merged_data.dropna(inplace=True, axis=0)

#### Pre-Processing Reviews

Check the first few reviews

In [52]:
#check first three reviews
for i in merged_data['review'][0:3]:
    print(i,'\n')

side effect take combin bystol mg fish oil 

first time use form birth control glad went patch month first decrea libido subsid downsid made period longer day exact use period day max also made cramp inten first two day period never cramp use birth control happi patch 

suboxon complet turn life around feel healthier excel job alway money pocket save account none suboxon spent year abus oxycontin paycheck alreadi spent time got start resort scheme steal fund addict histori readi stop good chanc suboxon put path great life found side effect minim compar oxycontin actual sleep better slight constip truli amaz cost pale comparison spent oxycontin 



#### Steps for reviews pre-processing

- <b>Remove HTML tags</b>
    - Using BeautifulSoup from bs4 module to remove the html tags. We   have already removed the html tags with pattern "64</span>...", we will use get_text() to remove the html tags if there are any.
- <b>Remove Stop Words</b>
    - Remove the stopwords like "a", "the", "I" etc.
- <b>Remove symbols and special characters</b>
    - We will remove the special characters from our reviews like '#' ,'&' ,'@' etc.
- <b>Tokenize</b>
    - We will tokenize the words. We will split the sentences with spaces e.g "I might come" --> "I", "might", "come"
- <b>Stemming</b>
    -Remove the suffixes from the words to get the root form of the word e.g 'Wording' --> "Word"

In [53]:
#import the libraries for pre-processing
from bs4 import BeautifulSoup
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

# Download the stop words corpus
nltk.download('stopwords')

stops = set(stopwords.words('english')) #english stopwords

stemmer = SnowballStemmer('english') #SnowballStemmer

def review_to_words(raw_review):
    # 1. Delete HTML 
    review_text = BeautifulSoup(raw_review, 'html.parser').get_text()
    # 2. Make a space
    letters_only = re.sub('[^a-zA-Z]', ' ', review_text)
    # 3. lower letters
    words = letters_only.lower().split()
    # 5. Stopwords 
    meaningful_words = [w for w in words if not w in stops]
    # 6. Stemming
    stemming_words = [stemmer.stem(w) for w in meaningful_words]
    # 7. space join words
    return( ' '.join(stemming_words))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [54]:
#apply review_to_words function on reviews
merged_data['review'] = merged_data['review'].apply(review_to_words)

#### Now we will create our target variable "Sentiment" from rating

In [55]:
#create sentiment feature from ratings
#if rating > 5 sentiment = 1 (positive)
#if rating < 5 sentiment = 0 (negative)
merged_data['sentiment'] = merged_data["rating"].apply(lambda x: 1 if x > 5 else 0)

We will predict the sentiment using the reviews only. So let's start building our model.

### Building Model

In [56]:
#import all the necessary packages

from sklearn.model_selection import train_test_split #import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer #import TfidfVectorizer 
from sklearn.metrics import confusion_matrix #import confusion_matrix
from sklearn.naive_bayes import MultinomialNB #import MultinomialNB
from sklearn.ensemble import RandomForestClassifier  #import RandomForestClassifier

We all know that we cannot pass raw text features in our model. We have to convert them into numeric values. We will use TfidfVectorizer to convert our reviews in numeric features.

In [57]:
# Creates TF-IDF vectorizer and transforms the corpus
vectorizer = TfidfVectorizer()
reviews_corpus = vectorizer.fit_transform(merged_data.review)
reviews_corpus.shape

(107696, 26211)

We have built reviews_corpus which are the independent feature in our model.

#### Store Dependent feature in sentiment and split the Data into train and test

In [58]:
#dependent feature
sentiment = merged_data['sentiment']
sentiment.shape

(107696,)

In [59]:
#split the data in train and test
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(reviews_corpus,sentiment,test_size=0.33,random_state=42)
print('Train data shape ',X_train.shape,Y_train.shape)
print('Test data shape ',X_test.shape,Y_test.shape)

Train data shape  (72156, 26211) (72156,)
Test data shape  (35540, 26211) (35540,)


#### Apply Multinomial Naive Bayes

In [60]:
#fit the model and predicct the output

clf = MultinomialNB().fit(X_train, Y_train) #fit the training data

pred = clf.predict(X_test) #predict the sentiment for test data

print("Accuracy: %s" % str(clf.score(X_test, Y_test))) #check accuracy
print("Confusion Matrix") 
print(confusion_matrix(pred, Y_test)) #print confusion matrix

Accuracy: 0.7424310635903207
Confusion Matrix
[[ 1793   234]
 [ 8920 24593]]


We have got accuracy score of 75.5% by using NaiveBayes

#### Apply RandomForest

In [61]:
#fit the model and predicct the output

clf = RandomForestClassifier().fit(X_train, Y_train)

pred = clf.predict(X_test)

print("Accuracy: %s" % str(clf.score(X_test, Y_test)))
print("Confusion Matrix")
print(confusion_matrix(pred, Y_test))

Accuracy: 0.841024198086663
Confusion Matrix
[[ 5461   398]
 [ 5252 24429]]


### Conclusion

After applying the TfidfVectorizer to transform our reviews in Vectors and applying NaiveBayes and RandomForestClassifier we see that RandomForestClassifier outperforms MulinomialNB. We have achieved accuracy of 89.4 % after applying RandomForestClassifier without any parameter tuning. We can tune the parameters of our classifier and improve our accuracy.