# Sentiment Analysis on Drug Reviews



## Approach

- Explore the dataset
- Clean the reviews
- Once the reviews are cleaned we start with the bag of words model (tokenization)
- Deduce the feature vectors from the bag of words
- Create a classifier (here we use a random forest classifier - set of decision trees) to classify the review as positive,neutral or negative
- After this let's test it with the test dataset
-Store the test results tp a new csv file


labels:
- positive "2"
- negative "1"
- neutral "0"


### Let's start by importing the necessary packages

In [16]:
import os

# to create a bag of words model
from sklearn.feature_extraction.text import CountVectorizer

# RandomForestClassifier will be our model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix


# To convert words to vectors
from KaggleWord2VecUtility import KaggleWord2VecUtility

# To read and load the csv files
import pandas as pd
import nltk
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

# To save and load our model
import pickle

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [6]:
#load the dataset
df2 = pd.read_csv("drugsComTrain.csv",delimiter=",")

#drop columns which are not required
#The required columns are only the "review" and "ratingSentiment"
df2 = df2.drop(['Id'],axis=1)
df2 = df2.drop(['rating'],axis=1)
df2 = df2.drop(['ratingSentimentLabel'],axis=1)

#Let's see our dataset
df2.head()


Unnamed: 0,review,ratingSentiment
0,"""I've tried a few antidepressants over the yea...",2
1,"""My son has Crohn's disease and has done very ...",2
2,"""Quick reduction of symptoms""",2
3,"""Contrave combines drugs that were used for al...",2
4,"""I have been on this birth control for one cyc...",2


### Cleaning the dataset

Once the dataset is loaded, we take only the required columns viz. "review" and "ratingSentiment"

We cannot directly send the raw reviews to our classifier, because the classifer will start overfitting on the common words like ('the','and', 'or','this','that' .. etc). To avoid this, we will have to remove these common words from the reviews.
We do this by taking the help of our helper class "[KaggleWord2VecUtility](https://github.com/wendykan/DeepLearningMovies/blob/master/KaggleWord2VecUtility.py)". This helps us remove the stop words.

We store the cleaned reviews and use this to train our model

In [7]:

clean_train_reviews=[]
print('Cleaning and parsing the training set reviews')
for i in range(0, len(df2["review"])):
  clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(df2['review'][i],True)))

Cleaning and parsing the training set reviews




  review_text = BeautifulSoup(review).get_text()


### Create a bag of words

Bag of words is essentially a big matrix of all the words present where the (frequency of) occurrence of each word is used as a feature for training a classifier

These features are stored in the train_data_features

In [8]:
#Create the bag of words
print("Creating a bag of words \n")
vectorizer = CountVectorizer(analyzer="word", tokenizer = None,preprocessor = None,stop_words = None, max_features = 5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()

Creating a bag of words 



### Training Random Forest Classifier

A random forest classifier is a tree of trees. It tries to narrow down the sentiment(positive,negative,neutral) by passing the review through this tree(here it is a tree of 100 sub trees)

In [9]:
print('Training the random forest')
forest = RandomForestClassifier(n_estimators = 100)
forest.fit(train_data_features,df2['ratingSentiment'])


Training the random forest


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [17]:
print("Model metrics - ")
print('Trained model',forest)
print("Train Accuracy :: ",accuracy_score(df2['ratingSentiment'], forest.predict(train_data_features)))

Model metrics - 
Trained model RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
Train Accuracy ::  0.9995164207697279


### Saving the trained model

Once the training is done, let's save the model

In [0]:
#Save the model to disk
filename = 'finalized_model.sav'
pickle.dump(forest, open(filename, 'wb'))

### Loading the model from disk

To load the trained model saved above

In [0]:
#Load model from disk
model = pickle.load(open(filename,'rb'))

### Testing the classifier

We do the same steps we did with the raw training data, i.e clean the reviews and drop unwanted columns.

Since, we are testing, we don't need any column other than "review"

In [19]:
#Format testing data
clean_test_reviews=[]
dftest = pd.read_csv('drugsComTest.csv')
print('Cleaning and parsing the test data reviews')
dftest = dftest.drop(['Id'],axis = 1)

Cleaning and parsing the test data reviews


Let's see if we have only the reviews

In [20]:
dftest.head()

Unnamed: 0,review
0,"""I've tried a few antidepressants over the yea..."
1,"""My son has Crohn's disease and has done very ..."
2,"""Quick reduction of symptoms"""
3,"""Contrave combines drugs that were used for al..."
4,"""I have been on this birth control for one cyc..."


### Extract the data features of test set

In [21]:
clean_test_review=[]
for i in range(0,len(dftest['review'])):
  clean_test_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(dftest['review'][i],True)))

test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()



  review_text = BeautifulSoup(review).get_text()


### Predicting the sentiment

We can now use our classifier to predict the sentiment of our test dataset by running the classifier on the test data features.

In [22]:
print('Predicting the sentiment of the test reviews')
result = forest.predict(test_data_features)
output = pd.DataFrame( data={'review':dftest["review"],'Predictedsentiment':result} )
output.to_csv("TestBagOfWordsModel.csv")
print("Wrote results to file")

Predicting the sentiment of the test reviews
Wrote results to file
