## INTRODUCTION
The freestyle format of hackathons has time and again stimulated groundbreaking and innovative data insights and technologies. Based on the dataset uploaded at The Kaggle University Club Hackathon - UCI Machine Learning - Drug Review Dataset.

### Prompt

Machine learning has permeated nearly all fields and disciplines of study. One hot topic is using natural language processing and sentiment analysis to identify, extract, and make use of subjective information. The UCI ML Drug Review dataset provides patient reviews on specific drugs along with related conditions and a 10-star patient rating system reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. This data was published in a study on sentiment analysis of drug experience over multiple facets, ex. sentiments learned on specific aspects such as effectiveness and side effects (see the acknowledgments section to learn more).

### Find the datasets in the Github dir. or follow the link below
https://www.kaggle.com/datasets/jessicali9530/kuc-hackathon-winter-2018?datasetId=76158&sortBy=voteCount

### About the datasets
The Drug Review Data Set is of shape (161297, 7) i.e. It has 7 features including the review and 161297 Data Points or entries.

The features are 'drugName' which is the name of the drug, 'condition' which is the condition the patient is suffering from, 'review' is the patients review, 'rating' is the 10-star patient rating for the drug, 'date' is the date of the entry and the 'usefulcount' is the number of users who found the review useful.

### About the project
With this dataset, I have tried to answer the following questions/conditions:
* What insights can we gain from exploring and visualizing our data?

* How does sentiment play into rating and usefulness of reviews?

* Can we create a way for people to find the best medication for their illness?

* What machine learning models work best for predicting the sentiment or rating based on review?

### This notebook contains sentiment analysis(Unsupervised) using AFINN & VADER lexicons

In [1]:
# Import Libs.
import pandas as pd
import numpy as np
import text_normalizer as tn
import nltk

import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
from contractions import contractions_dict
import unicodedata

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

from sklearn import metrics
from afinn import Afinn
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

%matplotlib inline

from IPython.core.display import HTML

import warnings
warnings.simplefilter(action = 'ignore', category = Warning)

In [2]:
# To center our generated images
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

In [3]:
nlp = spacy.load('en_core_web_sm')
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

In [4]:
# Import Datasets
df_train = pd.read_csv("D:\DataSc\Datasets\drugsComTrain_raw.csv")

In [5]:
df_train.head()

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,20-May-12,27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,27-Apr-10,192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,14-Dec-09,17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,3-Nov-15,10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,27-Nov-16,37


In [6]:
print('Train dataset shape:     ', df_train.shape)
print('Train dataset features:  ', list(df_train))

Train dataset shape:      (161297, 7)
Train dataset features:   ['uniqueID', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount']


In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161297 entries, 0 to 161296
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   uniqueID     161297 non-null  int64 
 1   drugName     161297 non-null  object
 2   condition    160398 non-null  object
 3   review       161297 non-null  object
 4   rating       161297 non-null  int64 
 5   date         161297 non-null  object
 6   usefulCount  161297 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 8.6+ MB


## Data Preprocessing

In [8]:
# Convert date object into datetime
df_train['date'] = pd.to_datetime(df_train['date'], errors = 'coerce')

In [9]:
df_train.head()

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,2012-05-20,27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,2010-04-27,192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,2009-12-14,17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,2015-11-03,10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,2016-11-27,37


### Dropping Null values

In [10]:
print(len(df_train))
df_train.isna().sum()

161297


uniqueID         0
drugName         0
condition      899
review           0
rating           0
date             0
usefulCount      0
dtype: int64

In [11]:
# Drop null values
df_train = df_train[~(df_train.condition.isnull())]
print(len(df_train))
df_train.isna().sum()

160398


uniqueID       0
drugName       0
condition      0
review         0
rating         0
date           0
usefulCount    0
dtype: int64

There are several entries with the condition "</span....helpful". There maybe an error in the data minning or crawling process when collecting. I will remove the entries with such condition.

### Dropping conditions with span

In [12]:
df_train = df_train.reset_index()

In [13]:
span_list = []

for i, j in enumerate(df_train['condition']):
    if '</span>' in j:
        span_list.append(i)

print(len(span_list))

900


In [14]:
df_train = df_train.drop(df_train.index[span_list])
df_train = df_train.drop(['index'], axis = 1)
df_train = df_train.reset_index()

In [15]:
df_train.head()

Unnamed: 0,index,uniqueID,drugName,condition,review,rating,date,usefulCount
0,0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,2012-05-20,27
1,1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,2010-04-27,192
2,2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,2009-12-14,17
3,3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,2015-11-03,10
4,4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,2016-11-27,37


In [16]:
df_train.tail()

Unnamed: 0,index,uniqueID,drugName,condition,review,rating,date,usefulCount
159493,160393,191035,Campral,Alcohol Dependence,"""I wrote my first report in Mid-October of 201...",10,2015-05-31,125
159494,160394,127085,Metoclopramide,Nausea/Vomiting,"""I was given this in IV before surgey. I immed...",1,2011-11-01,34
159495,160395,187382,Orencia,Rheumatoid Arthritis,"""Limited improvement after 4 months, developed...",2,2014-03-15,35
159496,160396,47128,Thyroid desiccated,Underactive Thyroid,"""I&#039;ve been on thyroid medication 49 years...",10,2015-09-19,79
159497,160397,215220,Lubiprostone,"Constipation, Chronic","""I&#039;ve had chronic constipation all my adu...",9,2014-12-13,116


In [17]:
df_train.shape

(159498, 8)

In [18]:
df_train = df_train.drop(['index'], axis = 1)
df_train.head()

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,2012-05-20,27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,2010-04-27,192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,2009-12-14,17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,2015-11-03,10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,2016-11-27,37


In [19]:
print("Lenght of the new DataFrame:     ", len(df_train), "\n")
df_train.info()

Lenght of the new DataFrame:      159498 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159498 entries, 0 to 159497
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   uniqueID     159498 non-null  int64         
 1   drugName     159498 non-null  object        
 2   condition    159498 non-null  object        
 3   review       159498 non-null  object        
 4   rating       159498 non-null  int64         
 5   date         159498 non-null  datetime64[ns]
 6   usefulCount  159498 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(3)
memory usage: 8.5+ MB


Checking again for any missing/null values!

In [20]:
df_train.isnull().sum()

uniqueID       0
drugName       0
condition      0
review         0
rating         0
date           0
usefulCount    0
dtype: int64

In [21]:
df_train.isna().sum()

uniqueID       0
drugName       0
condition      0
review         0
rating         0
date           0
usefulCount    0
dtype: int64

Now, let's create a new DataFrame with specific columns of ['uniqueID', 'condition', 'review', 'rating']

In [22]:
df_train_new = df_train[['uniqueID', 'condition', 'review', 'rating']].copy()
df_train.isnull().sum().head()

uniqueID     0
drugName     0
condition    0
review       0
rating       0
dtype: int64

In [23]:
print(len(df_train_new))
df_train_new.tail(10)

159498


Unnamed: 0,uniqueID,condition,review,rating
159488,132177,Anxiety,"""I was super against taking medication. I&#039...",9
159489,45410,Obsessive Compulsive Disorde,"""I have been off Prozac for about 4 weeks now....",8
159490,105263,Trigeminal Neuralgia,"""Up to 800mg seems to work about once every 2n...",1
159491,103458,High Blood Pressure,"""I have only been on Tekturna for 9 days. The ...",7
159492,164345,Birth Control,"""This would be my second month on Junel. I&#03...",6
159493,191035,Alcohol Dependence,"""I wrote my first report in Mid-October of 201...",10
159494,127085,Nausea/Vomiting,"""I was given this in IV before surgey. I immed...",1
159495,187382,Rheumatoid Arthritis,"""Limited improvement after 4 months, developed...",2
159496,47128,Underactive Thyroid,"""I&#039;ve been on thyroid medication 49 years...",10
159497,215220,"Constipation, Chronic","""I&#039;ve had chronic constipation all my adu...",9


## Sentiment Analysis

The following code have been taken from the book "Practical Machine Learning with Python"- follows a structured and comprehensive three-tiered approach packed with concepts, methodologies, hands-on examples, and code. This book is packed with over 500 pages of useful information which helps its readers master the essential skills needed to recognize and solve complex problems with Machine Learning and Deep Learning by following a data-driven mindset.

source: *https://github.com/dipanjanS/practical-machine-learning-with-python.git*


### Creating functions for model evaluation and corpus normalization

In [24]:
def get_metrics(true_labels, predicted_labels):
    print('Accuracy:', np.round(metrics.accuracy_score(true_labels, predicted_labels), 4))
    print('Precision:', np.round(metrics.precision_score(true_labels, predicted_labels, average = 'weighted'), 4))
    print('Recall:', np.round(metrics.recall_score(true_labels, predicted_labels, average = 'weighted'), 4))
    print('F1 Score:', np.round(metrics.f1_score(true_labels, predicted_labels, average = 'weighted'), 4))
                        
def train_predict_model(classifier, train_features, train_labels, test_features, test_labels):
    # build a model    
    classifier.fit(train_features, train_labels)
    # predict using a model
    predictions = classifier.predict(test_features) 
    return predictions    

def display_confusion_matrix(true_labels, predicted_labels, classes = [1, 0]):
    total_classes = len(classes)
    level_labels = [total_classes*[0], list(range(total_classes))]

    cm = metrics.confusion_matrix(y_true=true_labels, y_pred=predicted_labels, labels=classes)
    cm_frame = pd.DataFrame(data = cm, 
                            columns = pd.MultiIndex(levels = [['Predicted:'], classes], codes = level_labels), 
                            index = pd.MultiIndex(levels = [['Actual:'], classes], codes = level_labels)) 
    print(cm_frame) 
    
def display_classification_report(true_labels, predicted_labels, classes = [1, 0]):
    report = metrics.classification_report(y_true = true_labels, y_pred = predicted_labels, labels = classes) 
    print(report)
    
    
def display_model_performance_metrics(true_labels, predicted_labels, classes=[1, 0]):
    print('Model Performance metrics:')
    print('-'*30)
    
    get_metrics(true_labels = true_labels, predicted_labels = predicted_labels)
    print('\nModel Classification report:')
    print('-'*30)
    
    display_classification_report(true_labels = true_labels, predicted_labels = predicted_labels, classes = classes)
    print('\nPrediction Confusion Matrix:')
    print('-'*30)
    
    display_confusion_matrix(true_labels = true_labels, predicted_labels = predicted_labels, classes = classes)

In [25]:
CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"I&#039;ve": "I have",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

In [26]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match) \
                                   if contraction_mapping.get(match) \
                                    else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char + expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def remove_special_chars(text):
    text = re.sub('[^a-zA-z0-9\s]', '', text)
    return text

def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
        
    filtered_text = " ".join(filtered_tokens)
    return filtered_text

In [27]:
def normalize_corpus(corpus, html_stripping=True, contaction_expansion=True, 
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True):
    normalized_corpus = []
    #normalize each doc. in corpus
    for doc in corpus:
        # strip html
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented chars
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # lower case
        if text_lower_case:
            doc = doc.lower()
        # expand contractions
        if contaction_expansion:
            doc = expand_contractions(doc)
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', " ", doc)
        # insert space btw special chars
        special_char_pattern = re.compile(r'([{.(-)!}])')
        doc = special_char_pattern.sub(" \\1 ", doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special chars
        if special_char_removal:
            doc = remove_special_chars(doc)
        # remove extra white-space
        doc = re.sub(' +', ' ', doc)
        #remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc)
        
        normalized_corpus.append(doc) #append the doc into the list
    return normalized_corpus

### Testing our normalize function

In [28]:
sample_review_ids = [100, 200, 420, 1000, 5000]

print(df_train_new.review.iloc[100])
print(normalize_corpus([df_train_new.review.iloc[100]]), "\n")

print(df_train_new.review.iloc[2000])
print(normalize_corpus([df_train_new.review.iloc[2000]]), "\n")

print(df_train_new.review.iloc[10000])
print(normalize_corpus([df_train_new.review.iloc[10000]]), "\n")

"I&#039;ve been on Latuda for a little under 2 and a half years. It almost completely stopped my psychotic symptoms except I still hear voices now and then, mainly when I try to go to sleep, but there are no delusions or paranoia while on the drug. I take cogentin in combination with it because it causes me to shake a lot. Main side effects I experience include anhedonia, shakiness, jaw clenching, and inability to sit still. However, I&#039;m happy with it because it actually works, while other antipsychotic meds I tried did not, and it doesn&#039;t cause the endless hunger that I experienced with drugs like Saphris, Haldol, Zyprexa, and Risperdal. It should be noted I am at the max daily dose, which is 160mg."
['latuda little 2 half year almost completely stop psychotic symptom except still hear voice mainly try go sleep delusion paranoia drug take cogentin combination cause shake lot main side effect experience include anhedonia shakiness jaw clenching inability sit still however hap

In [29]:
reviews = np.array(df_train_new.review)
ratings = np.array(df_train_new.rating)
print(reviews.shape[0])
print(ratings.shape[0])

159498
159498


## Normalization

In [30]:
%time norm_reviews = normalize_corpus(reviews)

CPU times: total: 31min 5s
Wall time: 36min 41s


In [31]:
df_norm = pd.DataFrame(list(zip(norm_reviews, ratings)),
                       columns = ['reviews', 'ratings'])

In [32]:
df_norm.head()

Unnamed: 0,reviews,ratings
0,side effect take combination bystolic 5 mg fis...,9
1,son halfway fourth week intuniv become concern...,8
2,use take another oral contraceptive 21 pill cy...,5
3,first time use form birth control glad go patc...,8
4,suboxone completely turn life around feel heal...,9


In [33]:
df_norm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159498 entries, 0 to 159497
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   reviews  159498 non-null  object
 1   ratings  159498 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


## Modelling

In [34]:
afn = Afinn(emoticons=True)
afn_scores = []

for review in reviews:
    score = afn.score(review)
    afn_scores.append(score)

In [35]:
analyzer = SentimentIntensityAnalyzer()
vader_score = []

for review in norm_reviews:
    score = analyzer.polarity_scores(review)
    vader_score.append(score['compound'])

Based on the sentiment scores of both of our models, we can categorize sentiments as:
* Positive: positive sentiment will be evaluated as positive/good feedback for the drug i.e the drug proved to be effective.
* Negative: neagtive sentiment will be evaluated as neagtive/bad feedback for the drug i.e the drug didn't proved to be effective.
* Neutral:  neutral sentiment will be evaluated as neutral feedback for the drug i.e the drug was mildly effective.

Since, we are performing an unsupervised task, there are no actual labels present in the dataset. Therefore, we can categorize ratings to evaluate our models:

Based on the ratings, we can categorize rating sentiments as:
* Positive(rating > 7): Drug was effective, and helped with the condition of the patient.
* Negative(rating < 3): Drug was ineffective, and did not helped with the condition of the patient.
* Neutral(3 > rating < 7):  Drug had a mild/low effect on the condition of the patient.

### Here my assumption is simple: Higher the rating, more effective the drug is!

In [36]:
# Creating lists for sentiment results based on AFINN score, VADER score, and rating given by patient
afn_senti = []
rating_senti = []
vader_senti = []

for score in afn_scores:
    if score > 0:
        afn_senti.append('Positive')
    else:
        afn_senti.append('Negative')

for rating in ratings:
    if rating > 5:
        rating_senti.append('Positive')
    else:
        rating_senti.append('Negative')

for score in vader_score:
    if score > 0.1:
        vader_senti.append('Positive')
    else:
        vader_senti.append('Negative')

In [37]:
df_norm = pd.DataFrame(list(zip(norm_reviews, ratings, afn_scores, vader_score, afn_senti, vader_senti, rating_senti)),
                       columns = ['review', 'rating', 'afn_score', 'vader_score', 'afn_senti', 'vader_senti', 'rating_senti'])
df_norm.head(20)

Unnamed: 0,review,rating,afn_score,vader_score,afn_senti,vader_senti,rating_senti
0,side effect take combination bystolic 5 mg fis...,9,-1.0,0.0,Negative,Negative,Positive
1,son halfway fourth week intuniv become concern...,8,6.0,0.8892,Positive,Positive,Positive
2,use take another oral contraceptive 21 pill cy...,5,1.0,0.7506,Positive,Positive,Negative
3,first time use form birth control glad go patc...,8,5.0,0.6996,Positive,Positive,Positive
4,suboxone completely turn life around feel heal...,9,14.0,0.9307,Positive,Positive,Positive
5,2nd day 5 mg start work rock hard erection how...,2,-1.0,-0.6808,Negative,Negative,Negative
6,pull cumme bit take plan b 26 hour later take ...,1,0.0,0.0,Negative,Negative,Negative
7,abilify change life hope zoloft clonidine firs...,10,-7.0,-0.9559,Negative,Negative,Positive
8,nothing problem keppera constant shake arm leg...,1,-4.0,-0.2598,Negative,Negative,Negative
9,pill many year doctor change rx chateal effect...,8,7.0,0.8691,Positive,Positive,Positive


In [38]:
df_norm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159498 entries, 0 to 159497
Data columns (total 7 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   review        159498 non-null  object 
 1   rating        159498 non-null  int64  
 2   afn_score     159498 non-null  float64
 3   vader_score   159498 non-null  float64
 4   afn_senti     159498 non-null  object 
 5   vader_senti   159498 non-null  object 
 6   rating_senti  159498 non-null  object 
dtypes: float64(2), int64(1), object(4)
memory usage: 8.5+ MB


## Model Evaluation

We will look the recall metrics in our model i.e how much our models are able to evaluate actual negative cases. This is crucial as drugs with bad rating, sentiment, and review can help people select or choose the right drug for their condition, and avoid drug that may be not effective at all!

In [39]:
display_model_performance_metrics(true_labels = df_norm.rating_senti, 
                                  predicted_labels = df_norm.afn_senti, 
                                  classes = ["Positive", "Negative"])

Model Performance metrics:
------------------------------
Accuracy: 0.5566
Precision: 0.6981
Recall: 0.5566
F1 Score: 0.5694

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    Positive       0.83      0.46      0.59    112003
    Negative       0.38      0.78      0.51     47495

    accuracy                           0.56    159498
   macro avg       0.61      0.62      0.55    159498
weighted avg       0.70      0.56      0.57    159498


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   Positive Negative
Actual: Positive      51660    60343
        Negative      10382    37113


In [41]:
display_model_performance_metrics(true_labels = df_norm.rating_senti, 
                                  predicted_labels = df_norm.vader_senti, 
                                  classes = ["Positive", "Negative"])

Model Performance metrics:
------------------------------
Accuracy: 0.5972
Precision: 0.691
Recall: 0.5972
F1 Score: 0.614

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    Positive       0.81      0.55      0.66    112003
    Negative       0.40      0.70      0.51     47495

    accuracy                           0.60    159498
   macro avg       0.61      0.63      0.58    159498
weighted avg       0.69      0.60      0.61    159498


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   Positive Negative
Actual: Positive      61841    50162
        Negative      14083    33412


* Model 1- AFINN: 
        recall score: *55*
* Model 2- VADER: 
        recall score: *59*
        
Both of models gives a low recall.

Let's try to build a DNN for sentiment analysis...