<a href="https://colab.research.google.com/github/BrittonWinterrose/Drug_Review_NLP/blob/master/01_Drug_Review_Dataset_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Drug Review Dataset Exploration
## [Dataset](https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29)

This was from the first day of looking at the dataset and exploring any type of sentiment analysis or regression. 

## Import the Dataset

In [2]:
# Getting started with drug data
# http://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29

!wget http://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip

--2019-01-01 00:46:51--  http://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42989872 (41M) [application/zip]
Saving to: ‘drugsCom_raw.zip’


2019-01-01 00:46:53 (17.8 MB/s) - ‘drugsCom_raw.zip’ saved [42989872/42989872]



In [4]:
!unzip drugsCom_raw.zip

Archive:  drugsCom_raw.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   


In [7]:
!head -n5 drugsComTrain_raw.tsv

	drugName	condition	review	rating	date	usefulCount
206461	Valsartan	Left Ventricular Dysfunction	"""It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"""	9.0	May 20, 2012	27
95260	Guanfacine	ADHD	"""My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. 
We have tried many different medications and so far this is the most effective."""	8.0	April 27, 2010	192
92703

In [10]:
import numpy as np
import pandas as pd
from scipy import stats
import os
import re


df = pd.read_table('drugsComTrain_raw.tsv')
df.head()

### Thought flow for Depression Confidence Intervals
"""
I want to take the df, filter by condition, drug, confidence interval, sample size cutoff)
Then loop through all the drugs for a specific condition and calculate their
mean, top limit, and bottom limit. 
"""
# Create Confidence Interval Function
def confidence_interval (data, ci_percent):
  data = np.array(data) # Makes sure our data is in a numpy array
  mean = np.mean(data)
  n = len(data)
  stderr = stats.sem(data)
  interval = stderr * stats.t.ppf((1 + ci_percent) / 2., n - 1)
  return (mean, mean - interval, mean + interval)


def condition_compare (df, condition_id, ci_percent, sample_size_cutoff):
  output_names = ["Drug Name", "Sample Mean", "Lower Bound", "Upper Bound", "Sample Size"]
  drug_compare = []
  data = df[df.condition == condition_id]
  for drug in data.drugName.unique():
    one_drug = data[data.drugName == drug].rating
    if one_drug.size > sample_size_cutoff:
      mean, ilower, iupper= confidence_interval(one_drug, ci_percent)
      entry = [drug, mean, ilower, iupper, one_drug.size]
      drug_compare.append(entry)
  return pd.DataFrame(drug_compare, columns=output_names)


df2 = condition_compare(df, "Depression", 0.95, 10).sort_values(by="Sample Mean", ascending=False)
df2.head(3)

Unnamed: 0,Drug Name,Sample Mean,Lower Bound,Upper Bound,Sample Size
62,Niacin,9.857143,9.647474,10.066812,14
47,Tramadol,9.288462,8.934,9.642923,52
68,Clomipramine,9.181818,8.10616,10.257476,11


## Now an attempt at some sentiment analysis using scikit-learn



In [0]:
# For this first attempt I followed this tutorial. Simple, but lacking somewhat.
# Nevertheless it was reproduceable using my dataset. 
# https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

In [13]:
# I combined the pre-split test & train data. Didn't have to but felt right. 
df_train = pd.read_table('drugsComTrain_raw.tsv')
df_test = pd.read_table('drugsComTest_raw.tsv')

df_main = pd.concat([df_train, df_test], axis=0)
df_main.head(3)

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17


#### Some RegEx to clean this text up. 

In [14]:
# Clean up text with RegEx
pd.set_option('display.width', 1000)
rx_pat = r"(\\r)|(\\n)|(\\t)|(\\f)|(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(&#039;)|(\d\s)|(\d)|(\/)"
rx_pat_wSpace = r"(\-)|(\\)|(\s{2,})"
    
df_main['review'].replace(regex=True,inplace=True,to_replace=rx_pat, value=r'')
df_main['review'].replace(regex=True,inplace=True,to_replace=rx_pat_wSpace, value=r' ')
df_main.review.head(5)

0    It has no side effect I take it in combination...
1    My son is halfway through his fourth week of I...
2    I used to take another oral contraceptive whic...
3    This is my first time using any form of birth ...
4    Suboxone has completely turned my life around ...
5    nd day on mg started to work with rock hard er...
6    He pulled out but he cummed a bit in me I took...
7    Abilify changed my life There is hope I was on...
8     I Ve had nothing but problems with the Kepper...
9    I had been on the pill for many years When my ...
Name: review, dtype: object

In [15]:
# Inspect the cleaned text. 
df_main['review'] = df_main['review'].str.lower()

df_main['review'].head(5)
# Nailed it. 

0    it has no side effect i take it in combination...
1    my son is halfway through his fourth week of i...
2    i used to take another oral contraceptive whic...
3    this is my first time using any form of birth ...
4    suboxone has completely turned my life around ...
Name: review, dtype: object

In [0]:
# VECTORIZE IT (One Hot Encode It)
# Each word becomes one feature (column)
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True)
cv.fit(df_main['review'])

# Define my X & create my matrix with n things and n features
X = cv.transform(df_main['review'])

# Define my y with 
y = df_main["rating"]

In [17]:
# Does the matrix dimensions look right?
X

<215063x71335 sparse matrix of type '<class 'numpy.int64'>'
	with 12366059 stored elements in Compressed Sparse Row format>

In [18]:
y.head()

0    9.0
1    8.0
2    5.0
3    8.0
4    9.0
Name: rating, dtype: float64

In [19]:
np.size(X,0)

215063

In [0]:
# Binned by rating (same as the research paper)
y_rank = []
for i in y:
  if i <= 4:
    y_rank.append(-1)
  elif i >= 7:
    y_rank.append(1)
  else:
    y_rank.append(0)

In [21]:
# Make sure they're the same number of n. 
y_rank = np.asarray(y_rank)
np.size(y_rank,0)

215063

In [22]:
np.random.seed()
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y_rank, train_size = 0.7)

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))
    
# Accuracy for C=0.01: 0.7125497915342768
# Accuracy for C=0.05: 0.7175405694446597
# Accuracy for C=0.25: 0.7197724701250794
# Accuracy for C=0.5: 0.7199584618484477



Accuracy for C=0.01: 0.7822191912459896
Accuracy for C=0.05: 0.7928672174088253
Accuracy for C=0.25: 0.8028022752987493
Accuracy for C=0.5: 0.8072815759698694
Accuracy for C=1: 0.8132643097382166


In [23]:
# Pick a model with a C value. 
X_train, X_test, y_train, y_test = train_test_split(X, y_rank, train_size = 0.823)

final_model = LogisticRegression(C=1)
final_model.fit(X_train, y_train)
print ("Final Accuracy: %s" 
       % accuracy_score(y_test, final_model.predict(X_test)))



Final Accuracy: 0.819502456195655


In [24]:
# Inspect the weights of each token. 
feature_to_coef = {
    word: coef for word, coef in zip(
        cv.get_feature_names(), final_model.coef_[0]
    )
}
for best_positive in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1], 
    reverse=True)[:10]:
    print (best_positive)
print("\n")
    
#     ('excellent', 0.9288812418118644)
#     ('perfect', 0.7934641227980576)
#     ('great', 0.675040909917553)
#     ('amazing', 0.6160398142631545)
#     ('superb', 0.6063967799425831)
    
for best_negative in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1])[:10]:
    print (best_negative)
    
#     ('worst', -1.367978497228895)
#     ('waste', -1.1684451288279047)
#     ('awful', -1.0277001734353677)
#     ('poorly', -0.8748317895742782)
#     ('boring', -0.8587249740682945)

('rejecting', 2.4807666897364)
('association', 2.359297784471058)
('proair', 2.319277235606814)
('outfits', 2.3175035087548577)
('ponytail', 2.270379700558579)
('disappointed', 2.2315833736272137)
('bac', 2.131347660810012)
('scam', 2.129333979450242)
('doubting', 2.1255704237188664)
('advertise', 2.0964402402231)


('lifesaver', -2.9898448202957866)
('saver', -2.6957205162040636)
('saved', -2.507537430528717)
('doable', -2.453727059971392)
('blessing', -2.306927404512549)
('excellent', -2.300258083089314)
('endocrine', -2.2506912210059955)
('miracle', -2.1964994014355055)
('dip', -2.1444478560138642)
('aggravate', -2.0894909343327517)


#### The text tokens represented here were not congruent with the expected sentiment. Things like "miracle" and "aggravate" being so close together were suspect and caused me to seek a deeper understanding of NLP best practices. 