<a href="https://colab.research.google.com/github/ScottHay14/Natural-Language-Processing-Coursework/blob/main/Natural_Language_Processing_Coursework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Section 1 - Dataset

The Drug Reviews dataset from Druglib.com is a collection of patient reviews on specific drugs along with the related conditions. The dataset is broken up into these 9 variables.
<br>
<br>reviewID
<br>urlDrugName
<br>rating
<br>effectiveness
<br>sideEffects
<br>condition
<br>benefitsReview
<br>sideEffectsReview
<br>commentsReview
<br>
<br>
The task going to be performed in my classwork is text classification with the goal of predicting drug effectivness ratings from the patients reviews. The effectiveness variable is categorical and contains 5 options of effectiveness these being Highly Effective, Considerably Effective, Moderately Effective, Marginally Effective, Ineffective.



In [21]:
# Imports
import pandas as pd
import numpy as np

In [48]:
# Loading Data and combining the test and train dataset into one dataframe
test_data = "/content/Data/drugLibTest_raw.tsv"
train_data = "/content/Data/drugLibTrain_raw.tsv"

test_df = pd.read_csv(test_data, delimiter="\t")
train_df = pd.read_csv(train_data, delimiter="\t")

df = pd.concat([test_df, train_df], ignore_index=True)

In [86]:
# Exploring Data
print(df.head()) # Just printing first rows to see if loaded correctly


   Unnamed: 0 urlDrugName  rating           effectiveness  \
0        1366      biaxin       9  Considerably Effective   
1        3724    lamictal       9        Highly Effective   
2        3824    depakene       4    Moderately Effective   
3         969     sarafem      10        Highly Effective   
4         696    accutane      10        Highly Effective   

           sideEffects           condition  \
0    Mild Side Effects     sinus infection   
1    Mild Side Effects    bipolar disorder   
2  Severe Side Effects    bipolar disorder   
3      No Side Effects  bi-polar / anxiety   
4    Mild Side Effects        nodular acne   

                                      benefitsReview  \
0  The antibiotic may have destroyed bacteria cau...   
1  Lamictal stabilized my serious mood swings. On...   
2  Initial benefits were comparable to the brand ...   
3  It controlls my mood swings. It helps me think...   
4  Within one week of treatment superficial acne ...   

                   

AttributeError: 'str' object has no attribute 'value_counts'

In [105]:
# Combining the 3 review categories into one (benefitsReview, sideEffectsReview, commentsReview) for both the training dataset and the testing dataset

# Train dataset combined first
train_df["combined_review"] = train_df["benefitsReview"].fillna("").astype(str) + "\n\n" + train_df["sideEffectsReview"].fillna("").astype(str) + "\n\n" +  train_df["commentsReview"].fillna("").astype(str)
x_train = train_df["combined_review"].to_numpy()
y_train = train_df["effectiveness"].to_numpy()
print("Train dataset example")
print(x_train[0][:1000])
print(y_train[0])
print("\n")

# Test dataset combined after
test_df["combined_review"] = test_df["benefitsReview"].fillna("").astype(str) + "\n\n" + test_df["sideEffectsReview"].fillna("").astype(str) + "\n\n" +  test_df["commentsReview"].fillna("").astype(str)
x_test = test_df["combined_review"].to_numpy()
y_test = test_df["effectiveness"].to_numpy()
print("Test dataset example")
print(x_test[0][:1000])
print(y_test[0])

Train dataset example
slowed the progression of left ventricular dysfunction into overt heart failure 
alone or with other agents in the managment of hypertension 
mangagement of congestive heart failur

cough, hypotension , proteinuria, impotence , renal failure , angina pectoris , tachycardia , eosinophilic pneumonitis, tastes disturbances , anusease anorecia , weakness fatigue insominca weakness

monitor blood pressure , weight and asses for resolution of fluid
Highly Effective


Test dataset example
The antibiotic may have destroyed bacteria causing my sinus infection.  But it may also have been caused by a virus, so its hard to say.

Some back pain, some nauseau.

Took the antibiotics for 14 days. Sinus infection was gone after the 6th day.
Considerably Effective


In [96]:
# Preprocessing data
import nltk

nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("stopwords")
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

def prep(X):
  prep_text = []
  for x in X:
    token_text = word_tokenize(x)
    normd_text = [token.lower() for token in token_text if token.isalpha()]
    swr_text = [token for token in normd_text if token not in stopwords.words("english")]
    stemmer = SnowballStemmer("english")
    prep_text += [[stemmer.stem(word) for word in swr_text]]
  prep_sentences = [" ".join(sentence) for sentence in prep_text]
  return prep_sentences

prep_x_train = prep(x_train)
prep_x_test = prep(x_test)

print("Preprocessed working for train dataset")
print(prep_x_train[0][:1000])

print("Preprocessed working for test dataset")
print(prep_x_test[0][:1000])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Preprocessed working for train dataset
although type birth control con pros help cramp also effect prevent pregnanc along use condom well heavi cycl cramp hot flash fatigu long last cycl month concid chang differ bc first time use kind bc unfortun due constant hassel happi result hate birth control would suggest anyon
Preprocessed working for test dataset
lamict stabil serious mood swing one minut claw wall pure mania next curl fetal posit bed contempl suicdi longer whim mood neither around lucki start pharmaceut almost immedi diagnos bipolar lamict give amaz clariti go day honest assess form real relationship lamitc lift fog guess could call medic realiz cloudi thought process use wonder feel interest hard dreamt begin lamict would dream mean dream sens abl imagin pictur scene asleep rem mayb everi two month dream everi night found closer take bedtim frequent intens dream drowsi bit mental numb take much feel sedat sinc abl clear honest assess emot thought determin much medic need tou

## Section 2 - Representation Learning

do later

In [None]:
# pip install to get the Word2Vec to work
!pip install gensim

In [104]:
from gensim.models import Word2Vec
import numpy as np
def word2vec_rep(sentence, w2v_model):
  embs = [w2v_model.wv[word] for word in sentence if word in w2v_model.wv.index_to_key]
  sent_emb = np.mean(np.array(embs), 0)
  return sent_emb

## Section 3 - Algorithms

## Section 4 - Evaluation