<a href="https://colab.research.google.com/github/ScottHay14/Natural-Language-Processing-Coursework/blob/main/Natural_Language_Processing_Coursework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Section 1 - Dataset

The Drug Reviews dataset from Druglib.com is a collection of patient reviews on specific drugs along with the related conditions. The dataset is broken up into these 9 variables.
<br>
<br>reviewID
<br>urlDrugName
<br>rating
<br>effectiveness
<br>sideEffects
<br>condition
<br>benefitsReview
<br>sideEffectsReview
<br>commentsReview
<br>
<br>
The task going to be performed in my classwork is text classification with the goal of predicting drug effectivness ratings from the patients reviews. The effectiveness variable is categorical and contains 5 options of effectiveness these being Highly Effective, Considerably Effective, Moderately Effective, Marginally Effective, Ineffective.



In [132]:
# Google Colab deletes the folder and need to have the Data folder with the dataset. Running this gets it from the github and copies Data folder to /content/Data
!git clone https://github.com/ScottHay14/Natural-Language-Processing-Coursework
!cp -r /content/Natural-Language-Processing-Coursework/Data /content/Data

Cloning into 'Natural-Language-Processing-Coursework'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects:   2% (1/41)[Kremote: Counting objects:   4% (2/41)[Kremote: Counting objects:   7% (3/41)[Kremote: Counting objects:   9% (4/41)[Kremote: Counting objects:  12% (5/41)[Kremote: Counting objects:  14% (6/41)[Kremote: Counting objects:  17% (7/41)[Kremote: Counting objects:  19% (8/41)[Kremote: Counting objects:  21% (9/41)[Kremote: Counting objects:  24% (10/41)[Kremote: Counting objects:  26% (11/41)[Kremote: Counting objects:  29% (12/41)[Kremote: Counting objects:  31% (13/41)[Kremote: Counting objects:  34% (14/41)[Kremote: Counting objects:  36% (15/41)[Kremote: Counting objects:  39% (16/41)[Kremote: Counting objects:  41% (17/41)[Kremote: Counting objects:  43% (18/41)[Kremote: Counting objects:  46% (19/41)[Kremote: Counting objects:  48% (20/41)[Kremote: Counting objects:  51% (21/41)[Kremote: Counting objects:  5

In [133]:
# Imports
import pandas as pd
import numpy as np

In [134]:
# Loading Data and combining the test and train dataset into one dataframe
test_data = "/content/Data/drugLibTest_raw.tsv"
train_data = "/content/Data/drugLibTrain_raw.tsv"

test_df = pd.read_csv(test_data, delimiter="\t")
train_df = pd.read_csv(train_data, delimiter="\t")

df = pd.concat([test_df, train_df], ignore_index=True)

In [135]:
# Exploring Data
print(df.head()) # Just printing first rows to see if loaded correctly


print("\nClass distribution")
print(df["effectiveness"].value_counts()) # Shows a large class imbalance



   Unnamed: 0 urlDrugName  rating           effectiveness  \
0        1366      biaxin       9  Considerably Effective   
1        3724    lamictal       9        Highly Effective   
2        3824    depakene       4    Moderately Effective   
3         969     sarafem      10        Highly Effective   
4         696    accutane      10        Highly Effective   

           sideEffects           condition  \
0    Mild Side Effects     sinus infection   
1    Mild Side Effects    bipolar disorder   
2  Severe Side Effects    bipolar disorder   
3      No Side Effects  bi-polar / anxiety   
4    Mild Side Effects        nodular acne   

                                      benefitsReview  \
0  The antibiotic may have destroyed bacteria cau...   
1  Lamictal stabilized my serious mood swings. On...   
2  Initial benefits were comparable to the brand ...   
3  It controlls my mood swings. It helps me think...   
4  Within one week of treatment superficial acne ...   

                   

In [None]:
# Combining the 3 review categories into one (benefitsReview, sideEffectsReview, commentsReview) for both the training dataset and the testing dataset

# Train dataset combined first
train_df["combined_review"] = train_df["benefitsReview"].fillna("").astype(str) + "\n\n" + train_df["sideEffectsReview"].fillna("").astype(str) + "\n\n" +  train_df["commentsReview"].fillna("").astype(str)
x_train = train_df["combined_review"].to_numpy()
y_train = train_df["effectiveness"].to_numpy()
print("Train dataset example")
print(x_train[0][:1000])
print(y_train[0])
print("\n")

# Test dataset combined after
test_df["combined_review"] = test_df["benefitsReview"].fillna("").astype(str) + "\n\n" + test_df["sideEffectsReview"].fillna("").astype(str) + "\n\n" +  test_df["commentsReview"].fillna("").astype(str)
x_test = test_df["combined_review"].to_numpy()
y_test = test_df["effectiveness"].to_numpy()
print("Test dataset example")
print(x_test[0][:1000])
print(y_test[0])

Train dataset example
slowed the progression of left ventricular dysfunction into overt heart failure 
alone or with other agents in the managment of hypertension 
mangagement of congestive heart failur

cough, hypotension , proteinuria, impotence , renal failure , angina pectoris , tachycardia , eosinophilic pneumonitis, tastes disturbances , anusease anorecia , weakness fatigue insominca weakness

monitor blood pressure , weight and asses for resolution of fluid
Highly Effective


Test dataset example
The antibiotic may have destroyed bacteria causing my sinus infection.  But it may also have been caused by a virus, so its hard to say.

Some back pain, some nauseau.

Took the antibiotics for 14 days. Sinus infection was gone after the 6th day.
Considerably Effective


In [None]:
# Preprocessing data
import nltk

nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("stopwords")
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

def prep(X):
  prep_text = []
  for x in X:
    token_text = word_tokenize(x)
    normd_text = [token.lower() for token in token_text if token.isalpha()]
    swr_text = [token for token in normd_text if token not in stopwords.words("english")]
    stemmer = SnowballStemmer("english")
    prep_text += [[stemmer.stem(word) for word in swr_text]]
  prep_sentences = [" ".join(sentence) for sentence in prep_text]
  return prep_sentences

prep_x_train = prep(x_train)
prep_x_test = prep(x_test)

print("Preprocessed working for train dataset")
print(prep_x_train[0][:1000])

print("Preprocessed working for test dataset")
print(prep_x_test[0][:1000])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Preprocessed working for train dataset
slow progress left ventricular dysfunct overt heart failur alon agent manag hypertens mangag congest heart failur cough hypotens proteinuria impot renal failur angina pectori tachycardia eosinophil pneumon tast disturb anuseas anorecia weak fatigu insominca weak monitor blood pressur weight ass resolut fluid
Preprocessed working for test dataset
antibiot may destroy bacteria caus sinus infect may also caus virus hard say back pain nauseau took antibiot day sinus infect gone day


## Section 2 - Representation Learning

Term frequency-inverse document frequency (tf-idf) is a way to measure the importance of a word within a document or a collection of documents also known as a corpus. For the drug dataset the document would be the combined review. TF has a formula of

TF(t,d) = Number of times term t appears in document d / Total number of terms in document
So if the word "slowed" for example appeared once and the combined review had a total of 75 words TF=1/75=0.013
The next part is IDF which will measure how rare a word is across a corpus. It has the formula

IDF(t,d) = log(Total number of doucments in the corpus N / Number of documents containing term t)
So for a common word like drug it would have a lower idf.

The final part is just multiplying so TF-IDF = TF * IDF this results in words getting a higher TF-IDF if it appears a lot in this context a review but is not common in the corpus.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

In [None]:
# Representation Learning - TF-IDF
tfidf = TfidfVectorizer(max_features=5000)
x_train_tfidf = tfidf.fit_transform(prep_x_train)
x_test_tfidf = tfidf.transform(prep_x_test)

# Data Exploration
feature_names = tfidf.get_feature_names_out()
print(f"Sample {list(feature_names)[:15]}")


Sample ['aarm', 'aarp', 'abait', 'abandon', 'abat', 'abbout', 'abbsess', 'abcess', 'abdo', 'abdomen', 'abdomin', 'aberr', 'abfter', 'abil', 'abilifi']


## Section 3 - Algorithms

## Linear Support Vector Classification


## Multinomial Naive Bayes
Multinomial navie bayes is a variation based of Navie Bayes algorithm and is mainly used for text classification.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

# Initialize score lists for multiple metrics
svm_acc_scores = []
svm_f1_macro_scores = []
mnb_acc_scores = []
mnb_f1_macro_scores = []

kf = StratifiedKFold(n_splits=10, shuffle=True)

for fold, (train_idx, val_idx) in enumerate(kf.split(prep_x_train, y_train_encoded), 1):
    print(f"\nFold {fold}/10")

    x_train_fold = [prep_x_train[i] for i in train_idx]
    x_val_fold = [prep_x_train[i] for i in val_idx]
    y_train_fold = y_train_encoded[train_idx]
    y_val_fold = y_train_encoded[val_idx]

    tfidf_fold = TfidfVectorizer(
        max_features=5000,
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.90,
        sublinear_tf=True
    )

    x_train_tfidf = tfidf_fold.fit_transform(x_train_fold)
    x_val_tfidf = tfidf_fold.transform(x_val_fold)

    svm = LinearSVC(
        class_weight="balanced", # balanced mode uses value of y to adjust weights
        max_iter=1000,
        C=1.0
    )

    svm.fit(x_train_tfidf, y_train_fold)
    svm_pred = svm.predict(x_val_tfidf)

    svm_acc = accuracy_score(y_val_fold, svm_pred)
    svm_f1_macro = f1_score(y_val_fold, svm_pred, average='macro')

    svm_acc_scores.append(svm_acc)
    svm_f1_macro_scores.append(svm_f1_macro)

    print(f"  LinearSVC Accuracy: {svm_acc:.4f}, F1-macro: {svm_f1_macro:.4f}")

    mnb = MultinomialNB(alpha=1.0)
    mnb.fit(x_train_tfidf, y_train_fold)
    mnb_pred = mnb.predict(x_val_tfidf)

    mnb_acc = accuracy_score(y_val_fold, mnb_pred)
    mnb_f1_macro = f1_score(y_val_fold, mnb_pred, average='macro')

    mnb_acc_scores.append(mnb_acc)
    mnb_f1_macro_scores.append(mnb_f1_macro)

    print(f"  MultinomialNB Accuracy: {mnb_acc:.4f}, F1-macro: {mnb_f1_macro:.4f}")





Fold 1/10
  LinearSVC Accuracy: 0.4695, F1-macro: 0.3449
  MultinomialNB Accuracy: 0.4598, F1-macro: 0.1707

Fold 2/10
  LinearSVC Accuracy: 0.4630, F1-macro: 0.3886
  MultinomialNB Accuracy: 0.4534, F1-macro: 0.1640

Fold 3/10
  LinearSVC Accuracy: 0.4116, F1-macro: 0.3186
  MultinomialNB Accuracy: 0.4469, F1-macro: 0.1598

Fold 4/10
  LinearSVC Accuracy: 0.4598, F1-macro: 0.3611
  MultinomialNB Accuracy: 0.4469, F1-macro: 0.1573

Fold 5/10
  LinearSVC Accuracy: 0.4662, F1-macro: 0.4035
  MultinomialNB Accuracy: 0.4373, F1-macro: 0.1517

Fold 6/10
  LinearSVC Accuracy: 0.4148, F1-macro: 0.2935
  MultinomialNB Accuracy: 0.4405, F1-macro: 0.1502

Fold 7/10
  LinearSVC Accuracy: 0.4534, F1-macro: 0.3648
  MultinomialNB Accuracy: 0.4277, F1-macro: 0.1539

Fold 8/10
  LinearSVC Accuracy: 0.4484, F1-macro: 0.3659
  MultinomialNB Accuracy: 0.4613, F1-macro: 0.1688

Fold 9/10
  LinearSVC Accuracy: 0.4774, F1-macro: 0.3632
  MultinomialNB Accuracy: 0.4355, F1-macro: 0.1429

Fold 10/10
  Linea

## Section 4 - Evaluation