# Model 2.1 - Relate Review to Satisfaction

Goal 2.1 – Predict Satisfaction from Review Text

- Build a classification model to predict patient satisfaction ratings (1–5) based on free-text drug reviews from the WebMD dataset.
- Use a Hugging Face Transformer model (DistilBERT) fine-tuned locally for this multi-class classification task.
- Apply the model per drug, enabling drug-specific analysis of patient sentiment and satisfaction.

## Setup

In [1]:
pip install transformers datasets scikit-learn pandas

Note: you may need to restart the kernel to use updated packages.


## Loading CVS

In [2]:
import pandas as pd

# Load CSV using your path
df = pd.read_csv('/Users/homecomputer/code/ElbediwiM/drug-analysis-review/Raw_Data/webmd.csv')

# Clean and prepare relevant columns
df = df[['Drug', 'Reviews', 'Satisfaction']].dropna()
df['Satisfaction'] = df['Satisfaction'].astype(int)

df.head()


Unnamed: 0,Drug,Reviews,Satisfaction
0,25dph-7.5peh,I'm a retired physician and of all the meds I ...,5
1,25dph-7.5peh,cleared me right up even with my throat hurtin...,5
2,warfarin (bulk) 100 % powder,why did my PTINR go from a normal of 2.5 to ov...,3
3,warfarin (bulk) 100 % powder,FALLING AND DON'T REALISE IT,1
4,warfarin (bulk) 100 % powder,My grandfather was prescribed this medication ...,1


## Train/Test Split

In [3]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['Reviews'].tolist(),
    df['Satisfaction'].tolist(),
    test_size=0.2,
    random_state=42
)

## Install PyTorch

In [4]:
pip install torch torchvision

Note: you may need to restart the kernel to use updated packages.


## Tokenize with HuggingFace DistilBertTokenizer

In [5]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Efficient batched tokenization
train_encodings = tokenizer.batch_encode_plus(
    train_texts,
    truncation=True,
    padding=True,
    max_length=512
)

val_encodings = tokenizer.batch_encode_plus(
    val_texts,
    truncation=True,
    padding=True,
    max_length=512
)

  from .autonotebook import tqdm as notebook_tqdm


## Wrap Encodings into a PyTorch Dataset

In [6]:
import torch

class ReviewDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        return {
            'input_ids': torch.tensor(self.encodings['input_ids'][idx]),
            'attention_mask': torch.tensor(self.encodings['attention_mask'][idx]),
            'labels': torch.tensor(self.labels[idx] - 1)  # shift Satisfaction from 1–5 → 0–4
        }

    def __len__(self):
        return len(self.labels)

# Create dataset objects
train_dataset = ReviewDataset(train_encodings, train_labels)
val_dataset = ReviewDataset(val_encodings, val_labels)


In [7]:
!pip uninstall numpy -y
!pip install numpy==1.24.4

Found existing installation: numpy 1.24.4
Uninstalling numpy-1.24.4:
  Successfully uninstalled numpy-1.24.4
Collecting numpy==1.24.4
  Using cached numpy-1.24.4-cp310-cp310-macosx_10_9_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.24.4-cp310-cp310-macosx_10_9_x86_64.whl (19.8 MB)
Installing collected packages: numpy
Successfully installed numpy-1.24.4


## Loading the Pretrained DistilBERT for Classification

In [8]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=5  # Since Satisfaction is from 1 to 5
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
