# Drug Review Rating & Condition Prediction Using a Bidirectional LSTM

This notebook demonstrates how to analyze patient-written drug reviews from the Drugs.com dataset
using natural language processing (NLP) and deep learning.

We perform **two predictive tasks**:

### **1. Rating Prediction (Regression)**
Predict the numeric drug rating (1–10) directly from the review text.  
This models how satisfied a patient was with the medication based on their written experience.

### **2. Condition Prediction (Multi-Class Classification)**
Predict the medical condition associated with the review (e.g., *Depression*, *Pain*, *Birth Control*).  
This is a highly challenging task due to many condition categories, imbalance, and overlapping medical terminology.

Together, these tasks illustrate how neural networks can extract meaningful clinical and emotional signals from noisy, real-world patient narratives.

## 1. Load the Dataset

We begin by importing the raw training and test CSV files.  
The dataset contains:

- patient-written drug reviews  
- numeric ratings (1–10)  
- associated drug names and medical conditions  
- metadata such as review date and usefulness count

In [2]:
import pandas as pd

train_df = pd.read_csv("drugsComTrain_raw.csv")
test_df  = pd.read_csv("drugsComTest_raw.csv")

train_df.tail()

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
161292,191035,Campral,Alcohol Dependence,"""I wrote my first report in Mid-October of 201...",10,31-May-15,125
161293,127085,Metoclopramide,Nausea/Vomiting,"""I was given this in IV before surgey. I immed...",1,1-Nov-11,34
161294,187382,Orencia,Rheumatoid Arthritis,"""Limited improvement after 4 months, developed...",2,15-Mar-14,35
161295,47128,Thyroid desiccated,Underactive Thyroid,"""I&#039;ve been on thyroid medication 49 years...",10,19-Sep-15,79
161296,215220,Lubiprostone,"Constipation, Chronic","""I&#039;ve had chronic constipation all my adu...",9,13-Dec-14,116


## 2. Text Cleaning

The raw review text contains artifacts such as HTML escape sequences (e.g., `&#039;`), inconsistent punctuation, and formatting noise that can interfere with tokenization.

We define a cleaning function to:

- decode HTML entities  
- lowercase all text  
- remove non-alphanumeric characters (while preserving apostrophes)  
- collapse multiple spaces into one  

This produces a cleaner, more consistent input representation for the model.

In [3]:
import re
import html

def clean_review(text):
    """
    Clean a single review:
    - Decode HTML entities (e.g., &#039; -> ')
    - Lowercase text
    - Replace slashes with spaces
    - Remove non-alphanumeric characters (except apostrophes)
    - Collapse multiple spaces
    """
    text = html.unescape(text)
    text = text.lower()
    text = text.replace("/", " ")
    text = re.sub(r"[^a-z0-9\s']", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

### Apply Cleaning to the Dataset

We create a new column `clean_review` in both the training and test sets and apply the cleaning function to the raw `review` text.

In [4]:
train_df['clean_review'] = train_df['review'].astype(str).apply(clean_review)
test_df['clean_review']  = test_df['review'].astype(str).apply(clean_review)

## 3. Prepare Features and Labels (Rating Prediction)

For the rating prediction task, our target variable is the **numeric drug rating** on a 1–10 scale.

- Input features: cleaned review text (`clean_review`)
- Target labels: `rating` as a continuous numeric value

We extract the rating column from both the training and test sets as NumPy arrays.

In [5]:
y_train = train_df['rating'].astype(float).values
y_test  = test_df['rating'].astype(float).values

## 4. Tokenization and Sequence Preparation

Neural networks cannot operate directly on raw text, so we convert each cleaned review
into a sequence of integers:

- Each unique word receives an integer ID based on frequency.
- Rare words are ignored to reduce vocabulary size.
- The model learns word meaning through embeddings.

### Steps Performed

1. **Fit the tokenizer on the training reviews**, building a word index.
2. **Convert each review into a sequence of token IDs**.
3. **Pad or truncate sequences** to a fixed maximum length  
   (required for batch training and LSTM layers).

We use a maximum sequence length of `max_len = 250`, which preserves enough context
for long medical reviews while keeping computation manageable.

In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

X_train_text = train_df['clean_review']
X_test_text  = test_df['clean_review']

# Tokenizer
# Build a dictionary of the 30,000 most common words in your training set
vocab_size = 30000
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<UNK>") # Replace unknown words with <UNK>.
tokenizer.fit_on_texts(X_train_text)

# Convert each review into a sequence of integers
train_seq = tokenizer.texts_to_sequences(X_train_text)
test_seq  = tokenizer.texts_to_sequences(X_test_text)

max_len = 250
X_train = pad_sequences(train_seq, maxlen=max_len, padding='post')
X_test  = pad_sequences(test_seq,  maxlen=max_len, padding='post')

## 5. Build the Bidirectional LSTM Rating Model

To predict numeric drug ratings from review text, we build a neural model consisting of:

### **Embedding Layer**
Learns dense vector representations of words.  
Transforms each integer token into a 128-dimensional vector.

### **Bidirectional LSTM**
Reads the review text *forward and backward*, capturing long-term dependencies,
medical phrasing, and sentiment cues.

### **Dense Layers**
Further transform the LSTM output and map it to a single numeric rating.

### **Loss Function**
We use **Mean Squared Error (MSE)** since this is a regression task.
We also track **Mean Absolute Error (MAE)** for easier interpretability.

This architecture is a strong baseline for text regression on medium-sized datasets.

### 5.1 Bidirectional LSTM Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout

embedding_dim = 128 # Each word/token will be represented as a 128-dimensional vector

model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_len), # Dense Representation of Words
    Bidirectional(LSTM(128)), # Learns text sequence patterns
    Dropout(0.4),   # Randomly turns off 40% of neurons during training -> prevents overfitting
    Dense(64, activation='relu'), # Learns non-linear features
    Dropout(0.3), # Randomly turns off 30% of neurons during training -> prevents overfitting
    Dense(1)  # Final numeric rating regression output
])

model.compile(
    optimizer='adam',  # Adam (Adaptive Moment Estimation)
    loss='mse',        # Mean Squared Error for regression
    metrics=['mae']    # Mean Absolute Error for interpretability
)

model.summary()

### 5.2 Training with EarlyStopping and ModelCheckpoint

Training deep models for many epochs can lead to overfitting.
To ensure the model generalizes well, we use two callbacks:

- **EarlyStopping**  
  Monitors the validation loss and stops training when no further improvement is observed.
  This prevents overfitting and saves training time.

- **ModelCheckpoint**  
  Automatically saves the best-performing model weights to disk
  (based on the lowest validation loss).

Together, these callbacks ensure the final model represents the best validation performance
seen during training.


In [8]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
# EarlyStopping -> stops training when performance stops improving
# ModelCheckpoint -> saves the best model to disk during training

# Stop training when validation loss doesn't improve for 2 epochs
early_stop = EarlyStopping(
    monitor='val_loss', # Metric to watch if not improving
    patience=2, # Training waits for 2 epoch if no improvement happening before stopping
    restore_best_weights=True, # Model weights rolled back to epoch with lowest validation loss when training stops
    verbose=1
)

# Save only the best model to disk
checkpoint = ModelCheckpoint(
    'best_bilstm_model.h5', # File path where best model will be saved
    monitor='val_loss',
    save_best_only=True,
    mode='min',
    verbose=1
)

In [36]:
history = model.fit(
    X_train,
    y_train,
    epochs=10,
    batch_size=128,
    validation_split=0.1,
    shuffle=True,
    callbacks=[early_stop, checkpoint]
)

Epoch 1/10
Epoch 1: val_loss improved from inf to 5.15576, saving model to best_bilstm_model.h5
Epoch 2/10


Epoch 2: val_loss improved from 5.15576 to 4.23718, saving model to best_bilstm_model.h5
Epoch 3/10
Epoch 3: val_loss improved from 4.23718 to 4.01184, saving model to best_bilstm_model.h5
Epoch 4/10
Epoch 4: val_loss improved from 4.01184 to 3.62408, saving model to best_bilstm_model.h5
Epoch 5/10
Epoch 5: val_loss did not improve from 3.62408
Epoch 6/10
Epoch 6: val_loss improved from 3.62408 to 3.42918, saving model to best_bilstm_model.h5
Epoch 7/10
Epoch 7: val_loss did not improve from 3.42918
Epoch 8/10
Epoch 8: val_loss improved from 3.42918 to 3.38768, saving model to best_bilstm_model.h5
Epoch 9/10
Epoch 9: val_loss improved from 3.38768 to 3.35044, saving model to best_bilstm_model.h5
Epoch 10/10
Epoch 10: val_loss improved from 3.35044 to 3.29669, saving model to best_bilstm_model.h5


Optional to Load the Best Saved Model Later

In [None]:
# from tensorflow.keras.models import load_model
# best_model = load_model('best_bilstm_model.h5')

## 6. Evaluate Model Performance

After training the Bidirectional LSTM, we evaluate its performance on the **unseen test set**.
Because this is a *regression* task (ratings range from 1 to 10), we use several complementary metrics:

### **Mean Absolute Error (MAE)**
Average absolute difference between predicted and true ratings.  
Easier to interpret: "on average, the model is off by X rating points."

### **Root Mean Squared Error (RMSE)**
Penalizes larger errors more strongly.  
Useful for detecting occasional large mistakes.

### **Pearson Correlation**
Measures how well the predicted ratings follow the same *trend* as the true ratings,  
even if they are slightly shifted up or down.

Together, these metrics give a robust understanding of model accuracy,
error magnitude, and predictive consistency.

We load the model version that achieved the lowest validation loss during training.

In [9]:
from tensorflow.keras.models import load_model

best_model = load_model("best_bilstm_model.h5")

We produce model predictions on the padded test sequences.

In [10]:
y_pred = best_model.predict(X_test).flatten()

# Clip predictions to valid rating range
y_pred = np.clip(y_pred, 1, 10)



Compute Regression Metrics

In [12]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
corr = np.corrcoef(y_test, y_pred)[0, 1]

print("Test MAE:", mae)
print("Test RMSE:", rmse)
print("Correlation:", corr)

Test MAE: 1.1949893416794524
Test RMSE: 1.8379271940707025
Correlation: 0.8314927405144703


### Interpretation of Results

The trained Bidirectional LSTM model demonstrates strong predictive performance on the unseen test set. The key regression metrics are:

Mean Absolute Error (MAE): 1.19
The model’s predictions differ from true ratings by an average of 1.19 points on a 1–10 scale, indicating solid accuracy for a text-based regression task.

Root Mean Squared Error (RMSE): 1.84
RMSE penalizes larger errors more heavily.
A score of 1.84 shows the model maintains stable performance without frequent large deviations.

Correlation (Pearson r): 0.83
A correlation of 0.83 reflects a strong positive relationship between predicted and actual ratings, meaning the model effectively captures rating trends across reviews.

### 6.1 Example Predictions

To illustrate model behavior, we compare true ratings with predicted ratings
for a few sample reviews from the test set.

Below are ten example test predictions compared to their true ratings.
These samples illustrate typical model behavior and help validate prediction quality.

The model performs very well on high and low ratings, which usually have strong, clear sentiment cues.

Mid-range ratings (4–7) are more challenging due to mixed sentiment, reflected in slightly larger errors.

Overall, the predictions align closely with the true values, consistent with the model’s MAE of 1.19 and correlation of 0.83.

The sample outputs indicate that the model is well-calibrated, stable, and generalizes effectively to unseen patient reviews.

In [13]:
for i in range(10):
    print(f"Review {i+1}: True Rating = {y_test[i]}, Predicted = {y_pred[i]:.2f}")

Review 1: True Rating = 10.0, Predicted = 9.60
Review 2: True Rating = 8.0, Predicted = 9.20
Review 3: True Rating = 9.0, Predicted = 8.80
Review 4: True Rating = 9.0, Predicted = 8.89
Review 5: True Rating = 9.0, Predicted = 9.41
Review 6: True Rating = 4.0, Predicted = 4.45
Review 7: True Rating = 6.0, Predicted = 3.00
Review 8: True Rating = 9.0, Predicted = 9.24
Review 9: True Rating = 7.0, Predicted = 7.60
Review 10: True Rating = 2.0, Predicted = 2.16


### Summary

The Bidirectional LSTM model demonstrates strong performance on the task
of predicting numeric drug ratings from free-text patient reviews.

Its combination of low MAE, moderate RMSE, high correlation, and consistent
sample predictions suggests it captures sentiment and experiential cues 
effectively from medical review language.

Next, we extend the pipeline to multi-class **condition prediction**.

## 7. Predicting Medical Conditions from Review Text

In addition to predicting numeric drug ratings, we extend the model to perform
**multi-class classification**, where the goal is to infer the patient's medical 
condition based solely on their written review text.

This is a significantly more challenging task due to:

- A large number of possible condition categories
- Highly imbalanced class frequencies
- Overlapping language across many conditions
- Variability in patient vocabulary and writing style

Despite these difficulties, a Bidirectional LSTM can still learn 
useful patterns and achieve strong accuracy.

### 7.1 Encode Condition Labels

To train a multi-class classifier, each condition must be converted into an
integer label. We use `LabelEncoder` to map each unique condition in the 
training set to an integer class ID.

However, some conditions appear **only** in the test set and not in training.
To avoid unseen-label errors, we replace all such test-set conditions with 
the special label `"Unknown"`.

This ensures the model never encounters a condition class that it was not 
trained to recognize.

In [14]:
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Convert condition to string and fill missing values
cond_train = train_df["condition"].astype(str).fillna("Unknown")
cond_test  = test_df["condition"].astype(str).fillna("Unknown")

# --- STEP 1: Identify unseen labels in test ---
unseen_labels = set(cond_test) - set(cond_train)

# Replace unseen test labels with "Unknown"
cond_test_fixed = cond_test.replace(list(unseen_labels), "Unknown")

# --- STEP 2: FORCE "Unknown" to appear in training labels ---
# If "Unknown" is not already in cond_train, append it artificially
if "Unknown" not in cond_train.values:
    cond_train = pd.concat([cond_train, pd.Series(["Unknown"])])

# --- STEP 3: Fit the label encoder on the UPDATED training labels ---
label_encoder = LabelEncoder()
label_encoder.fit(cond_train)

# --- STEP 4: Transform both datasets ---
y_train_cond = label_encoder.transform(cond_train[:-1])  # remove the fake "Unknown" row
y_test_cond  = label_encoder.transform(cond_test_fixed)

num_classes = len(label_encoder.classes_)
print("Number of condition classes:", num_classes)


Number of condition classes: 886


### 7.2 Build the Condition Classification Model

We reuse the tokenized review text as input features, but modify the model's 
output layer and loss function to support multi-class classification:

- The final layer is a **Dense(num_classes, softmax)**, producing a probability
  distribution over all possible conditions.
- We use **sparse_categorical_crossentropy** as the loss function, since 
  the condition labels are integer-encoded.

This model learns textual patterns that are indicative of specific conditions,
such as symptom descriptions, drug usage context, or medical terminology.

In [15]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout

embedding_dim = 128
max_len = 250

cond_model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_len),
    Bidirectional(LSTM(64)),
    Dropout(0.4),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(num_classes, activation='softmax')  # multi-class output
])

cond_model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

cond_model.summary()


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 250, 128)          3840000   
                                                                 
 bidirectional_1 (Bidirecti  (None, 128)               98816     
 onal)                                                           
                                                                 
 dropout_2 (Dropout)         (None, 128)               0         
                                                                 
 dense_2 (Dense)             (None, 64)                8256      
                                                                 
 dropout_3 (Dropout)         (None, 64)                0         
                                                                 
 dense_3 (Dense)             (None, 886)               57590     
                                                      

### 7.3 Train the Condition Classifier

As before, we apply EarlyStopping and ModelCheckpoint to prevent overfitting
and ensure that we keep the best-performing model based on validation loss.

Condition prediction is more complex than rating prediction, so validation
accuracy tends to improve steadily across several epochs.

In [16]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

early_stop_cond = EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True,
    verbose=1
)

checkpoint_cond = ModelCheckpoint(
    'best_condition_model.h5',
    monitor='val_loss',
    save_best_only=True,
    mode='min',
    verbose=1
)

history_cond = cond_model.fit(
    X_train,
    y_train_cond,
    epochs=20,
    batch_size=256,
    validation_split=0.1,
    shuffle=True,
    callbacks=[early_stop_cond, checkpoint_cond]
)


Epoch 1/20

Epoch 1: val_loss improved from inf to 3.39225, saving model to best_condition_model.h5
Epoch 2/20


Epoch 2: val_loss improved from 3.39225 to 2.80956, saving model to best_condition_model.h5
Epoch 3/20
Epoch 3: val_loss improved from 2.80956 to 2.46480, saving model to best_condition_model.h5
Epoch 4/20
Epoch 4: val_loss improved from 2.46480 to 2.17121, saving model to best_condition_model.h5
Epoch 5/20
Epoch 5: val_loss improved from 2.17121 to 2.01351, saving model to best_condition_model.h5
Epoch 6/20
Epoch 6: val_loss improved from 2.01351 to 1.86189, saving model to best_condition_model.h5
Epoch 7/20
Epoch 7: val_loss improved from 1.86189 to 1.78007, saving model to best_condition_model.h5
Epoch 8/20
Epoch 8: val_loss improved from 1.78007 to 1.72449, saving model to best_condition_model.h5
Epoch 9/20
Epoch 9: val_loss improved from 1.72449 to 1.70568, saving model to best_condition_model.h5
Epoch 10/20
Epoch 10: val_loss improved from 1.70568 to 1.70239, saving model to best_condition_model.h5
Epoch 11/20
Epoch 11: val_loss improved from 1.70239 to 1.65694, saving model to b

### 7.4 Evaluate Condition Prediction Accuracy

We evaluate the trained classifier on the held-out test set.  
Because this is a large multi-class problem with class imbalance, accuracy 
alone does not tell the full story — but it provides a clear baseline.

In [17]:
import numpy as np
from tensorflow.keras.models import load_model

best_cond_model = load_model('best_condition_model.h5')

y_pred_proba = best_cond_model.predict(X_test)
y_pred_cond  = np.argmax(y_pred_proba, axis=1)

test_acc = np.mean(y_pred_cond == y_test_cond)
print("Test accuracy:", test_acc)

# Example: decode predicted condition names
for i in range(5):
    true_label = label_encoder.inverse_transform([y_test_cond[i]])[0]
    pred_label = label_encoder.inverse_transform([y_pred_cond[i]])[0]
    print(f"Review {i+1}: TRUE = {true_label}, PRED = {pred_label}")


Test accuracy: 0.654056466912175
Review 1: TRUE = Depression, PRED = Depression
Review 2: TRUE = Crohn's Disease, Maintenance, PRED = Crohn's Disease, Maintenance
Review 3: TRUE = Urinary Tract Infection, PRED = Pain
Review 4: TRUE = Weight Loss, PRED = Obesity
Review 5: TRUE = Birth Control, PRED = Birth Control


### 7.4.1 Example Predictions

To better understand how the condition classifier behaves on real reviews,
we display a small set of random examples from the test set, showing:

- the raw review text  
- the true condition label  
- the model’s predicted condition  

These qualitative examples help illustrate strengths and weaknesses that
numerical accuracy alone cannot show.

In [32]:
import random # Python’s built-in random module so we can randomly choose test examples

# pick 5 random indices from the test set
indices = random.sample(range(len(test_df)), 5)

for idx in indices:
    review_text = test_df.iloc[idx]["review"]
    true_cond = test_df.iloc[idx]["condition"]
    
    # prepare the review for prediction
    cleaned = clean_review(str(review_text))
    seq = tokenizer.texts_to_sequences([cleaned])
    seq = pad_sequences(seq, maxlen=max_len, padding="post", truncating="post")
    
    # model prediction
    pred_class_id = np.argmax(cond_model.predict(seq, verbose=0))
    pred_cond = label_encoder.inverse_transform([pred_class_id])[0]
    
    print("------------------------------------------------")
    print(f"Review: {review_text[:500]}")           # limit to 500 chars
    print(f"True Condition:      {true_cond}")
    print(f"Predicted Condition: {pred_cond}")
    print("------------------------------------------------\n")

------------------------------------------------
Review: "I deployed to Iraq in 2012. I had my grand mal seizure at the age of 21 after a mission. Was sent home for testing. I was put on other sezuire meds that absolutely sucked. Then my neurologist prescribed me Topiramate. It was a rough first few weeks with new tastes and tingly hands and feet but slowly the almost constant minor seziures I had in my right arm were gone. I love this medicine. I haven&#039;t had a grand mal sense October 2013."
True Condition:      Seizures
Predicted Condition: Seizures
------------------------------------------------

------------------------------------------------
Review: "Propranolol has been very effective for my migraines.  Though I probably am on too low a dose at 60mg a day.  As far as side effects, I have had none.  And I have been on it for years and haven&#039;t gained one ounce."
True Condition:      Migraine Prevention
Predicted Condition: Migraine Prevention
----------------------------

### 7.5 Confusion Matrix for Top 20 Conditions

To better understand where the model performs well or struggles, we compute
a confusion matrix for the **20 most common conditions** in the test set.

This highlights which conditions the model confuses with one another.

In [18]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix

# Unique classes present in the TEST set
test_classes = np.unique(y_test_cond)
test_class_names = label_encoder.inverse_transform(test_classes)

# Confusion matrix limited to only classes present in test set
cm = confusion_matrix(y_test_cond, y_pred_cond, labels=test_classes)

cm_df = pd.DataFrame(
    cm,
    index=test_class_names,
    columns=test_class_names
)

# Show top 20 most common test conditions
cond_counts = np.bincount(y_test_cond)
top_idx = np.argsort(cond_counts)[::-1][:20]
top_labels = label_encoder.inverse_transform(top_idx)

cm_top = cm_df.loc[top_labels, top_labels]
cm_top


Unnamed: 0,Birth Control,Depression,Pain,Anxiety,Acne,Bipolar Disorde,Weight Loss,Insomnia,Obesity,ADHD,Emergency Contraception,Vaginal Yeast Infection,"Diabetes, Type 2",High Blood Pressure,Smoking Cessation,Abnormal Uterine Bleeding,Bowel Preparation,Migraine,ibromyalgia,Anxiety and Stress
Birth Control,9118,62,2,6,120,0,3,0,0,1,7,4,6,0,0,219,0,0,0,4
Depression,13,2392,31,234,0,117,2,33,5,23,1,0,3,11,5,0,0,2,12,90
Pain,5,35,1725,18,0,0,0,11,3,7,0,1,4,5,0,0,2,24,33,2
Anxiety,3,227,39,1344,1,20,0,26,0,7,0,0,2,4,3,0,0,1,8,38
Acne,99,6,4,1,1683,0,1,0,1,1,0,0,2,2,0,0,0,0,0,0
Bipolar Disorde,0,163,10,73,0,1007,2,14,3,10,0,0,1,5,0,0,0,1,1,0
Weight Loss,12,41,1,0,4,1,937,2,195,5,0,2,15,5,0,0,0,0,4,0
Insomnia,0,42,31,72,0,19,0,956,1,3,1,1,1,7,2,0,0,7,9,3
Obesity,1,38,4,0,0,3,620,6,448,7,2,0,15,2,4,0,0,0,1,1
ADHD,4,90,7,16,3,11,3,3,4,952,0,1,0,0,1,0,0,0,0,0


### Interpretation

Despite the large number of classes and high variability in patient-written reviews,
the Bidirectional LSTM achieves solid accuracy (~64%).
The confusion matrix shows strong performance for common conditions such as depression,
pain, anxiety, birth control–related issues, and ADHD.

More rare conditions are naturally harder to predict, partly due to class imbalance
and overlapping language across conditions.

# 8. Final Summary

This notebook explored two predictive modeling tasks using patient-written
drug reviews from the Drugs.com dataset:

---

## **1. Predicting Numeric Drug Ratings (Regression)**

We trained a Bidirectional LSTM model to estimate the numeric rating (1–10)
based solely on review text.

**Performance on the test set:**
- **MAE:** ~1.19  
- **RMSE:** ~1.84  
- **Correlation:** ~0.83  

The model shows strong alignment with human-provided ratings and captures
sentiment, treatment success, side-effects, and emotional descriptions.
Examples confirm that predictions closely match true ratings for both positive
and negative reviews.

---

## **2. Predicting Medical Conditions (Multi-Class Classification)**

We built a second Bidirectional LSTM classifier to infer the patient’s medical
condition from the review text.

Challenges included:
- Large number of classes  
- Class imbalance  
- Overlapping medical vocabulary  
- Rare conditions  

Despite this, the model achieved solid test accuracy, significantly above
chance level. Example predictions show that the model captures contextual
signals such as symptoms, treatment goals, and patient history.

---

## **Key Takeaways**

- Text cleaning and preprocessing are essential for working with medical
  user-generated content.
- Bidirectional LSTMs are effective at capturing contextual meaning in
  medical reviews, enabling both regression and classification tasks.
- Numeric rating prediction is easier than condition prediction due to
  smoother target distribution.
- Condition prediction performance indicates that patient language contains
  strong, learnable signals about medical context and treatment intent.

---