<a href="https://colab.research.google.com/github/ShFANI/ShFANI.github.io/blob/main/Healthcare_Recommender_CF_DrugsCom.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collaborative Filtering Recommender System — Drugs.com Review Data

This Colab notebook builds a **collaborative filtering** recommender system using**real-world drug review ratings**.
We’ll treat:

- **User** = *medical condition* (e.g., “Depression”, “Birth Control”, “High Blood Pressure”)
- **Item** = *drug name*
- **Rating** = average patient satisfaction rating (1–10)

Then we’ll train a **matrix-factorization** model (latent factors) using `TruncatedSVD` on a sparse condition×drug matrix.

> **Important**: This is a data science demo, **not medical advice**. Patient reviews are biased, incomplete, and not a substitute for clinician guidance.



In [1]:
# === 0) Runtime check ===
import sys, platform
print("Python:", sys.version)
print("Platform:", platform.platform())

Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Platform: Linux-6.6.105+-x86_64-with-glibc2.35


## 1) Imports
We’ll use:
- `pandas` for data wrangling
- `scipy.sparse` for large sparse matrices
- `sklearn` `TruncatedSVD` for matrix factorization

In [2]:
import zipfile, io, os, re
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy import sparse

pd.set_option("display.max_colwidth", 120)

## 2) Download and load the dataset
This dataset is distributed as a zip file containing:

- `drugsComTrain_raw.tsv`
- `drugsComTest_raw.tsv`

Each row is a review with a rating (1–10), a drug name, and a condition.

In [3]:
# Download + unzip
!wget -q -O drugsCom_raw.zip "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip -q -o drugsCom_raw.zip

# Load TSVs
train_path = "drugsComTrain_raw.tsv"
test_path  = "drugsComTest_raw.tsv"

train_df = pd.read_csv(train_path, sep="\t")
test_df  = pd.read_csv(test_path,  sep="\t")

print("Train shape:", train_df.shape)
print("Test  shape:", test_df.shape)
train_df.head()

Train shape: (161297, 7)
Test  shape: (53766, 7)


Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil""",9.0,"May 20, 2012",27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he sta...",8.0,"April 27, 2010",192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 d...",5.0,"December 14, 2009",17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8...",8.0,"November 3, 2015",10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around. I feel healthier, I&#039;m excelling at my job and I always have mo...",9.0,"November 27, 2016",37


## 3) Basic cleaning
We’ll:
- drop rows missing `drugName`, `condition`, or `rating`
- normalize condition strings (strip whitespace)
- keep rating in numeric form

In [4]:
def clean(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    # Standard columns in this dataset include: 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'
    df = df.dropna(subset=["drugName", "condition", "rating"])
    df["drugName"] = df["drugName"].astype(str).str.strip()
    df["condition"] = df["condition"].astype(str).str.strip()
    df["rating"] = pd.to_numeric(df["rating"], errors="coerce")
    df = df.dropna(subset=["rating"])
    # Some conditions are "Not Listed / Othe r" or blank-like; filter trivial noise
    df = df[df["condition"].str.len() > 1]
    return df

train_df = clean(train_df)
test_df  = clean(test_df)

print("Clean train:", train_df.shape, "Clean test:", test_df.shape)
train_df[["drugName","condition","rating","usefulCount"]].head()

Clean train: (160398, 7) Clean test: (53471, 7)


Unnamed: 0,drugName,condition,rating,usefulCount
0,Valsartan,Left Ventricular Dysfunction,9.0,27
1,Guanfacine,ADHD,8.0,192
2,Lybrel,Birth Control,5.0,17
3,Ortho Evra,Birth Control,8.0,10
4,Buprenorphine / naloxone,Opiate Dependence,9.0,37


## 4) Build a condition×drug ratings table
Because the raw dataset does not have persistent user IDs, we’ll use **condition** as the “user”.

For each *(condition, drug)* pair we compute:
- `n_reviews`: number of reviews
- `avg_rating`: average rating

Then we’ll filter to keep the matrix reasonably dense for a clean recommender demo.

In [5]:
# Combine train+test only for counting / filtering popular conditions/drugs.
# We'll still build the model using a proper train/test split later.
all_df = pd.concat([train_df, test_df], ignore_index=True)

agg = (
    all_df.groupby(["condition", "drugName"], as_index=False)
          .agg(n_reviews=("rating", "size"),
               avg_rating=("rating", "mean"),
               avg_useful=("usefulCount", "mean"))
)

print("Condition-drug pairs:", agg.shape)
agg.sort_values("n_reviews", ascending=False).head(10)

Condition-drug pairs: (9446, 5)


Unnamed: 0,condition,drugName,n_reviews,avg_rating,avg_useful
2185,Birth Control,Etonogestrel,4394,5.829768,6.789258
2182,Birth Control,Ethinyl estradiol / norethindrone,3081,5.646868,7.884778
2215,Birth Control,Levonorgestrel,2884,7.038835,8.352288
2251,Birth Control,Nexplanon,2883,5.64967,6.016649
2180,Birth Control,Ethinyl estradiol / levonorgestrel,2107,5.867584,7.288562
2183,Birth Control,Ethinyl estradiol / norgestimate,2097,5.87649,8.472103
3959,Emergency Contraception,Levonorgestrel,1651,8.472441,13.55421
9276,Weight Loss,Phentermine,1650,8.769697,31.258182
2196,Birth Control,Implanon,1496,6.175134,8.346257
9159,Vaginal Yeast Infection,Miconazole,1338,2.988042,6.331091


### 4.1 Filter to top conditions/drugs (for speed + density)
Real-world healthcare data is often *very sparse*. To keep this notebook fast and the results interpretable, we’ll:

- keep conditions with at least `min_condition_reviews` total reviews
- keep drugs with at least `min_drug_reviews` total reviews
- keep only *(condition, drug)* pairs with at least `min_pair_reviews` reviews

You can loosen/tighten these thresholds depending on your runtime.

In [6]:
min_condition_reviews = 300   # total reviews across all drugs
min_drug_reviews      = 300   # total reviews across all conditions
min_pair_reviews      = 5     # per (condition, drug) pair

cond_counts = all_df["condition"].value_counts()
drug_counts = all_df["drugName"].value_counts()

keep_conds = set(cond_counts[cond_counts >= min_condition_reviews].index)
keep_drugs = set(drug_counts[drug_counts >= min_drug_reviews].index)

filtered = agg[
    (agg["condition"].isin(keep_conds)) &
    (agg["drugName"].isin(keep_drugs)) &
    (agg["n_reviews"] >= min_pair_reviews)
].copy()

print("After filtering pairs:", filtered.shape)
print("Unique conditions:", filtered["condition"].nunique())
print("Unique drugs:", filtered["drugName"].nunique())

filtered.sample(5, random_state=7)

After filtering pairs: (616, 5)
Unique conditions: 75
Unique drugs: 155


Unnamed: 0,condition,drugName,n_reviews,avg_rating,avg_useful
6515,Obesity,Adipex-P,90,8.944444,49.922222
2183,Birth Control,Ethinyl estradiol / norgestimate,2097,5.87649,8.472103
3344,Depression,Cymbalta,411,6.489051,49.425791
8443,Sinusitis,Azithromycin,93,7.129032,26.473118
9361,ibromyalgia,Escitalopram,15,8.333333,30.533333


## 5) Train/test split on observed pairs
We’ll split the **observed (condition, drug)** pairs into train/test.

Why?
- For collaborative filtering, we train on known interactions and evaluate how well we predict held-out ones.

We evaluate with:
- **RMSE** and **MAE** on held-out ratings.

> Note: This is a standard offline evaluation for explicit-feedback recommenders.

In [7]:
# Split condition-drug pairs
train_pairs, test_pairs = train_test_split(filtered, test_size=0.2, random_state=42)

print("Train pairs:", train_pairs.shape)
print("Test  pairs:", test_pairs.shape)

Train pairs: (492, 5)
Test  pairs: (124, 5)


## 6) Create the sparse rating matrix
We map conditions and drugs to integer IDs, then build a sparse matrix `R`:

- rows = conditions
- cols = drugs
- values = average rating (1–10)

We’ll train `TruncatedSVD` (matrix factorization) on the **train matrix**.

In [8]:
# Encode IDs
cond2id = {c:i for i, c in enumerate(sorted(filtered["condition"].unique()))}
drug2id = {d:i for i, d in enumerate(sorted(filtered["drugName"].unique()))}
id2cond = {i:c for c,i in cond2id.items()}
id2drug = {i:d for d,i in drug2id.items()}

n_conds = len(cond2id)
n_drugs = len(drug2id)

def to_sparse(df_pairs: pd.DataFrame) -> sparse.csr_matrix:
    rows = df_pairs["condition"].map(cond2id).to_numpy()
    cols = df_pairs["drugName"].map(drug2id).to_numpy()
    vals = df_pairs["avg_rating"].to_numpy().astype(np.float32)
    return sparse.csr_matrix((vals, (rows, cols)), shape=(n_conds, n_drugs))

R_train = to_sparse(train_pairs)
R_test  = to_sparse(test_pairs)

print("R_train shape:", R_train.shape, "nnz:", R_train.nnz)
print("R_test  shape:", R_test.shape,  "nnz:", R_test.nnz)

R_train shape: (75, 155) nnz: 492
R_test  shape: (75, 155) nnz: 124


## 7) Train the matrix-factorization model (Collaborative Filtering)

**Idea**: approximate the sparse matrix with low-rank factors:

\[
R \approx U \Sigma V^\top
\]

`TruncatedSVD` learns latent factors that capture patterns like:
- drugs often co-rated highly for similar conditions
- conditions that share “preferred” drugs cluster in latent space

We’ll also include a simple baseline:
- fill unknown ratings with the **global mean** during reconstruction.

In [9]:
# Hyperparameters
n_components = 40  # latent factors; try 20–100
random_state = 42

svd = TruncatedSVD(n_components=n_components, random_state=random_state)
svd.fit(R_train)

# Latent factors
U = svd.transform(R_train)          # (conditions x k)
Vt = svd.components_                # (k x drugs)

# Reconstruct predictions for ALL condition-drug pairs
global_mean = train_pairs["avg_rating"].mean()

R_pred = (U @ Vt)  # (conditions x drugs) approximate ratings
R_pred = R_pred + (global_mean - R_pred.mean())  # shift so the overall mean matches global_mean

# Clip to rating scale [1, 10]
R_pred = np.clip(R_pred, 1, 10)

R_pred.shape, global_mean

((75, 155), np.float64(7.2025727033675775))

## 8) Evaluate on held-out pairs (RMSE / MAE)
We compare model predictions against the held-out `avg_rating` in `test_pairs`.

In [10]:
test_rows = test_pairs["condition"].map(cond2id).to_numpy()
test_cols = test_pairs["drugName"].map(drug2id).to_numpy()
y_true = test_pairs["avg_rating"].to_numpy()
y_pred = R_pred[test_rows, test_cols]

mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae  = mean_absolute_error(y_true, y_pred)

print(f"RMSE: {rmse:.4f}")
print(f"MAE : {mae:.4f}")

# Quick sanity check: show a few predictions
pd.DataFrame({
    "condition": test_pairs["condition"].values[:10],
    "drug": test_pairs["drugName"].values[:10],
    "true_avg_rating": y_true[:10],
    "pred_rating": y_pred[:10]
})

RMSE: 1.3716
MAE : 1.0553


Unnamed: 0,condition,drug,true_avg_rating,pred_rating
0,Anxiety and Stress,Paroxetine,6.787234,6.896181
1,Depression,Lexapro,7.620098,6.656908
2,Social Anxiety Disorde,Venlafaxine,7.173913,8.47016
3,Chronic Pain,Fentanyl,8.089431,7.441722
4,Benign Prostatic Hyperplasia,Cialis,7.808511,10.0
5,Birth Control,TriNessa,6.198444,7.194439
6,Anxiety,Hydroxyzine,5.779621,7.172212
7,Pain,Acetaminophen / hydrocodone,7.971239,6.530961
8,Birth Control,Depo-Provera,5.522907,6.931629
9,Depression,Quetiapine,7.073394,6.924125


## 9) Recommend top-N drugs for a given condition
For a condition:
1. take its predicted scores across all drugs
2. remove drugs already observed for that condition in the training set
3. return the top-N by predicted rating

This is the typical inference flow for collaborative filtering recommenders.

In [11]:
# Precompute "seen" drugs per condition from TRAIN pairs (so we don't recommend what is already known in training)
seen_by_condition = (
    train_pairs.groupby("condition")["drugName"]
               .apply(set)
               .to_dict()
)

def recommend_for_condition(condition: str, top_n: int = 10) -> pd.DataFrame:
    condition = str(condition).strip()
    if condition not in cond2id:
        raise ValueError(f"Condition not found in filtered set. Try one of: {list(sorted(cond2id.keys()))[:15]} ...")

    c_id = cond2id[condition]
    scores = R_pred[c_id].copy()

    seen = seen_by_condition.get(condition, set())
    # mask seen items
    for d in seen:
        if d in drug2id:
            scores[drug2id[d]] = -np.inf

    top_ids = np.argsort(-scores)[:top_n]
    recs = pd.DataFrame({
        "drugName": [id2drug[i] for i in top_ids],
        "predicted_rating": scores[top_ids]
    })
    return recs

# Example: pick a condition that exists
example_condition = sorted(cond2id.keys())[0]
print("Example condition:", example_condition)
recommend_for_condition(example_condition, top_n=10)

Example condition: ADHD


Unnamed: 0,drugName,predicted_rating
0,Chantix,7.944148
1,Varenicline,7.943129
2,Effexor XR,7.316542
3,Sertraline,7.241568
4,Topamax,7.207634
5,Ativan,7.204571
6,Xanax,7.19943
7,Clonidine,7.187816
8,Adipex-P,7.180979
9,Hydroxyzine,7.17101


### 9.1 Try your own condition
Run the cell below and change `my_condition` to something in the filtered list.
Tip: print the top conditions first.

In [14]:
# Show some common conditions you can try
top_conditions = all_df["condition"].value_counts().head(25)
top_conditions

Unnamed: 0_level_0,count
condition,Unnamed: 1_level_1
Birth Control,38436
Depression,12164
Pain,8245
Anxiety,7812
Acne,7435
Bipolar Disorde,5604
Insomnia,4904
Weight Loss,4857
Obesity,4757
ADHD,4509


In [17]:
my_condition = top_conditions.index[3]  # <-- change this line, e.g., "Depression", "Birth Control", "High Blood Pressure"
recommend_for_condition(my_condition, top_n=15)

Unnamed: 0,drugName,predicted_rating
0,Cyclobenzaprine,7.518955
1,Lamictal,7.493766
2,Sertraline,7.381028
3,Olanzapine,7.290075
4,Seroquel,7.259819
5,Vilazodone,7.215272
6,Trintellix,7.1907
7,Hydroxyzine,7.172212
8,Mirtazapine,7.162562
9,Naproxen,7.157447


## 10) Optional: Explainability helpers (simple)
Collaborative filtering is often “black-box”. Here are two lightweight checks:

1) **Nearest-neighbor conditions** in latent space (which conditions behave similarly)  
2) **Nearest-neighbor drugs** in latent space

These can help validate whether embeddings make clinical sense (often imperfect with review data).

In [24]:
from sklearn.metrics.pairwise import cosine_similarity

# Latent embeddings
cond_emb = U  # (conditions x k)
drug_emb = Vt.T  # (drugs x k)

def nearest_conditions(condition: str, k: int = 10) -> pd.DataFrame:
    c_id = cond2id[condition]
    sims = cosine_similarity(cond_emb[c_id:c_id+1], cond_emb).ravel()
    top = np.argsort(-sims)[:k+1]  # include itself
    rows = []
    for idx in top:
        rows.append((id2cond[idx], float(sims[idx])))
    out = pd.DataFrame(rows, columns=["condition", "cosine_sim"])
    return out[out["condition"] != condition].head(k)

def nearest_drugs(drug: str, k: int = 10) -> pd.DataFrame:
    d_id = drug2id[drug]
    sims = cosine_similarity(drug_emb[d_id:d_id+1], drug_emb).ravel()
    top = np.argsort(-sims)[:k+1]
    rows = []
    for idx in top:
        rows.append((id2drug[idx], float(sims[idx])))
    out = pd.DataFrame(rows, columns=["drugName", "cosine_sim"])
    return out[out["drugName"] != drug].head(k)

# Demo neighbors
demo_condition = top_conditions.index[1]
print("Condition neighbors for:", demo_condition)
nearest_conditions(demo_condition, k=10)

Condition neighbors for: Depression


Unnamed: 0,condition,cosine_sim
1,Major Depressive Disorde,0.683776
2,Anxiety,0.530602
3,Panic Disorde,0.469244
4,Post Traumatic Stress Disorde,0.459512
5,ibromyalgia,0.453566
6,Hot Flashes,0.356107
7,Premenstrual Dysphoric Disorde,0.343345
8,Bipolar Disorde,0.335977
9,Anxiety and Stress,0.30994
10,Irritable Bowel Syndrome,0.287956


In [26]:
# Demo drug neighbors: pick a popular drug from filtered set
some_drug = sorted(drug2id.keys())[0]
print("Drug neighbors for:", some_drug)
nearest_drugs(some_drug, k=10)

Drug neighbors for: Abilify


Unnamed: 0,drugName,cosine_sim
1,Latuda,0.763324
2,Lurasidone,0.762007
3,Wellbutrin,0.5188
4,Risperidone,0.471109
5,Aripiprazole,0.42542
6,Quetiapine,0.360505
7,Trintellix,0.267098
8,Vilazodone,0.267098
9,Viibryd,0.245037
10,Olanzapine,0.231678
