### 1. Loading the disease–symptom matrix

This step reads the `symptom_matrix.csv` file into a pandas DataFrame.  
Each row represents a disease, the `disease` column holds the disease name, and all other columns are symptoms encoded as 0/1 flags.  

We then:
- Extract a list of disease names (`diseases`).
- Collect the symptom column names (`symptom_cols`), excluding the `disease` column.
- Convert all symptom columns to integer type to guarantee numeric 0/1 values.
- Build `symptom_matrix`, a NumPy array of shape `(num_diseases, num_symptoms)`, which we later use as the “canonical” symptom vector for each disease.


In [15]:
import pandas as pd
import numpy as np
import json

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from numpy.linalg import norm
import joblib


In [19]:
# ----- 1.1 Load symptom matrix -----
df_symptoms = pd.read_csv("DiseaseAndSymptoms.csv")

# Disease names (strings)
diseases = df_symptoms["diseases"].tolist()

# Symptom columns (all numeric)
symptom_cols = [c for c in df_symptoms.columns if c != "diseases"]
df_symptoms[symptom_cols] = df_symptoms[symptom_cols].astype(int)

# Numeric matrix (rows = diseases, cols = symptoms)
symptom_matrix = df_symptoms[symptom_cols].values
print("symptom_matrix dtype:", symptom_matrix.dtype)
print("num diseases:", len(diseases), "num symptoms:", len(symptom_cols))

pos_counts = Y_symptoms.sum(axis=0)          # sum over rows
keep_mask = pos_counts > 0                   # True where there is at least one 1
print("Symptoms with at least one positive:", keep_mask.sum(), "out of", len(symptom_cols))

# 2) Filter symptom columns and matrices
symptom_cols = [c for c, keep in zip(symptom_cols, keep_mask) if keep]
Y_symptoms = Y_symptoms[:, keep_mask]
symptom_matrix = symptom_matrix[:, keep_mask]

print("New Y_symptoms shape:", Y_symptoms.shape)

symptom_matrix dtype: int64
num diseases: 246945 num symptoms: 377
Symptoms with at least one positive: 328 out of 377
New Y_symptoms shape: (247479, 328)


### 2. Creating synthetic training texts from the matrix

Here we generate synthetic training data directly from the disease–symptom matrix.  

For each disease row:
- We find all symptoms whose value is 1 (i.e., symptoms that are present for that disease).
- We build a simple text description like `"patient has fever, runny nose, dry throat"` from those symptom names.
- We append this synthetic text to `texts`.
- We append the corresponding 0/1 symptom vector (the row from `symptom_matrix`) to `symptom_labels`.

The goal is to create a first batch of training examples where the mapping from text → symptom vector is perfectly clean (because the text is literally generated from the symptom vector).


In [20]:

# ----- 1.2 Synthetic texts from symptom matrix -----
texts = []
symptom_labels = []

for i, disease_vec in enumerate(symptom_matrix):
    symptoms_present = [symptom_cols[j] for j in range(len(symptom_cols)) 
                        if disease_vec[j] == 1]
    if symptoms_present:
        text = "patient has " + ", ".join(s.replace("_", " ") for s in symptoms_present)
        texts.append(text)
        symptom_labels.append(disease_vec.astype(int))

print("Synthetic examples:", len(texts))


Synthetic examples: 246945


### 3. Adding real free‑text JSONL data (train + test)

In this step we load the real free‑text dataset stored as JSONL files (`train.jsonl` and `test.jsonl`).  
Each line in these files contains an object with:

- `input_text`: a natural‑language symptom description written in a more realistic style.
- `output_text`: the disease name corresponding to that description.

We then:

1. Load both JSONL files and concatenate them into a single list `real_data`.
2. Build a dictionary `disease_to_idx` that maps each disease name in the symptom matrix to its row index.
3. For each item in `real_data`:
   - We read the disease name (`output_text`) and symptom description (`input_text`).
   - If the disease exists in our matrix, we fetch the corresponding 0/1 symptom row.
   - We append the free‑text symptom description to `texts`.
   - We append the disease’s symptom row to `symptom_labels`.
   - If the disease name does not exist in the matrix, we record it in `unmatched` and skip it.

At the end, we combine the synthetic texts and real JSONL texts into `X_text`, and the corresponding symptom vectors into `Y_symptoms`.  
This gives us training pairs of the form: **(free text or synthetic text) → symptom vector** in the same 0/1 space as our disease–symptom matrix.


In [35]:
def load_jsonl(filename):
    data = []
    with open(filename, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                data.append(json.loads(line))
    return data

real_data = load_jsonl("train.jsonl") + load_jsonl("test.jsonl")
print("Real JSONL rows:", len(real_data))

Real JSONL rows: 1065


### 4. Filtering out symptom labels that never appear

Some symptom columns in `Y_symptoms` may be all zeros — that is, no training example ever has that symptom as present.  

Logistic regression cannot be trained on a label that has only one class (all 0), so we:

- Compute `pos_counts`, the sum of each symptom column across all training examples.
- Build `keep_mask`, a boolean mask that is `True` for symptoms that have at least one positive label.
- Filter:
  - `symptom_cols` to keep only those symptoms with at least one positive.
  - `Y_symptoms` to keep only those columns.
  - `symptom_matrix` to keep the same subset of symptoms.

This ensures every remaining symptom label has both 0 and 1 examples in the training set, making the multi‑label classifier trainable.


In [36]:
# Map disease name -> index (from symptom matrix)
disease_to_idx = {d: i for i, d in enumerate(diseases)}

unmatched = set()

for item in real_data:
    disease_name = item["output_text"]
    symptom_text = item["input_text"]
    if disease_name in disease_to_idx:
        idx = disease_to_idx[disease_name]
        row_vec = symptom_matrix[idx].astype(int)
        texts.append(symptom_text)
        symptom_labels.append(row_vec)
    else:
        unmatched.add(disease_name)

print("Total training examples:", len(texts))
print("Unmatched diseases (skipped):", len(unmatched))

# Convert to arrays
X_text = np.array(texts)
Y_symptoms = np.array(symptom_labels)
print("Y_symptoms dtype:", Y_symptoms.dtype, "shape:", Y_symptoms.shape)

Total training examples: 1599
Unmatched diseases (skipped): 11
Y_symptoms dtype: int64 shape: (248013, 328)


### 5. Training a multi‑label text → symptom classifier (TF‑IDF + Logistic Regression)

Here we train the core NLP model that predicts symptoms from text.

Steps:

1. **Train–test split**  
   We split `X_text` and `Y_symptoms` into training and test sets using `train_test_split`. This allows us to measure how well the model generalises.

2. **Text vectorisation (TF‑IDF)**  
   We create a `TfidfVectorizer` with:
   - A maximum vocabulary size (e.g. 5000 features).
   - Unigrams and bigrams (`ngram_range=(1, 2)`).
   - A minimum document frequency (`min_df=2`) to drop extremely rare tokens.

   We fit the vectorizer on the training texts and transform both train and test texts into TF‑IDF feature matrices (`X_train_vec`, `X_test_vec`).  
   This is the NLP featurisation step: raw text → numeric feature vectors.

3. **Multi‑label classifier**  
   We wrap `LogisticRegression` inside `MultiOutputClassifier`.  
   This creates one binary logistic regression model per symptom, so the model learns `P(symptom_k = 1 | text)` for every symptom `k`.

4. **Training and evaluation**  
   We fit the multi‑output classifier on the TF‑IDF features and `Y_train`.  
   Then we predict on the test set and compute the micro‑averaged F1 score, giving a rough measure of how well the model predicts symptom vectors from text.


In [9]:
print(df_symptoms.dtypes)


diseases                            object
anxiety and nervousness              int64
depression                           int64
shortness of breath                  int64
depressive or psychotic symptoms     int64
                                     ...  
hip weakness                         int64
back swelling                        int64
ankle stiffness or tightness         int64
ankle weakness                       int64
neck weakness                        int64
Length: 378, dtype: object


In [23]:
# Split
X_train, X_test, Y_train, Y_test = train_test_split(
    X_text, Y_symptoms, test_size=0.2, random_state=42
)

# TF-IDF
vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),
    min_df=2
)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Multi-label classifier
clf = MultiOutputClassifier(
    LogisticRegression(max_iter=500, random_state=42)
)
clf.fit(X_train_vec, Y_train)

# Quick evaluation
Y_pred = clf.predict(X_test_vec)
print("Micro F1:", f1_score(Y_test, Y_pred, average="micro"))

# Save artifacts
joblib.dump(vectorizer, "vectorizer.pkl")
joblib.dump(clf, "symptom_model.pkl")
joblib.dump({"diseases": diseases, "symptom_cols": symptom_cols}, "metadata.pkl")
print("Saved vectorizer, model, metadata.")


Micro F1: 0.9976008831476716
Saved vectorizer, model, metadata.


In [27]:
counts = Y_symptoms.sum(axis=0)
for s, c in sorted(zip(symptom_cols, counts), key=lambda x: -x[1])[:30]:
    print(s, c)


sharp abdominal pain 32307
vomiting 27923
headache 24816
cough 24442
sharp chest pain 24016
nausea 23735
back pain 21857
shortness of breath 21346
fever 20541
dizziness 17362
abnormal appearing skin 16573
nasal congestion 16249
leg pain 16239
skin swelling 15315
depressive or psychotic symptoms 15064
lower abdominal pain 14984
sore throat 14005
burning abdominal pain 12981
skin rash 12473
skin lesion 12440
arm pain 11619
weakness 11598
low back pain 11053
ear pain 10633
depression 10556
side pain 10554
itching of skin 10511
diarrhea 10472
loss of sensation 10399
skin growth 10311


### 6. Predicting diseases via symptom similarity

The inference pipeline in the original approach works in two stages:

1. **Text → symptom probabilities**  
   - Given a new user text, we transform it with the trained TF‑IDF vectorizer.
   - For each internal logistic regression estimator (one per symptom), we call `predict_proba` to obtain `P(symptom_k = 1 | text)`.
   - We stack these probabilities into a vector `symptom_probs`, which represents the model’s belief about which symptoms the user has.

2. **Symptom probabilities → disease similarity scores**  
   - For each disease row in `symptom_matrix`, we compute the cosine similarity between:
     - `symptom_probs` (probabilities from the text model).
     - The disease’s 0/1 symptom vector.

   - Higher cosine similarity means the pattern of predicted symptoms is more similar to that disease’s canonical symptom profile.
   - We sort diseases by similarity and return the top‑k as candidate diagnoses, with similarity scores.

This implements the idea of “scan the text for symptoms, then choose diseases whose symptom vectors most closely match what the model extracted.”


In [30]:
# Cosine similarity
def cosine_similarity(u, v):
    if norm(u) == 0 or norm(v) == 0:
        return 0.0
    return float(np.dot(u, v) / (norm(u) * norm(v)))

def predict_diseases_from_text(
    text,
    vectorizer,
    model,
    symptom_matrix,
    diseases,
    symptom_cols,
    top_k=5
):
    # 1. text -> symptom probabilities
    X_vec = vectorizer.transform([text])
    probs_per_symptom = []
    for est in model.estimators_:
        probs = est.predict_proba(X_vec)[:, 1]  # prob of label=1
        probs_per_symptom.append(probs[0])
    symptom_probs = np.array(probs_per_symptom)  # (num_symptoms,)

    # >>> BOOST EXACT MATCHES HERE <<<
    symptom_probs = boost_exact_matches(text, symptom_probs, symptom_cols, boost=0.7)

    # 2. similarity to each disease row
    scores = []
    for i, disease_vec in enumerate(symptom_matrix):
        score = cosine_similarity(symptom_probs, disease_vec.astype(float))
        scores.append((diseases[i], score))

    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]


# Reload artifacts (simulate real usage)
vectorizer = joblib.load("vectorizer.pkl")
model = joblib.load("symptom_model.pkl")
metadata = joblib.load("metadata.pkl")
diseases = metadata["diseases"]
symptom_cols = metadata["symptom_cols"]
symptom_matrix = df_symptoms[symptom_cols].values

# Test
user_text = "I have a runny nose, dry throat and a temperature for two days"
results = predict_diseases_from_text(user_text, vectorizer, model, symptom_matrix, diseases, symptom_cols, top_k=5)

for disease, score in results:
    print(f"{disease}: {score:.2f}")


drug reaction: 0.81
impetigo: 0.79
viral exanthem: 0.78
viral exanthem: 0.78
actinic keratosis: 0.77


### 7. Inspecting the model’s predicted symptoms

To understand why disease predictions looked wrong, we added a debugging step that:

- Takes `symptom_probs` (the probabilities for each symptom).
- Pairs each probability with its symptom name.
- Sorts these pairs in descending order of probability.
- Prints the top‑N symptoms and their probabilities.

For example, for the input `"I have a runny nose, dry throat and a temperature for two days"`, we observed that the model assigned relatively high probabilities to skin‑related symptoms (e.g. abnormal appearing skin, itching of skin) and very low probabilities to respiratory symptoms such as runny nose, sore throat, and coryza.

This revealed that the text → symptom model was not learning the intuitive mapping between common respiratory symptom phrases and their corresponding symptom labels, but was heavily biased toward skin and pain symptoms that are frequent in the training data.


In [38]:
def boost_exact_matches(text, symptom_probs, symptom_cols, boost=0.5):
    text_low = text.lower()
    for i, s in enumerate(symptom_cols):
        phrase = s.replace("_", " ")
        if phrase in text_low:
            symptom_probs[i] = max(symptom_probs[i], boost)
    return symptom_probs


### 9. Training a direct text → disease classifier (new approach)

In the new, simpler approach we ignore the symptom layer at prediction time and train a model directly from text to disease.

Steps:

1. **Build training data**  
   - Inputs: `input_text` from the JSONL files (free‑text symptom descriptions).
   - Labels: `output_text` from the same files (disease names).

2. **TF‑IDF features**  
   - We use `TfidfVectorizer` to convert each input text into a numeric feature vector, similar to the previous model.

3. **Multi‑class Logistic Regression**  
   - We train a single `LogisticRegression` classifier to predict the disease name for each text.
   - This is a standard multi‑class text classification setup: one label per example.

4. **Inference**  
   - For a new user text, we transform it with the vectorizer, call `predict_proba`, and obtain a probability distribution over all diseases in the training set.
   - We sort diseases by predicted probability and display the top‑k, with percentages, as the “similarity” or model confidence.

This new model directly optimises for the final target (disease) rather than going through an intermediate symptom space.


In [39]:
# Training data from JSONL only
texts = [item["input_text"] for item in real_data]
labels = [item["output_text"] for item in real_data]

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

vec = TfidfVectorizer(max_features=5000, ngram_range=(1,2), min_df=2)
X_train_vec = vec.fit_transform(X_train)
X_test_vec = vec.transform(X_test)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_vec, y_train)

joblib.dump(vec, "text_vec.pkl")
joblib.dump(clf, "text_disease_model.pkl")
print("Accuracy:", (clf.predict(X_test_vec) == y_test).mean())


Accuracy: 0.9014084507042254


In [40]:
def predict_diseases_text_only(text, vec, clf, top_k=5):
    X_vec = vec.transform([text])
    probs = clf.predict_proba(X_vec)[0]
    classes = clf.classes_
    pairs = sorted(zip(classes, probs), key=lambda x: x[1], reverse=True)
    return pairs[:top_k]


user_text = "I have a runny nose, dry throat and a temperature for two days"
for disease, p in predict_diseases_text_only(user_text, vec, clf):
    print(f"{disease}: {p:.1%}")



allergy: 31.5%
common cold: 9.1%
drug reaction: 4.8%
gastroesophageal reflux disease: 4.5%
diabetes: 4.4%


### 10. Why the original text → symptoms → disease pipeline gave bad results, and how the new text → disease model fixes it

In the original design, the pipeline had two stages:

1. **Text → symptoms (multi‑label)**  
   - The model tried to predict hundreds of symptom labels from each text.
   - Training labels came from a mix of:
     - Synthetic texts generated directly from the disease–symptom matrix.
     - Real JSONL texts mapped to symptom vectors via disease names.
   - The label distribution was heavily imbalanced: many pain and skin‑related symptoms appeared extremely frequently, while common respiratory symptoms (runny nose, sore throat, coryza) were relatively under‑represented.

2. **Symptoms → disease (similarity)**  
   - The model’s predicted symptom probabilities were compared to each disease’s symptom row using cosine similarity.
   - Diseases whose symptom vectors aligned best with the (possibly biased) symptom probabilities were ranked highest.

Because the text → symptom model was trained on noisy, imbalanced data, it often predicted high probabilities for skin and pain symptoms even when the text clearly described respiratory symptoms. When these biased symptom vectors were fed into the similarity step, the system naturally preferred diseases with strong skin components (e.g. impetigo, drug reaction, actinic keratosis) even for inputs like “runny nose, dry throat, temperature”. In short, **errors at the symptom layer propagated and were amplified by the similarity step**.

The new approach removes this fragile intermediate step:

- Instead of predicting hundreds of symptoms, we train a **single multi‑class classifier** that maps text directly to diseases, using the JSONL data (`input_text → output_text`) as the supervision signal.
- TF‑IDF + Logistic Regression now optimises exactly the quantity we care about (disease label) and uses the best‑quality labels we have (disease names from the JSONL set), without having to infer a large, noisy symptom vector first.
- At inference, a single `predict_proba` call yields a probability distribution over diseases, and we interpret these probabilities as similarity/confidence scores.

As a result, for the same input “I have a runny nose, dry throat and a temperature for two days”, the direct text → disease model sensibly ranks **allergy** and **common cold** highest, instead of unrelated skin conditions. The model is simpler, less error‑prone, and more closely aligned with the available data and the end goal of suggesting likely diseases from free‑text symptom descriptions.
