# Preprocessing & Text Vectorization Notebook

## Objective
This notebook implements the **EDA-driven preprocessing pipeline** for predicting medical conditions from drug side-effects text. It consolidates:

1. Data loading
2. Target consolidation (rare class handling)
3. Text truncation (99th percentile)
4. Text cleaning
5. Train–test split
6. TF-IDF vectorization


## 1. Imports & Configuration

In [5]:
import pandas as pd
import numpy as np
import re
import joblib

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

## 2. Load Raw Dataset

In [6]:
DATA_PATH = 'drugs_side_effects_drugs_com.csv'

df = pd.read_csv(DATA_PATH)
print('Dataset shape:', df.shape)
df.head()

Dataset shape: (2931, 17)


Unnamed: 0,drug_name,medical_condition,side_effects,generic_name,drug_classes,brand_names,activity,rx_otc,pregnancy_category,csa,alcohol,related_drugs,medical_condition_description,rating,no_of_reviews,drug_link,medical_condition_url
0,doxycycline,Acne,"(hives, difficult breathing, swelling in your ...",doxycycline,"Miscellaneous antimalarials, Tetracyclines","Acticlate, Adoxa CK, Adoxa Pak, Adoxa TT, Alod...",87%,Rx,D,N,X,amoxicillin: https://www.drugs.com/amoxicillin...,Acne Other names: Acne Vulgaris; Blackheads; B...,6.8,760.0,https://www.drugs.com/doxycycline.html,https://www.drugs.com/condition/acne.html
1,spironolactone,Acne,hives ; difficulty breathing; swelling of your...,spironolactone,"Aldosterone receptor antagonists, Potassium-sp...","Aldactone, CaroSpir",82%,Rx,C,N,X,amlodipine: https://www.drugs.com/amlodipine.h...,Acne Other names: Acne Vulgaris; Blackheads; B...,7.2,449.0,https://www.drugs.com/spironolactone.html,https://www.drugs.com/condition/acne.html
2,minocycline,Acne,"skin rash, fever, swollen glands, flu-like sym...",minocycline,Tetracyclines,"Dynacin, Minocin, Minolira, Solodyn, Ximino, V...",48%,Rx,D,N,,amoxicillin: https://www.drugs.com/amoxicillin...,Acne Other names: Acne Vulgaris; Blackheads; B...,5.7,482.0,https://www.drugs.com/minocycline.html,https://www.drugs.com/condition/acne.html
3,Accutane,Acne,problems with your vision or hearing; muscle o...,isotretinoin (oral),"Miscellaneous antineoplastics, Miscellaneous u...",,41%,Rx,X,N,X,doxycycline: https://www.drugs.com/doxycycline...,Acne Other names: Acne Vulgaris; Blackheads; B...,7.9,623.0,https://www.drugs.com/accutane.html,https://www.drugs.com/condition/acne.html
4,clindamycin,Acne,hives ; difficult breathing; swelling of your ...,clindamycin topical,"Topical acne agents, Vaginal anti-infectives","Cleocin T, Clindacin ETZ, Clindacin P, Clindag...",39%,Rx,B,N,,doxycycline: https://www.drugs.com/doxycycline...,Acne Other names: Acne Vulgaris; Blackheads; B...,7.4,146.0,https://www.drugs.com/mtm/clindamycin-topical....,https://www.drugs.com/condition/acne.html


## 3. Target Consolidation (Rare Medical Conditions)

### Rationale
Medical conditions with very few samples destabilize multi-class classifiers. Based on EDA, conditions with fewer than **50 samples** are merged into an `Other` category.

In [7]:
MIN_CLASS_COUNT = 50

condition_counts = df['medical_condition'].value_counts()
rare_conditions = condition_counts[condition_counts < MIN_CLASS_COUNT].index

df['medical_condition'] = df['medical_condition'].replace(rare_conditions, 'Other')

print('Number of target classes after consolidation:', df['medical_condition'].nunique())

Number of target classes after consolidation: 23


## 4. Side Effects Text Preparation

### 4.1 Handle Missing Values

In [8]:
df['side_effects'] = df['side_effects'].fillna('')

### 4.2 Compute Text Length & Truncation Threshold
Based on EDA, we truncate at the **99th percentile** of text length to remove structurally different regulatory monographs while retaining >99% of data.


In [9]:
df['side_effects_length'] = df['side_effects'].apply(len)

truncation_threshold = int(df['side_effects_length'].quantile(0.99))
print(f'Truncating side_effects at {truncation_threshold} characters (99th percentile)')

Truncating side_effects at 4692 characters (99th percentile)


### 4.3 Truncate Text

In [10]:
def truncate_text(text, max_length):
    return text[:max_length]

df['side_effects'] = df['side_effects'].apply(
    lambda x: truncate_text(x, truncation_threshold)
)

## 5. Text Cleaning

### Rationale
Lightweight normalization is applied to remove formatting noise while preserving clinical meaning.

In [11]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r"\n|\r|\t", " ", text)
    text = re.sub(r"[^a-z0-9.,;:()\- ]+", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text


df['side_effects'] = df['side_effects'].apply(clean_text)

df.drop(columns=['side_effects_length'], inplace=True)

## 6. Train–Test Split

### Rationale
A **stratified split** ensures class proportions are preserved in both training and testing sets.


In [12]:
X = df['side_effects']
y = df['medical_condition']

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

print('Train size:', X_train.shape[0])
print('Test size:', X_test.shape[0])

##Save Train and Test Data as CSV
train_df = pd.DataFrame({'side_effects': X_train, 'medical_condition': y_train})
test_df = pd.DataFrame({'side_effects': X_test, 'medical_condition': y_test})
train_df.to_csv('train_data.csv', index=False)
test_df.to_csv('test_data.csv', index=False)


Train size: 2344
Test size: 587


## 7. Encode Target Labels

In [None]:
label_encoder = LabelEncoder()

y_train_enc = label_encoder.fit_transform(y_train)
y_test_enc = label_encoder.transform(y_test)

joblib.dump(label_encoder, 'label_encoder.pkl')
print('Label encoder saved')

Label encoder saved


## 8. TF-IDF Vectorization

### Design Choices
- Unigrams + Bigrams
- Maximum 10,000 features
- Minimum document frequency = 3
- Maximum document frequency = 90%

In [None]:
tfidf = TfidfVectorizer(
    ngram_range=(1, 2),
    max_features=10000,
    min_df=3,
    max_df=0.9,
    stop_words='english',
    sublinear_tf=True
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

joblib.dump(tfidf, 'tfidf_vectorizer.pkl')

['tfidf_vectorizer.pkl']

In [None]:
print('TF-IDF Vectorization Complete')
print('Train TF-IDF shape:', X_train_tfidf.shape)
print('Test TF-IDF shape:', X_test_tfidf.shape)

TF-IDF Vectorization Complete
Train TF-IDF shape: (2344, 10000)
Test TF-IDF shape: (587, 10000)


## 9. Feature Sparsity Sanity Check

In [None]:
density = X_train_tfidf.nnz / (
    X_train_tfidf.shape[0] * X_train_tfidf.shape[1]
)
print(f'TF-IDF matrix density: {density:.6f}')

TF-IDF matrix density: 0.016546


**Expected:** very low density (≤ 1–2%), which is normal for sparse NLP features.

**Observation:** 0.016

## 10. Outputs Ready for Modeling

The following artifacts are now ready:

- `X_train_tfidf`, `X_test_tfidf`
- `y_train_enc`, `y_test_enc`
- `label_encoder.pkl`
- `tfidf_vectorizer.pkl`

These will be used directly in the model training notebook or script.

In [None]:
from scipy.sparse import save_npz
import joblib

# Save TF-IDF matrices
save_npz("X_train_tfidf.npz", X_train_tfidf)
save_npz("X_test_tfidf.npz", X_test_tfidf)

# Save labels
joblib.dump(y_train_enc, "y_train_enc.pkl")
joblib.dump(y_test_enc, "y_test_enc.pkl")

print("Preprocessing artifacts saved successfully")


Preprocessing artifacts saved successfully
