# MediMine: Data Preprocessing
**Author:** Hussam M. Bansao
**Goal:** Clean and transform the raw symptom dataset into a transactional format for Apriori.

In [6]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder

# Load the dataset
# Make sure your file is at this exact path
try:
    df = pd.read_csv('../data/raw/dataset.csv')
    print("✅ Dataset loaded successfully.")
    display(df.head())
except FileNotFoundError:
    print("❌ Error: File not found. Please ensure 'dataset.csv' is in 'data/raw/'.")

✅ Dataset loaded successfully.


Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,,,,,,,,,,,,


## 2. Data Cleaning
The dataset often has underscores in symptom names (e.g., `stomach_pain` instead of `stomach pain`). We will remove these to make the results readable.

In [7]:
# Function to clean text
def clean_text(text):
    if isinstance(text, str):
        return text.replace('_', ' ').strip()
    return text

# Apply cleaning to all columns
df_clean = df.applymap(clean_text)

print("Cleaning complete. Sample:")
display(df_clean.iloc[0].values)

Cleaning complete. Sample:


  df_clean = df.applymap(clean_text)


array(['Fungal infection', 'itching', 'skin rash', 'nodal skin eruptions',
       'dischromic  patches', nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan], dtype=object)

## 3. Transform to Transactions
The Apriori algorithm requires a list of transactions, where each transaction is a list of items (symptoms). We extract these from the dataframe rows.

In [8]:
transactions = []
for i in range(len(df_clean)):
    # Collect all symptoms in the row, ignoring NaNs
    row_values = [str(x) for x in df_clean.iloc[i].values if str(x) != 'nan']
    transactions.append(row_values)

print(f"✅ Extracted {len(transactions)} transactions.")
print(f"Sample Transaction: {transactions[0]}")

✅ Extracted 4920 transactions.
Sample Transaction: ['Fungal infection', 'itching', 'skin rash', 'nodal skin eruptions', 'dischromic  patches']


## 4. One-Hot Encoding
We convert the list of transactions into a boolean matrix (True/False) which Apriori needs.

In [9]:
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

print(f"Encoded Data Shape: {df_encoded.shape}")
display(df_encoded.head())

Encoded Data Shape: (4920, 172)


Unnamed: 0,(vertigo) Paroymsal Positional Vertigo,AIDS,Acne,Alcoholic hepatitis,Allergy,Arthritis,Bronchial Asthma,Cervical spondylosis,Chicken pox,Chronic cholestasis,...,vomiting,watering from eyes,weakness in limbs,weakness of one body side,weight gain,weight loss,yellow crust ooze,yellow urine,yellowing of eyes,yellowish skin
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## 5. Save Processed Data
We save the clean, encoded data to the `processed` folder for the next notebook to use.

In [10]:
df_encoded.to_csv('../data/processed/encoded_symptoms.csv', index=False)
print("✅ Processed data saved to '../data/processed/encoded_symptoms.csv'")

✅ Processed data saved to '../data/processed/encoded_symptoms.csv'
