# CarePath AI — Notebook 01: Data exploration & preprocessing


This notebook will:
- Inspect the `data/` folder structure
- Load the Kaggle prototype dataset (Disease_symptom_and_patient_profile)
- Load a small sample from the DDXPlus dataset
- Clean/normalize symptom text into lists
- Produce simple outputs (symptom frequencies, disease→symptoms mapping)


Run cells one at a time and read the explanation below each cell.

In [1]:
from pathlib import Path

DATA_DIR = Path(r"C:\Users\ASUS\OneDrive\Desktop\CarePath AI\data")

patient_file = DATA_DIR / "Disease_symptom_and_patient_profile_dataset.csv"

ddx_dir = DATA_DIR / "ddxplus"

print("DATA_DIR:", DATA_DIR.resolve())

print("Patient dataset exists:", patient_file.exists())

print("DDX folder exists:", ddx_dir.exists())

DATA_DIR: C:\Users\ASUS\OneDrive\Desktop\CarePath AI\data
Patient dataset exists: True
DDX folder exists: True


In [2]:
import pandas as pd 

df=pd.read_csv(patient_file)

print("Shape:", df.shape)

print("\nColumns Names:", df.columns.tolist())

print("\nFirst 5 rows: ")

df.head()

Shape: (349, 10)

Columns Names: ['Disease', 'Fever', 'Cough', 'Fatigue', 'Difficulty Breathing', 'Age', 'Gender', 'Blood Pressure', 'Cholesterol Level', 'Outcome Variable']

First 5 rows: 


Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,Influenza,Yes,No,Yes,Yes,19,Female,Low,Normal,Positive
1,Common Cold,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
2,Eczema,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
3,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive
4,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive


In [3]:
print("Missing values per columns: \n", df.isnull().sum)

print("\nNumber of duplicate rows: ", df.duplicated().sum())

print("\nData types: ")
print(df.dtypes)

print("\nBasic statistics: ")
print(df.describe())

Missing values per columns: 
 <bound method DataFrame.sum of      Disease  Fever  Cough  Fatigue  Difficulty Breathing    Age  Gender  \
0      False  False  False    False                 False  False   False   
1      False  False  False    False                 False  False   False   
2      False  False  False    False                 False  False   False   
3      False  False  False    False                 False  False   False   
4      False  False  False    False                 False  False   False   
..       ...    ...    ...      ...                   ...    ...     ...   
344    False  False  False    False                 False  False   False   
345    False  False  False    False                 False  False   False   
346    False  False  False    False                 False  False   False   
347    False  False  False    False                 False  False   False   
348    False  False  False    False                 False  False   False   

     Blood Pressure  Chole

In [4]:
print("Shape before removing duplicates: ", df.shape)

df=df.drop_duplicates()

print("Shape after removing duplicates: ", df.shape)

Shape before removing duplicates:  (349, 10)
Shape after removing duplicates:  (300, 10)


In [5]:
categorical_cols=df.select_dtypes(include='object').columns

for col in categorical_cols:
    print(f"\n{col}-> Unique values: {df[col].unique()}")


Disease-> Unique values: ['Influenza' 'Common Cold' 'Eczema' 'Asthma' 'Hyperthyroidism'
 'Allergic Rhinitis' 'Anxiety Disorders' 'Diabetes' 'Gastroenteritis'
 'Pancreatitis' 'Rheumatoid Arthritis' 'Depression' 'Liver Cancer'
 'Stroke' 'Urinary Tract Infection' 'Dengue Fever' 'Hepatitis'
 'Kidney Cancer' 'Migraine' 'Muscular Dystrophy' 'Sinusitis'
 'Ulcerative Colitis' 'Bipolar Disorder' 'Bronchitis' 'Cerebral Palsy'
 'Colorectal Cancer' 'Hypertensive Heart Disease' 'Multiple Sclerosis'
 'Myocardial Infarction (Heart...' 'Urinary Tract Infection (UTI)'
 'Osteoporosis' 'Pneumonia' 'Atherosclerosis'
 'Chronic Obstructive Pulmonary...' 'Epilepsy' 'Hypertension'
 'Obsessive-Compulsive Disorde...' 'Psoriasis' 'Rubella' 'Cirrhosis'
 'Conjunctivitis (Pink Eye)' 'Liver Disease' 'Malaria' 'Spina Bifida'
 'Kidney Disease' 'Osteoarthritis' 'Klinefelter Syndrome' 'Acne'
 'Brain Tumor' 'Cystic Fibrosis' 'Glaucoma' 'Rabies' 'Chickenpox'
 'Coronary Artery Disease' 'Eating Disorders (Anorexia,...' 'Fi

In [6]:
binary_map={
    "Yes":1, "No":0,
    "Male":1, "Female":0,
    "Positive":1, "Negative":0
}
df=df.replace(binary_map).infer_objects(copy=False)

ordinal_map_bp={"Low": 0, "Normal": 1, "High": 2}
ordinal_map_chol={"Low": 0, "Normal": 1, "High": 2}

df["Blood Pressure"]=df["Blood Pressure"].replace(ordinal_map_bp)
df["Cholesterol Level"]=df["Cholesterol Level"].replace(ordinal_map_chol)

print(df.head())

       Disease  Fever  Cough  Fatigue  Difficulty Breathing  Age  Gender  \
0    Influenza      1      0        1                     1   19       0   
1  Common Cold      0      1        1                     0   25       0   
2       Eczema      0      1        1                     0   25       0   
3       Asthma      1      1        0                     1   25       1   
5       Eczema      1      0        0                     0   25       0   

   Blood Pressure  Cholesterol Level  Outcome Variable  
0               0                  1                 1  
1               1                  1                 0  
2               1                  1                 0  
3               1                  1                 1  
5               1                  1                 1  


  df=df.replace(binary_map).infer_objects(copy=False)
  df["Blood Pressure"]=df["Blood Pressure"].replace(ordinal_map_bp)
  df["Cholesterol Level"]=df["Cholesterol Level"].replace(ordinal_map_chol)


In [7]:
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()

df["Disease"]=le.fit_transform(df["Disease"])

df.head()

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,56,1,0,1,1,19,0,0,1,1
1,24,0,1,1,0,25,0,1,1,0
2,37,0,1,1,0,25,0,1,1,0
3,6,1,1,0,1,25,1,1,1,1
5,37,1,0,0,0,25,0,1,1,1


In [8]:
from sklearn.model_selection import train_test_split

X=df.drop("Outcome Variable", axis=1)

y=df["Outcome Variable"]

X_train, X_test, y_train, y_test=train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train Shape:", X_train.shape, y_train.shape)

print("Test Shape:", X_test.shape, y_test.shape)

Train Shape: (240, 9) (240,)
Test Shape: (60, 9) (60,)


### Baseline: Logistic Regression
Logistic Regression Accuracy: 0.7333333333333333

Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.69      0.71        29
           1       0.73      0.77      0.75        31
           
    accuracy                           0.73        60
    macro avg       0.73      0.73      0.73        60
    weighted avg       0.73      0.73      0.73        60


    Confusion Matrix:
    [[20  9]
    [ 7 24]]

In [9]:
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# log_reg=LogisticRegression(max_iter=1000, random_state=42)

# log_reg.fit(X_train, y_train)

# y_pred=log_reg.predict(X_test)

# accuracy=accuracy_score(y_test, y_pred)
# print("Logistic Regression Accuracy:", accuracy)

# print("\nClassification Report:")
# print(classification_report(y_test, y_pred))

# print("\nConfusion Matrix:")
# print(confusion_matrix(y_test, y_pred))

### Improved Model: Random Forest
Random Forest Accuracy: 0.8

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.72      0.78        29
           1       0.77      0.87      0.82        31

    accuracy                           0.80        60
    macro avg       0.81      0.80      0.80        60
    weighted avg       0.80      0.80      0.80        60


    Confusion Matrix:
    [[21  8]
    [ 4 27]]

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

rf_clf=RandomForestClassifier(
    n_estimators=100,
    random_state=42
)
rf_clf.fit(X_train, y_train)
y_pred_rf=rf_clf.predict(X_test)

accuracy_rf=accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", accuracy_rf)

print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))

Random Forest Accuracy: 0.8

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.72      0.78        29
           1       0.77      0.87      0.82        31

    accuracy                           0.80        60
   macro avg       0.81      0.80      0.80        60
weighted avg       0.80      0.80      0.80        60


Confusion Matrix:
[[21  8]
 [ 4 27]]


### Advanced Model: XGBoost
XGBoost Accuracy: 0.7666666666666667

Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.72      0.75        29
           1       0.76      0.81      0.78        31

    accuracy                           0.77        60
    macro avg       0.77      0.77      0.77        60
    weighted avg       0.77      0.77      0.77        60


    Confusion Matrix:
    [[21  8]
    [ 6 25]]

In [11]:
# !pip install xgboost

In [12]:
# from xgboost import XGBClassifier
# from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# xgb_clf=XGBClassifier(
#     n_estimators=100,
#     learning_rate=0.1,
#     max_depth=3,
#     subsample=0.8,
#     colsample_bytree=0.8,
#     random_state=42,
#     eval_metric="logloss"
# )
# xgb_clf.fit(X_train, y_train)
# y_pred_xgb=xgb_clf.predict(X_test)

# accuracy_xgb=accuracy_score(y_test, y_pred_xgb)
# print("XGBoost Accuracy:", accuracy_xgb)

# print("\nClassification Report:")
# print(classification_report(y_test, y_pred_xgb))

# print("\nConfusion Matrix:")
# print(confusion_matrix(y_test, y_pred_xgb))

In [13]:
import joblib

joblib.dump(rf_clf, "disease_model.pkl")

print("Random Forest model saved as 'disease_model.pkl'")

Random Forest model saved as 'disease_model.pkl'
