<a href="https://colab.research.google.com/github/SriCharan2705/HealthSync-AI/blob/main/HealthSync_AI_DataPreprocessingAndModelTraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

HeathSync AI : Data preprocessing and model training

Step 1: Install Required Libraries

In [1]:
!pip install pandas numpy scikit-learn joblib imbalanced-learn xgboost
!pip install gradio



Step 2: Import Libraries

In [2]:
import pandas as pd
import numpy as np
import joblib
import os
from google.colab import drive  # For Google Drive Access
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report


Step 3: Mount Google Drive

In [3]:
drive.mount('/content/drive')


Mounted at /content/drive


Step 4: Load Dataset from Google Drive

In [4]:
file_path = "/content/drive/My Drive/HealthSyncAI/Synthetic_health_dataset.csv"
df = pd.read_csv(file_path)

# Display first few rows
print(" Dataset Loaded Successfully!")
print(df.head())


 Dataset Loaded Successfully!
   glucose  cholesterol  HbA1c   BMI  blood_pressure  sleep_hours  age  \
0      109          202    6.7  31.8             143          8.2   51   
1       97          198    4.5  24.5             118          7.4   24   
2      112          228    6.3  19.6             100          8.7   56   
3      130          245    5.1  32.0              99          6.4   26   
4       95          220    5.4  24.4             114          6.8   44   

  smoking_status alcohol_intake physical_activity diet_habits  gender  \
0            Yes             No               Yes    Balanced    Male   
1             No            Yes               Yes       Vegan    Male   
2             No             No                No    High-Fat  Female   
3             No            Yes                No    High-Fat    Male   
4             No             No                No    Low-Carb    Male   

  medical_history disease_status  
0             NaN       Diabetes  
1   Heart Diseas

Step 5: Handle Missing

In [5]:
print("\n Checking Missing Values...")
print(df.isnull().sum())


df.fillna(df.median(numeric_only=True), inplace=True)


df.fillna(df.mode().iloc[0], inplace=True)

print(" Missing Values Handled Successfully!")



 Checking Missing Values...
glucose                  0
cholesterol              0
HbA1c                    0
BMI                      0
blood_pressure           0
sleep_hours              0
age                      0
smoking_status           0
alcohol_intake           0
physical_activity        0
diet_habits              0
gender                   0
medical_history      12641
disease_status           0
dtype: int64
 Missing Values Handled Successfully!


Step 6: Define Features (X) and Target (y)

In [6]:
X = df.drop(columns=['disease_status'])
y = df['disease_status']


label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

joblib.dump(label_encoder, "/content/drive/My Drive/HealthSyncAI/label_encoder.pkl")
print(" Target Labels Encoded Successfully!")


 Target Labels Encoded Successfully!


Step 5: Encode Labels

In [7]:

# convert the target catagotrical value into numeric from
X = df.drop(columns=['disease_status'])
y = df['disease_status']

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)


joblib.dump(label_encoder, "label_encoder.pkl")

print("Target labels encoded successfully:", label_encoder.classes_)


Target labels encoded successfully: ['Diabetes' 'Healthy' 'Heart Disease' 'Hypertension']



Step 7: One-Hot Encode Categorical Variables & Scale Numerical Data

In [8]:
#converting the feature into machine understandabul way
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

categorical_columns = ['smoking_status', 'alcohol_intake', 'physical_activity',
                       'diet_habits', 'gender', 'medical_history']
numerical_columns = ['glucose', 'cholesterol', 'HbA1c', 'BMI',
                     'blood_pressure', 'sleep_hours', 'age']


X_encoded = encoder.fit_transform(X[categorical_columns])
X_encoded_df = pd.DataFrame(X_encoded, columns=encoder.get_feature_names_out())


scaler = StandardScaler()
X_numeric_scaled = scaler.fit_transform(X[numerical_columns])
X_numeric_scaled_df = pd.DataFrame(X_numeric_scaled, columns=numerical_columns)


X_transformed = pd.concat([X_numeric_scaled_df, X_encoded_df], axis=1)

print(" Categorical Data Encoded & Numerical Features Scaled Successfully!")


 Categorical Data Encoded & Numerical Features Scaled Successfully!


Step 7: Handle Class Imbalance using SMOTE

In [9]:
#Synthetic minority over sampling technique
smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X_transformed, y)

print("Dataset balanced successfully! New class distribution:")
print(pd.Series(y_balanced).value_counts())

Dataset balanced successfully! New class distribution:
0    17165
2    17165
3    17165
1    17165
Name: count, dtype: int64


Step 8: Split Data into Training & Test Sets

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")


Training set size: (54928, 22)
Test set size: (13732, 22)


Step 9: Train Random Forest Model

In [11]:
#merging multiple decision trees to provide accurate result
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)


joblib.dump(clf, "/content/drive/My Drive/HealthSyncAI/trained_model.pkl")
joblib.dump(encoder, "/content/drive/My Drive/HealthSyncAI/encoder.pkl")
joblib.dump(scaler, "/content/drive/My Drive/HealthSyncAI/scaler.pkl")

print(" Model Trained & Saved Successfully!")


 Model Trained & Saved Successfully!


Step 10: Evaluate Model

In [12]:
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"\n Model Accuracy: {accuracy * 100:.2f}%")
print("\n Classification Report:\n", classification_report(y_test, y_pred))



 Model Accuracy: 92.22%

 Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      3502
           1       0.96      0.72      0.82      3465
           2       0.77      0.97      0.86      3330
           3       1.00      1.00      1.00      3435

    accuracy                           0.92     13732
   macro avg       0.93      0.92      0.92     13732
weighted avg       0.93      0.92      0.92     13732



Step 11: Define Diet Recommendations

In [13]:
diet_recommendations = {
    "Diabetes": {
        "foods_to_eat": {
            "Leafy Greens": ["Rich in Vitamin C, Vitamin K, and Magnesium (e.g., Spinach, Kale)"],
            "Whole Grains": ["High in Fiber and B Vitamins (e.g., Brown Rice, Quinoa, Oats)"],
            "Lean Proteins": ["Helps with muscle maintenance and blood sugar control (e.g., Chicken, Fish, Tofu, Eggs)"],
            "Low-GI Fruits": ["Provides natural sweetness with lower blood sugar impact (e.g., Berries, Apples, Pears)"],
            "Nuts & Seeds": ["Rich in healthy fats, fiber, and Vitamin E (e.g., Almonds, Walnuts, Chia Seeds)"],
            "Legumes": ["Good source of fiber and plant-based protein (e.g., Lentils, Chickpeas, Black Beans)"]
        },
        "foods_to_avoid": {
            "Sugary Drinks": ["Causes rapid spikes in blood sugar (e.g., Sodas, Energy Drinks, Sweetened Teas)"],
            "Refined Carbs": ["Lacks fiber and nutrients, raises blood sugar quickly (e.g., White Bread, White Pasta, Pastries)"],
            "Fried Foods": ["High in unhealthy fats, raises cholesterol and insulin resistance (e.g., French Fries, Fried Chicken)"],
            "High-Sugar Fruits": ["Can spike blood sugar levels (e.g., Watermelon, Mango, Pineapple, Grapes)"]
        }
    },
    "Hypertension": {
        "foods_to_eat": {
            "Bananas": ["Rich in Potassium, helps reduce blood pressure"],
            "Leafy Greens": ["High in Magnesium, helps relax blood vessels"],
            "Beets": ["Contains Nitrates, which help lower blood pressure"],
            "Garlic": ["Rich in Allicin, reduces inflammation and pressure"],
            "Fatty Fish": ["Contains Omega-3, reduces hypertension (e.g., Salmon, Mackerel)"],
            "Nuts": ["Provides healthy fats, Magnesium, and Potassium (e.g., Walnuts, Almonds)"],
            "Low-fat Dairy": ["Good source of Calcium and Protein, helps manage blood pressure"]
        },
        "foods_to_avoid": {
            "Processed Foods": ["High in sodium and preservatives (e.g., Canned Soups, Frozen Meals)"],
            "Salty Foods": ["Raises blood pressure levels (e.g., Chips, Pickles, Salted Nuts)"],
            "Alcohol": ["Can increase blood pressure over time"],
            "Caffeine": ["May cause temporary blood pressure spikes (e.g., Coffee, Energy Drinks)"],
            "Red Meat": ["Contains saturated fats that may increase heart disease risk"]
        }
    },
    "Heart Disease": {
        "foods_to_eat": {
            "Oatmeal": ["High in Soluble Fiber, reduces bad cholesterol"],
            "Salmon": ["Rich in Omega-3 Fatty Acids, reduces inflammation"],
            "Avocados": ["Contains healthy Monounsaturated Fats, good for the heart"],
            "Nuts": ["Rich in Omega-3, reduces heart disease risk"],
            "Olive Oil": ["Provides healthy fats and antioxidants"],
            "Dark Chocolate": ["High in Flavonoids, improves blood circulation"],
            "Legumes": ["High in Fiber and Protein, lowers cholesterol (e.g., Lentils, Beans)"]
        },
        "foods_to_avoid": {
            "Fried Foods": ["High in trans fats, increases cholesterol"],
            "Trans Fats": ["Found in processed snacks and margarine"],
            "Red Meat": ["High in saturated fat, can lead to heart disease"],
            "High-Sugar Foods": ["Increases inflammation and risk of diabetes"],
            "Excessive Dairy": ["Can be high in saturated fat, affecting heart health"]
        }
    },
    "Healthy": {
        "foods_to_eat": {
            "Balanced Diet": ["A mix of Proteins, Carbs, and Healthy Fats"],
            "Fruits & Vegetables": ["Provides essential Vitamins, Minerals, and Antioxidants"],
            "Whole Grains": ["High in Fiber, keeps digestion and energy levels stable"],
            "Healthy Fats": ["Supports brain function and reduces inflammation (e.g., Nuts, Avocados, Olive Oil)"]
        },
        "foods_to_avoid": {
            "Excess Processed Foods": ["High in unhealthy fats, sodium, and preservatives"],
            "Too Much Sugar & Salt": ["Can lead to metabolic disorders and hypertension"]
        }
    }
}



Step 12: User Interactive Prediction Function

In [14]:
def predict_disease_and_diet():
    print("\n Enter Your Health Details  ")

    try:
        # Collect user input with validation
        glucose = float(input("Enter Glucose Level (70-200 mg/dL): "))
        if not (70 <= glucose <= 200):
            raise ValueError("Invalid Glucose Level")

        cholesterol = float(input("Enter Cholesterol Level (100-300 mg/dL): "))
        if not (100 <= cholesterol <= 300):
            raise ValueError("Invalid Cholesterol Level")

        HbA1c = float(input("Enter HbA1c Level (4-10 %): "))
        if not (4 <= HbA1c <= 10):
            raise ValueError("Invalid HbA1c Level")

        BMI = float(input("Enter BMI (15-40): "))
        if not (15 <= BMI <= 40):
            raise ValueError("Invalid BMI")

        blood_pressure = float(input("Enter Blood Pressure (80-180 mmHg): "))
        if not (80 <= blood_pressure <= 180):
            raise ValueError("Invalid Blood Pressure")

        sleep_hours = float(input("Enter Sleep Hours per Night (2-10): "))
        if not (2 <= sleep_hours <= 10):
            raise ValueError("Invalid Sleep Hours")

        age = int(input("Enter Age (18-80): "))
        if not (18 <= age <= 80):
            raise ValueError("Invalid Age")

        smoking_status = input("Do you smoke? (Yes/No): ").strip().lower()
        if smoking_status not in ["yes", "no"]:
            raise ValueError("Invalid Smoking Status")

        alcohol_intake = input("Do you consume alcohol? (Yes/No): ").strip().lower()
        if alcohol_intake not in ["yes", "no"]:
            raise ValueError("Invalid Alcohol Intake")

        physical_activity = input("Do you exercise regularly? (Yes/No): ").strip().lower()
        if physical_activity not in ["yes", "no"]:
            raise ValueError("Invalid Physical Activity")

        diet_habits = input("Describe your diet (Balanced/High-Fat/High-Sugar): ").strip().lower()
        if diet_habits not in ["balanced", "high-fat", "high-sugar"]:
            raise ValueError("Invalid Diet Habits")

        gender = input("Enter Gender (Male/Female): ").strip().lower()
        if gender not in ["male", "female"]:
            raise ValueError("Invalid Gender")

        medical_history = input("Any major medical history? (Diabetes/Heart Disease/None): ").strip().lower()
        if medical_history not in ["diabetes", "heart disease", "none"]:
            raise ValueError("Invalid Medical History")

        # Create input DataFrame
        sample_input = {
            "glucose": glucose,
            "cholesterol": cholesterol,
            "HbA1c": HbA1c,
            "BMI": BMI,
            "blood_pressure": blood_pressure,
            "sleep_hours": sleep_hours,
            "age": age,
            "smoking_status": smoking_status,
            "alcohol_intake": alcohol_intake,
            "physical_activity": physical_activity,
            "diet_habits": diet_habits,
            "gender": gender,
            "medical_history": medical_history
        }

        input_df = pd.DataFrame([sample_input])
        input_categorical = encoder.transform(input_df[categorical_columns])
        input_numerical = scaler.transform(input_df[numerical_columns])
        input_final = pd.DataFrame(np.hstack((input_numerical, input_categorical)),
                                   columns=list(numerical_columns) + list(encoder.get_feature_names_out()))

        predicted_label = clf.predict(input_final)[0]
        predicted_disease = label_encoder.inverse_transform([predicted_label])[0]

        # Get detailed diet recommendation
        diet_info = diet_recommendations.get(predicted_disease, {"foods_to_eat": [], "foods_to_avoid": []})

        print(f"\n **Predicted Disease**: {predicted_disease}")

        print("\n **Recommended Foods to Eat**:")
        for food, details in diet_info.get("foods_to_eat", {}).items():
            print(f"- {food}: {details[0]}")

        print("\n **Foods to Avoid**:")
        for food, reason in diet_info.get("foods_to_avoid", {}).items():
            print(f"- {food}: {reason}")

    except ValueError as e:
        print(f"\n Invalid Input: {e}")



Step 13: Run User Interactive Prediction

In [15]:
import gradio as gr
def predict_with_gradio(glucose, cholesterol, HbA1c, BMI, blood_pressure, sleep_hours,
                        age, smoking_status, alcohol_intake, physical_activity,
                        diet_habits, gender, medical_history):

    try:
        sample_input = {
            "glucose": glucose,
            "cholesterol": cholesterol,
            "HbA1c": HbA1c,
            "BMI": BMI,
            "blood_pressure": blood_pressure,
            "sleep_hours": sleep_hours,
            "age": age,
            "smoking_status": smoking_status.lower(),
            "alcohol_intake": alcohol_intake.lower(),
            "physical_activity": physical_activity.lower(),
            "diet_habits": diet_habits.lower(),
            "gender": gender.lower(),
            "medical_history": medical_history.lower()
        }

        input_df = pd.DataFrame([sample_input])
        input_categorical = encoder.transform(input_df[categorical_columns])
        input_numerical = scaler.transform(input_df[numerical_columns])
        input_final = pd.DataFrame(np.hstack((input_numerical, input_categorical)),
                                   columns=list(numerical_columns) + list(encoder.get_feature_names_out()))

        predicted_label = clf.predict(input_final)[0]
        predicted_disease = label_encoder.inverse_transform([predicted_label])[0]

        diet_info = diet_recommendations.get(predicted_disease, {"foods_to_eat": {}, "foods_to_avoid": {}})

        output = f"### 🤕 Predicted Disease: {predicted_disease}\n\n"
        output += "  **🥗 Recommended Foods to Eat:**\n"
        for food, details in diet_info.get("foods_to_eat", {}).items():
            output += f"- {food}: {details[0]}\n"

        output += "\n**❌ Foods to Avoid:**\n"
        for food, reason in diet_info.get("foods_to_avoid", {}).items():
            output += f"- {food}: {reason}\n"

        return output

    except Exception as e:
        return f"Error: {e}"

# Create interface
gr.Interface(
    fn=predict_with_gradio,
    inputs=[
        gr.Number(label="Glucose (70-200)"),
        gr.Number(label="Cholesterol (100-300)"),
        gr.Number(label="HbA1c (4-10)"),
        gr.Number(label="BMI (15-40)"),
        gr.Number(label="Blood Pressure (80-180)"),
        gr.Number(label="Sleep Hours (2-10)"),
        gr.Number(label="Age (18-80)"),
        gr.Radio(["Yes", "No"], label="Smoking Status"),
        gr.Radio(["Yes", "No"], label="Alcohol Intake"),
        gr.Radio(["Yes", "No"], label="Physical Activity"),
        gr.Radio(["Balanced", "High-Fat", "High-Sugar"], label="Diet Habits"),
        gr.Radio(["Male", "Female"], label="Gender"),
        gr.Radio(["Diabetes", "Heart Disease", "None"], label="Medical History"),
    ],
    outputs=gr.Markdown(),
    title=" 🩺HealthSync AI : Smart Diagnosis and Custom Nutrition Guide ",
    description="Enter your health details below:"
).launch()


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://e01af67afe492aed91.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


