### Logistic Regression Model Implementation

In this section, we implement a logistic regression classifier to predict whether an individual is obese or non-obese using the provided dataset. 

All data handling, training, and evaluation steps are encapsulated in a single Python class for modularity and easy integration with the project’s architecture.

### Data Preprocessing
To prepare the data for modeling, we perform several steps:

* Data Loading & Cleaning:

 Read the CSV dataset (detailed_meals_macros_.csv) into a pandas DataFrame. Use the DataCleaner class (from the Data_Preprocessing module) to drop duplicate entries, remove rows with missing values, and normalize column names (strip whitespace, replace spaces with underscores, and convert to lowercase for consistency).

* Feature Engineering: 

 Leverage the DataPreprocessor class (from Data_Preprocessing) to create additional features.We convert height from centimeters to meters and calculate the BMI for each individual. An obesity label is then derived based on gender-specific BMI thresholds (e.g., BMI ≥ 30 for males, BMI ≥ 25 for females are labeled obese). This label is stored in a new column (e.g., obesity with values 1 for obese and 0 for non-obese).


* Encoding Categorical Variables: 

 Categorical features such as activity level and dietary preference are label-encoded to numeric values using scikit-learn’s LabelEncoder. Gender is already handled in the obesity feature creation (mapped to 0/1), so it is treated as numeric.

* Feature Selection: 

 We drop irrelevant or non-numeric columns that won’t be used in modeling. In particular, textual columns like meal suggestions (breakfast_suggestion, lunch_suggestion, dinner_suggestion, snack_suggestion) and the disease column are removed. These columns contain free-text or goal descriptions that are not directly usable by the model. We also drop the original height and weight features since BMI (which we keep as a feature) already captures the relationship between height and weight for obesity classification. This helps reduce redundancy and multicollinearity in our feature set.

* Normalization:

 All remaining feature columns (which are now numeric) are normalized using StandardScaler to ensure they are on a similar scale. This step improves the logistic regression model’s convergence and performance. The scaled feature matrix and the target vector are prepared for modeling.

### Model Training and Evaluation

With the data preprocessed, we proceed to model training and evaluation:

1) Train-Test Split:

    The clean and engineered dataset is split into training and testing subsets (for example, 75% train and 25% test) using scikit-learn’s train_test_split (via the data_split utility in Data_Loader). We stratify the split on the obesity label to maintain the same proportion of obese vs. non-obese in both sets.


2) Logistic Regression Model:

   We initialize a scikit-learn LogisticRegression classifier and train (fit) it on the training data. Default parameters (such as an L2 penalty and an appropriate solver) are used, but these can be adjusted if needed. The training process learns the weights for each feature in order to best separate obese vs. non-obese individuals.

3) Prediction:

   After training, we use the model to predict labels on the test set. These predictions are then compared to the true labels to evaluate performance.

4) Evaluation Metrics:

   We compute key classification metrics: accuracy (the fraction of correct predictions), the confusion matrix (to see counts of true positives, true negatives, false positives, false negatives), and the classification report (which includes precision, recall, and F1-score for each class) on the test data. These metrics provide a comprehensive overview of how well the model is performing in classifying individuals as obese or non-obese.

 


The entire process is encapsulated in the LogisticRegressionModel class below. This class can be integrated into the larger project (e.g., used within a Jupyter notebook or a Streamlit app) by instantiating it and calling its methods to prepare data, train the model, and retrieve evaluation results.

### Importing Libraries 

In [18]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Now we are ready for main part:

In [19]:
class LogisticRegressionModel:
    """
    Logistic Regression model for Obese vs Non-Obese classification.

    Designed to work cleanly with your current repo structure without requiring:
    from Data_Loader... or from Data_Preprocessing...

    You can pass a preprocessed df OR let this class load the CSV.
    """

    def __init__(
        self,
        df: pd.DataFrame = None,
        data_path: str = "data/raw/detailed_meals_macros_.csv",
        target_column: str = "obese",
        test_size: float = 0.2,
        random_state: int = 42,
        bmi_threshold: float = 30.0,
        drop_height_weight_from_X: bool = True
    ):
        self.df = df
        self.data_path = data_path
        self.target_column = target_column
        self.test_size = test_size
        self.random_state = random_state
        self.bmi_threshold = bmi_threshold
        self.drop_height_weight_from_X = drop_height_weight_from_X

        self.model = None
        self.scaler = None

        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None

        self.feature_columns_ = None

Note: We used pipeline method for previous presentation sor we can use again to avoid issues.

In [20]:
def load_data(self) -> pd.DataFrame:
        if self.df is not None:
            return self.df.copy()

        return pd.read_csv(self.data_path)

def _normalize_columns(self, df: pd.DataFrame) -> pd.DataFrame:
        # Lowercase + replace spaces with underscores
        df = df.copy()
        df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

        # Handle duplicate-like names such as dinner_protein.1
        # Make columns unique if pandas auto-added suffixes
        # (safe no-op if already unique)
        new_cols = []
        seen = {}
        for c in df.columns:
            if c not in seen:
                seen[c] = 0
                new_cols.append(c)
            else:
                seen[c] += 1
                new_cols.append(f"{c}_{seen[c]}")
        df.columns = new_cols

        return df

def _create_obesity_label(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Creates BMI + obese label if not already present.
        Assumes height in cm and weight in kg if available.
        """
        df = df.copy()

        if self.target_column in df.columns:
            return df

        # Try common column names after normalization
        height_col = "height"
        weight_col = "weight"

        if height_col in df.columns and weight_col in df.columns:
            height_m = df[height_col] / 100.0
            bmi = df[weight_col] / (height_m ** 2)

            df["bmi"] = bmi
            df[self.target_column] = (df["bmi"] >= self.bmi_threshold).astype(int)
        else:
            raise ValueError(
                f"Cannot create target '{self.target_column}'. "
                f"Expected columns '{height_col}' and '{weight_col}' to compute BMI."
            )

        return df

def _drop_text_columns(self, df: pd.DataFrame) -> pd.DataFrame:
        df = df.copy()

        # Drop any column that looks like suggestion text
        drop_cols = [c for c in df.columns if "suggestion" in c]
        # Your dataset also has a "disease" column that looks like goal text
        if "disease" in df.columns:
            drop_cols.append("disease")

        existing = [c for c in drop_cols if c in df.columns]
        if existing:
            df.drop(columns=existing, inplace=True)

        return df

def _encode_categoricals(self, df: pd.DataFrame) -> pd.DataFrame:
        df = df.copy()

        # Identify object columns except target
        cat_cols = [
            c for c in df.columns
            if df[c].dtype == "object" and c != self.target_column
        ]

        if cat_cols:
            df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

        return df

def preprocess(self) -> pd.DataFrame:
        df = self.load_data()
        df = self._normalize_columns(df)
        df = df.drop_duplicates()

        # Basic missing-value handling
        # (We’ll do final numeric fill later)
        df = df.dropna(subset=[c for c in df.columns if c in ["height", "weight"]], how="any")

        df = self._create_obesity_label(df)
        df = self._drop_text_columns(df)
        df = self._encode_categoricals(df)

        # Optionally remove direct BMI inputs from features to reduce leakage
        if self.drop_height_weight_from_X:
            for col in ["height", "weight", "bmi"]:
                if col in df.columns:
                    # keep them in df for reference if you want,
                    # but we'll drop from X later in build_features
                    pass

        # Fill remaining missing numeric with median
        for c in df.columns:
            if c != self.target_column and pd.api.types.is_numeric_dtype(df[c]):
                df[c] = df[c].fillna(df[c].median())

        self.df = df
        return df

def build_features(self):
        if self.df is None:
            self.preprocess()

        df = self.df.copy()

        y = df[self.target_column].astype(int)

        X = df.drop(columns=[self.target_column])

        if self.drop_height_weight_from_X:
            for col in ["height", "weight", "bmi"]:
                if col in X.columns:
                    X = X.drop(columns=[col])

        # Keep feature names for later reference
        self.feature_columns_ = X.columns.tolist()

        return X, y

def split_data(self):
        X, y = self.build_features()

        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            X, y,
            test_size=self.test_size,
            random_state=self.random_state,
            stratify=y
        )

def build_model(self, C: float = 1.0, max_iter: int = 1000):
        if self.X_train is None:
            self.split_data()

        self.scaler = StandardScaler()
        X_train_scaled = self.scaler.fit_transform(self.X_train)
        X_test_scaled = self.scaler.transform(self.X_test)

        self.model = LogisticRegression(C=C, max_iter=max_iter)
        self.model.fit(X_train_scaled, self.y_train)

        # Store scaled test for evaluation convenience
        self._X_test_scaled = X_test_scaled

def evaluate_model(self):
        if self.model is None:
            self.build_model()

        y_pred = self.model.predict(self._X_test_scaled)

        acc = accuracy_score(self.y_test, y_pred)
        cm = confusion_matrix(self.y_test, y_pred)
        report = classification_report(self.y_test, y_pred, target_names=["Non-Obese", "Obese"])

        return {
            "accuracy": acc,
            "confusion_matrix": cm,
            "classification_report": report
        }

def get_coefficients(self) -> pd.DataFrame:
        if self.model is None or self.feature_columns_ is None:
            raise ValueError("Train the model before requesting coefficients.")

        coef = self.model.coef_.ravel()
        return pd.DataFrame({
            "feature": self.feature_columns_,
            "coefficient": coef
        }).sort_values(by="coefficient", ascending=False)

def predict(self, new_df: pd.DataFrame):
        if self.model is None or self.scaler is None:
            raise ValueError("Train the model before calling predict().")

        tmp = new_df.copy()
        tmp = self._normalize_columns(tmp)
        tmp = self._drop_text_columns(tmp)
        tmp = self._encode_categoricals(tmp)

        # Align columns to training features
        for col in self.feature_columns_:
            if col not in tmp.columns:
                tmp[col] = 0

        tmp = tmp[self.feature_columns_]

        X_scaled = self.scaler.transform(tmp)
        return self.model.predict(X_scaled)

### Summary and References:

In the code above, the LogisticRegressionModel class handles data loading, preprocessing (using the project’s DataCleaner and DataPreprocessor utilities), model training, and evaluation. After instantiating this class and calling build_model(), one can call evaluate_model() to obtain the accuracy, confusion matrix, and classification report for the test data. These metrics will indicate how well the logistic regression model is able to classify individuals as obese or non-obese.

*

*

*

*