#### K-Means Clustering

In [338]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA , KernelPCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import optuna

In [339]:
# Noramlization
Numerical_features=[ 'Age','Height','Weight' ,'FCVC' ,'NCP' ,'CH2O' ,'FAF' , 'TUE' ]

In [340]:
import pandas as pd

`🧰 ColumnTransformer with MinMaxScaler — Explanation `

The `ColumnTransformer` in Scikit-learn allows you to apply different preprocessing techniques to specific columns of your dataset. This is especially useful when you have a mix of numerical, categorical, and binary features.

`🧾 ColumnTransformer Component Breakdown`

| **Component**               | **Purpose**                                                                     |
|-----------------------------|----------------------------------------------------------------------------------|
| `'Normalization'`           | Just a **name** for this transformation step (can be anything)                  |
| `MinMaxScaler()`            | Applies **Min-Max scaling** to selected columns (scales values to 0–1 range)   |
| `NumericalFeatures`         | List of **column names or indices** that are numeric and should be scaled       |
| `remainder='passthrough'`   | Tells it to **leave the rest of the columns unchanged**                         |



In [341]:
''' When we have a mix of numerical features and categorical features we use this just to normalize only the numerical features ''' 
PreprocessingFeatures = ColumnTransformer(
    [
        ('Normalization', MinMaxScaler(), Numerical_features)
    ],
    remainder='passthrough',n_jobs=2
)


`⚙️ Pipeline in Scikit-learn`

When building machine learning models in Scikit-learn, you can either manually preprocess your data and train the model, or use a `Pipeline` to automate and streamline the process.

In [342]:
pca=PCA(0.99,random_state=30)

ClusteringAnalysis = Pipeline(
    [
        ('Preprocessing',PreprocessingFeatures), #0
        ('DimensionalReduction',pca),            #1
        ('Clustering',KMeans(n_clusters=4,random_state=30)), #2
    ] 
)

# three clusters underweight or normal weight, overweight and obese.

In [343]:
df=pd.read_csv('ObesityDataSet_Cleaned.csv')

In [344]:
LabelsClusters = ClusteringAnalysis.fit_predict(df)

In [345]:
pca.explained_variance_ratio_.sum()
# this value closed to one means the new principle components represent the old dataset with old features well

0.9999294424826446

In [346]:
TransformedDataset = ClusteringAnalysis[:2].transform(df) # here we only do the preprocesssing and dimensionality reduction

In [347]:
TransformedDataset.shape #here we see we have 7 principle components

(2111, 1)

In [348]:
_score = silhouette_score(TransformedDataset,LabelsClusters)
print(f'Silhouette Score :: {_score:.4f}')

Silhouette Score :: 0.5710


`📊 What is the Silhouette Score?`

The **Silhouette Score** measures how similar a data point is to **its own cluster** compared to **other clusters**.

It tells you how well each point fits into its assigned cluster.

---

`📐 Formula Intuition:`

For each data point:

- Let **a** = average distance to all other points in the **same cluster** (intra-cluster distance)
- Let **b** = average distance to all points in the **nearest other cluster** (nearest-cluster distance)

Then the **Silhouette Score** is computed as:

\[
\text{Silhouette Score} = \frac{b - a}{\max(a, b)}
\]

---

`🧠 What Do the Values Mean?`

| **Silhouette Score**       | **Interpretation**                                                 |
|----------------------------|---------------------------------------------------------------------|
| **+1.0 (close to 1)**       | Perfectly matched to its own cluster, far from other clusters       |
| **0**                      | On or very close to the boundary between two clusters               |
| **-1.0 (close to -1)**      | Likely assigned to the wrong cluster                                |


A **higher silhouette score** indicates better-defined and more clearly separated clusters.


#### Preprocesising Pipeline

here we will use the StandardScaler library here for noramization

In [349]:
from sklearn.preprocessing import StandardScaler

In [350]:
# these are the features that we are going to use for normalization
df[Numerical_features] 

Unnamed: 0,Age,Height,Weight,FCVC,NCP,CH2O,FAF,TUE
0,21.000000,1.620000,64.000000,2.0,3.0,2.000000,0.000000,1.000000
1,21.000000,1.520000,56.000000,3.0,3.0,3.000000,3.000000,0.000000
2,23.000000,1.800000,77.000000,2.0,3.0,2.000000,2.000000,1.000000
3,27.000000,1.800000,87.000000,3.0,3.0,2.000000,2.000000,0.000000
4,22.000000,1.780000,89.800000,2.0,1.0,2.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...
2106,20.976842,1.710730,131.408528,3.0,3.0,1.728139,1.676269,0.906247
2107,21.982942,1.748584,133.742943,3.0,3.0,2.005130,1.341390,0.599270
2108,22.524036,1.752206,133.689352,3.0,3.0,2.054193,1.414209,0.646288
2109,24.361936,1.739450,133.346641,3.0,3.0,2.852339,1.139107,0.586035


In [351]:
df_copy=df.copy()
df_copy.head()
df_copy=df_copy.drop(columns=['NObeyesdad'],axis=1)
df.head()


Unnamed: 0.1,Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,NObeyesdad,BMI,MTRANS_Automobile,MTRANS_Bike,MTRANS_Motorbike,MTRANS_Public_Transportation,MTRANS_Walking
0,0,0,21.0,1.62,64.0,1,0,2.0,3.0,1,0,2.0,0,0.0,1.0,0,1,24.386526,0,0,0,1,0
1,1,0,21.0,1.52,56.0,1,0,3.0,3.0,1,1,3.0,1,3.0,0.0,1,1,24.238227,0,0,0,1,0
2,2,1,23.0,1.8,77.0,1,0,2.0,3.0,1,0,2.0,0,2.0,1.0,2,1,23.765432,0,0,0,1,0
3,3,1,27.0,1.8,87.0,0,0,3.0,3.0,1,0,2.0,0,2.0,0.0,2,5,26.851852,0,0,0,0,1
4,4,1,22.0,1.78,89.8,0,0,2.0,1.0,1,0,2.0,0,0.0,0.0,1,6,28.342381,0,0,0,1,0


In [352]:
X_train, X_test, y_train, y_test = train_test_split(df_copy,df['NObeyesdad'],test_size=0.2,random_state=30)

### Logistic Regression

In [353]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [354]:
PreprocessingFeatures = ColumnTransformer(
    [
        ('Normalization', StandardScaler(), Numerical_features)
    ],
    remainder='passthrough')

In [355]:
PreprocessingFeatures

In [356]:
def objective(trial):
    # Suggest hyperparameters
    C = trial.suggest_float('LogisticRegressionModel__C',1e-10,2,log=True) #considering float values between 1e-10 and 2 
    #use log=True for logarithmic scale
    l1_ratio=trial.suggest_float('LogisticRegressionModel__l1_ratio',0,1) #considering float values between 0 and 1
    
    # Create pipeline
    pipeline = Pipeline([
        ('Preprocessing',PreprocessingFeatures),
        ('LogisticRegressionModel', LogisticRegression(
            penalty='elasticnet',
            solver='saga',
            random_state=40,
            l1_ratio=l1_ratio,
            C=C
            ))
    ])

    # Cross-validate
    score = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='accuracy')
    return score.mean()
    #optuna tries to maximize the accuracy score
    # we can also use other scoring metrics like f1, precision, recall, roc_auc
    

In [357]:
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.RandomSampler(seed=30))
study.optimize(objective, n_trials=15)

[I 2025-08-03 07:20:59,133] A new study created in memory with name: no-name-7679875f-f1a9-4c55-8b0c-40c2374734d0
[I 2025-08-03 07:20:59,468] Trial 0 finished with value: 0.23696242591269867 and parameters: {'LogisticRegressionModel__C': 0.0004318589122639343, 'LogisticRegressionModel__l1_ratio': 0.38074848963511654}. Best is trial 0 with value: 0.23696242591269867.
[I 2025-08-03 07:20:59,754] Trial 1 finished with value: 0.25770054929426117 and parameters: {'LogisticRegressionModel__C': 0.0006762018606196053, 'LogisticRegressionModel__l1_ratio': 0.16365072610275333}. Best is trial 1 with value: 0.25770054929426117.
[I 2025-08-03 07:21:00,141] Trial 2 finished with value: 0.26362437290485435 and parameters: {'LogisticRegressionModel__C': 0.8238572402170163, 'LogisticRegressionModel__l1_ratio': 0.34666184037976566}. Best is trial 2 with value: 0.26362437290485435.
[I 2025-08-03 07:21:00,516] Trial 3 finished with value: 0.26362437290485435 and parameters: {'LogisticRegressionModel__C': 

In [358]:
print("Best params:", study.best_trial.params)
print("Best score:", study.best_trial.value)

Best params: {'LogisticRegressionModel__C': 0.8238572402170163, 'LogisticRegressionModel__l1_ratio': 0.34666184037976566}
Best score: 0.26362437290485435


In [359]:
from sklearn.metrics import accuracy_score

In [360]:
best_params = study.best_trial.params

In [361]:
best_params

{'LogisticRegressionModel__C': 0.8238572402170163,
 'LogisticRegressionModel__l1_ratio': 0.34666184037976566}

In [362]:

# Rebuild pipeline with best params
best_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('LogisticRegressionModel', LogisticRegression(
        C=best_params['LogisticRegressionModel__C'],
        l1_ratio=best_params['LogisticRegressionModel__l1_ratio'],
        penalty='elasticnet',
        solver='saga',
        random_state=30
    ))
])

# Fit the full pipeline on all training data
best_pipeline.fit(X_train, y_train)

y_pred_test = best_pipeline.predict(X_test)
y_pred_train = best_pipeline.predict(X_train)
test_score=accuracy_score(y_test, y_pred_test)
train_score=accuracy_score(y_train, y_pred_train)
print(f" test accuracy score: {test_score:.4f} and train accuracy score: {train_score:.4f}")


 test accuracy score: 0.8865 and train accuracy score: 0.9224




### Random Forest Algorithm

In [363]:
from sklearn.ensemble import RandomForestClassifier

In [364]:
def objective(trial):
    # Suggest hyperparameters
    n_estimators = trial.suggest_int('RandomForestModel__n_estimators',1,100) #considering integer values between 1 and 100 
    max_depth = trial.suggest_int( 'RandomForestModel__max_depth',1,12) 
    criterion=trial.suggest_categorical('RandomForestModel__criterion',['gini','entropy']) #considering float values between 0 and 1
    
    # Create pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('RandomForestModel', RandomForestClassifier(
            random_state=30,
            n_estimators=n_estimators,
            max_depth=max_depth,
            criterion=criterion,
            n_jobs=-1
            ))
    ])

    # Cross-validate
    score = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='accuracy')
    return score.mean()
    #optuna tries to maximize the accuracy score
    # we can also use other scoring metrics like f1, precision, recall, roc_auc
    

In [365]:
study_2= optuna.create_study(direction='maximize', sampler=optuna.samplers.RandomSampler(seed=30))
study_2.optimize(objective, n_trials=15)

[I 2025-08-03 07:21:02,576] A new study created in memory with name: no-name-204365d2-fb63-4b92-8d53-cecef3b76ad6
[I 2025-08-03 07:21:02,969] Trial 0 finished with value: 0.9745263996257973 and parameters: {'RandomForestModel__n_estimators': 65, 'RandomForestModel__max_depth': 5, 'RandomForestModel__criterion': 'gini'}. Best is trial 0 with value: 0.9745263996257973.
[I 2025-08-03 07:21:03,472] Trial 1 finished with value: 0.9763015450613031 and parameters: {'RandomForestModel__n_estimators': 97, 'RandomForestModel__max_depth': 5, 'RandomForestModel__criterion': 'gini'}. Best is trial 1 with value: 0.9763015450613031.
[I 2025-08-03 07:21:03,831] Trial 2 finished with value: 0.9834179292849482 and parameters: {'RandomForestModel__n_estimators': 59, 'RandomForestModel__max_depth': 5, 'RandomForestModel__criterion': 'entropy'}. Best is trial 2 with value: 0.9834179292849482.
[I 2025-08-03 07:21:04,127] Trial 3 finished with value: 0.9893417528955414 and parameters: {'RandomForestModel__n_

In [366]:
best_params= study_2.best_trial.params
best_params.keys()

dict_keys(['RandomForestModel__n_estimators', 'RandomForestModel__max_depth', 'RandomForestModel__criterion'])

In [367]:


# Rebuild pipeline with best params
RandomForest_best_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('LogisticRegressionModel',RandomForestClassifier(
            random_state=30,
            n_estimators=best_params['RandomForestModel__n_estimators'],
            max_depth=best_params['RandomForestModel__max_depth'],
            criterion=best_params['RandomForestModel__criterion'],
            n_jobs=-1
            ))
])

# Fit the full pipeline on all training data
RandomForest_best_pipeline.fit(X_train, y_train)
y_pred_test = RandomForest_best_pipeline.predict(X_test)
y_pred_train = RandomForest_best_pipeline.predict(X_train)

test_score=accuracy_score(y_test, y_pred_test)
train_score=accuracy_score(y_train, y_pred_train)
print(f" test accuracy score: {test_score:.4f} and train accuracy score: {train_score:.4f}")


 test accuracy score: 0.9882 and train accuracy score: 1.0000


### XGBoost

In [368]:
from xgboost import XGBClassifier

In [369]:
def objective(trial):
    # Suggest hyperparameters
    n_estimators = trial.suggest_int('XGBoostModel__n_estimators',1,100) #considering integer values between 1 and 100 
    max_depth = trial.suggest_int( 'XGBoostModel__max_depth',1,12) 
    learning_rate=trial.suggest_float('XGBoostModel__learning_rate',0.01,1,log=True) #considering float values between 0 and 1
    
    # Create pipeline
    pipeline = Pipeline([
        ('Preprocessing',PreprocessingFeatures),
        ('XGBoostModel', XGBClassifier(
            random_state=30,
            n_estimators=n_estimators,
            max_depth=max_depth,
            learning_rate=learning_rate,
            n_jobs=-1
            ))
    ])

    # Cross-validate
    score = cross_val_score(pipeline, X_train, y_train, cv=3, scoring='accuracy')
    return score.mean()
    #optuna tries to maximize the accuracy score
    # we can also use other scoring metrics like f1, precision, recall, roc_auc
    

In [370]:
study_3= optuna.create_study(direction='maximize', sampler=optuna.samplers.RandomSampler(seed=30))
study_3.optimize(objective, n_trials=15)

[I 2025-08-03 07:21:07,744] A new study created in memory with name: no-name-78107785-6236-4eee-8b98-996f8ae001f6
[I 2025-08-03 07:21:08,521] Trial 0 finished with value: 0.9881534062775885 and parameters: {'XGBoostModel__n_estimators': 65, 'XGBoostModel__max_depth': 5, 'XGBoostModel__learning_rate': 0.21188285231627235}. Best is trial 0 with value: 0.9881534062775885.
[I 2025-08-03 07:21:08,847] Trial 1 finished with value: 0.9846010084090272 and parameters: {'XGBoostModel__n_estimators': 17, 'XGBoostModel__max_depth': 12, 'XGBoostModel__learning_rate': 0.04935415046269999}. Best is trial 0 with value: 0.9881534062775885.
[I 2025-08-03 07:21:09,813] Trial 2 finished with value: 0.9875613399661617 and parameters: {'XGBoostModel__n_estimators': 100, 'XGBoostModel__max_depth': 3, 'XGBoostModel__learning_rate': 0.14838449980044874}. Best is trial 0 with value: 0.9881534062775885.
[I 2025-08-03 07:21:10,200] Trial 3 finished with value: 0.985193074720454 and parameters: {'XGBoostModel__n_e

In [371]:
best_params= study_3.best_trial.params
best_params.keys()

dict_keys(['XGBoostModel__n_estimators', 'XGBoostModel__max_depth', 'XGBoostModel__learning_rate'])

In [372]:

# Rebuild pipeline with best params
XGBClassifier_best_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('LogisticRegressionModel',XGBClassifier(
            random_state=30,
            n_estimators=best_params['XGBoostModel__n_estimators'],
            max_depth=best_params['XGBoostModel__max_depth'],
            learning_rate=best_params['XGBoostModel__learning_rate'],
            n_jobs=-1
            ))
])

# Fit the full pipeline on all training data
RandomForest_best_pipeline.fit(X_train, y_train)
y_pred_test = RandomForest_best_pipeline.predict(X_test)
y_pred_train = RandomForest_best_pipeline.predict(X_train)

test_score=accuracy_score(y_test, y_pred_test)
train_score=accuracy_score(y_train, y_pred_train)
print(f" test accuracy score: {test_score:.4f} and train accuracy score: {train_score:.4f}")

 test accuracy score: 0.9882 and train accuracy score: 1.0000
