<a href="https://www.kaggle.com/code/cid007/obesity-risk-eda-bl?scriptVersionId=162017463" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">

## 🚀 Getting Started
In this binary classification task focused on predicting obesity risk in individuals, which is related to cardiovascular disease.

## 🔧 Tools and Libraries

We will be using Python for this project, along with several libraries for data analysis and machine learning. Here are the main libraries we'll be using:

- **Pandas**: For data manipulation and analysis.
- **NumPy**: For numerical computations.
- **Matplotlib and Seaborn**: For data visualization.
- **Scikit-learn**: For machine learning tasks, including data preprocessing, model training, and model evaluation.
- **Gradient Boosting (e.g., XGBoost, LightGBM)**: Ensemble method building decision trees sequentially,Often yields high predictive performance.
Handles complex relationships and feature interactions.

## 📈 Workflow

Here's a brief overview of our workflow for this project:

1. **Data Loading and Preprocessing**: Load the data and preprocess it for analysis and modeling. This includes handling missing values, encoding categorical variables, and scaling numerical variables..

2. **Exploratory Data Analysis (EDA)**: Explore the data to gain insights and understand the relationships between different **`features`** and the .

3. **Model Training**: Train the model on the preprocessed data.

4. **Model Evaluation**: Evaluate the model's performance using various metrics, such as accuracy, precision, recall, F1-score, Cohen's Kappa, and Matthews Correlation Coefficient.

5. **Error Analysis**: Analyze the instances where the model made errors to gain insights into potential improvements.

6. **Future Work**: Based on our findings, suggest potential directions for future work.

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">
    
# <span style="color:#094863; font-size: 1%|;">Loading Libraries</span>

In [None]:
# Data Manipulation and Analysis
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Utilities
import pprint
import warnings

# Data Preprocessing
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, QuantileTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import metrics, linear_model, tree, naive_bayes,neighbors, ensemble, neural_network, svm

# Feature Selection
from sklearn.feature_selection import mutual_info_classif


# Model Building
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import lightgbm as lgb
import xgboost as xgb

# Cross validation
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score, cross_val_predict

# Statistical Analysis
from scipy.stats import chi2_contingency

# Hyperparameter Tuning
import optuna

# Data Splitting
from sklearn.model_selection import train_test_split

# Metrices
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns
def set_color_map(color_list):
    cmap_custom = ListedColormap(color_list)
    print("Notebook Color Schema:")
    sns.palplot(sns.color_palette(color_list))
    plt.show()
    return cmap_custom

color_list = ['royalblue', 'cyan','yellow', 'orange']
cmap_custom = set_color_map(color_list)

In [None]:
!pip install catboost

In [None]:
from catboost import CatBoostClassifier

In [None]:
# ignore warnings
warnings.filterwarnings("ignore", category= UserWarning)
optuna.logging.set_verbosity(optuna.logging.WARNING)

In [None]:
# Define the style
rc = {
    "axes.facecolor": "#dcf5f7",
    "figure.facecolor": "#dcf5f7",
    "axes.edgecolor": "#000000",
    "grid.color": "#094863",
    "font.family": "arial",
    "axes.labelcolor": "#000000",
    "xtick.color": "#000000",
    "ytick.color": "#000000",
    "grid.alpha": 0.4,
}
sns.set(rc=rc)

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">

# <span style="color:#094863; font-size: 1%|;"> Import Data and Exploration</span>

In [None]:
# import the csv files
train = pd.read_csv('/kaggle/input/playground-series-s4e2/train.csv')
test = pd.read_csv('/kaggle/input/playground-series-s4e2/test.csv')
submission=pd.read_csv('/kaggle/input/playground-series-s4e2/sample_submission.csv')

In [None]:
train_df=train.copy()
test_df=test.copy()

In [None]:
train_df.shape, test_df.shape

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
test_df.info()

In [None]:
# Define the function that creates missing value heatmap
def plot_missing_data(dataset, title):
  fig,ax=plt.subplots(figsize=(5,5))
  plt.title(title)
  sns.heatmap(dataset,cbar=False)
     

In [None]:
plot_missing_data(train_df.isnull(),"Training Data")

In [None]:
plot_missing_data(test_df.isnull(),"Test Data")

#### It seems like there are no missing values in both train and test data.

In [None]:
# Check duplicate values
train_df.duplicated().sum(), test_df.duplicated().sum()

#### It seems like there are no duplicate values in both train and test data.

In [None]:
train_df.drop(['id'],axis=1).describe().T.style.bar(subset=['mean'],color='#7BCC70')\
    .background_gradient(subset=['std'], cmap='Reds')\
    .background_gradient(subset=['50%'], cmap='coolwarm')

In [None]:
train_df.describe(include="object").T.style.bar(subset=['unique'],color='#7BCC70')\
    .background_gradient(subset=['freq'], cmap='Reds')\
   

In [None]:
string_columns=[f for f in train_df.columns if train_df[f].dtype == object and f != 'NObeyesdad']
numeric_columns=[f for f in train_df.columns if f not in string_columns and f not in ['id', 'NObeyesdad']]
print(string_columns)
print(numeric_columns)

In [None]:
def unique_values(data):
    total = data.count()
    unq = pd.DataFrame(total)
    unq.columns = ['Total']
    uniques = []
    for col in data.columns:
        unique = data[col].nunique()
        uniques.append(unique)
        type=data[col].dtype
    unq['Uniques'] = uniques
    return(np.transpose(unq))

In [None]:
unique_values(train_df[string_columns])

In [None]:
#Check categories for each categorical attribute
pd.set_option('display.max_colwidth',0)
cat=[]
for col in string_columns:
    catlist=train_df[col].value_counts().index.to_list()
    cat.append([col,catlist])
pd.DataFrame(cat,columns=['Column Name','Categories']).set_index('Column Name').rename_axis(None)

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">

# <span style="color:#094863; font-size: 1%|;"> Exploratory Data Analysis </span>

In [None]:
# LEts check correlation matric to find out which features are important in prediction survival

corrMatrix = train_df[numeric_columns].corr()
sns.heatmap(corrMatrix, annot=True,cmap='RdYlGn')
plt.show()


<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">
<strong>1. Weight is strongly corelated to Age,Height,FCVC and CH2O. Using feature engineering, we can derive new features from these three or drop some and check the results.</strong><br>
<strong>2. Age is inversly co related with FAF. </strong><br>
<strong>3. All these co relation canbe further checked by plotting scatter plot of related variables.</strong>

In [None]:
# Check distribution of numeric features
import matplotlib.pyplot as plt
fig,axe=plt.subplots(nrows=4,ncols=2,figsize=(20,20),)
axe=axe.flatten()
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
axis_counter=0
for feature in numeric_columns:
  ax=axe[axis_counter]
  _=sns.histplot(data=train_df,x=feature,kde=True,ax=ax)
  _=ax.set_title("{}".format(feature))
  _=ax.set_ylabel("")
  _=ax.set_xlabel("")
  axis_counter+=1

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">
<strong> 1. Age , height and weight are not normaly distributed, they need to be conveted to normal distribution.</strong><br>
<strong> 2. Other numeric features require feature engineering and further probing. Binning might be one of the options.</strong>

In [None]:
# Check categorical features.
def plot_categorical_variables(df):
    for column in df.columns:
        if df[column].dtype == 'object' or len(df[column].unique()) < 10:
            plt.figure(figsize=(12, 6))
            sns.countplot(x=column, data=df,palette='rainbow')
            plt.title(f'Distribution of {column}')
            plt.show()

In [None]:
plot_categorical_variables(train_df)

#### There are seven categories of obesity risk. They seem to be kind of evenly distributed so this can be considered as balanced data.

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">

# <span style="color:#094863; font-size: 1%|;"> Target Feature Distribution</span>

In [None]:
train_df['NObeyesdad'].value_counts(normalize=True).plot.bar(figsize=(12,6))
plt.xlabel('Variables')
plt.ylabel('Number of unique categories')
plt.title('Total number of labels')
plt.show()


It seems there is not much difference between categories. They are sort of evenly distributed.

# <span style="color:#094863; font-size: 1%|;"> Train/Test distribution Check</span>



<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">
<strong>To check weather train and test data came from same distribution, we combine train and test data and added one feature to denote each.
After that we have plotted each varible with hue being train or test type. </strong>


In [None]:
train_df['Data Type']='Train'
test_df['Data Type']='Test'

all=pd.concat([train_df.drop(['NObeyesdad'],axis=1),test_df],ignore_index=True)
all.shape


In [None]:
all.head()

In [None]:
# Check numeric features
plt.figure(figsize=(10,4*len(numeric_columns)))
for i, f in enumerate(numeric_columns,1):
    plt.subplot(len(numeric_columns),1,i)
    sns.histplot(data=all,x=f,hue='Data Type',kde=True,element='step',stat='density',common_norm=False,palette='bright')
    plt.title(f'Distribution of {f} by Data Type')
    #plt.xlabel('')
plt.tight_layout()
plt.show()

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">

## <span style="color:#094863; font-size: 1%|;"> Train/Test categorical features check</span>


In [None]:
#Plot categorical features
plt.figure(figsize=(10,4*len(string_columns)))
for i , f in enumerate(string_columns,1):
    plt.subplot(len(string_columns),1,i)
    sns.countplot(data=all,x=f,hue='Data Type',palette = 'pastel')
    plt.title(f'Distribution of {f} by Data Type')
plt.tight_layout()
plt.show()

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">

<strong> It seems from above graphs that train and test features follow same distribution, so they are from same data.

## <span style="color:#094863; font-size: 1%|;"> Prepare Data For Modelling</span>

In [None]:
category_mapping={
    'Obesity_Type_III':0,
    'Obesity_Type_II':1,
    'Normal_Weight':2,
    'Obesity_Type_I':3,
    'Insufficient_Weight':4,
    'Overweight_Level_II':5,
    'Overweight_Level_I':6,    
}
train_df['y']=train_df['NObeyesdad'].map(category_mapping)
train_df['y'].head()
train_df.drop(['Data Type'],axis=1)

In [None]:
# Add ordinal features 
def ord_feature(df):
    df['Age_Cat'] = pd.cut(df['Age'], bins=[0, 20, 30, 40,50,60, float('inf')],labels=[0,1,2,3,4,5])
    df['FCVC_Cat'] = pd.cut(df['FCVC'], bins=[1,2,3,4,5, float('inf')],labels=[1,2,3,4,5])
    df['NCP_Cat'] = pd.cut(df['NCP'], bins=[1,2,3,4,5, float('inf')],labels=[1,2,3,4,5])
    df['CH2O_Cat'] = pd.cut(df['CH2O'], bins=[0, 1, 2, 3,4, float('inf')],labels=[0,1,2,3,4])
    df['FAF_Cat'] = pd.cut(df['FAF'], bins=[0, 0.5, 1.0, 1.5, 2.5, 3.5, float('inf')],labels=[0,1,2,3,4,5])
    df['TUE_Cat'] = pd.cut(df['TUE'], bins=[0, 0.5, 1.0, 1.5, 2, 3, float('inf')],labels=[0,1,2,3,4,5])
    return df

In [None]:
ord_feature(train_df)
ord_feature(test_df)

In [None]:
def fill(df,col): 
    minm=df[col].min()
    #print(minm)
    df[col]=df[col].fillna(minm)
    return df

In [None]:
for c in train_df.columns:
    fill(train_df,c)
for c in test_df.columns:
    fill(test_df,c)

In [None]:
catcol=[c for c in train_df.columns if train_df[c].dtype=='category']
#print(catcol)
train_df[catcol]=train_df[catcol].astype(int)
#print(train_df.dtypes)
test_df[catcol]=test_df[catcol].astype(int)

In [None]:
#split data into train and val set
X_train,X_test,y_train,y_test=train_test_split(train_df.drop(['id','NObeyesdad','y','Data Type'],axis=1),train_df['y'],test_size=0.2,random_state=42)
X_train.shape, y_train.shape, X_test.shape,y_test.shape

In [None]:
from sklearn.preprocessing import LabelEncoder
for colname in X_train.select_dtypes(['object','bool']).columns:
    X_train[colname]=LabelEncoder().fit_transform(X_train[colname])

for colname in train_df.select_dtypes(['object','bool']).columns:
    train_df[colname]=LabelEncoder().fit_transform(train_df[colname])

for colname in X_test.select_dtypes(['object','bool']).columns:
    X_test[colname]=LabelEncoder().fit_transform(X_test[colname])


In [None]:
#Use standard scaler for numeric data transformation
from sklearn.preprocessing import MinMaxScaler
sc=StandardScaler()
sc.fit(X_train[numeric_columns])


X_train[numeric_columns]=sc.transform(X_train[numeric_columns])
X_test[numeric_columns]=sc.transform(X_test[numeric_columns])
train_df[numeric_columns]=sc.transform(train_df[numeric_columns])

train_df.head()


In [None]:
X_train[numeric_columns].describe().T.style.bar(subset=['mean'],color='#7BCC70')\
    .background_gradient(subset=['std'], cmap='Reds')\
    .background_gradient(subset=['50%'], cmap='coolwarm')

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">

# <span style="color:#094863; font-size: 1%|;"> Define Baseline Models</span>

In [None]:
rand=42
import lightgbm as lgb
import xgboost as xgb
import catboost as catboost
class_models = {
    #Tree
    'decision_tree':{
        'model': tree. DecisionTreeClassifier(max_depth=7,
                                              random_state=rand)
    },
    
    #Nearest Neighbors
    'knn':{'model': neighbors.KNeighborsClassifier(n_neighbors=7)},
    
    #Ensemble Methods
    'gradient_boosting':{
        'model':ensemble.
        GradientBoostingClassifier(n_estimators=210)
    },
    
    'random_forest':{
        'model':ensemble.RandomForestClassifier(
            max_depth=11,class_weight='balanced', random_state=rand
        )
    },
    
    'XGBoost':{
        'model': xgb.XGBClassifier(
            max_depth=7,class_weight='balanced',eval_metric = 'mlogloss', random_state=rand
        )
    },
    
    'LightGBM':{
        'model':lgb.LGBMClassifier(
            num_leaves=35,max_depth=7,class_weight='balanced', random_state=rand
        )
    },
    
    'CatBoost':{
        'model': catboost.CatBoostClassifier(
                iterations=100, depth=6, learning_rate=0.1,
                   loss_function='MultiClass', verbose=False
        )
    }

}

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">

## <span style="color:#094863; font-size: 1%|;"> Training And Inference</span>

In [None]:
from sklearn.preprocessing import label_binarize
for model_name, model_info in class_models.items():
    fitted_model = model_info['model'].fit(X_train, y_train)
    y_train_pred = fitted_model.predict(X_train)
    y_test_pred = fitted_model.predict(X_test)
    
    model_info['fitted'] = fitted_model
    model_info['preds'] = y_test_pred
    model_info['Accuracy_train'] = metrics.accuracy_score(y_train, y_train_pred)
    model_info['Accuracy_test'] = metrics.accuracy_score(y_test, y_test_pred)
    model_info['Recall_train'] = metrics.recall_score(y_train, y_train_pred, average='macro')
    model_info['Recall_test'] = metrics.recall_score(y_test, y_test_pred, average='macro')
    
    # For models supporting predict_proba, calculate additional metrics
    if hasattr(fitted_model, "predict_proba"):
        y_test_prob = fitted_model.predict_proba(X_test)
        # ROC AUC calculation for multi-class requires binarized labels
        y_test_binarized = label_binarize(y_test, classes=np.unique(y_train))
        if y_test_binarized.shape[1] == 1:  # Binarize returns a single column for two classes
            y_test_binarized = np.hstack((1 - y_test_binarized, y_test_binarized))
        model_info['ROC_AUC_test'] = metrics.roc_auc_score(y_test_binarized, y_test_prob, multi_class='ovr', average='macro')
    else:
        model_info['ROC_AUC_test'] = np.nan

    model_info['F1_test'] = metrics.f1_score(y_test, y_test_pred, average='macro')
    model_info['MCC_test'] = metrics.matthews_corrcoef(y_test, y_test_pred)

# Create a DataFrame to display metrics
class_metrics = pd.DataFrame.from_dict(
    class_models, orient='index',
    columns=['Accuracy_train', 'Accuracy_test', 'Recall_train', 'Recall_test', 'ROC_AUC_test', 'F1_test', 'MCC_test']
)

# Display the metrics, sorted by ROC_AUC_test score
display = class_metrics.sort_values(by='ROC_AUC_test', ascending=False).style.format("{:.3f}").background_gradient(cmap='plasma', low=1, high=0.1, subset=['Accuracy_train', 'Accuracy_test']).background_gradient(cmap='viridis', low=1, high=0.1, subset=['Recall_train', 'Recall_test', 'ROC_AUC_test', 'F1_test', 'MCC_test'])
display


<strong> From above,it seems that XGBoost is the most succesfull . So we will use it to predict final test data.</strong>

In [None]:
#Fit all training data 

lgb=LGBMClassifier(
            num_leaves=35,max_depth=7,class_weight='balanced', random_state=rand
        )
lgb.fit(train_df.drop(['id','NObeyesdad','y','Data Type'],axis=1),train_df['y'])

In [None]:
def train_skl(x, y, folds, how='log'):
    n_folds = len(folds)
    oof = np.zeros((len(y), ))
    preds = np.zeros((len(y),))
    
    print('='*30)
    for idx in range(n_folds):
        print("FOLD:", idx)
        tr_idx, val_idx = folds[idx]
        xt, yt = x[tr_idx], y[tr_idx]
        xv, yv = x[val_idx], y[val_idx]
        
        elif how == 'xgb':
            model = xgb.XGBClassifier(n_estimators=100,max_depth=3,learning_rate=0.2,class_weight='balanced',eval_metric = 'mlogloss',subsample=0.9,colsample_bytree=0.85)
        elif how == 'xgb1':
            model = xgb.XGBClassifier(class_weight='balanced',eval_metric = 'mlogloss')
        elif how =='lgb':
            model = lgb.LGBMClassifier(max_depth=3)
        elif how =='ctb':
            model = ctb.CatBoostClassifier(iterations=100, depth=6, learning_rate=0.1,
                   loss_function='MultiClass', verbose=False)
        elif how == 'mlp':
            model = MLPClassifier(hidden_layer_sizes=(100,100,),
                                  random_state=777, max_iter=300)
        elif how == 'tree':
            model = DecisionTreeClassifier(max_depth=5) 
        else: 
            model = AdaBoostClassifier( RandomForestClassifier(n_estimators=100, max_depth=4) )
        #
        model.fit(xt, yt)
        
        #
        oof[val_idx] =   model.predict_proba(xv)  
        preds +=  model.predict_proba(xe) /n_folds
        #
        print('='*30)
    return oof, preds
#===================

In [None]:
X=train_df.drop(['id','NObeyesdad','y','Data Type'],axis=1)
y=train_df['y']

In [None]:
skf = MultilabelStratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=42)
FOLDS = list(skf.split(X))

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">

## <span style="color:#094863; font-size: 1%|;"> Prediction and Submission</span>

In [None]:
for colname in test_df.select_dtypes(['object','bool']).columns:
    test_df[colname]=LabelEncoder().fit_transform(test_df[colname])

test_df[numeric_columns]=sc.transform(test_df[numeric_columns])

test_df.head()

In [None]:
train_df.dtypes

In [None]:
test_df.dtypes

In [None]:
#submission['y_test_pred'] = class_models['XGBoost']['model'].predict(test_df.drop(['id','Data Type'],axis=1))
submission['y_test_pred'] = lgb.predict(test_df.drop(['id','Data Type'],axis=1))

In [None]:
submission.head()

In [None]:
# Re map the categories
numeric_to_category={v:k for k,v in category_mapping.items()}
submission['NObeyesdad']=submission['y_test_pred'].map(numeric_to_category)
submission.drop(['y_test_pred'],axis=1).head()

In [None]:
submission.drop(['y_test_pred'],axis=1).to_csv('submission.csv',index=False)

<div style="border-radius:10px;border:#D2222D solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">

## <span style="color:#094863; font-size: 1%|;"> Thank You</span>