<br>

## Instructions to follow:

- Please include all your codes here. Be sure that your code is CLEAN, READABLE, and REPRODUCIBLE.
- Put your data set into a **datasets** folder.
- Put your images (if available) into an **img** folder.
- Please write YOUR OWN code. **DO NOT copy** my codes or someone else's codes.

## Data: Estimation of obesity levels based on eating habits and physical condition

* This [obesity dataset](https://archive-beta.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition) 
include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.

* This dataset is originally available at UC Irvine Machine Learning Repository via the [Link](https://archive-beta.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition). The related paper is also available 
in [here](https://www.sciencedirect.com/science/article/pii/S2352340919306985).

* The variables in the dataset are: 

   * `Gender`: A binary variable with levels **Female** and **Male**.
   * `Age`: A numerical variable in years.
   * `Height`: A numerical variable in meters.
   * `Weight`: A numerical variable in kilograms.
   * `family_history_with_overweight`: A binary variable with levels **Yes** and **No** showing whether a family member suffered/suffers from overweight.
   * `FAVC` : A binary variable with levels **Yes** and **No** showing frequency of consumption of high caloric food.
   * `FCVC` : A numerical variable here. Frequency of consumption of vegetables (**Interestingly, it is a categorical variable in the paper**).
   * `NCP` : A numerical variable here. Number of main meals (**Interestingly, it is a categorical variable in the paper**).
   * `CAEC` : An ordinal variable with four levels **No**, **Sometimes**, **Frequently**, and **Always** showing consumption of food between meals.    
   * `SMOKE` : A binary variable with levels **Yes** and **No** showing smoking habit.
   * `CH2O` : A numerical variable here. Consumption of water daily (**Interestingly, it is a categorical variable in the paper**).
   * `SCC` : A binary variable with levels **Yes** and **No** showing calories consumption monitoring.
   * `FAF` : A numerical variable here. Physical activity frequency (**Interestingly, it is a categorical variable in the paper**).
   * `TUE` : A numerical variable here. Time using technology devices (**Interestingly, it is  a categorical variable in the paper**). 
   * `CALC` : An ordinal variable with four levels **No**, **Sometimes**, **Frequently**, and **Always** showing consumption of alcohol
   frequency.
   * `MTRANS` : A nominal variable with four levels **Public_Transportation**, **Automobile**, **Walking**, **Motorbike**, 
   and **Bike** showing transportation type used.
   * `NObeyesdad`:  Another nominal variable (**not clear**). 
   
* A portion of the data set is shown below:   

In [226]:
# import the data set
import pandas as pd

df = pd.read_csv('datasets/ObesityDataSet_raw_and_data_sinthetic.csv')

In [227]:
# check the first five lines
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


We intentionally created some missing values in the variables `Age`, `family_history_with_overweight`, and `CALC` as given below:

In [228]:
import numpy as np

np.random.seed(42)

for col in ['Age','family_history_with_overweight','CALC']:
    
    df.loc[df.sample(frac=np.random.randint(5,15)/100).index, col] = np.nan

df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [229]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             1879 non-null   float64
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2005 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   CAEC                            2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  SCC                             2111 non-null   object 
 12  FAF                             21

**Step 1:** In order to change this problem into a regression problem, following [Center for Disease Control and Prevention(CDC)](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html#:~:text=With%20the%20metric%20system%2C%20the,to%20obtain%20height%20in%20meters.), create the BMI variable as follows:


$$
BMI = weight (kg) / [height (m)]^2,
$$

and also include the variables `Gender`, `Age`, `family_history_with_overweight`,	`FAVC`,	`FCVC`,	`NCP`,	`CAEC`,
`SMOKE`, `CH2O`, `SCC`, `FAF`,	`TUE`,	`CALC`, and	`MTRANS` in your dataset. Show the first 5 lines of your dataset.

In [230]:
# create the BMI variable
data = df.copy()
data['BMI'] = df['Weight'] / (df['Height'] ** 2)

# drop the unnecessary columns
data.drop(['Weight', 'Height', 'NObeyesdad'], axis=1, inplace=True)

In [231]:
# show the first 5 lines
data.head(5)

Unnamed: 0,Gender,Age,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,BMI
0,Female,21.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,24.386526
1,Female,21.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,,Public_Transportation,24.238227
2,Male,23.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,23.765432
3,Male,27.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,26.851852
4,Male,22.0,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,28.342381


**Step 2:** Create a SINGLE **pipeline object** with [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) such that
 
   - Numeric features are:
 
     - imputed with strategy="mean" and
     - scaled with Z-transformation;

   - Nominal features are:

     - imputed with strategy="most_frequent" and
     - one-hot encoded appropriately;

   - Ordinal features are:

      - imputed with strategy="most_frequent" and
      - ordinal encoded appropriately,
      
   - and all these **transform** steps are finally assembled with [SequentialFeatureSelector](https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/) algorithm from [mlextend](https://rasbt.github.io/mlxtend/) library to select the best subset of features with forward=False (in other words, backward elimination), scoring='r2', and cv=5. Tells us the optimum combination of features which predicts BMI. How did you select this combination?
   
- Note that this question involves use of [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) in the intermediate steps. Answers avoiding use of [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) will NOT BE ACCEPTED.  

In [232]:
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

from sklearn import set_config
set_config(transform_output="pandas")

### Split the Data

In [233]:
#split the data %20 for test and %80 for train
data_y = data["BMI"]
data_X = data.drop(["BMI"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.2, random_state=3333)

### Column Transformer

In [234]:
#numeric columns
numeric_features = ["Age", "FCVC", "NCP", "CH2O", "FAF", "TUE"]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])


In [235]:
#nominal categorical columns
categorical_features = ["MTRANS", "Gender", "family_history_with_overweight", "FAVC", "SMOKE", "SCC"]

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(sparse_output=False, handle_unknown='ignore', drop='first'))]) # drop first column to avoid dummy variable trap

In [236]:
#ordinal categorical columns
ordinal_features = ["CAEC", "CALC"]

ordinal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan))]) 

In [237]:
#apply different transformations on different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('ord', ordinal_transformer, ordinal_features)])

### Feature Selection

In [238]:
#apply feature selection
feature_selector = SFS(LinearRegression(), 
                       k_features='best', # select the best subset of features
                       forward=False, # backward elimination
                       scoring='r2', # R2 score
                       cv=5, # 5-fold cross validation
                       verbose=2) # verbose = 2 to see the process

### Pipeline

In [239]:
#shema of the pipeline
pipe_sfs = Pipeline(steps=[('preprocessor', preprocessor),
                       ('feature_selector', feature_selector),
                       ('regressor', LinearRegression())])

pipe_sfs

**Step 3:** Report $R^2$ of the **best model (i.e. the model with optimum features)** found in **Step 2** on train and test data, respectively. (Note here that the number of features in train and test datasets should be less than the original number of features due to feature selection).

In [240]:
#train the model
pipe_sfs.fit(X_train, y_train)


[2023-11-23 19:31:19] Features: 16/1 -- score: 0.4303815731094235
[2023-11-23 19:31:19] Features: 15/1 -- score: 0.43082198043698483
[2023-11-23 19:31:19] Features: 14/1 -- score: 0.43112941285031
[2023-11-23 19:31:19] Features: 13/1 -- score: 0.4313274341837353
[2023-11-23 19:31:19] Features: 12/1 -- score: 0.43054991917138546
[2023-11-23 19:31:20] Features: 11/1 -- score: 0.4297205812277113
[2023-11-23 19:31:20] Features: 10/1 -- score: 0.4287184874331659
[2023-11-23 19:31:20] Features: 9/1 -- score: 0.4264036613544812
[2023-11-23 19:31:20] Features: 8/1 -- score: 0.4240683899802292
[2023-11-23 19:31:20] Features: 7/1 -- score: 0.41746419959199893
[2023-11-23 19:31:20] Features: 6/1 -- score: 0.40682579498238497
[2023-11-23 19:31:20] Features: 5/1 -- score: 0.3930931495957742
[2023-11-23 19:31:20] Features: 4/1 -- score: 0.347032373215099
[2023-11-23 19:31:20] Features: 3/1 -- score: 0.32887965246477957
[2023-11-23 19:31:20] Features: 2/1 -- score: 0.27045851904085205
[2023-11-23 19

In [241]:
selected_features_indices = pipe_sfs.named_steps['feature_selector'].k_feature_idx_

# Display the optimum combination of features
print('Optimum combination of features:', selected_features_indices)

Optimum combination of features: (0, 1, 2, 3, 4, 5, 8, 10, 11, 12, 14, 15, 16)


13/17 features are selected by the feature selection algorithm. The selected features are shown below:

In [242]:
#show the selected features
pipe_sfs.named_steps['feature_selector'].k_feature_names_

('num__Age',
 'num__FCVC',
 'num__NCP',
 'num__CH2O',
 'num__FAF',
 'num__TUE',
 'cat__MTRANS_Public_Transportation',
 'cat__Gender_Male',
 'cat__family_history_with_overweight_yes',
 'cat__FAVC_yes',
 'cat__SCC_yes',
 'ord__CAEC',
 'ord__CALC')

In [243]:
#predict the train and test data
y_train_pred_sfs = pipe_sfs.predict(X_train)
y_test_pred_sfs = pipe_sfs.predict(X_test)

In [244]:
#calculate the R2 score
r2_train_sfs = r2_score(y_train, y_train_pred_sfs)
r2_test_sfs = r2_score(y_test, y_test_pred_sfs)

In [245]:
#print the R2 score
print(" R2 score on train data: ", r2_train_sfs, "\n", "-"*43, "\n", "R2 score on test data: ", r2_test_sfs)

 R2 score on train data:  0.4438662144907958 
 ------------------------------------------- 
 R2 score on test data:  0.46241677266488324


**Step 4:** Repeat **Step 2** when the final estimator is only linear regression (i.e., without feature selection). 

In [246]:
# create the pipeline without feature selection
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('regressor', LinearRegression())])

pipe

**Step 5:** Report the $R^2$ of the linear regression model without any feature selection on train and test data, respectively.

In [247]:
#train the model
pipe.fit(X_train, y_train)

In [248]:
# show the transformed train and test data
pipe.named_steps['preprocessor'].transform(X_train)

Unnamed: 0,num__Age,num__FCVC,num__NCP,num__CH2O,num__FAF,num__TUE,cat__MTRANS_Bike,cat__MTRANS_Motorbike,cat__MTRANS_Public_Transportation,cat__MTRANS_Walking,cat__Gender_Male,cat__family_history_with_overweight_yes,cat__FAVC_yes,cat__SMOKE_yes,cat__SCC_yes,ord__CAEC,ord__CALC
703,-2.327518e-01,1.103361,0.399707,-0.010412,0.292868,-0.391656,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,2.0
1732,1.192456e+00,0.310620,0.399707,0.115481,0.037397,-0.457894,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,2.0,2.0
227,-5.709030e-02,1.103361,0.399707,-0.016278,-0.012879,0.563231,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,2.0
2021,2.711099e-01,1.103361,0.399707,-1.588795,-1.076472,-0.371303,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,2.0,2.0
710,5.863229e-16,-2.260192,1.084953,-1.438169,1.165113,0.563231,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50,-5.521960e-01,1.103361,0.399707,1.626876,-1.190872,0.563231,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,2.0,2.0
194,-2.221255e-01,1.103361,0.399707,1.626876,2.343105,-1.074980,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,3.0
424,1.428227e+00,-0.770333,0.399707,-1.659433,-1.190872,2.201443,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,2.0
1069,1.050524e+00,-0.713643,0.333537,-0.042038,-0.028884,-0.528353,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,3.0


In [249]:
#predict the train and test data
y_train_pred = pipe.predict(X_train)
y_test_pred = pipe.predict(X_test)

In [250]:
#calculate the R2 score
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

In [251]:
#print the R2 score
print(" R2 score on train data: ", r2_train, "\n", "-"*43, "\n", "R2 score on test data: ", r2_test)

 R2 score on train data:  0.44596688666350814 
 ------------------------------------------- 
 R2 score on test data:  0.4635061865728807


**Step 6:** Compare these results in **Step 5** with the ones in **Step 3** and comment on it.

In [252]:
#compare the results in a table
pd.DataFrame({"R2 score on train data": [r2_train, r2_train_sfs],
              "R2 score on test data": [r2_test, r2_test_sfs]},
             index=["Without feature selection", "With feature selection"])

Unnamed: 0,R2 score on train data,R2 score on test data
Without feature selection,0.445967,0.463506
With feature selection,0.443866,0.462417


The R2 score on test data is almost the same for both models. This means that the feature selection did not improve the model performance. 
The feature selection is useful when the model is complex and overfits the data. 
In this case, the model is a simple linear regression model and it does not overfit the data. Therefore, the feature selection did not improve the model performance.