# GOAL:
- Experiment to determine methods of imputation
- Idea: Try different ways of imputation, use the baseline model (decision trees or random forests) and evaluate. Select best model.

Approach: Make the pipeline as much as efficient as possible, by using *"configurations"* for imputing values (exm: impute variable $X$ in a manner $f$)

In [1]:
import pandas as pd
import sklearn as sk


In [2]:
df = pd.read_csv(r"../our data/no_outliers.csv")
df.head(3)

Unnamed: 0.1,Unnamed: 0,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,siblings,smoke,transportation,veggies_freq,water_daily,weight,obese_level
0,0,21.0,Never,no,up to 5,Sometimes,Female,1.62,3.0,no,yes,,3.0,no,Public,Sometimes,1 to 2,64.0,Normal_Weight
1,1,23.0,Frequently,no,up to 5,Sometimes,Male,1.8,3.0,no,yes,3 to 4,0.0,no,Public,Sometimes,1 to 2,77.0,Normal_Weight
2,2,,Frequently,no,up to 2,Sometimes,Male,1.8,3.0,no,no,3 to 4,2.0,no,Walk,Always,1 to 2,87.0,Overweight_Level_I


In [3]:
df = df.drop('Unnamed: 0', axis=1)

## Separate explanatory and targets

In [4]:
df_var, df_target = df.drop("obese_level", axis=1), df['obese_level']

# Separate train and validation

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df_var, df_target,
    test_size = .33,
    random_state=20
)

In [6]:
df.dtypes

age                          float64
alcohol_freq                  object
caloric_freq                  object
devices_perday                object
eat_between_meals             object
gender                        object
height                       float64
meals_perday                 float64
monitor_calories              object
parent_overweight             object
physical_activity_perweek     object
siblings                     float64
smoke                         object
transportation                object
veggies_freq                  object
water_daily                   object
weight                       float64
obese_level                   object
dtype: object

# Start experimenting
## Setup configurations and results

In [7]:
from sklearn.experimental import enable_iterative_imputer

from data_preprocesser import preprocesser


In [8]:
my_preprocesser = preprocesser()

# Data preprocesser:
def run(o_train, o_val, configs):
    # Preserve original data as preprocesser does everything inplace
    train = o_train.copy()
    val = o_val.copy()

    for config in configs:
        l = config.split(";")

        if len(l) == 1: # Single-argument options only
            option = l[0]            
            if option == "knn_imputer":
                train, val = my_preprocesser.knn_imputer(train, val, configs[config])
            
            if option == "add_bmi":
                train, val = my_preprocesser.add_bmi(train, val)
            
        if len(l) == 2: # Two options
            option = l[0]
            arg = l[1]

            if option == "encode_data":
                train, val = my_preprocesser.encode_data(train, val, configs[config], type=arg)

            if option == "simp_imputer":
                train, val = my_preprocesser.simp_imputer(train, val, configs[config], strategy=arg)
            
            if option == "scaler":
                train, val = my_preprocesser.scaler(train, val, configs[config], method=arg)
            
            if option == "constant_imputer":
                train, val = my_preprocesser.constant_imputer(train, val, configs[config], filling=arg)

            if option == "iterative_imputer":
                train, val = my_preprocesser.iterative_imputer(train, val, configs[config], estimator=arg)
        
    return train, val 

Preprocesser loaded


# Start making some configurations

In [9]:
configs = []

In [10]:
config_0 = {
    "simp_imputer;median": ["age","height","siblings","weight"],
    "simp_imputer;most_frequent":    [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            ],
    "encode_data;one_hot": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
    ]
} # Simplest configuration: impute everything with mode or median

configs.append(config_0)

In [11]:
config_1 = {
    "iterative_imputer;none": ['age', 'weight', 'height', 'siblings'],
    "simp_imputer;most_frequent": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            ],
    "encode_data;one_hot": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
    ]


} # Impute everything with iteraitveimputer and impute the rest with most frequent value
configs.append(config_1)

In [12]:
config_2 = {
    "scaler;standard": ['age', 'weight', 'height'],
    "iterative_imputer;lr": ['age', 'weight', 'height'],
    "simp_imputer;most_frequent": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            ],
    "constant_imputer;0": [
                            'physical_activity_perweek'
                            ],
    "encode_data;one_hot": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
    ]
} # Scales numerical values with standard, imputes with linear regression, imputes categorical with mode and physical activity with 0
configs.append(config_2)

In [13]:
config_3 = {
    "scaler;standard": ['age', 'weight', 'height'],
    "iterative_imputer;lr": ['age', 'weight', 'height'],
    "simp_imputer;most_frequent": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            ],
    "constant_imputer;0": [
                            'physical_activity_perweek'
                            ],
    "encode_data;ordinal": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
    ]
    } # Scales numerical values with standard, imputes with linear regression, imputes categorical with mode and physical activity with 0
configs.append(config_3)

In [14]:
config_4 = {
    "scaler;standard": ['age', 'weight', 'height'],
    "iterative_imputer;lr": ['age', 'weight', 'height'],
    "simp_imputer;most_frequent": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            ],
    "constant_imputer;0": [
                            'physical_activity_perweek'
                            ],
    "encode_data;ordinal": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            "eat_between_meals"
    ],
"encode_data;one_hot": [
        "gender",
        "smoke",
    ]
}
configs.append(config_4)

In [15]:
config_5 = {
    "add_bmi": None,
    "scaler;standard": ['age', 'weight', 'height'],
    "iterative_imputer;lr": ['age', 'weight', 'height'],
    "simp_imputer;most_frequent": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            ],
    "constant_imputer;0": [
                            'physical_activity_perweek'
                            ],
    "encode_data;ordinal": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            "bmi_class"
    ],
    } # Scales numerical values with standard, imputes with linear regression, imputes categorical with mode and physical activity with 0
# a. Now adds bmi aswell.
configs.append(config_5)

In [16]:
config_6 = {
    "add_bmi": None,
    "scaler;standard": ['age', 'weight', 'height'],
    "iterative_imputer;lr": ['age', 'weight', 'height'],
    "simp_imputer;most_frequent": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            ],
    "constant_imputer;0": [
                            'physical_activity_perweek'
                            ],
    "encode_data;ordinal": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            "bmi_class",
                            "eat_between_meals"
    ],
"encode_data;one_hot": [
        "gender",
        "smoke",
    ]
} # 4 but added bmi
configs.append(config_6)

In [17]:
config_7 = {
    "add_bmi": None,
    "scaler;standard": ['age', 'weight', 'height'],
    "iterative_imputer;lr": ['age', 'weight', 'height'],
    "knn_imputer":[
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            ],
    "constant_imputer;0": [
                            'physical_activity_perweek'
                            ],

} # previous but i'm using knn
configs.append(config_7)

In [18]:
config_8 = {
    "add_bmi": None,
    "scaler;standard": ['age', 'weight', 'height'],
    "knn_imputer": ['age', 'weight', 'height'],
    "simp_imputer;most_frequent": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            ],
    "constant_imputer;0": [
                            'physical_activity_perweek'
                            ],
    "encode_data;ordinal": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            "bmi_class",
                            "eat_between_meals"
    ],
"encode_data;one_hot": [
        "gender",
        "smoke",
    ]
} # 6 but using knn instead of lr
configs.append(config_8)


In [19]:
config_9 = {
    "add_bmi": None,
    "scaler;standard": ['age', 'weight', 'height'],
    "knn_imputer": ['age', 'weight', 'height'],
    "simp_imputer;most_frequent": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'eat_between_meals',
                            'gender',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'smoke',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            'physical_activity_perweek'
                            ],
    "encode_data;ordinal": [
                            'alcohol_freq',
                            'caloric_freq',
                            'devices_perday',
                            'meals_perday',
                            'monitor_calories',
                            'parent_overweight',
                            'physical_activity_perweek',
                            'transportation',
                            'veggies_freq',
                            'water_daily',
                            "bmi_class",
                            "eat_between_meals"
    ],
"encode_data;one_hot": [
        "gender",
        "smoke",
    ]
} # before but imputing physical activity with mode

# IMPORTANT OBSERVATION: In some categories performance slightly improves with this
configs.append(config_9)


# Mass Experiments with Baseline Model

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
tree_rf = RandomForestClassifier(random_state=42)

# For loop with configurations
for i,config in enumerate(configs):
    print("="*50)
    print(f"CONFIGURATION {i}")
    print([e for e in config])
    print()
    train_curr, test_curr = run(X_train, X_test, config)
    tree_rf.fit(train_curr, y_train)

    print(classification_report(y_test, tree_rf.predict(test_curr)))

#   importances = tree_rf.feature_importances_
#   importances_df = pd.DataFrame(importances, index=train_curr.columns, columns=['importance'])
#   importances_df = importances_df.sort_values('importance', ascending=False)
#   print(importances_df)
#   print()


CONFIGURATION 0
['simp_imputer;median', 'simp_imputer;most_frequent', 'encode_data;one_hot']

                     precision    recall  f1-score   support

Insufficient_Weight       0.95      0.92      0.93        75
      Normal_Weight       0.68      0.74      0.71        70
     Obesity_Type_I       0.91      0.93      0.92        80
    Obesity_Type_II       0.99      0.92      0.95        76
   Obesity_Type_III       0.99      1.00      0.99        75
 Overweight_Level_I       0.85      0.87      0.86        79
Overweight_Level_II       0.89      0.84      0.86        75

           accuracy                           0.89       530
          macro avg       0.89      0.89      0.89       530
       weighted avg       0.89      0.89      0.89       530

CONFIGURATION 1
['iterative_imputer;none', 'simp_imputer;most_frequent', 'encode_data;one_hot']

                     precision    recall  f1-score   support

Insufficient_Weight       0.95      0.92      0.93        75
      Normal

# Conclusions
As we can see, the models with good performances are:
- CONFIGURATION 3: Scale with standard, impute with iterated linear regressions, impute qualitative with mode (except 0 for activity) and encode everything with ordinal values
- CONFIGURATION 6: As previous, but add BMI class column and encode certain values with ordinals and others with onehot
- CONFIGURATION 7: Previous but with KNN instead of linear regressions without encoding
- CONFIGURATION 8: Previous but with encoding
- CONFIGURATION 9: Previous but imputing qualitative for every column, including activity

For precise values see `analysis.txt`. We can note that we can globally consider the last configuration as the "best one", as it preserves a *"minimum amount"* of variance (e.g. it classifies certain classes better, when with previous approaches they would have lower precision)

In [21]:
importances = tree_rf.feature_importances_
importances_df = pd.DataFrame(importances, index=train_curr.columns, columns=['importance'])
importances_df = importances_df.sort_values('importance', ascending=False)
print(importances_df) # Importances of last model


                                   importance
weight                               0.306011
bmi_class_encoded                    0.134463
age                                  0.102683
height                               0.089391
gender_1.0                           0.068019
veggies_freq_encoded                 0.044252
alcohol_freq_encoded                 0.037868
meals_perday_encoded                 0.032371
eat_between_meals_encoded            0.029643
caloric_freq_encoded                 0.023886
parent_overweight_encoded            0.023314
transportation_encoded               0.023192
devices_perday_encoded               0.021875
water_daily_encoded                  0.020343
physical_activity_perweek_encoded    0.017830
siblings                             0.017396
monitor_calories_encoded             0.006222
smoke_1.0                            0.001240
