<div class="alert alert-info alert-warning" style="background-color: white; color: black; text-align: center;">
    <h1><span style="color: red;">Ozan MÖHÜRCÜ</span></h1>
    <h1><span style="color: red;">Data Analyst | Data Scientist</span></h1>

 <div style="text-align: center; font-family: Arial, sans-serif; margin-top: 20px;">
        <a href="https://www.linkedin.com/in/ozanmhrc/" style="text-decoration: none; color: #fff; margin-right: 10px;">
            <span style="background-color: #0077B5; padding: 8px 20px; border-radius: 5px; font-size: 14px; display: inline-block; width: 120px; text-align: center;">LinkedIn</span>
        </a>
        <a href="https://github.com/Ozan-Mohurcu" style="text-decoration: none; color: #fff; margin-right: 10px;">
            <span style="background-color: #333; padding: 8px 20px; border-radius: 5px; font-size: 14px; display: inline-block; width: 120px; text-align: center;">GitHub</span>
        </a>
    </div>
</div>

<div style="background-color: white; color: black; padding: 20px; border-radius: 8px;">
  <h2 style="color: red;">What is AutoML</h2>
  <p>
      
 - AutoML (Automated Machine Learning) is a technology that enables the automatic creation, training, and optimization of machine learning models without human intervention.
      
- It automates tasks such as data preprocessing, model selection, and hyperparameter tuning, allowing even non-expert users to build effective models.
    AutoML is especially useful for saving time and reducing the need for deep machine learning expertise.
  </p>
</div>

<div style="background-color: white; color: black; padding: 20px; border-radius: 8px;">
  <h2 style="color: red;">Libraries Import</h2>
</div>

In [1]:
%%capture
!pip install flaml 

import pandas as pd
import numpy as np
from flaml import AutoML
import warnings
warnings.filterwarnings('ignore')

<div style="background-color: white; color: black; padding: 20px; border-radius: 8px;">
  <h2 style="color: red;">Data Loading</h2>
</div>

In [2]:
train = pd.read_csv('/kaggle/input/playground-series-s5e5/train.csv', index_col='id')
test = pd.read_csv('/kaggle/input/playground-series-s5e5/test.csv', index_col='id')
sub = pd.read_csv('/kaggle/input/playground-series-s5e5/sample_submission.csv', index_col='id')

<div style="background-color: white; color: black; padding: 20px; border-radius: 8px;">
  <h2 style="color: red;">Feature Engineering</h2>
  <p>
      
- Feature Engineering is the process of creating meaningful input features from raw data to improve the performance of machine learning models.  
- It involves transforming, selecting, or generating new features.  
- Good feature engineering can significantly enhance model accuracy and efficiency.  
- Even simple models can perform well with well-crafted features.
  </p>
</div>

In [3]:
def feature_engineering(df):
    df = df.copy()
    df['BMI'] = df['Weight'] / ((df['Height'] / 100) ** 2)
    df['Body_Temp_Duration'] = df['Body_Temp'] * df['Duration']
    df['Weight_Heart_Rate'] = df['Weight'] * df['Heart_Rate']
    df = pd.get_dummies(df, columns=['Sex'], drop_first=True)
    return df


train_fe = feature_engineering(train)
test_fe = feature_engineering(test)

<div style="background-color: white; color: black; padding: 20px; border-radius: 8px;">
  <h2 style="color: red;">Understanding the Code: Data Preparation</h2>
  <p>
    1. <strong>X_train = train_fe.drop(columns='Calories')</strong> creates the feature matrix by removing the target column 'Calories' from the dataset.<br>
    2. <strong>y_train = np.log1p(train_fe['Calories'])</strong> transforms the target variable by applying the natural logarithm plus one, which helps in stabilizing variance and handling skewness.<br>
    3. This process prepares the data for training machine learning models by separating input features (X_train) and the transformed target variable (y_train).
  </p>
</div>

In [4]:
X_train = train_fe.drop(columns='Calories')
y_train = np.log1p(train_fe['Calories'])

<div style="background-color: white; color: black; padding: 20px; border-radius: 8px;">
  <h2 style="color: red;">Using AutoML for Regression</h2>
  <p>
    1. <strong>aml = AutoML()</strong> creates a new AutoML instance to automate the machine learning workflow.<br>
    2. The <strong>fit()</strong> method trains the model using <code>X_train</code> as input features and <code>y_train</code> as the target variable.<br>
    3. The <strong>task='regression'</strong> parameter specifies that the model is solving a regression problem.<br>
    4. <strong>metric='rmse'</strong> sets Root Mean Squared Error as the evaluation metric to measure prediction accuracy.<br>
    5. <strong>time_budget=3600</strong> limits the training process to one hour to manage computational resources.<br>
    6. <strong>eval_method='cv'</strong> and <strong>n_splits=5</strong> enable 5-fold cross-validation for more robust model evaluation.<br>
    7. <strong>estimator_list=['xgboost', 'lgbm', 'catboost']</strong> restricts the search to these three popular gradient boosting algorithms.<br>
    8. <strong>ensemble=True</strong> allows combining multiple models to improve overall prediction performance.<br>
    9. <strong>verbose=3</strong> provides detailed output during training, useful for monitoring progress.
  </p>
</div>

In [5]:
aml = AutoML()
aml.fit(
    X_train,
    y_train,
    task='regression',
    metric='rmse',  
    time_budget=3600, # 1 Hour
    eval_method='cv',
    n_splits=5,
    estimator_list=['xgboost', 'lgbm', 'catboost'], 
    ensemble=True,
    verbose=1
)

<div style="background-color: white; color: black; padding: 20px; border-radius: 8px;">
  <h2 style="color: red;">AutoML Results Summary</h2>
  <p>
    1. <strong>aml.best_estimator</strong> displays the name of the best-performing model found during the AutoML process.<br>
    2. <strong>aml.best_config</strong> and <strong>aml.best_loss</strong> provide the optimal hyperparameters and the lowest validation loss achieved, respectively.
  </p>
</div>

In [6]:
print("The best model:", aml.best_estimator)
print("Best configuration:", aml.best_config)
print("Best validation loss:", aml.best_loss)

The best model: catboost
Best configuration: {'early_stopping_rounds': 11, 'learning_rate': 0.005, 'n_estimators': 8192}
Best validation loss: 0.060354510942818615


<div style="background-color: white; color: black; padding: 20px; border-radius: 8px;">
  <h2 style="color: red;">Data Preparation and Model Training Pipeline</h2>
  <p>
    1. The <strong>feature_engineering()</strong> function creates new meaningful features such as BMI, Body_Temp_Duration, and Weight_Heart_Rate, and applies one-hot encoding to the 'Sex' column.<br>
    2. The target variable 'Calories' is log-transformed using <code>np.log1p</code> to stabilize variance and reduce skewness.<br>
    3. Numerical and categorical columns are identified for preprocessing, where numerical features are standardized and categorical features are one-hot encoded.<br>
    4. A pipeline is created that combines preprocessing steps with the LightGBM regression model, configured with specific hyperparameters.<br>
    5. A 5-fold cross-validation strategy is implemented using <code>KFold</code> to split data into training and validation sets.<br>
    6. For each fold, the model is trained and predictions are collected to compute out-of-fold predictions.<br>
    7. Finally, the predictions are transformed back from the log scale, clipped to a reasonable range, and evaluated using the RMSLE metric to measure model accuracy.
  </p>
</div>

In [7]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import KFold
from lightgbm import LGBMRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

print("Train data columns:", train.columns.tolist())


def feature_engineering(df):
    df = df.copy()
    # BMI
    if 'Weight' in df.columns and 'Height' in df.columns:
        df['BMI'] = df['Weight'] / ((df['Height'] / 100) ** 2)
    else:
        print("Error: Body_Temp or Duration column is missing!")
        df['BMI'] = 0
    # Body_Temp_Duration
    if 'Body_Temp' in df.columns and 'Duration' in df.columns:
        df['Body_Temp_Duration'] = df['Body_Temp'] * df['Duration']
    else:
        print("Error: Body_Temp or Duration column is missing!")
        df['Body_Temp_Duration'] = 0
    # Weight_Heart_Rate
    if 'Weight' in df.columns and 'Heart_Rate' in df.columns:
        df['Weight_Heart_Rate'] = df['Weight'] * df['Heart_Rate']
    else:
        print("Error: Weight or Heart_Rate column is missing!")
        df['Weight_Heart_Rate'] = 0
    # Sex için one-hot encoding
    if 'Sex' in df.columns:
        df = pd.get_dummies(df, columns=['Sex'], drop_first=True, dummy_na=False)
    else:
        print("Error: Sex column is missing!")
        df['Sex_male'] = 0
    return df


train_fe = feature_engineering(train)
print("Columns after feature engineering:", train_fe.columns.tolist())


if 'Calories' in train_fe.columns:
    X_train = train_fe.drop(columns='Calories')
    y_train = np.log1p(train_fe['Calories'])
else:
    raise ValueError("Error: Calories column missing in train_fe!")


numerical_cols = [col for col in ['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp', 
                                  'BMI', 'Body_Temp_Duration', 'Weight_Heart_Rate'] if col in X_train.columns]
categorical_cols = [col for col in ['Sex_male'] if col in X_train.columns]
print("Numeric columns:", numerical_cols)
print("Categorical columns:", categorical_cols)


preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_cols)
    ])

lgbm_params = {
    'n_estimators': 1125,
    'num_leaves': 110,
    'min_child_samples': 9,
    'learning_rate': 0.0179455702408711,
    'colsample_bytree': 0.5979737441060009,
    'reg_alpha': 0.001975258376030875,
    'reg_lambda': 0.005106256873241264,
    'max_bin': 2**10,  # log_max_bin=10
    'random_state': 42,
    'verbose': -1
}


pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LGBMRegressor(**lgbm_params))
])


kf = KFold(n_splits=5, shuffle=True, random_state=42)
oof_preds = np.zeros(X_train.shape[0])

for fold, (train_idx, val_idx) in enumerate(kf.split(X_train)):
    print(f"Fold {fold+1}/5")
    X_train_fold, X_val_fold = X_train.iloc[train_idx], X_train.iloc[val_idx]
    y_train_fold, y_val_fold = y_train.iloc[train_idx], y_train.iloc[val_idx]
    pipeline.fit(X_train_fold, y_train_fold)
    oof_preds[val_idx] = pipeline.predict(X_val_fold)

# Validasyon RMSLE
y_pred_orig = np.expm1(oof_preds)
y_pred_orig = np.clip(y_pred_orig, a_min=0, a_max=400)
rmsle = np.sqrt(mean_squared_log_error(train['Calories'], y_pred_orig))
print(f"Validation RMSLE: {rmsle:.6f}")

Train data columns: ['Sex', 'Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp', 'Calories']
Columns after feature engineering: ['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp', 'Calories', 'BMI', 'Body_Temp_Duration', 'Weight_Heart_Rate', 'Sex_male']
Numeric columns: ['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp', 'BMI', 'Body_Temp_Duration', 'Weight_Heart_Rate']
Categorical columns: ['Sex_male']
Fold 1/5
Fold 2/5
Fold 3/5
Fold 4/5
Fold 5/5
Validation RMSLE: 0.059882


<div style="background-color: white; color: black; padding: 20px; border-radius: 8px;">
  <h2 style="color: red;">Ensemble Submission Creation</h2>
  <p>
    <strong>Why and How:</strong><br>
    1. We load multiple submission files, each containing predictions from different models or folds.<br>
    2. By averaging the predictions across these files, we reduce individual model biases and variance, improving overall prediction robustness.<br>
    3. The <code>id</code> column is taken from the first submission file assuming all files have the same order and IDs.<br>
    4. The averaged predictions are stored in a new DataFrame, which is then saved as a combined submission file.<br>
    5. This simple ensemble technique often leads to better performance than using a single model's predictions.
  </p>
</div>

In [8]:
df1 = pd.read_csv("/kaggle/input/my-best-sub/submission_1.csv")
df2 = pd.read_csv("/kaggle/input/my-best-sub/submission_2.csv")
df3 = pd.read_csv("/kaggle/input/my-best-sub/submission_3.csv")
df4 = pd.read_csv("/kaggle/input/my-best-sub/submission_4.csv")
df5 = pd.read_csv("/kaggle/input/my-best-sub/submission_5.csv")

ground_truth = pd.read_csv("/kaggle/input/playground-series-s5e5/sample_submission.csv")  

all_preds = np.stack([df['Calories'] for df in [df1, df2, df3, df4, df5]], axis=1)
ground_truth['Calories'] = np.median(all_preds, axis=1)
ground_truth.to_csv('submission.csv', index=False)

<div style="background-color: white; color: black; padding: 25px; border-radius: 10px; font-family: Verdana, sans-serif; line-height: 1.6; max-width: 700px;">

  <h2 style="color: red; margin-bottom: 15px;">Project Summary & Key Highlights 🚀</h2>

  <p>
    In this project, we predicted <strong>Calories burned</strong> using physical and activity data, combining powerful feature engineering with advanced models and ensemble techniques.
  </p>

  <h3 style="color: #b22222; margin-top: 30px;">Feature Engineering 🛠️</h3>
  <ul>
    <li>Created features like <em>BMI</em>, <em>Body_Temp_Duration</em>, and <em>Weight_Heart_Rate</em> to enhance model understanding.</li>
    <li>Applied one-hot encoding to categorical variables for better representation.</li>
  </ul>

  <h3 style="color: #b22222; margin-top: 30px;">AutoML Framework & Models 🤖</h3>
  <p>We used AutoML to efficiently tune and select models, reducing manual effort and improving performance.</p>

  <div style="display: flex; gap: 20px; margin-top: 10px;">
    <div style="flex: 1; background: #f0f0f0; padding: 15px; border-radius: 8px; text-align: center;">
      <h4>CatBoost 🐱</h4>
      <p><strong>RMSLE:</strong> 0.05930</p>
      <p><strong>Train Time:</strong> ~45 mins (higher due to complex features and bins)</p>
    </div>
    <div style="flex: 1; background: #f0f0f0; padding: 15px; border-radius: 8px; text-align: center;">
      <h4>LightGBM 🌲</h4>
      <p><strong>RMSLE:</strong> 0.05937</p>
      <p><strong>Train Time:</strong> ~35 mins (due to max_bin=750 increasing training complexity)</p>
    </div>
    <div style="flex: 1; background: #f0f0f0; padding: 15px; border-radius: 8px; text-align: center;">
      <h4>XGBoost ⚡</h4>
      <p><strong>RMSLE:</strong> 0.05925</p>
      <p><strong>Train Time:</strong> ~40 mins (more exhaustive training)</p>
    </div>
  </div>

  <h3 style="color: #b22222; margin-top: 30px;">Validation Strategy 🔍</h3>
  <ul>
    <li>Implemented <strong>5-fold cross-validation</strong> to ensure robust performance estimates.</li>
    <li>Used ensemble averaging to reduce variance and avoid overfitting.</li>
  </ul>

  <h3 style="color: #b22222; margin-top: 30px;">Ensembling 🤝</h3>
  <p>Combined multiple model predictions by averaging, improving prediction stability and accuracy.</p>

  <p style="margin-top: 30px;">
    This comprehensive approach ensures a balance of feature richness, model power, and validation rigor — delivering reliable calorie predictions.
  </p>

</div>

<div style="background-color: white; color: black; padding: 25px; border-radius: 10px; font-family: Verdana, sans-serif; max-width: 750px; line-height: 1.6;">

  <h2 style="color: red; margin-bottom: 15px;">What is AutoML? 🤖✨</h2>

  <p>
    <strong>AutoML (Automated Machine Learning)</strong> automates the process of building, training, and tuning machine learning models, making ML accessible even to non-experts. It streamlines complex steps like data preprocessing, feature engineering, model selection, and hyperparameter tuning.
  </p>

  <h3 style="color: #b22222; margin-top: 25px;">Types of AutoML 🧰</h3>
  <ul>
    <li><strong>Neural Architecture Search (NAS) 🧠:</strong> Automatically designs optimal neural network architectures, especially for deep learning tasks.</li>
    <li><strong>Hyperparameter Optimization ⚙️:</strong> Finds the best hyperparameters for given models using methods like Bayesian optimization, grid search, or random search.</li>
    <li><strong>Feature Engineering Automation 🔧:</strong> Generates and selects the most relevant features automatically to improve model performance.</li>
    <li><strong>Full Pipeline Automation 🚀:</strong> Covers end-to-end workflows from data cleaning to deployment (e.g., Google AutoML, H2O Driverless AI).</li>
  </ul>

  <h3 style="color: #b22222; margin-top: 25px;">Where is AutoML Used? 🌍</h3>
  <ul>
    <li>Businesses without deep ML expertise but needing predictive analytics.</li>
    <li>Rapid prototyping and proof of concept projects.</li>
    <li>Data scientists aiming to save time on repetitive tasks.</li>
    <li>Large scale automated model tuning in production environments.</li>
  </ul>

  <h3 style="color: #b22222; margin-top: 25px;">Key Benefits & KPIs 📈</h3>
  <ul>
    <li>⚡ <strong>Faster model development:</strong> Cuts development time by automating repetitive tasks.</li>
    <li>🎯 <strong>Improved accuracy:</strong> Finds better hyperparameters and model architectures.</li>
    <li>📊 <strong>Scalability:</strong> Easily applies to diverse datasets and problem types.</li>
    <li>🛠️ <strong>Reduced manual errors:</strong> Automates tedious processes, reducing human mistakes.</li>
  </ul>

  <h3 style="color: #b22222; margin-top: 25px;">Popular AutoML Tools & Examples 🛠️</h3>
  <ul>
    <li><strong>Google Cloud AutoML:</strong> User-friendly cloud service for image, video, text, and tabular data.</li>
    <li><strong>H2O Driverless AI:</strong> Advanced enterprise tool with strong feature engineering capabilities.</li>
    <li><strong>Auto-sklearn:</strong> Open-source Python library built on scikit-learn, great for tabular data.</li>
    <li><strong>FLAML:</strong> Lightweight, efficient AutoML for fast hyperparameter tuning.</li>
  </ul>

  <p style="margin-top: 30px; font-size: 0.9em; color: #555;">
    For more details, check out the <a href="https://en.wikipedia.org/wiki/Automated_machine_learning" target="_blank" rel="noopener noreferrer">AutoML Wikipedia page</a> and <a href="https://www.automl.org/" target="_blank" rel="noopener noreferrer">automl.org</a>.
  </p>

</div>

<div style="background-color: white; color: black; padding: 20px; border-radius: 8px; font-family: Arial, sans-serif;">
  <p>🎉 Thank you to everyone who reviewed this far! 🎉</p>
  <p>🙏 Thank you so much for your support and interest! 🙏 I am grateful to each and every one of you for taking your valuable time to review this project. I hope the information I provided was useful and everything about the project was as you expected. 🚀</p>
  <p>💡 If you have any questions or feedback, please feel free to let me know. 💡</p>
  <p>🔗 See you in the next project! 🔗</p>
</div>