In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Experiment: End-to-End ML Pipeline\n",
    "\n",
    "This notebook demonstrates a complete experiment workflow that:\n",
    "\n",
    "- Loads a sample dataset\n",
    "- Preprocesses the data\n",
    "- Selects important features (with an option to extend/adapt to an adaptive approach)\n",
    "- Selects and tunes models\n",
    "- Evaluates model performance\n",
    "- Applies interpretability methods using SHAP\n",
    "\n",
    "All logs and results are saved for further review."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup & Imports\n",
    "\n",
    "Import necessary libraries and modules from the project. Adjust module paths as necessary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic libraries\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# For dataset loading and evaluation\n",
    "from sklearn.datasets import load_iris\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import accuracy_score, f1_score, confusion_matrix\n",
    "\n",
    "# Optional: SHAP for interpretability\n",
    "import shap\n",
    "\n",
    "# Import your pipeline modules (adjust the import paths if needed)\n",
    "from src import preprocess, feature_selection, model_selection, hyperparam_tuning, evaluation, pipeline_manager\n",
    "\n",
    "# Setup logging\n",
    "import logging\n",
    "\n",
    "LOG_FILE = 'results/logs.txt'\n",
    "logging.basicConfig(filename=LOG_FILE, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n",
    "logging.info('Experiment started')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load and Prepare Data\n",
    "\n",
    "For this experiment we use the Iris dataset. You can replace this section with your own dataset loader if needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load Iris dataset\n",
    "data = load_iris()\n",
    "X = pd.DataFrame(data.data, columns=data.feature_names)\n",
    "y = pd.Series(data.target, name='target')\n",
    "\n",
    "print('Dataset shape:', X.shape)\n",
    "\n",
    "# Log dataset information\n",
    "logging.info(f'Dataset loaded with shape: {X.shape}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Data Preprocessing\n",
    "\n",
    "Apply preprocessing steps (e.g., handling missing values, scaling, encoding). We assume a pre-built preprocessor exists in `src/preprocess.py`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Instantiate and run the preprocessor\n",
    "preprocessor = preprocess.Preprocessor()\n",
    "X_preprocessed = preprocessor.fit_transform(X)\n",
    "\n",
    "logging.info('Data preprocessing completed')\n",
    "print('Preprocessing complete. Sample data:')\n",
    "print(X_preprocessed.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Feature Selection\n",
    "\n",
    "Select important features from the dataset. \n",
    "\n",
    "If desired, you could extend this section to include an **adaptive feature selection** strategy that tailors the feature subset based on dataset characteristics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Instantiate and apply feature selection\n",
    "fs = feature_selection.FeatureSelector()\n",
    "X_selected = fs.select_features(X_preprocessed, y)\n",
    "\n",
    "logging.info('Feature selection completed')\n",
    "print('Selected features:')\n",
    "print(X_selected.columns.tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Split Data into Train/Test Sets\n",
    "\n",
    "Standard train-test split to evaluate model performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)\n",
    "\n",
    "logging.info('Data split into train and test sets')\n",
    "print('Train and test sets created')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Model Selection & Hyperparameter Tuning\n",
    "\n",
    "Use your existing pipeline components to select the best model and tune hyperparameters. \n",
    "\n",
    "Here, we assume `model_selection.select_best_model` returns a preliminary best estimator and that `hyperparam_tuning.tune` optimizes it further using Optuna."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Model selection\n",
    "model_candidate = model_selection.select_best_model(X_train, y_train)\n",
    "logging.info(f'Initial model selected: {model_candidate.__class__.__name__}')\n",
    "print('Initial model selected:', model_candidate.__class__.__name__)\n",
    "\n",
    "# Hyperparameter tuning using Optuna (if integrated in your pipeline)\n",
    "tuned_model = hyperparam_tuning.tune(model_candidate, X_train, y_train)\n",
    "logging.info('Hyperparameter tuning completed')\n",
    "print('Hyperparameter tuning complete. Model details:')\n",
    "print(tuned_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Evaluation\n",
    "\n",
    "Evaluate the tuned model on the test set and log performance metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate predictions and evaluate\n",
    "y_pred = tuned_model.predict(X_test)\n",
    "\n",
    "acc = accuracy_score(y_test, y_pred)\n",
    "f1 = f1_score(y_test, y_pred, average='weighted')\n",
    "cm = confusion_matrix(y_test, y_pred)\n",
    "\n",
    "logging.info(f'Accuracy: {acc}')\n",
    "logging.info(f'F1 Score: {f1}')\n",
    "logging.info(f'Confusion Matrix: {cm}')\n",
    "\n",
    "print(f'Accuracy: {acc:.4f}')\n",
    "print(f'F1 Score: {f1:.4f}')\n",
    "print('Confusion Matrix:')\n",
    "print(cm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Interpretability using SHAP\n",
    "\n",
    "Use SHAP to explain model predictions. This step is especially useful for understanding feature contributions. \n",
    "\n",
    "Note: Ensure that your tuned model is compatible with SHAP. You may need to use the appropriate SHAP explainer (e.g., `TreeExplainer` for tree-based models)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate SHAP values\n",
    "explainer = shap.Explainer(tuned_model.predict, X_train)\n",
    "shap_values = explainer(X_test)\n",
    "\n",
    "logging.info('SHAP analysis completed')\n",
    "\n",
    "shap.summary_plot(shap_values, X_test, plot_type='bar')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Logging & Saving Results\n",
    "\n",
    "Log key results and optionally save the tuned model for future use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example: Save the tuned model using joblib\n",
    "import joblib\n",
    "model_path = 'results/tuned_model.pkl'\n",
    "joblib.dump(tuned_model, model_path)\n",
    "logging.info(f'Model saved to {model_path}')\n",
    "print(f'Model saved to {model_path}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. Conclusion\n",
    "\n",
    "This experiment notebook has demonstrated an end-to-end pipeline:\n",
    "\n",
    "- **Data Loading & Preprocessing**\n",
    "- **Feature Selection**\n",
    "- **Model Selection & Hyperparameter Tuning**\n",
    "- **Evaluation & Interpretability**\n",
    "\n",
    "All important steps are logged and key results are visualized. \n",
    "\n",
    "Feel free to extend this notebook with additional research components such as adaptive feature selection strategies or meta-learning based model selection techniques."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.x"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

