In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Exploratory Data Analysis (EDA) and Preprocessing Strategy\n",
    "\n",
    "**Goal:** Understand the Kepler exoplanet dataset and decide on preprocessing steps for model training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import os\n",
    "\n",
    "# Adjust path to import from src (if notebooks are in a subdir)\n",
    "import sys\n",
    "module_path = os.path.abspath(os.path.join('..')) # Adjust if your notebooks folder is elsewhere\n",
    "if module_path not in sys.path:\n",
    "    sys.path.append(module_path)\n",
    "\n",
    "from src import config\n",
    "from src import data_loader\n",
    "\n",
    "pd.set_option('display.max_columns', 100)\n",
    "pd.set_option('display.max_rows', 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.1 Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_df = data_loader.load_data(config.RAW_DATA_FILE)\n",
    "if raw_df is not None:\n",
    "    print(f\"Dataset shape: {raw_df.shape}\")\n",
    "    display(raw_df.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.2 Basic Information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if raw_df is not None:\n",
    "    raw_df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if raw_df is not None:\n",
    "    # Clean column names for easier access\n",
    "    raw_df.columns = raw_df.columns.str.strip()\n",
    "    display(raw_df.describe(include='all'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.3 Target Variable Analysis (`koi_disposition`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if raw_df is not None and config.TARGET_COLUMN in raw_df.columns:\n",
    "    print(f\"Value counts for target variable '{config.TARGET_COLUMN}':\")\n",
    "    print(raw_df[config.TARGET_COLUMN].value_counts())\n",
    "    \n",
    "    plt.figure(figsize=(8, 5))\n",
    "    sns.countplot(x=config.TARGET_COLUMN, data=raw_df, order=raw_df[config.TARGET_COLUMN].value_counts().index)\n",
    "    plt.title(f'Distribution of {config.TARGET_COLUMN}')\n",
    "    plt.show()\n",
    "    \n",
    "    # Proposed mapping for binary classification\n",
    "    print(f\"\\nPositive labels (map to 1): {config.POSITIVE_LABELS}\")\n",
    "    print(f\"Negative label (map to 0): {config.NEGATIVE_LABEL}\")\n",
    "else:\n",
    "    print(f\"Target column '{config.TARGET_COLUMN}' not found.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset appears somewhat imbalanced. `FALSE POSITIVE` is the majority class. `CONFIRMED` and `CANDIDATE` will be combined into the positive class (exoplanet)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.4 Missing Values Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if raw_df is not None:\n",
    "    missing_values = raw_df.isnull().sum()\n",
    "    missing_percent = (raw_df.isnull().sum() / len(raw_df)) * 100\n",
    "    missing_df = pd.DataFrame({'count': missing_values, 'percent': missing_percent})\n",
    "    missing_df = missing_df[missing_df['count'] > 0].sort_values(by='percent', ascending=False)\n",
    "    \n",
    "    if not missing_df.empty:\n",
    "        print(\"Features with missing values:\")\n",
    "        display(missing_df)\n",
    "        \n",
    "        plt.figure(figsize=(12, 8))\n",
    "        sns.barplot(x=missing_df.index, y='percent', data=missing_df)\n",
    "        plt.xticks(rotation=90)\n",
    "        plt.title('Percentage of Missing Values by Feature')\n",
    "        plt.ylabel('Percentage Missing (%)')\n",
    "        plt.tight_layout()\n",
    "        plt.show()\n",
    "    else:\n",
    "        print(\"No missing values found in the dataset.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many features have missing values. The error columns (`_err1`, `_err2`) seem to have a significant number. Stellar parameters (`koi_steff_err1`, etc.) also have many NaNs. Median imputation for numerical features is a reasonable strategy to start with."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.5 Feature Selection and Dropping Columns\n",
    "\n",
    "Based on `config.py` and initial understanding:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if raw_df is not None:\n",
    "    print(\"Columns to be dropped (from config.py):\")\n",
    "    # Filter to show only columns that actually exist in the DataFrame\n",
    "    existing_cols_to_drop = [col for col in config.FEATURES_TO_DROP if col in raw_df.columns]\n",
    "    print(existing_cols_to_drop)\n",
    "    \n",
    "    # koi_pdisposition vs koi_disposition\n",
    "    if 'koi_pdisposition' in raw_df.columns and 'koi_disposition' in raw_df.columns:\n",
    "        print(\"\\nComparison of 'koi_pdisposition' and 'koi_disposition':\")\n",
    "        print(\"koi_pdisposition value counts:\")\n",
    "        print(raw_df['koi_pdisposition'].value_counts())\n",
    "        # This cross-tabulation can show how they relate\n",
    "        # display(pd.crosstab(raw_df['koi_disposition'], raw_df['koi_pdisposition']))\n",
    "    \n",
    "    # koi_score - often a derived score, good to check its distribution if not dropped\n",
    "    if 'koi_score' in raw_df.columns:\n",
    "        print(\"\\n'koi_score' distribution (if not dropped):\")\n",
    "        #raw_df['koi_score'].hist(bins=50)\n",
    "        #plt.title('koi_score distribution')\n",
    "        #plt.show()\n",
    "        print(\"Dropping 'koi_score' as it might be a pre-computed probability or cause leakage.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`koi_pdisposition` seems redundant or an earlier version of `koi_disposition`. `koi_score` is a strong candidate for removal to avoid data leakage if it's a model-derived score.\n",
    "\n",
    "The `FEATURES_TO_DROP` list in `config.py` looks reasonable for a first pass."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.6 Numerical Feature Analysis (Post Initial Drops)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if raw_df is not None:\n",
    "    df_temp = raw_df.copy()\n",
    "    df_temp.columns = df_temp.columns.str.strip() # Ensure clean names\n",
    "    \n",
    "    # Simulate dropping columns as per config (excluding target for now)\n",
    "    cols_to_drop_sim = [col for col in config.FEATURES_TO_DROP if col in df_temp.columns and col != config.TARGET_COLUMN]\n",
    "    df_features = df_temp.drop(columns=cols_to_drop_sim)\n",
    "    \n",
    "    # Identify numerical features from the remaining set\n",
    "    numerical_features = df_features.select_dtypes(include=np.number).columns.tolist()\n",
    "    if config.TARGET_COLUMN in numerical_features: # Should not be, but as a check\n",
    "        numerical_features.remove(config.TARGET_COLUMN)\n",
    "        \n",
    "    print(f\"Identified {len(numerical_features)} numerical features after initial drops:\")\n",
    "    # print(numerical_features)\n",
    "    \n",
    "    # Display histograms for a subset of numerical features\n",
    "    if numerical_features:\n",
    "        print(\"\\nPlotting histograms for a sample of numerical features...\")\n",
    "        sample_num_features = np.random.choice(numerical_features, min(len(numerical_features), 9), replace=False)\n",
    "        df_features[sample_num_features].hist(bins=20, figsize=(15, 10), layout=(-1, 3))\n",
    "        plt.tight_layout()\n",
    "        plt.show()\n",
    "    \n",
    "    # Correlation Heatmap (for numerical features only)\n",
    "    if numerical_features and len(numerical_features) > 1:\n",
    "        print(\"\\nPlotting correlation heatmap for numerical features (can be slow if many features)...\")\n",
    "        # For performance, consider a subset or sampling for the heatmap if too many features\n",
    "        # We also need to handle NaNs before .corr()\n",
    "        df_corr = df_features[numerical_features].copy()\n",
    "        # Simple median imputation for heatmap purposes only\n",
    "        for col in df_corr.columns:\n",
    "            if df_corr[col].isnull().any():\n",
    "                 df_corr[col] = df_corr[col].fillna(df_corr[col].median())\n",
    "        \n",
    "        # Select a smaller subset for cleaner visualization if many features\n",
    "        if len(numerical_features) > 30:\n",
    "            print(\"Selecting top 30 features by variance for correlation heatmap due to high dimensionality.\")\n",
    "            # Select features with highest variance for a more manageable heatmap\n",
    "            top_variance_features = df_corr.var().nlargest(30).index\n",
    "            correlation_matrix = df_corr[top_variance_features].corr()\n",
    "        else:\n",
    "            correlation_matrix = df_corr.corr()\n",
    "            \n",
    "        plt.figure(figsize=(18, 15))\n",
    "        sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', vmin=-1, vmax=1)\n",
    "        plt.title('Correlation Heatmap of Numerical Features')\n",
    "        plt.show()\n",
    "    else:\n",
    "        print(\"Not enough numerical features to plot a correlation heatmap.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many features are highly skewed (e.g., `koi_period`, `koi_duration`). Scaling will be important. Some features might be highly correlated (e.g., a feature and its error terms, or different flux measurements). The Random Forest model can handle correlated features to some extent, but severe multicollinearity might be something to look into for other model types."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.7 Categorical Features (Post Initial Drops)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if raw_df is not None:\n",
    "    # Using df_features from previous cell (raw_df after config drops)\n",
    "    categorical_features = df_features.select_dtypes(include='object').columns.tolist()\n",
    "    if config.TARGET_COLUMN in categorical_features:\n",
    "        categorical_features.remove(config.TARGET_COLUMN) # Target is handled separately\n",
    "        \n",
    "    if categorical_features:\n",
    "        print(f\"Identified categorical features: {categorical_features}\")\n",
    "        for cat_col in categorical_features:\n",
    "            print(f\"\\nValue counts for {cat_col}:\")\n",
    "            print(df_features[cat_col].value_counts(dropna=False))\n",
    "    else:\n",
    "        print(\"No remaining categorical features (excluding target) after initial drops and type selection.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Kepler dataset, after dropping IDs and text descriptions like `koi_comment`, primarily contains numerical data or flags that can be treated as numerical (e.g., `koi_fpflag_nt` which is 0 or 1). The `src/preprocessor.py` currently drops any remaining object-type columns. If any important categorical features were identified, they would need encoding (e.g., one-hot encoding)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.8 Preprocessing Strategy Summary\n",
    "\n",
    "Based on the EDA, the `src/preprocessor.py` strategy is reasonable:\n",
    "\n",
    "1.  **Load Data**: Using `data_loader.py`.\n",
    "2.  **Clean Column Names**: Strip spaces.\n",
    "3.  **Target Variable Encoding**: Map `koi_disposition` to binary (1 for `CONFIRMED`/`CANDIDATE`, 0 for `FALSE POSITIVE`). Filter out any other disposition values.\n",
    "4.  **Feature Dropping**: Remove columns listed in `config.FEATURES_TO_DROP` (identifiers, redundant info like `koi_pdisposition`, `koi_score`, text comments).\n",
    "5.  **Handle Remaining Categorical Features**: Currently, `preprocessor.py` drops any remaining non-numerical features. This is acceptable as most data is numerical/flags.\n",
    "6.  **Missing Value Imputation**: Use `SimpleImputer` with `median` strategy for all numerical features. This is robust to outliers.\n",
    "7.  **Feature Scaling**: Use `StandardScaler` on numerical features after imputation. This is important for many ML algorithms, though Random Forest is less sensitive to it.\n",
    "8.  **Train-Test Split**: Stratified split to maintain class proportions, using parameters from `config.py`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook provides the rationale for the steps implemented in `src/preprocessor.py`. The next notebook (`02_model_training_and_evaluation.ipynb`) will use these preprocessed data to train and evaluate models."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}