# 🛰️ Kepler Mission - Exoplanet Classification Analysis
## NASA Space Apps Challenge - Complete 3-Class Implementation

### 🎯 **Objective**
Classify Kepler Objects of Interest (KOIs) into three categories using the complete Exoplanet Archive disposition:
- **CONFIRMED**: Validated exoplanets (2,746 samples)
- **CANDIDATE**: Objects awaiting validation (1,979 samples)  
- **FALSE POSITIVE**: Ruled out objects (4,839 samples)

### 📊 **Dataset Information**
- **Source**: NASA Exoplanet Archive - Kepler Cumulative Table
- **File**: `cumulative_2025.09.25_10.52.58.csv`
- **Total Samples**: 9,564 Kepler Objects of Interest
- **Target Variable**: `koi_disposition` (3-class classification)
- **Mission**: Kepler Space Telescope (2009-2017)

### 🔬 **Analysis Approach**
1. **Data Loading & Exploration**: Comprehensive EDA with target analysis
2. **Feature Engineering**: Astronomical feature derivation and preprocessing
3. **Model Training**: 5-model comparison with XGBoost optimization
4. **Evaluation**: Cross-validation, feature importance, and performance analysis
5. **Results**: Integration-ready outputs for multi-dataset comparison

---

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
plt.style.use('default')
sns.set_palette("husl")

print("🚀 Kepler Analysis - Libraries Imported Successfully!")
print("📅 Analysis Date: September 26, 2025")
print("🎯 Target: koi_disposition (3-class classification)")

## 📂 Data Loading and Initial Exploration

In [None]:
# Load the Kepler dataset
print("📂 Loading Kepler Dataset...")
kepler_data = pd.read_csv('cumulative_2025.09.25_10.52.58.csv', comment='#', low_memory=False)

print(f"✅ Dataset loaded successfully!")
print(f"📊 Dataset shape: {kepler_data.shape}")
print(f"📋 Columns: {kepler_data.shape[1]}")
print(f"📈 Rows: {kepler_data.shape[0]:,}")

# Display basic information
print(f"\n🔍 Dataset Overview:")
print(f"  • Memory usage: {kepler_data.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f"  • Data types: {kepler_data.dtypes.value_counts().to_dict()}")

In [None]:
# Analyze target variable - koi_disposition (the correct 3-class target)
print("🎯 Target Variable Analysis: 'koi_disposition'")
print("=" * 50)

target_counts = kepler_data['koi_disposition'].value_counts()
target_pct = kepler_data['koi_disposition'].value_counts(normalize=True) * 100

print("📊 Class Distribution:")
for class_name, count in target_counts.items():
    percentage = target_pct[class_name]
    print(f"  • {class_name}: {count:,} samples ({percentage:.1f}%)")

print(f"\n📈 Total samples: {target_counts.sum():,}")

# Verify we have all three classes
print(f"\n✅ Confirmed 3-class problem:")
print(f"  • CONFIRMED class: {target_counts.get('CONFIRMED', 0):,} samples")
print(f"  • CANDIDATE class: {target_counts.get('CANDIDATE', 0):,} samples") 
print(f"  • FALSE POSITIVE class: {target_counts.get('FALSE POSITIVE', 0):,} samples")

# Compare with the binary version (koi_pdisposition)
print(f"\n🔄 Comparison with koi_pdisposition (binary version):")
binary_counts = kepler_data['koi_pdisposition'].value_counts()
print(f"  • Binary version has {len(binary_counts)} classes: {list(binary_counts.index)}")
print(f"  • Complete version has {len(target_counts)} classes: {list(target_counts.index)}")

In [None]:
# Visualize target distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: 3-class target distribution (koi_disposition)
target_counts.plot(kind='bar', ax=ax1, color=['#2E86AB', '#A23B72', '#F18F01'], 
                   edgecolor='black', alpha=0.8)
ax1.set_title('Kepler Target Distribution\n(koi_disposition - 3-class)', 
              fontsize=14, fontweight='bold')
ax1.set_xlabel('Disposition Class')
ax1.set_ylabel('Number of Objects')
ax1.tick_params(axis='x', rotation=45)
ax1.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(target_counts.values):
    ax1.text(i, v + 50, f'{v:,}\n({target_pct.iloc[i]:.1f}%)', 
             ha='center', va='bottom', fontweight='bold')

# Plot 2: Comparison pie chart
ax2.pie(target_counts.values, labels=target_counts.index, autopct='%1.1f%%',
        colors=['#2E86AB', '#A23B72', '#F18F01'], startangle=90)
ax2.set_title('Class Distribution\n(Proportional View)', 
              fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("📊 Visualization: 3-class target distribution displayed")

## 🔧 Feature Engineering & Data Preprocessing

In [None]:
# Identify astronomical features
print("🔍 Astronomical Feature Identification")
print("=" * 40)

# Key astronomical parameters
astronomical_keywords = ['koi_', 'pl_', 'st_', 'period', 'radius', 'mass', 'temp', 'mag', 'depth', 'duration', 'snr']
astronomical_features = []

for col in kepler_data.columns:
    col_lower = col.lower()
    if any(keyword in col_lower for keyword in astronomical_keywords):
        if col not in ['koi_disposition', 'koi_pdisposition']:  # Exclude target columns
            astronomical_features.append(col)

print(f"📡 Identified astronomical features ({len(astronomical_features)}):")
for i, feature in enumerate(astronomical_features[:15], 1):
    print(f"  {i:2d}. {feature}")
if len(astronomical_features) > 15:
    print(f"  ... and {len(astronomical_features) - 15} more features")

# Analyze missing data
print(f"\n🔍 Missing Data Analysis:")
missing_data = kepler_data[astronomical_features].isnull().sum()
missing_percent = (missing_data / len(kepler_data)) * 100
missing_df = pd.DataFrame({
    'Feature': missing_data.index,
    'Missing_Count': missing_data.values,
    'Missing_Percent': missing_percent.values
}).sort_values('Missing_Percent', ascending=False)

# Show top 10 features with missing data
print("📊 Top 10 features with missing data:")
for _, row in missing_df.head(10).iterrows():
    if row['Missing_Count'] > 0:
        print(f"  • {row['Feature']}: {row['Missing_Count']:,} ({row['Missing_Percent']:.1f}%)")

In [None]:
# Feature preprocessing and engineering
print("⚙️ Feature Preprocessing Pipeline")
print("=" * 35)

# Separate features by type
numerical_columns = kepler_data[astronomical_features].select_dtypes(include=[np.number]).columns.tolist()
categorical_columns = kepler_data[astronomical_features].select_dtypes(include=['object']).columns.tolist()

print(f"📊 Feature breakdown:")
print(f"  • Numerical features: {len(numerical_columns)}")
print(f"  • Categorical features: {len(categorical_columns)}")

# Prepare feature matrix X and target vector y
X = kepler_data[astronomical_features].copy()
y = kepler_data['koi_disposition'].copy()

print(f"\n🎯 Target preparation:")
print(f"  • Target variable: koi_disposition")
print(f"  • Classes: {y.value_counts().to_dict()}")

# Encode target variable
target_encoder = LabelEncoder()
y_encoded = target_encoder.fit_transform(y.astype(str))
target_classes = target_encoder.classes_

print(f"  • Encoded classes: {dict(zip(range(len(target_classes)), target_classes))}")

# Handle numerical features
print(f"\n🔢 Numerical feature processing:")
numerical_imputer = SimpleImputer(strategy='median')
X_numerical = X[numerical_columns].copy()

# Replace infinite values
X_numerical = X_numerical.replace([np.inf, -np.inf], np.nan)

# Impute missing values
X_numerical_imputed = pd.DataFrame(
    numerical_imputer.fit_transform(X_numerical),
    columns=numerical_columns,
    index=X.index
)

print(f"  • Features processed: {len(numerical_columns)}")
print(f"  • Missing values imputed with median")
print(f"  • Infinite values replaced with NaN then imputed")

# Handle categorical features
print(f"\n📝 Categorical feature processing:")
if categorical_columns:
    categorical_imputer = SimpleImputer(strategy='most_frequent')
    label_encoders = {}
    
    X_categorical = X[categorical_columns].copy()
    X_categorical_imputed = pd.DataFrame(
        categorical_imputer.fit_transform(X_categorical),
        columns=categorical_columns,
        index=X.index
    )
    
    # Label encode categorical features
    for col in categorical_columns:
        le = LabelEncoder()
        X_categorical_imputed[col] = le.fit_transform(X_categorical_imputed[col].astype(str))
        label_encoders[col] = le
    
    # Combine numerical and categorical features
    X_processed = pd.concat([X_numerical_imputed, X_categorical_imputed], axis=1)
    print(f"  • Categorical features encoded: {len(categorical_columns)}")
else:
    X_processed = X_numerical_imputed
    label_encoders = {}
    print(f"  • No categorical features found")

print(f"\n✅ Preprocessing completed")
print(f"  • Final feature matrix shape: {X_processed.shape}")
print(f"  • Target vector shape: {y_encoded.shape}")

In [None]:
# Feature scaling
print("📏 Feature Scaling")
print("=" * 20)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_processed)

print(f"✅ Features scaled using StandardScaler")
print(f"  • Scaled feature matrix shape: {X_scaled.shape}")
print(f"  • Features centered (mean ≈ 0) and scaled (std ≈ 1)")

# Create DataFrame for easier handling
X_scaled_df = pd.DataFrame(X_scaled, columns=X_processed.columns, index=X_processed.index)

# Verify scaling
print(f"\n🔍 Scaling verification:")
print(f"  • Mean of first 5 features: {X_scaled_df.iloc[:, :5].mean().round(3).tolist()}")
print(f"  • Std of first 5 features: {X_scaled_df.iloc[:, :5].std().round(3).tolist()}")

# Final dataset summary
print(f"\n📋 Final Dataset Summary:")
print(f"  • Total samples: {X_scaled.shape[0]:,}")
print(f"  • Total features: {X_scaled.shape[1]:,}")
print(f"  • Target classes: {len(target_classes)} (3-class classification)")
print(f"  • Class distribution: {dict(zip(target_classes, np.bincount(y_encoded)))}")

## 🤖 Model Training Pipeline

In [None]:
# Model Training Pipeline
print("🚂 Kepler Model Training Pipeline:")
print("=" * 50)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)

print(f"📊 Data split:")
print(f"  • Training set: {X_train.shape}")
print(f"  • Test set: {X_test.shape}")

# Check class distribution in splits
train_dist = np.bincount(y_train)
test_dist = np.bincount(y_test)

print(f"\n🎯 Training set distribution:")
for i, class_name in enumerate(target_classes):
    print(f"  • {class_name}: {train_dist[i]:,} ({train_dist[i]/len(y_train):.1%})")

print(f"\n🎯 Test set distribution:")
for i, class_name in enumerate(target_classes):
    print(f"  • {class_name}: {test_dist[i]:,} ({test_dist[i]/len(y_test):.1%})")

# Initialize models (consistent with TOI and K2 analysis)
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Extra Trees': ExtraTreesClassifier(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBClassifier(random_state=42, eval_metric='mlogloss'),
    'SVM': SVC(random_state=42, probability=True)
}

print(f"\n🤖 Models to train: {list(models.keys())}")

In [None]:
# Train and evaluate all models
print("🎯 Training 5 models...")
print()

results = {}
trained_models = {}

for name, model in models.items():
    print(f"🔄 Training {name}...")
    
    # Train the model
    model.fit(X_train, y_train)
    trained_models[name] = model
    
    # Make predictions
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy', 
                               cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42))
    
    results[name] = {
        'accuracy': accuracy,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'predictions': y_pred
    }
    
    print(f"✅ {name} - Accuracy: {accuracy:.4f} | CV: {cv_scores.mean():.4f} (±{cv_scores.std():.4f})")

print(f"\n🏆 Model Training Completed!")

In [None]:
# Performance comparison and ranking
print("🏆 Model Performance Ranking:")
print("=" * 40)

# Create results DataFrame
results_df = pd.DataFrame({
    'Model': results.keys(),
    'Test_Accuracy': [results[name]['accuracy'] for name in results.keys()],
    'CV_Mean': [results[name]['cv_mean'] for name in results.keys()],
    'CV_Std': [results[name]['cv_std'] for name in results.keys()]
}).sort_values('Test_Accuracy', ascending=False)

print(results_df.to_string(index=False, float_format='%.6f'))

# Identify best model
best_model_name = results_df.iloc[0]['Model']
best_model = trained_models[best_model_name]
best_predictions = results[best_model_name]['predictions']

print(f"\n🥇 Best Model: {best_model_name}")
print(f"  • Test Accuracy: {results[best_model_name]['accuracy']:.4f}")
print(f"  • Cross-validation: {results[best_model_name]['cv_mean']:.4f} (±{results[best_model_name]['cv_std']:.4f})")

# Detailed classification report for best model
print(f"\n📋 Classification Report - {best_model_name}:")
print(classification_report(y_test, best_predictions, target_names=target_classes))

# Confusion matrix
print(f"\n🔢 Confusion Matrix:")
cm = confusion_matrix(y_test, best_predictions)
print(cm)

In [None]:
# Visualization of results
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Plot 1: Model performance comparison
ax1 = axes[0]
models_list = results_df['Model'].tolist()
accuracies = results_df['Test_Accuracy'].tolist()
cv_means = results_df['CV_Mean'].tolist()

x_pos = np.arange(len(models_list))
ax1.bar(x_pos, accuracies, alpha=0.7, label='Test Accuracy', color='skyblue', edgecolor='black')
ax1.plot(x_pos, cv_means, 'ro-', label='CV Mean', linewidth=2, markersize=6)
ax1.set_xlabel('Models')
ax1.set_ylabel('Accuracy')
ax1.set_title('Kepler Dataset - Model Performance', fontweight='bold')
ax1.set_xticks(x_pos)
ax1.set_xticklabels(models_list, rotation=45, ha='right')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)
ax1.set_ylim(0.9, 1.0)

# Plot 2: Cross-validation performance
ax2 = axes[1]
cv_means_plot = results_df['CV_Mean'].tolist()
cv_stds_plot = results_df['CV_Std'].tolist()

ax2.bar(x_pos, cv_means_plot, yerr=cv_stds_plot, alpha=0.7, 
        color='lightgreen', edgecolor='black', capsize=5)
ax2.set_xlabel('Models')
ax2.set_ylabel('CV Accuracy')
ax2.set_title('Cross-Validation Performance', fontweight='bold')
ax2.set_xticks(x_pos)
ax2.set_xticklabels(models_list, rotation=45, ha='right')
ax2.grid(axis='y', alpha=0.3)

# Plot 3: Confusion matrix heatmap
ax3 = axes[2]
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=target_classes, yticklabels=target_classes, ax=ax3)
ax3.set_title(f'Confusion Matrix - {best_model_name}', fontweight='bold')
ax3.set_xlabel('Predicted')
ax3.set_ylabel('Actual')

plt.tight_layout()
plt.show()

print("📊 Model evaluation visualizations completed")

## 📊 Feature Importance Analysis

In [None]:
# Feature importance analysis for tree-based models
print("📊 Feature Importance Analysis")
print("=" * 35)

tree_models = ['Random Forest', 'Extra Trees', 'XGBoost']
feature_names = X_processed.columns

for model_name in tree_models:
    if model_name in trained_models:
        model = trained_models[model_name]
        
        if hasattr(model, 'feature_importances_'):
            importance = model.feature_importances_
            
            # Create importance DataFrame
            importance_df = pd.DataFrame({
                'Feature': feature_names,
                'Importance': importance
            }).sort_values('Importance', ascending=False)
            
            print(f"\n🌲 {model_name} - Top 10 Important Features:")
            print(importance_df.head(10).to_string(index=False))
            
            # Plot feature importance for best model
            if model_name == best_model_name:
                plt.figure(figsize=(12, 8))
                top_features = importance_df.head(15)
                plt.barh(range(len(top_features)), top_features['Importance'], 
                        color='steelblue', alpha=0.8, edgecolor='black')
                plt.yticks(range(len(top_features)), top_features['Feature'])
                plt.xlabel('Feature Importance')
                plt.title(f'Top 15 Feature Importances - {best_model_name}\n(Kepler 3-Class Classification)', 
                         fontweight='bold', fontsize=14)
                plt.gca().invert_yaxis()
                plt.grid(axis='x', alpha=0.3)
                plt.tight_layout()
                plt.show()

print(f"\n✅ Feature importance analysis completed for tree-based models")

## 💾 Results Summary and Integration

In [None]:
# Kepler Analysis Final Summary and Integration Preparation
print("🎯 Kepler Dataset Analysis - Final Summary")
print("=" * 50)

print(f"📊 Dataset Statistics:")
print(f"  • Total samples: {len(kepler_data):,}")
print(f"  • Original features: {len(kepler_data.columns)}")
print(f"  • Processed features: {X_scaled.shape[1]}")
print(f"  • Target variable: koi_disposition (3-class classification)")

print(f"\n🏆 Model Performance (3-Class Classification):")
for i, (_, row) in enumerate(results_df.iterrows(), 1):
    print(f"  {i}. {row['Model']}: {row['Test_Accuracy']:.4f} accuracy")

print(f"\n🥇 Best Performing Model:")
print(f"  • Model: {best_model_name}")
print(f"  • Test Accuracy: {results[best_model_name]['accuracy']:.4f}")
print(f"  • Cross-validation: {results[best_model_name]['cv_mean']:.4f} ± {results[best_model_name]['cv_std']:.4f}")

print(f"\n🎯 Target Classes (3-Class Problem):")
for class_name, count in target_counts.items():
    percentage = (count / target_counts.sum()) * 100
    print(f"  • {class_name}: {count:,} samples ({percentage:.1f}%)")

# Feature importance for best model
if best_model_name in tree_models and hasattr(trained_models[best_model_name], 'feature_importances_'):
    model = trained_models[best_model_name]
    importance = model.feature_importances_
    
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importance
    }).sort_values('Importance', ascending=False)
    
    print(f"\n🌲 {best_model_name} - Top 10 Important Features:")
    print(importance_df.head(10).to_string(index=False))

# Save results for integration
print(f"\n💾 Saving Kepler Results for Multi-Dataset Integration:")
kepler_output_data = {
    'processed_features': X_processed,
    'scaled_features': X_scaled,
    'target': y_encoded,
    'target_names': target_classes,
    'feature_names': feature_names.tolist(),
    'best_model': best_model,
    'best_model_name': best_model_name,
    'results_summary': results_df,
    'dataset_info': {
        'name': 'Kepler',
        'samples': len(kepler_data),
        'original_features': len(kepler_data.columns),
        'processed_features': X_scaled.shape[1],
        'target_variable': 'koi_disposition',
        'classes': 3,
        'class_distribution': dict(zip(target_classes, np.bincount(y_encoded)))
    },
    'preprocessing_info': {
        'scaler': scaler,
        'target_encoder': target_encoder,
        'label_encoders': label_encoders,
        'numerical_imputer': numerical_imputer,
        'categorical_imputer': SimpleImputer(strategy='most_frequent') if categorical_columns else None
    }
}

# Save results
import pickle
with open('kepler_analysis_corrected_results.pkl', 'wb') as f:
    pickle.dump(kepler_output_data, f)

print("✅ Results saved to 'kepler_analysis_corrected_results.pkl'")

print(f"\n🔬 Key Insights:")
print("  • Kepler dataset successfully analyzed using 3-class classification")
print("  • Target: koi_disposition (includes CONFIRMED exoplanets)")
print("  • Consistent methodology with TOI and K2 datasets")
print("  • Results ready for multi-dataset comparison and integration")

print(f"\n🚀 Integration Readiness:")
print("  • ✅ Kepler dataset analysis completed (3-class)")
print("  • ✅ TOI dataset analysis completed") 
print("  • ✅ K2 dataset analysis completed")
print("  • 🔄 Ready for cross-dataset comparison")
print("  • 🔄 Ready for unified preprocessing pipeline")
print("  • 🔄 Ready for merged multi-mission training")

print(f"\n📋 Next Steps:")
print("  1. Update cross-dataset comparison with corrected Kepler results")
print("  2. Unified preprocessing pipeline development")
print("  3. Multi-dataset integration and training")
print("  4. Final NASA Space Apps challenge solution")

print("\n" + "=" * 50)
print("✨ Kepler 3-Class Analysis Pipeline Completed Successfully! ✨")
print("🌟 Ready for Multi-Dataset Integration Phase! 🌟")
print("=" * 50)