# EXOPLANET CLASSIFICATION WITH kNN AND PCA

This notebook analyzes the NASA Kepler Object of Interest (KOI) dataset to classify potential exoplanet candidates as 'CONFIRMED' vs 'CANDIDATE' vs 'FALSE POSITIVE' using k-Nearest Neighbors. We'll compare performance with and without PCA dimensionality reduction.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix, classification_report
import time
import os
import psutil
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Define a colorblind-friendly palette (Okabe-Ito)
COLORBLIND_PALETTE = {
    'blue': '#0072B2',
    'orange': '#E69F00',
    'green': '#009E73',
    'red': '#D55E00',
    'purple': '#CC79A7',
    'yellow': '#F0E442',
    'cyan': '#56B4E9',
    'grey': '#999999'
}

# Colors for the three classes (consistent across all visualizations)
CLASS_COLORS = {
    'FALSE POSITIVE': COLORBLIND_PALETTE['red'],
    'CANDIDATE': COLORBLIND_PALETTE['blue'],
    'CONFIRMED': COLORBLIND_PALETTE['green']
}

# Colors for with/without PCA comparison
PCA_COLORS = {
    'Without PCA': COLORBLIND_PALETTE['orange'],
    'With PCA': COLORBLIND_PALETTE['purple']
}

## 1. Data Loading and Exploratory Data Analysis

First, we'll load the KOI dataset and prepare it for multi-class classification of exoplanets. We will also examine the class distribution and key features of our dataset.

In [3]:
def load_and_preprocess_data(csv_path='koi_data.csv'):
    """
    Load and preprocess the KOI dataset for multi-class classification
    Parameters:
    -----------
    csv_path : str
        Path to the CSV file
    Returns:
    --------
    X_train, X_test, y_train, y_test, feature_names, scaler, class_names
    """
    df = pd.read_csv(csv_path)

    # Display some basic information
    print(f"Dataset shape: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")

    # Check for missing values
    missing_values = df.isnull().sum().sum()
    print(f"Total missing values: {missing_values}")

    # Create multi-class classification target
    class_mapping = {'FALSE POSITIVE': 0, 'CANDIDATE': 1, 'CONFIRMED': 2}
    class_names = ['FALSE POSITIVE', 'CANDIDATE', 'CONFIRMED']
    df['target'] = df['koi_disposition'].map(class_mapping)

    # Display class distribution
    class_counts = df['target'].value_counts()
    print("\nClass distribution:")
    for class_id, count in class_counts.items():
        class_name = class_names[class_id]
        percentage = count / len(df) * 100
        print(f"{class_name}: {count} ({percentage:.2f}%)")

    # Select features - exclude non-numerical and target columns
    exclude_cols = ['koi_disposition', 'target']
    feature_cols = [col for col in df.columns if col not in exclude_cols]

    # Split data
    X = df[feature_cols]
    y = df['target']
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Create a scaler for later use
    scaler = StandardScaler()
    print(f"Training set shape: {X_train.shape}")
    print(f"Test set shape: {X_test.shape}")
    return X_train, X_test, y_train, y_test, feature_cols, scaler, class_names

def interactive_class_distribution(y_train, class_names):
    """
    Create an interactive pie chart showing class distribution
    """
    class_counts = pd.Series(y_train).value_counts().sort_index()
    values = class_counts.values
    labels = [class_names[i] for i in class_counts.index]

    # Calculate percentages
    percentages = [f"{v/sum(values)*100:.1f}%" for v in values]
    labels_with_pct = [f"{label} ({pct})" for label, pct in zip(labels, percentages)]

    # Use consistent colors
    colors = [CLASS_COLORS[class_name] for class_name in class_names]
    fig = go.Figure(data=[go.Pie(
        labels=labels_with_pct,
        values=values,
        marker=dict(colors=colors),
        textinfo='label+value',
        hoverinfo='label+percent',
        textfont=dict(size=14),
        hole=0.4,
    )])
    fig.update_layout(
        title={
            'text': 'Exoplanet Classification Distribution',
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        annotations=[dict(
            text='KOI Dataset',
            x=0.5, y=0.5,
            font=dict(size=15),
            showarrow=False
        )],
        height=500,
        legend=dict(
            title_text='Classes',
            yanchor="top",
            y=0.99,
            xanchor="left",
            x=0.01
        )
    )
    return fig

def interactive_feature_correlations(X_train, feature_names, threshold=0.7):
    """
    Create an interactive correlation heatmap and scatter plot matrix for features
    """
    # Calculate correlation matrix
    corr_matrix = X_train.corr()

    # Create a mask for the upper triangle
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

    # Convert masked correlation matrix to suitable format for Plotly
    z_vals = corr_matrix.values

    z_vals[mask] = np.nan  # Set upper triangle to NaN

    # Create interactive heatmap
    heatmap_fig = go.Figure(data=go.Heatmap(
        z=z_vals,
        x=corr_matrix.columns,
        y=corr_matrix.columns,
        colorscale='RdBu_r',
        zmid=0,
        colorbar=dict(title='Correlation'),
        hovertemplate='%{y} & %{x}<br>Correlation: %{z:.3f}<extra></extra>'
    ))
    heatmap_fig.update_layout(
        title={
            'text': 'Feature Correlation Matrix',
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        height=700,
        width=750
    )

    # Find highly correlated features
    high_corr_features = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                feat1 = corr_matrix.columns[i]
                feat2 = corr_matrix.columns[j]
                corr_value = corr_matrix.iloc[i, j]
                high_corr_features.append((feat1, feat2, corr_value))

    # Print highly correlated features
    if high_corr_features:
        print(f"\nHighly correlated features (|correlation| > {threshold}):")
        for feat1, feat2, corr in high_corr_features:
            print(f"{feat1} and {feat2}: {corr:.3f}")

    # Create interactive scatter plot matrix for the top correlated features
    if high_corr_features:
        # Get unique features from high correlation pairs
        unique_features = set()
        for feat1, feat2, _ in high_corr_features:
            unique_features.add(feat1)
            unique_features.add(feat2)
        # Limit to 6 features to keep plot readable
        plot_features = list(unique_features)[:min(6, len(unique_features))]

        # Create a DataFrame with only these features
        scatter_df = X_train[plot_features].copy()

        # Add color by randomly sampling a subset of points to prevent overplotting
        sample_size = min(1000, len(scatter_df))

        scatter_df = scatter_df.sample(sample_size, random_state=42)

        # Create scatter plot matrix
        scatter_fig = px.scatter_matrix(
            scatter_df,
            dimensions=plot_features,
            opacity=0.7,
            title="Scatter Plot Matrix of Highly Correlated Features",
            height=700,
            width=800,
        )

        # Update layout for better readability
        scatter_fig.update_layout(
            title={
                'text': 'Scatter Plot Matrix of Highly Correlated Features',
                'y':0.98,
                'x':0.5,
                'xanchor': 'center',
                'yanchor': 'top'
            },
            font=dict(size=10)
        )

        # Update traces for better aesthetics
        scatter_fig.update_traces(
            diagonal_visible=False,
            showupperhalf=False,
            marker=dict(color=COLORBLIND_PALETTE['blue']),
        )
    else:
        scatter_fig = None
        print(f"\nNo highly correlated features found (|correlation| > {threshold})")
    return heatmap_fig, scatter_fig, high_corr_features


In [4]:
def visualize_feature_correlations(X_train, feature_names, threshold=0.7):
    """
    Create correlation heatmap of the features and identify highly correlated features
    Parameters:
    -----------
    X_train : DataFrame
        Training data
    feature_names : list
        Names of features
    threshold : float
        Correlation threshold to identify highly correlated features
    Returns:
    --------
    high_corr_features : list
        List of tuples containing highly correlated feature pairs
    """
    # Calculate correlation matrix
    corr_matrix = X_train.corr()

    # Create a mask for the upper triangle
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

    # Create the heatmap
    plt.figure(figsize=(18, 14))
    sns.heatmap(corr_matrix, mask=mask, cmap='coolwarm', center=0,
                square=True, linewidths=0.5, annot=False)
    plt.title('Feature Correlation Matrix', fontsize=16)
    plt.tight_layout()
    plt.show()

    # Find highly correlated features
    high_corr_features = []

    # Create a more detailed correlation matrix visualization for highly correlated features
    high_corr_pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                feat1 = corr_matrix.columns[i]
                feat2 = corr_matrix.columns[j]
                corr_value = corr_matrix.iloc[i, j]
                high_corr_features.append((feat1, feat2, corr_value))
                high_corr_pairs.append((feat1, feat2))

    if high_corr_features:
        print(f"\nHighly correlated features (|correlation| > {threshold}):")
        for feat1, feat2, corr in high_corr_features:
            print(f"{feat1} and {feat2}: {corr:.3f}")
        # Create a scatter plot matrix for highly correlated features
        if len(high_corr_features) > 0:
            # Get unique features from high correlation pairs
            unique_features = set()
            for feat1, feat2, _ in high_corr_features:
                unique_features.add(feat1)
                unique_features.add(feat2)
            # Limit to at most 6 features to keep the plot readable
            plot_features = list(unique_features)[:min(6, len(unique_features))]
            # Create scatter plot matrix
            print("\nScatter plot matrix for selected highly correlated features:")
            plt.figure(figsize=(15, 15))
            scatter_matrix = pd.plotting.scatter_matrix(
                X_train[plot_features],
                alpha=0.5,
                figsize=(15, 15),
                diagonal='kde'
            )
            # Rotate axis labels
            for ax in scatter_matrix.flatten():
                ax.xaxis.label.set_rotation(45)
                ax.yaxis.label.set_rotation(0)
                ax.yaxis.label.set_ha('right')
            plt.tight_layout()
            plt.show()
    else:
        print(f"\nCorrelated features found (|correlation| > {threshold})")
    return high_corr_features

Results Discussion:

### Feature Correlations

The scatter plot matrix shows correlations between key features. We observe:
- Strong correlations between different magnitude measurements (koi_imag, koi_gmag, koi_rmag, koi_zmag, koi_jmag)
- These correlations make sense since they represent observations of the same stars at different wavelengths
- The high correlations (>0.7) suggest that feature selection or dimensionality reduction could be beneficial

## 3. Principal Component Analysis (PCA)

PCA transforms our dataset into a new coordinate system to identify the directions of maximum variance.

In [5]:
def perform_pca_analysis(X_train, feature_names, scaler, class_names, y_train, n_components=20):
    """
    Perform PCA analysis and create interactive visualizations, including class separation.
    Parameters:
    -----------
    X_train : DataFrame
        Training data
    feature_names : list
        Names of features
    scaler : StandardScaler
        Fitted scaler
    class_names : list
        Names of classes
    y_train : Series
        Training labels
    n_components : int
        Number of PCA components to analyze
    Returns:
    --------
    optimal_components : int
        Optimal number of components based on 95% variance
    pca : PCA
        Fitted PCA object
    """
    # Scale the data
    X_scaled = scaler.fit_transform(X_train)

    # Apply PCA
    pca = PCA(n_components=min(n_components, len(feature_names)))
    pca.fit(X_scaled)

    # Get explained variance ratio
    explained_variance = pca.explained_variance_ratio_
    cumulative_variance = np.cumsum(explained_variance)

    # Find optimal number of components for 95% variance
    optimal_components = np.argmax(cumulative_variance >= 0.95) + 1

    # Create interactive bar chart for explained variance
    variance_fig = go.Figure()

    # Individual explained variance
    variance_fig.add_trace(go.Bar(
        x=list(range(1, len(explained_variance) + 1)),
        y=explained_variance,
        name='Individual',
        marker_color=COLORBLIND_PALETTE['blue'],
        hovertemplate='PC%{x}<br>Variance Explained: %{y:.4f}<extra></extra>'
    ))
    # Cumulative explained variance
    variance_fig.add_trace(go.Scatter(
        x=list(range(1, len(cumulative_variance) + 1)),
        y=cumulative_variance,
        mode='lines+markers',
        name='Cumulative',
        marker_color=COLORBLIND_PALETTE['orange'],
        hovertemplate='PC%{x}<br>Cumulative Variance: %{y:.4f}<extra></extra>'
    ))
    # Add threshold line
    variance_fig.add_shape(
        type="line",
        x0=0.5,
        y0=0.95,
        x1=len(explained_variance) + 0.5,
        y1=0.95,
        line=dict(
            color=COLORBLIND_PALETTE['red'],
            width=2,
            dash="dash",
        ),
        name="95% Threshold"
    )
    # Add optimal components line
    variance_fig.add_shape(
        type="line",
        x0=optimal_components,
        y0=0,
        x1=optimal_components,
        y1=1,
        line=dict(
            color=COLORBLIND_PALETTE['green'],
            width=2,
            dash="dash",
        ),
        name=f"Optimal Components: {optimal_components}"
    )
    variance_fig.update_layout(
        title={
            'text': 'Explained Variance by Principal Components',
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        xaxis_title="Principal Component",
        yaxis_title="Variance Explained",
        legend=dict(
            orientation="h",
            yanchor="bottom",
            y=1.02,
            xanchor="right",
            x=1
        ),
        annotations=[
            dict(
                x=optimal_components,
                y=0.5,
                xref="x",
                yref="y",
                text=f"Optimal: PC{optimal_components}",
                showarrow=True,
                arrowhead=1,
                ax=-40,
                ay=0
            ),
            dict(
                x=len(explained_variance) / 2,
                y=0.95,
                xref="x",
                yref="y",
                text="95% Threshold",
                showarrow=True,
                arrowhead=1,
                ax=0,
                ay=-40
            )
        ],
        height=600
    )
    # Transform data to visualize first few principal components
    pca_result = pca.transform(X_scaled)

    # Create dataframe with PCA results and class labels
    pca_df = pd.DataFrame(data=pca_result[:, 0:3], columns=['PC1', 'PC2', 'PC3'])
    pca_df['Class'] = y_train.values
    pca_df['Class_Name'] = pca_df['Class'].map({i: name for i, name in enumerate(class_names)})

    # Create interactive 2D scatter plot
    scatter_2d_fig = px.scatter(
        pca_df, x='PC1', y='PC2',
        color='Class_Name',
        color_discrete_map={class_name: CLASS_COLORS[class_name] for class_name in class_names},
        hover_name='Class_Name',
        hover_data={'PC1': ':.3f', 'PC2': ':.3f', 'Class': False, 'Class_Name': False},
        labels={'PC1': f'PC1 ({explained_variance[0]:.2%} variance)',
                'PC2': f'PC2 ({explained_variance[1]:.2%} variance)'},
        title='PCA: First Two Principal Components by Class',
        opacity=0.7,
        height=600
    )
    scatter_2d_fig.update_layout(
        title={
            'text': 'PCA: First Two Principal Components by Class',
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        legend=dict(
            title="Exoplanet Class",
            orientation="h",
            yanchor="bottom",
            y=1.02,
            xanchor="right",
            x=1
        ),
        xaxis=dict(
            title=f"PC1 ({explained_variance[0]:.2%} variance explained)"
        ),
        yaxis=dict(
            title=f"PC2 ({explained_variance[1]:.2%} variance explained)"
        ),
        # Add annotations for class centroids
        annotations=[]
    )
    # Add class centroid annotations
    for cls_name in class_names:
        cls_data = pca_df[pca_df['Class_Name'] == cls_name]
        centroid_x = cls_data['PC1'].mean()
        centroid_y = cls_data['PC2'].mean()
        scatter_2d_fig.add_annotation(
            x=centroid_x,
            y=centroid_y,
            text=f"{cls_name} centroid",
            showarrow=True,
            arrowhead=2,
            arrowsize=1,
            arrowcolor=CLASS_COLORS[cls_name],
            ax=-30,
            ay=-30,
            font=dict(
                size=12,
                color=CLASS_COLORS[cls_name]
            )
        )
    # Create interactive 3D scatter plot
    scatter_3d_fig = px.scatter_3d(
        pca_df, x='PC1', y='PC2', z='PC3',
        color='Class_Name',
        color_discrete_map={class_name: CLASS_COLORS[class_name] for class_name in class_names},
        hover_name='Class_Name',
        hover_data={'PC1': ':.3f', 'PC2': ':.3f', 'PC3': ':.3f', 'Class': False, 'Class_Name': False},
        labels={'PC1': f'PC1 ({explained_variance[0]:.2%})',
                'PC2': f'PC2 ({explained_variance[1]:.2%})',
                'PC3': f'PC3 ({explained_variance[2]:.2%})'},
        title='PCA: First Three Principal Components by Class',
        opacity=0.7,
        height=700
    )
    scatter_3d_fig.update_traces(marker=dict(size=3,
                                      sizemode='diameter'))
    scatter_3d_fig.update_layout(
        title={
            'text': 'PCA: First Three Principal Components by Class',
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        scene=dict(
            xaxis_title=f"PC1 ({explained_variance[0]:.2%})",
            yaxis_title=f"PC2 ({explained_variance[1]:.2%})",
            zaxis_title=f"PC3 ({explained_variance[2]:.2%})"
        ),
        legend=dict(
            title="Exoplanet Class"
        )
    )
    # Calculate feature importance through PCA components
    components = pca.components_

    # Get the absolute loadings of features in each component
    feature_importance = np.abs(components).sum(axis=0)

    # Normalize to get relative importance
    feature_importance = feature_importance / feature_importance.sum()

    # Create a DataFrame with feature names and their importance
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': feature_importance
    }).sort_values('Importance', ascending=False)

    # Create interactive bar chart for feature importance
    top_n = 15  # show top 15 features
    top_features = importance_df.head(top_n)
    importance_fig = px.bar(
        top_features,
        y='Feature',
        x='Importance',
        orientation='h',
        color='Importance',
        color_continuous_scale=px.colors.sequential.Viridis,
        title=f'Top {top_n} Features by PCA Importance',
        height=600
    )
    importance_fig.update_layout(
        title={
            'text': f'Top {top_n} Features by PCA Importance',
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        yaxis=dict(
            title="",
            autorange="reversed"  # display in descending order of importance
        ),
        xaxis=dict(
            title="Importance Score"
        ),
        coloraxis_showscale=False
    )
    # Add value annotations
    for i, feature in enumerate(top_features['Feature']):
        importance_value = top_features.loc[top_features['Feature'] == feature, 'Importance'].values[0]
        importance_fig.add_annotation(
            x=importance_value + 0.01,  # Add a small offset
            y=feature,
            text=f"{importance_value:.3f}",
            showarrow=False,
            font=dict(
                size=10
            )
        )
    print(f"Number of components needed to retain 95% variance: {optimal_components}")

    # Print top features for each of the first 5 components
    print("\nTop features in each principal component:")

    for i in range(min(5, optimal_components)):
        # Get the loadings for this component
        loadings = components[i]
        # Sort by absolute value
        sorted_indices = np.argsort(np.abs(loadings))[::-1]
        print(f"\nPC{i+1} (explains {explained_variance[i]:.2%} of variance):")
        for j in range(5):  # Show top 5 features
            idx = sorted_indices[j]
            sign = "+" if loadings[idx] >= 0 else "-"
            print(f"  {sign} {feature_names[idx]} ({loadings[idx]:.3f})")
    return optimal_components, pca, variance_fig, scatter_2d_fig, scatter_3d_fig, importance_fig

Results Discussion

### 3.1 Explained Variance

The PCA explained variance plot shows:
- The first component explains approximately 24% of variance
- We need 19 components to retain 95% of the original information
- The variance is distributed across many components, suggesting complex underlying patterns

### 3.2 Class Separation in PCA Space

The scatter plot of the first two principal components shows:
- Substantial overlap between classes in the central region
- Some FALSE POSITIVE examples have more extreme values, particularly on the negative side of PC1
- PC1 and PC2 together explain about 37% of the variance, which provides some separation but is insufficient for perfect classification
- This visualization confirms why machine learning algorithms are necessary for effective classification

## 4. k-Nearest Neighbors Classification

We'll now build kNN models with and without PCA to classify exoplanets.

### 4.1 kNN without PCA (Baseline)

First, we'll train a kNN classifier using all features without dimensionality reduction.

### 4.2 kNN with PCA

Next, we'll train a kNN classifier using the principal components that explain 95% of the variance.

In [6]:
# Define helper function
def get_memory_usage():
    """Get current memory usage of the process in MB"""
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024  # Convert to MB

def run_knn_experiment(X_train, X_test, y_train, y_test, feature_names, scaler, class_names, with_pca=True, n_components=10):
    """
    Run KNN experiment with or without PCA for multi-class classification
    Parameters:
    -----------
    X_train, X_test, y_train, y_test : DataFrame/Series
        Training and test data
    feature_names : list
        List of feature names
    scaler : StandardScaler
        Fitted scaler
    class_names : list
        Names of classes
    with_pca : bool
        Whether to use PCA for dimensionality reduction
    n_components : int
        Number of PCA components to use if with_pca is True
    Returns:
    --------
    results : dict
        Dictionary with experiment results
    """
    experiment_name = "KNN with PCA" if with_pca else "KNN without PCA"
    print(f"\n{'='*50}")
    print(f"Running {experiment_name} - Multi-class Classification")
    print(f"{'='*50}")

    # Log initial memory usage
    initial_memory = get_memory_usage()
    print(f"Initial memory usage: {initial_memory:.2f} MB")

    # Create pipeline
    if with_pca:
        pipeline = Pipeline([
            ('scaler', scaler),
            ('pca', PCA(n_components=n_components)),
            ('knn', KNeighborsClassifier())
        ])
        print(f"Using PCA with {n_components} components")
    else:
        pipeline = Pipeline([
            ('scaler', scaler),
            ('knn', KNeighborsClassifier())
        ])
        print("Using all features without PCA")

    # Define parameter grid for grid search
    param_grid = {
        'knn__n_neighbors': [3, 5, 7, 9, 11, 15],
        'knn__weights': ['uniform', 'distance'],
        'knn__metric': ['euclidean', 'manhattan', 'minkowski']
    }

    # Create grid search
    grid_search = GridSearchCV(
        pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1
    )

    # Measure training time
    print("\nTraining model...")
    start_time = time.time()
    grid_search.fit(X_train, y_train)
    training_time = time.time() - start_time

    # Log memory usage after training
    final_memory = get_memory_usage()
    memory_used = final_memory - initial_memory

    # Get best model
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    print(f"Training completed in {training_time:.2f} seconds")
    print(f"Memory usage: {memory_used:.2f} MB")
    print(f"Best parameters: {best_params}")

    # Make predictions
    y_pred = best_model.predict(X_test)

    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='weighted')
    print("\nModel performance:")
    print(f"  Accuracy:  {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print(f"  F1 Score:  {f1:.4f}")

    # Print detailed classification report
    print("\nDetailed Classification Report:")
    print(classification_report(y_test, y_pred, target_names=class_names))

    # Create confusion matrix
    cm = confusion_matrix(y_test, y_pred)

    # Create interactive confusion matrix
    cm_fig = px.imshow(
        cm,
        x=class_names,
        y=class_names,
        text_auto=True,
        color_continuous_scale='Blues',
        labels=dict(x="Predicted", y="True", color="Count"),
        title=f"Confusion Matrix - {experiment_name}"
    )
    cm_fig.update_layout(
        title={
            'text': f"Confusion Matrix - {experiment_name}",
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        height=600,
        width=700,
        xaxis=dict(
            title="Predicted Class",
            tickangle=-45
        ),
        yaxis=dict(
            title="True Class"
        )
    )
    # Calculate class-specific metrics
    class_precision, class_recall, class_f1, support = precision_recall_fscore_support(
        y_test, y_pred, labels=range(len(class_names))
    )

    # Create interactive grouped bar chart
    class_metrics_fig = go.Figure()

    # Add trace for Precision
    class_metrics_fig.add_trace(go.Bar(
        x=class_names,
        y=class_precision,
        name='Precision',
        marker_color=COLORBLIND_PALETTE['blue'],
        text=[f"{val:.2f}" for val in class_precision],
        textposition='auto',
        hovertemplate='Metric=Precision<br>Class=%{x}<br>Score=%{y:.4f}<br>Support=%{customdata}<extra></extra>',
        customdata=support
    ))

    # Add trace for Recall
    class_metrics_fig.add_trace(go.Bar(
        x=class_names,
        y=class_recall,
        name='Recall',
        marker_color=COLORBLIND_PALETTE['orange'],
        text=[f"{val:.2f}" for val in class_recall],
        textposition='auto',
        hovertemplate='Metric=Recall<br>Class=%{x}<br>Score=%{y:.4f}<br>Support=%{customdata}<extra></extra>',
        customdata=support
    ))

    # Add trace for F1 Score
    class_metrics_fig.add_trace(go.Bar(
        x=class_names,
        y=class_f1,
        name='F1 Score',
        marker_color=COLORBLIND_PALETTE['green'],
        text=[f"{val:.2f}" for val in class_f1],
        textposition='auto',
        hovertemplate='Metric=F1 Score<br>Class=%{x}<br>Score=%{y:.4f}<br>Support=%{customdata}<extra></extra>',
        customdata=support
    ))

    # Update layout
    class_metrics_fig.update_layout(
        title={
            'text': f"Class-specific Performance Metrics - {experiment_name}",
            'y': 0.95,
            'x': 0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        xaxis_title="Exoplanet Class",
        yaxis_title="Score",
        yaxis=dict(range=[0, 1.0]),
        legend=dict(
            title="Metric",
            orientation="h",
            yanchor="bottom",
            y=1.02,
            xanchor="right",
            x=1
        ),
        barmode='group',
        height=500
    )

    # Store results
    results = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'training_time': training_time,
        'memory_usage': memory_used,
        'best_params': best_params,
        'class_precision': class_precision,
        'class_recall': class_recall,
        'class_f1': class_f1,
        'support': support,  # Add support to the results dictionary
        'confusion_matrix': cm,
        'cm_fig': cm_fig,
        'class_metrics_fig': class_metrics_fig
    }
    return results

## 5. Results Comparison

Compare the performance of kNN models with and without PCA.

In [7]:
def compare_results(results_without_pca, results_with_pca, class_names):
    """
    Compare and visualize results with and without PCA for multi-class classification
    Parameters:
    -----------
    results_without_pca : dict
        Results dictionary for KNN without PCA
    results_with_pca : dict
        Results dictionary for KNN with PCA
    class_names : list
        Names of classes
    """
    print("\n\n" + "="*80)
    print("COMPARISON OF RESULTS: KNN WITHOUT PCA vs. KNN WITH PCA")
    print("="*80)
    # Compare metrics
    metrics = ['accuracy', 'precision', 'recall', 'f1']
    print("\nOverall Performance Metrics:")
    print(f"{'Metric':<12} {'Without PCA':<15} {'With PCA':<15} {'Difference':<12} {'% Change':<10}")
    print("-"*65)
    for metric in metrics:
        without_pca = results_without_pca[metric]
        with_pca = results_with_pca[metric]
        diff = with_pca - without_pca
        pct_change = (diff / without_pca) * 100 if without_pca != 0 else float('inf')
        print(f"{metric.capitalize():<12} {without_pca:.4f} {' '*8} {with_pca:.4f} {' '*8} {diff:.4f} {' '*6} {pct_change:+.2f}%")

    # Create interactive bar chart for metrics comparison using GraphObjects
    perf_comparison_fig = go.Figure()

    # Define values for both approaches
    without_pca_values = [results_without_pca[m] for m in metrics]
    with_pca_values = [results_with_pca[m] for m in metrics]

    # Add trace for Without PCA
    perf_comparison_fig.add_trace(go.Bar(
        x=['Accuracy', 'Precision', 'Recall', 'F1 Score'],
        y=without_pca_values,
        name='Without PCA',
        marker_color=PCA_COLORS['Without PCA'],
        text=[f"{val:.3f}" for val in without_pca_values],
        textposition='auto',
        hovertemplate='Metric=%{x}<br>Score=%{y:.4f}<br>Approach=Without PCA<extra></extra>'
    ))

    # Add trace for With PCA
    perf_comparison_fig.add_trace(go.Bar(
        x=['Accuracy', 'Precision', 'Recall', 'F1 Score'],
        y=with_pca_values,
        name='With PCA',
        marker_color=PCA_COLORS['With PCA'],
        text=[f"{val:.3f}" for val in with_pca_values],
        textposition='auto',
        hovertemplate='Metric=%{x}<br>Score=%{y:.4f}<br>Approach=With PCA<extra></extra>'
    ))

    # Update layout
    perf_comparison_fig.update_layout(
        title={
            'text': 'Performance Metrics: With vs. Without PCA',
            'y': 0.95,
            'x': 0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        xaxis_title="Metric",
        yaxis_title="Score",
        yaxis=dict(range=[0, 1.0]),
        legend=dict(
            title="",
            orientation="h",
            yanchor="bottom",
            y=1.02,
            xanchor="right",
            x=1
        ),
        barmode='group'
    )

    # Create interactive bar chart for class-specific F1 scores using GraphObjects
    f1_comparison_fig = go.Figure()

    # Get support values from results (assuming both have the same support values)
    support_values = results_without_pca.get('support', np.ones(len(class_names)))

    # Add trace for Without PCA
    f1_comparison_fig.add_trace(go.Bar(
        x=class_names,
        y=results_without_pca['class_f1'],
        name='Without PCA',
        marker_color=PCA_COLORS['Without PCA'],
        text=[f"{val:.3f}" for val in results_without_pca['class_f1']],
        textposition='auto',
        hovertemplate='Class=%{x}<br>F1 Score=%{y:.4f}<br>Approach=Without PCA<br>Support=%{customdata}<extra></extra>',
        customdata=support_values
    ))

    # Add trace for With PCA
    f1_comparison_fig.add_trace(go.Bar(
        x=class_names,
        y=results_with_pca['class_f1'],
        name='With PCA',
        marker_color=PCA_COLORS['With PCA'],
        text=[f"{val:.3f}" for val in results_with_pca['class_f1']],
        textposition='auto',
        hovertemplate='Class=%{x}<br>F1 Score=%{y:.4f}<br>Approach=With PCA<br>Support=%{customdata}<extra></extra>',
        customdata=support_values
    ))

    # Update layout
    f1_comparison_fig.update_layout(
        title={
            'text': 'F1 Score by Class: With vs. Without PCA',
            'y': 0.95,
            'x': 0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        xaxis_title="Exoplanet Class",
        yaxis_title="F1 Score",
        yaxis=dict(range=[0, 1.0]),
        legend=dict(
            title="",
            orientation="h",
            yanchor="bottom",
            y=1.02,
            xanchor="right",
            x=1
        ),
        barmode='group',
        height=500
    )

    # Create interactive bar chart for computational metrics
    comp_df = pd.DataFrame({
        'Metric': ['Training Time (s)', 'Memory Usage (MB)'],
        'Without PCA': [results_without_pca['training_time'], results_without_pca['memory_usage']],
        'With PCA': [results_with_pca['training_time'], results_with_pca['memory_usage']]
    })
    # Melt the DataFrame for easier plotting
    melted_comp_df = pd.melt(comp_df, id_vars=['Metric'], var_name='PCA', value_name='Value')
    # Create interactive subplots for computational metrics
    comp_fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=("Training Time Comparison", "Memory Usage Comparison"),
        specs=[[{"type": "bar"}, {"type": "bar"}]]
    )
    # Add traces for training time
    time_df = melted_comp_df[melted_comp_df['Metric'] == 'Training Time (s)']
    colors = [PCA_COLORS['Without PCA'] if pca == 'Without PCA' else PCA_COLORS['With PCA'] for pca in time_df['PCA']]
    comp_fig.add_trace(
        go.Bar(
            x=time_df['PCA'],
            y=time_df['Value'],
            marker_color=colors,
            text=time_df['Value'].apply(lambda x: f"{x:.2f}s"),
            textposition='auto',
            hovertemplate='%{x}: %{y:.2f} seconds<extra></extra>',
            showlegend=False
        ),
        row=1, col=1
    )
    # Add traces for memory usage
    mem_df = melted_comp_df[melted_comp_df['Metric'] == 'Memory Usage (MB)']
    colors = [PCA_COLORS['Without PCA'] if pca == 'Without PCA' else PCA_COLORS['With PCA'] for pca in mem_df['PCA']]
    comp_fig.add_trace(
        go.Bar(
            x=mem_df['PCA'],
            y=mem_df['Value'],
            marker_color=colors,
            text=mem_df['Value'].apply(lambda x: f"{x:.2f}MB"),
            textposition='auto',
            hovertemplate='%{x}: %{y:.2f} MB<extra></extra>',
            showlegend=False
        ),
        row=1, col=2
    )
    comp_fig.update_layout(
        title={
            'text': 'Computational Efficiency Comparison',
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        height=500,
        width=900,
        yaxis1=dict(title="Seconds"),
        yaxis2=dict(title="Megabytes (MB)")
    )
    # Compare confusion matrices
    cm_comparison_fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=("Confusion Matrix - Without PCA", "Confusion Matrix - With PCA"),
        specs=[[{"type": "heatmap"}, {"type": "heatmap"}]]
    )
    # Without PCA confusion matrix
    cm_comparison_fig.add_trace(
        go.Heatmap(
            z=results_without_pca['confusion_matrix'],
            x=class_names,
            y=class_names,
            colorscale='Blues',
            text=results_without_pca['confusion_matrix'],
            texttemplate="%{text}",
            hovertemplate='True: %{y}<br>Predicted: %{x}<br>Count: %{z}<extra></extra>'
        ),
        row=1, col=1
    )
    # With PCA confusion matrix
    cm_comparison_fig.add_trace(
        go.Heatmap(
            z=results_with_pca['confusion_matrix'],
            x=class_names,
            y=class_names,
            colorscale='Blues',
            text=results_with_pca['confusion_matrix'],
            texttemplate="%{text}",
            hovertemplate='True: %{y}<br>Predicted: %{x}<br>Count: %{z}<extra></extra>'
        ),
        row=1, col=2
    )
    cm_comparison_fig.update_layout(
        title={
            'text': 'Confusion Matrix Comparison',
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        height=600,
        width=2000
    )
    # Update x and y axis labels
    cm_comparison_fig.update_xaxes(title_text="Predicted", tickangle=-45, row=1, col=1)
    cm_comparison_fig.update_xaxes(title_text="Predicted", tickangle=-45, row=1, col=2)
    cm_comparison_fig.update_yaxes(title_text="True", row=1, col=1)
    cm_comparison_fig.update_yaxes(title_text="True", row=1, col=2)

    # Print conclusion
    print("\nCONCLUSION:")

    # Compare accuracy
    if results_with_pca['accuracy'] > results_without_pca['accuracy']:
        acc_msg = f"PCA IMPROVED accuracy by {(results_with_pca['accuracy'] - results_without_pca['accuracy']) * 100:.2f}%"
    elif results_with_pca['accuracy'] < results_without_pca['accuracy']:
        acc_msg = f"PCA REDUCED accuracy by {(results_without_pca['accuracy'] - results_with_pca['accuracy']) * 100:.2f}%"
    else:
        acc_msg = "PCA had NO EFFECT on accuracy"

    # Compare training time
    time_without = results_without_pca['training_time']
    time_with = results_with_pca['training_time']
    time_diff = time_with - time_without
    time_pct = (time_diff / time_without) * 100
    if time_with < time_without:
        time_msg = f"PCA REDUCED training time by {-time_pct:.2f}%"
    elif time_with > time_without:
        time_msg = f"PCA INCREASED training time by {time_pct:.2f}%"
    else:
        time_msg = "PCA had NO EFFECT on training time"

    # Compare memory usage
    mem_without = results_without_pca['memory_usage']
    mem_with = results_with_pca['memory_usage']
    mem_diff = mem_with - mem_without
    mem_pct = (mem_diff / mem_without) * 100
    if mem_with < mem_without:
        mem_msg = f"PCA REDUCED memory usage by {-mem_pct:.2f}%"
    elif mem_with > mem_without:
        mem_msg = f"PCA INCREASED memory usage by {mem_pct:.2f}%"
    else:
        mem_msg = "PCA had NO EFFECT on memory usage"
    print(f"• {acc_msg}")
    print(f"• {time_msg}")
    print(f"• {mem_msg}")
    overall_msg = "Based on these results, "
    if results_with_pca['accuracy'] >= results_without_pca['accuracy'] and (time_with < time_without or mem_with < mem_without):
        overall_msg += "PCA is RECOMMENDED for this dataset as it maintained or improved accuracy while reducing computational resources."
    elif results_with_pca['accuracy'] < results_without_pca['accuracy'] and (time_with < time_without or mem_with < mem_without):
        overall_msg += "there is a TRADE-OFF: PCA reduces computational resources but at the cost of some accuracy."
    elif results_with_pca['accuracy'] >= results_without_pca['accuracy'] and time_with >= time_without and mem_with >= mem_without:
        overall_msg += "PCA improved accuracy but did not reduce computational resources, so its use depends on whether accuracy or efficiency is more important."
    else:
        overall_msg += "PCA is NOT RECOMMENDED for this dataset as it neither improved accuracy nor reduced computational resources."
    print(f"• {overall_msg}")

    # Return comparison figures
    return perf_comparison_fig, f1_comparison_fig, comp_fig, cm_comparison_fig

In [8]:
def main():
    """
    Main function that runs the exoplanet classification analysis with interactive visualizations
    """
    # Load and preprocess data
    X_train, X_test, y_train, y_test, feature_names, scaler, class_names = load_and_preprocess_data('koi_data.csv')

    # Create class distribution visualization
    class_dist_fig = interactive_class_distribution(y_train, class_names)
    class_dist_fig.show()

    # Visualize feature correlations
    print("\n" + "="*50)
    print("FEATURE CORRELATION ANALYSIS")
    print("="*50)
    heatmap_fig, scatter_fig, high_corr = interactive_feature_correlations(X_train, feature_names)
    heatmap_fig.show()
    if scatter_fig:
        scatter_fig.show()

    # Perform PCA analysis
    print("\n" + "="*50)
    print("PCA ANALYSIS")
    print("="*50)
    optimal_components, pca, variance_fig, scatter_2d_fig, scatter_3d_fig, importance_fig = perform_pca_analysis(
        X_train, feature_names, scaler, class_names, y_train, n_components=20
    )
    # Show PCA visualizations
    variance_fig.show()
    scatter_2d_fig.show()
    scatter_3d_fig.show()
    importance_fig.show()

    # Run KNN without PCA
    results_without_pca = run_knn_experiment(
        X_train, X_test, y_train, y_test, feature_names, scaler, class_names, with_pca=False
    )
    # Show confusion matrix and class metrics for KNN without PCA
    results_without_pca['cm_fig'].show()
    results_without_pca['class_metrics_fig'].show()

    # Run KNN with PCA
    results_with_pca = run_knn_experiment(
        X_train, X_test, y_train, y_test, feature_names, scaler, class_names,
        with_pca=True, n_components=optimal_components
    )
    # Show confusion matrix and class metrics for KNN with PCA
    results_with_pca['cm_fig'].show()
    results_with_pca['class_metrics_fig'].show()

    # Compare results
    perf_comparison_fig, f1_comparison_fig, comp_fig, cm_comparison_fig = compare_results(
        results_without_pca, results_with_pca, class_names
    )
    # Show comparison visualizations
    perf_comparison_fig.show()
    f1_comparison_fig.show()
    comp_fig.show()
    cm_comparison_fig.show()

    # Return important figures for further analysis
    return {
        'class_distribution': class_dist_fig,
        'correlation_heatmap': heatmap_fig,
        'correlation_scatter': scatter_fig,
        'pca_variance': variance_fig,
        'pca_2d': scatter_2d_fig,
        'pca_3d': scatter_3d_fig,
        'feature_importance': importance_fig,
        'confusion_matrix_no_pca': results_without_pca['cm_fig'],
        'class_metrics_no_pca': results_without_pca['class_metrics_fig'],
        'confusion_matrix_with_pca': results_with_pca['cm_fig'],
        'class_metrics_with_pca': results_with_pca['class_metrics_fig'],
        'performance_comparison': perf_comparison_fig,
        'f1_comparison': f1_comparison_fig,
        'computational_comparison': comp_fig,
        'confusion_matrix_comparison': cm_comparison_fig
    }

if __name__ == "__main__":
    figures = main()

Dataset shape: (7785, 39)
Columns: ['koi_disposition', 'koi_fpflag_nt', 'koi_fpflag_ss', 'koi_fpflag_co', 'koi_fpflag_ec', 'koi_period', 'koi_time0bk', 'koi_time0', 'koi_impact', 'koi_duration', 'koi_depth', 'koi_ror', 'koi_srho', 'koi_prad', 'koi_sma', 'koi_incl', 'koi_teq', 'koi_insol', 'koi_dor', 'koi_max_sngle_ev', 'koi_max_mult_ev', 'koi_model_snr', 'koi_count', 'koi_num_transits', 'koi_tce_plnt_num', 'koi_steff', 'koi_slogg', 'koi_srad', 'koi_smass', 'ra', 'dec', 'koi_kepmag', 'koi_gmag', 'koi_rmag', 'koi_imag', 'koi_zmag', 'koi_jmag', 'koi_hmag', 'koi_kmag']
Total missing values: 0

Class distribution:
FALSE POSITIVE: 3744 (48.09%)
CONFIRMED: 2616 (33.60%)
CANDIDATE: 1425 (18.30%)
Training set shape: (6228, 38)
Test set shape: (1557, 38)



FEATURE CORRELATION ANALYSIS

No highly correlated features found (|correlation| > 0.7)



PCA ANALYSIS
Number of components needed to retain 95% variance: 20

Top features in each principal component:

PC1 (explains 23.19% of variance):
  + koi_imag (0.329)
  + koi_kepmag (0.329)
  + koi_zmag (0.328)
  + koi_rmag (0.328)
  + koi_jmag (0.324)

PC2 (explains 12.23% of variance):
  + koi_sma (0.413)
  + koi_period (0.396)
  + koi_time0bk (0.355)
  + koi_time0 (0.355)
  + koi_dor (0.340)

PC3 (explains 8.98% of variance):
  + koi_max_mult_ev (0.438)
  + koi_max_sngle_ev (0.430)
  + koi_depth (0.425)
  + koi_model_snr (0.411)
  + koi_fpflag_ss (0.321)

PC4 (explains 7.57% of variance):
  + koi_impact (0.450)
  + koi_ror (0.430)
  + koi_prad (0.413)
  - koi_incl (-0.207)
  + koi_fpflag_co (0.188)

PC5 (explains 6.58% of variance):
  + koi_ror (0.354)
  + koi_impact (0.338)
  + koi_prad (0.325)
  + koi_count (0.281)
  + koi_tce_plnt_num (0.243)



Running KNN without PCA - Multi-class Classification
Initial memory usage: 294.90 MB
Using all features without PCA

Training model...
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Training completed in 50.23 seconds
Memory usage: 4.75 MB
Best parameters: {'knn__metric': 'euclidean', 'knn__n_neighbors': 15, 'knn__weights': 'uniform'}

Model performance:
  Accuracy:  0.8009
  Precision: 0.7915
  Recall:    0.8009
  F1 Score:  0.7895

Detailed Classification Report:
                precision    recall  f1-score   support

FALSE POSITIVE       0.97      0.94      0.96       749
     CANDIDATE       0.49      0.31      0.38       285
     CONFIRMED       0.70      0.87      0.78       523

      accuracy                           0.80      1557
     macro avg       0.72      0.71      0.70      1557
  weighted avg       0.79      0.80      0.79      1557




Running KNN with PCA - Multi-class Classification
Initial memory usage: 300.62 MB
Using PCA with 20 components

Training model...
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Training completed in 17.73 seconds
Memory usage: 0.18 MB
Best parameters: {'knn__metric': 'manhattan', 'knn__n_neighbors': 11, 'knn__weights': 'uniform'}

Model performance:
  Accuracy:  0.7919
  Precision: 0.7793
  Recall:    0.7919
  F1 Score:  0.7783

Detailed Classification Report:
                precision    recall  f1-score   support

FALSE POSITIVE       0.97      0.94      0.95       749
     CANDIDATE       0.46      0.28      0.35       285
     CONFIRMED       0.69      0.86      0.76       523

      accuracy                           0.79      1557
     macro avg       0.70      0.69      0.69      1557
  weighted avg       0.78      0.79      0.78      1557





COMPARISON OF RESULTS: KNN WITHOUT PCA vs. KNN WITH PCA

Overall Performance Metrics:
Metric       Without PCA     With PCA        Difference   % Change  
-----------------------------------------------------------------
Accuracy     0.8009          0.7919          -0.0090        -1.12%
Precision    0.7915          0.7793          -0.0122        -1.54%
Recall       0.8009          0.7919          -0.0090        -1.12%
F1           0.7895          0.7783          -0.0112        -1.42%

CONCLUSION:
• PCA REDUCED accuracy by 0.90%
• PCA REDUCED training time by 64.69%
• PCA REDUCED memory usage by 96.22%
• Based on these results, there is a TRADE-OFF: PCA reduces computational resources but at the cost of some accuracy.
