# Calculate Baseline Distance Metrics for Reliability Analysis

This notebook computes the weighted average nearest neighbor distance for correctly classified samples in each class. These values will be used as baseline metrics in the Streamlit app to provide accurate reliability scores for personality predictions.

## Methodology

1. **Load Model and Sample Data**: Load the trained XGBoost model and sample 1000 data points
2. **Identify Classifications**: Separate correctly classified and misclassified samples by class
3. **Nearest Neighbor Analysis**: For each correctly classified sample, find the distance to its nearest correctly classified neighbor of the same class
4. **Weighted Average**: Calculate weighted average distance for each class based on misclassification ratios
5. **Save Results**: Export the baseline metrics for use in the Streamlit app

## 1. Import Required Libraries

Import all necessary libraries for data processing, model loading, and distance calculations.

In [1]:
import pandas as pd
import numpy as np
import pickle
import json
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.neighbors import NearestNeighbors
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print("Numpy version:", np.__version__)
print("Pandas version:", pd.__version__)

Libraries imported successfully!
Numpy version: 1.26.4
Pandas version: 2.2.3


## 2. Load Model and Data

Load the trained XGBoost model and sample training data for analysis. We'll use a representative sample to calculate baseline distances.

In [2]:
# Load the trained model and its components
try:
    with open('xgboost_personality_model.pkl', 'rb') as f:
        components = pickle.load(f)
    
    model = components['model']
    target_encoder = components['target_encoder']
    label_encoders = components['label_encoders']
    feature_columns = components['feature_columns']
    categorical_columns = components['categorical_columns']
    
    print("✅ Model loaded successfully!")
    print(f"Feature columns: {feature_columns}")
    print(f"Categorical columns: {categorical_columns}")
    print(f"Classes: {target_encoder.classes_}")
except FileNotFoundError:
    print("❌ Model file not found! Please ensure the model has been trained and saved.")
    raise

# Load training data - we'll try to find a training dataset
try:
    # Try to load original training data if available
    train_data = pd.read_csv('train.csv')  # Adjust filename as needed
    print(f"✅ Training data loaded! Shape: {train_data.shape}")
except FileNotFoundError:
    print("❌ Training data file 'train.csv' not found!")
    # Create sample data for demonstration (you should replace this with actual data)
    print("Creating sample data for demonstration...")
    np.random.seed(42)
    n_samples = 2000
    train_data = pd.DataFrame({
        'Time_spent_Alone': np.random.uniform(0, 12, n_samples),
        'Stage_fear': np.random.choice(['Yes', 'No'], n_samples),
        'Social_event_attendance': np.random.randint(0, 11, n_samples),
        'Going_outside': np.random.randint(0, 8, n_samples),
        'Drained_after_socializing': np.random.choice(['Yes', 'No'], n_samples),
        'Friends_circle_size': np.random.randint(1, 21, n_samples),
        'Post_frequency': np.random.randint(0, 11, n_samples),
        'Personality': np.random.choice(['Introvert', 'Extrovert'], n_samples)
    })
    print(f"✅ Sample data created! Shape: {train_data.shape}")

print(f"\nDataset Info:")
print(f"- Total samples: {len(train_data)}")
print(f"- Features: {list(train_data.columns)}")
if 'Personality' in train_data.columns:
    print(f"- Class distribution:\n{train_data['Personality'].value_counts()}")
else:
    print("- Looking for target column...")
    target_cols = [col for col in train_data.columns if col.lower() in ['personality', 'target', 'label', 'class']]
    if target_cols:
        target_col = target_cols[0]
        print(f"- Found target column: {target_col}")
        print(f"- Class distribution:\n{train_data[target_col].value_counts()}")
        train_data = train_data.rename(columns={target_col: 'Personality'})
    else:
        print("- No clear target column found in the data")

✅ Model loaded successfully!
Feature columns: ['Time_spent_Alone', 'Stage_fear', 'Social_event_attendance', 'Going_outside', 'Drained_after_socializing', 'Friends_circle_size', 'Post_frequency']
Categorical columns: ['Stage_fear', 'Drained_after_socializing']
Classes: ['Extrovert' 'Introvert']
✅ Training data loaded! Shape: (18524, 9)

Dataset Info:
- Total samples: 18524
- Features: ['id', 'Time_spent_Alone', 'Stage_fear', 'Social_event_attendance', 'Going_outside', 'Drained_after_socializing', 'Friends_circle_size', 'Post_frequency', 'Personality']
- Class distribution:
Personality
Extrovert    13699
Introvert     4825
Name: count, dtype: int64


## 3. Identify Correctly and Misclassified Samples

Preprocess the data, make predictions, and separate correctly classified from misclassified samples for each class.

In [4]:
# Sample 1000 data points for analysis (adjust as needed)
sample_size = min(1000, len(train_data))
sample_data = train_data.sample(n=sample_size, random_state=42).reset_index(drop=True)
print(f"Sampled {sample_size} data points for analysis")

# Prepare features for prediction
X_sample = sample_data[feature_columns].copy()
y_true = sample_data['Personality'].copy()

# Check for missing values and handle them
print(f"\nChecking for missing values:")
for col in X_sample.columns:
    missing_count = X_sample[col].isnull().sum()
    if missing_count > 0:
        print(f"  {col}: {missing_count} missing values")
        if col in categorical_columns:
            # For categorical columns, fill with the most common value
            mode_value = X_sample[col].mode()
            if len(mode_value) > 0:
                X_sample[col] = X_sample[col].fillna(mode_value[0])
                print(f"    Filled with mode: {mode_value[0]}")
            else:
                # If no mode available, use a default value
                if col == 'Stage_fear' or col == 'Drained_after_socializing':
                    X_sample[col] = X_sample[col].fillna('No')
                    print(f"    Filled with default: 'No'")
        else:
            # For numerical columns, fill with median
            median_value = X_sample[col].median()
            X_sample[col] = X_sample[col].fillna(median_value)
            print(f"    Filled with median: {median_value}")

# Encode categorical features
for feature in categorical_columns:
    if feature in X_sample.columns and feature in label_encoders:
        le = label_encoders[feature]
        try:
            X_sample[feature] = le.transform(X_sample[feature])
        except ValueError as e:
            print(f"Error encoding {feature}: {e}")
            # Check unique values in the sample vs trained encoder
            print(f"  Sample unique values: {X_sample[feature].unique()}")
            print(f"  Encoder classes: {le.classes_}")
            raise

# Ensure all features are present and in correct order
X_sample = X_sample.reindex(columns=feature_columns, fill_value=0)

print(f"Feature matrix shape: {X_sample.shape}")
print(f"True labels shape: {y_true.shape}")

# Make predictions
y_pred = model.predict(X_sample)
y_pred_labels = target_encoder.inverse_transform(y_pred)

# Calculate accuracy
accuracy = (y_true == y_pred_labels).mean()
print(f"\nModel accuracy on sample: {accuracy:.3f}")

# Analyze by class
results_by_class = {}
for class_name in target_encoder.classes_:
    class_mask = (y_true == class_name)
    class_samples = class_mask.sum()
    correct_predictions = ((y_true == class_name) & (y_pred_labels == class_name)).sum()
    misclassified = class_samples - correct_predictions
    
    results_by_class[class_name] = {
        'total_samples': int(class_samples),
        'correctly_classified': int(correct_predictions),
        'misclassified': int(misclassified),
        'accuracy': correct_predictions / class_samples if class_samples > 0 else 0,
        'misclassification_rate': misclassified / class_samples if class_samples > 0 else 0
    }
    
    print(f"\n{class_name}:")
    print(f"  Total samples: {class_samples}")
    print(f"  Correctly classified: {correct_predictions}")
    print(f"  Misclassified: {misclassified}")
    print(f"  Accuracy: {correct_predictions/class_samples:.3f}")
    print(f"  Misclassification rate: {misclassified/class_samples:.3f}")

# Create masks for correctly classified samples by class
correct_introvert_mask = (y_true == 'Introvert') & (y_pred_labels == 'Introvert')
correct_extrovert_mask = (y_true == 'Extrovert') & (y_pred_labels == 'Extrovert')

print(f"\nCorrectly classified Introverts: {correct_introvert_mask.sum()}")
print(f"Correctly classified Extroverts: {correct_extrovert_mask.sum()}")

# Extract feature arrays for correctly classified samples
X_correct_introvert = X_sample[correct_introvert_mask].values
X_correct_extrovert = X_sample[correct_extrovert_mask].values

print(f"\nShape of correctly classified Introvert features: {X_correct_introvert.shape}")
print(f"Shape of correctly classified Extrovert features: {X_correct_extrovert.shape}")

Sampled 1000 data points for analysis

Checking for missing values:
  Time_spent_Alone: 62 missing values
    Filled with median: 2.0
  Stage_fear: 93 missing values
    Filled with mode: No
  Social_event_attendance: 63 missing values
    Filled with median: 5.0
  Going_outside: 61 missing values
    Filled with median: 4.0
  Drained_after_socializing: 69 missing values
    Filled with mode: No
  Friends_circle_size: 57 missing values
    Filled with median: 8.0
  Post_frequency: 77 missing values
    Filled with median: 5.0
Feature matrix shape: (1000, 7)
True labels shape: (1000,)

Model accuracy on sample: 0.940

Extrovert:
  Total samples: 717
  Correctly classified: 709
  Misclassified: 8
  Accuracy: 0.989
  Misclassification rate: 0.011

Introvert:
  Total samples: 283
  Correctly classified: 231
  Misclassified: 52
  Accuracy: 0.816
  Misclassification rate: 0.184

Correctly classified Introverts: 231
Correctly classified Extroverts: 709

Shape of correctly classified Introvert

## 4. Compute Nearest Neighbor Distances for Correctly Classified Samples

For each correctly classified sample, compute the Euclidean distance to its nearest neighbor of the same class (excluding itself).

In [5]:
def calculate_nearest_neighbor_distances(feature_matrix):
    """
    Calculate the distance from each sample to its nearest neighbor (excluding itself).
    
    Args:
        feature_matrix (numpy.ndarray): Matrix of features for samples of the same class
    
    Returns:
        numpy.ndarray: Array of distances to nearest neighbors
    """
    if len(feature_matrix) < 2:
        return np.array([])
    
    # Use NearestNeighbors to find the 2 closest points (including self)
    # We need k=2 because the closest will be the point itself
    nn = NearestNeighbors(n_neighbors=2, metric='euclidean')
    nn.fit(feature_matrix)
    
    # Get distances to neighbors (first column is distance to self = 0, second is nearest neighbor)
    distances, indices = nn.kneighbors(feature_matrix)
    
    # Return distances to nearest neighbors (exclude self-distance)
    nearest_neighbor_distances = distances[:, 1]
    
    return nearest_neighbor_distances

# Calculate nearest neighbor distances for each class
print("Calculating nearest neighbor distances...")

# For Introverts
if len(X_correct_introvert) >= 2:
    introvert_nn_distances = calculate_nearest_neighbor_distances(X_correct_introvert)
    introvert_avg_distance = np.mean(introvert_nn_distances)
    print(f"\nIntrovert class:")
    print(f"  Number of correctly classified samples: {len(X_correct_introvert)}")
    print(f"  Average nearest neighbor distance: {introvert_avg_distance:.6f}")
    print(f"  Distance statistics:")
    print(f"    Min: {np.min(introvert_nn_distances):.6f}")
    print(f"    Max: {np.max(introvert_nn_distances):.6f}")
    print(f"    Std: {np.std(introvert_nn_distances):.6f}")
    print(f"    Median: {np.median(introvert_nn_distances):.6f}")
else:
    introvert_nn_distances = np.array([])
    introvert_avg_distance = 0.0
    print(f"\nIntrovert class: Not enough samples for nearest neighbor calculation")

# For Extroverts
if len(X_correct_extrovert) >= 2:
    extrovert_nn_distances = calculate_nearest_neighbor_distances(X_correct_extrovert)
    extrovert_avg_distance = np.mean(extrovert_nn_distances)
    print(f"\nExtrovert class:")
    print(f"  Number of correctly classified samples: {len(X_correct_extrovert)}")
    print(f"  Average nearest neighbor distance: {extrovert_avg_distance:.6f}")
    print(f"  Distance statistics:")
    print(f"    Min: {np.min(extrovert_nn_distances):.6f}")
    print(f"    Max: {np.max(extrovert_nn_distances):.6f}")
    print(f"    Std: {np.std(extrovert_nn_distances):.6f}")
    print(f"    Median: {np.median(extrovert_nn_distances):.6f}")
else:
    extrovert_nn_distances = np.array([])
    extrovert_avg_distance = 0.0
    print(f"\nExtrovert class: Not enough samples for nearest neighbor calculation")

# Store raw distances for visualization if needed
raw_distances = {
    'Introvert': introvert_nn_distances,
    'Extrovert': extrovert_nn_distances
}

Calculating nearest neighbor distances...

Introvert class:
  Number of correctly classified samples: 231
  Average nearest neighbor distance: 1.632476
  Distance statistics:
    Min: 1.000000
    Max: 9.219544
    Std: 0.675451
    Median: 1.414214

Extrovert class:
  Number of correctly classified samples: 709
  Average nearest neighbor distance: 1.319735
  Distance statistics:
    Min: 0.000000
    Max: 4.242641
    Std: 0.449314
    Median: 1.414214


## 5. Calculate Weighted Average Distance for Each Class

Calculate the weighted average distance for each class, where the weight is based on the ratio of misclassified to correctly classified samples.

In [6]:
def calculate_weighted_baseline_distance(avg_nn_distance, misclassified_count, correctly_classified_count):
    """
    Calculate weighted baseline distance using misclassification ratio as weight.
    
    Formula: weighted_distance = avg_nn_distance * (1 + misclassified_ratio)
    where misclassified_ratio = misclassified_count / correctly_classified_count
    
    This gives higher baseline distances for classes that are harder to classify correctly.
    """
    if correctly_classified_count == 0:
        return 0.0
    
    misclassified_ratio = misclassified_count / correctly_classified_count
    weighted_distance = avg_nn_distance * (1 + misclassified_ratio)
    
    return weighted_distance

print("Calculating weighted baseline distances...")
print("=" * 60)

weighted_baselines = {}

for class_name in target_encoder.classes_:
    class_stats = results_by_class[class_name]
    
    if class_name == 'Introvert':
        avg_distance = introvert_avg_distance
    else:  # Extrovert
        avg_distance = extrovert_avg_distance
    
    # Calculate weighted baseline
    weighted_baseline = calculate_weighted_baseline_distance(
        avg_distance,
        class_stats['misclassified'],
        class_stats['correctly_classified']
    )
    
    weighted_baselines[class_name] = weighted_baseline
    
    print(f"\n{class_name} Class Analysis:")
    print(f"  Correctly classified samples: {class_stats['correctly_classified']}")
    print(f"  Misclassified samples: {class_stats['misclassified']}")
    print(f"  Misclassification ratio: {class_stats['misclassification_rate']:.3f}")
    print(f"  Average NN distance: {avg_distance:.6f}")
    print(f"  Weighted baseline distance: {weighted_baseline:.6f}")
    print(f"  Weight factor: {(1 + class_stats['misclassification_rate']):.3f}")

# Calculate overall weighted average across both classes
total_correct = sum(results_by_class[cls]['correctly_classified'] for cls in target_encoder.classes_)
if total_correct > 0:
    overall_weighted_baseline = sum(
        weighted_baselines[cls] * results_by_class[cls]['correctly_classified'] 
        for cls in target_encoder.classes_
    ) / total_correct
else:
    overall_weighted_baseline = 0.0

print(f"\n" + "=" * 60)
print(f"OVERALL WEIGHTED BASELINE DISTANCE: {overall_weighted_baseline:.6f}")
print(f"=" * 60)

# Compare with simple average
simple_avg_baseline = np.mean([introvert_avg_distance, extrovert_avg_distance])
print(f"\nComparison:")
print(f"  Simple average baseline: {simple_avg_baseline:.6f}")
print(f"  Weighted average baseline: {overall_weighted_baseline:.6f}")
print(f"  Difference: {abs(overall_weighted_baseline - simple_avg_baseline):.6f}")

# Store all calculated values
baseline_metrics = {
    'class_specific': {
        'Introvert': {
            'average_nn_distance': float(introvert_avg_distance),
            'weighted_baseline_distance': float(weighted_baselines['Introvert']),
            'correctly_classified_count': int(results_by_class['Introvert']['correctly_classified']),
            'misclassified_count': int(results_by_class['Introvert']['misclassified']),
            'misclassification_rate': float(results_by_class['Introvert']['misclassification_rate'])
        },
        'Extrovert': {
            'average_nn_distance': float(extrovert_avg_distance),
            'weighted_baseline_distance': float(weighted_baselines['Extrovert']),
            'correctly_classified_count': int(results_by_class['Extrovert']['correctly_classified']),
            'misclassified_count': int(results_by_class['Extrovert']['misclassified']),
            'misclassification_rate': float(results_by_class['Extrovert']['misclassification_rate'])
        }
    },
    'overall': {
        'weighted_baseline_distance': float(overall_weighted_baseline),
        'simple_average_baseline': float(simple_avg_baseline),
        'total_samples_analyzed': int(sample_size),
        'total_correctly_classified': int(total_correct)
    },
    'methodology': {
        'description': "Weighted average nearest neighbor distance for correctly classified samples",
        'formula': "weighted_distance = avg_nn_distance * (1 + misclassification_ratio)",
        'sample_size': int(sample_size),
        'random_seed': 42
    }
}

print(f"\nBaseline metrics calculated and stored!")

Calculating weighted baseline distances...

Extrovert Class Analysis:
  Correctly classified samples: 709
  Misclassified samples: 8
  Misclassification ratio: 0.011
  Average NN distance: 1.319735
  Weighted baseline distance: 1.334626
  Weight factor: 1.011

Introvert Class Analysis:
  Correctly classified samples: 231
  Misclassified samples: 52
  Misclassification ratio: 0.184
  Average NN distance: 1.632476
  Weighted baseline distance: 1.999960
  Weight factor: 1.184

OVERALL WEIGHTED BASELINE DISTANCE: 1.498128

Comparison:
  Simple average baseline: 1.476105
  Weighted average baseline: 1.498128
  Difference: 0.022023

Baseline metrics calculated and stored!


## 6. Save Results to JSON File

Save the computed baseline metrics to a JSON file for use in the Streamlit app.

In [7]:
# Save baseline metrics to JSON file
output_filename = 'baseline_distance_metrics.json'

try:
    with open(output_filename, 'w') as f:
        json.dump(baseline_metrics, f, indent=2)
    
    print(f"✅ Baseline metrics saved to '{output_filename}'")
    print(f"\nFile contents preview:")
    print(f"  Overall weighted baseline: {baseline_metrics['overall']['weighted_baseline_distance']:.6f}")
    print(f"  Introvert weighted baseline: {baseline_metrics['class_specific']['Introvert']['weighted_baseline_distance']:.6f}")
    print(f"  Extrovert weighted baseline: {baseline_metrics['class_specific']['Extrovert']['weighted_baseline_distance']:.6f}")
    
except Exception as e:
    print(f"❌ Error saving baseline metrics: {e}")

# Display final summary for easy reference
print(f"\n" + "=" * 80)
print(f"FINAL BASELINE DISTANCE METRICS SUMMARY")
print(f"=" * 80)
print(f"For use in Streamlit app:")
print(f"")
print(f"OVERALL_BASELINE_DISTANCE = {overall_weighted_baseline:.6f}")
print(f"")
print(f"CLASS_SPECIFIC_BASELINES = {{")
print(f"    'Introvert': {weighted_baselines['Introvert']:.6f},")
print(f"    'Extrovert': {weighted_baselines['Extrovert']:.6f}")
print(f"}}")
print(f"")
print(f"This replaces the old hardcoded value of 9.341171")
print(f"=" * 80)

# Create a simple verification
print(f"\n🔍 Verification:")
print(f"  Old baseline (hardcoded): 9.341171")
print(f"  New baseline (calculated): {overall_weighted_baseline:.6f}")
print(f"  Difference: {abs(9.341171 - overall_weighted_baseline):.6f}")

if abs(9.341171 - overall_weighted_baseline) > 1.0:
    print(f"  ⚠️  Significant difference detected - this is expected as we're using the correct methodology now")
else:
    print(f"  ✅ Values are similar - good consistency check")

print(f"\n📝 Next Steps:")
print(f"1. Update the Streamlit app to load baseline metrics from '{output_filename}'")
print(f"2. Replace hardcoded BASELINE_CORRECT_DISTANCE with the appropriate baseline")
print(f"3. Consider using class-specific baselines for more accurate reliability scoring")

✅ Baseline metrics saved to 'baseline_distance_metrics.json'

File contents preview:
  Overall weighted baseline: 1.498128
  Introvert weighted baseline: 1.999960
  Extrovert weighted baseline: 1.334626

FINAL BASELINE DISTANCE METRICS SUMMARY
For use in Streamlit app:

OVERALL_BASELINE_DISTANCE = 1.498128

CLASS_SPECIFIC_BASELINES = {
    'Introvert': 1.999960,
    'Extrovert': 1.334626
}

This replaces the old hardcoded value of 9.341171

🔍 Verification:
  Old baseline (hardcoded): 9.341171
  New baseline (calculated): 1.498128
  Difference: 7.843043
  ⚠️  Significant difference detected - this is expected as we're using the correct methodology now

📝 Next Steps:
1. Update the Streamlit app to load baseline metrics from 'baseline_distance_metrics.json'
2. Replace hardcoded BASELINE_CORRECT_DISTANCE with the appropriate baseline
3. Consider using class-specific baselines for more accurate reliability scoring
