# STEP 2: DATA PREPARATION

**Extrovert-Introvert Classification - Data Cleaning and Preprocessing**

This notebook handles comprehensive data preparation including:
- Loading processed data from Step 1
- Handling missing values and outliers
- Feature validation and cleaning
- Data type conversions and normalization
- Feature engineering for personality classification
- Preparing final datasets for modeling

**Key Objectives:**
- Clean and validate behavioral and psychological data
- Handle missing values using domain-appropriate strategies
- Engineer new features based on personality psychology
- Balance classes using appropriate techniques
- Export cleaned datasets for modeling phase


## 1. Import Required Libraries


In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Statistical analysis
from scipy import stats
from scipy.stats import zscore

# Data preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import SimpleImputer, KNNImputer

# Class balancing
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

# Progress bars
from tqdm.auto import tqdm
tqdm.pandas()

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plotting style
plt.style.use('default')
sns.set_palette("viridis")

print("\n" + "="*60)
print("STEP 1: LIBRARY IMPORT COMPLETED")
print("="*60)
print("All required libraries loaded successfully")
print("Available preprocessing methods:")
print("- Missing value imputation (Simple, KNN)")
print("- Outlier detection and handling")
print("- Feature scaling and normalization")
print("- Class balancing (SMOTE, Random sampling)")
print("- Statistical analysis tools")



STEP 1: LIBRARY IMPORT COMPLETED
All required libraries loaded successfully
Available preprocessing methods:
- Missing value imputation (Simple, KNN)
- Outlier detection and handling
- Feature scaling and normalization
- Class balancing (SMOTE, Random sampling)
- Statistical analysis tools


## 2. Load Data from Step 1


In [20]:
# Load data using robust encoding handling
def load_csv_safe(file_path, encodings=['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']):
    """
    Load CSV file with multiple encoding attempts to handle any text encoding issues.
    """
    for encoding in encodings:
        try:
            df = pd.read_csv(file_path, encoding=encoding)
            print(f"  Successfully loaded with encoding: {encoding}")
            return df
        except UnicodeDecodeError:
            print(f"  Failed with {encoding} encoding")
            continue
        except Exception as e:
            print(f"  Error with encoding {encoding}: {str(e)}")
            continue
    
    raise Exception(f"Could not load file with any encoding: {encodings}")

print("\n" + "="*60)
print("STEP 2: DATA LOADING")
print("="*60)

# Try to load from processed data first, then fallback to raw data
processed_path = Path('../data/processed/raw_personality_data.csv')
raw_path = Path('../data/raw/personality_dataset.csv')

if processed_path.exists():
    print(f"Loading processed data from: {processed_path}")
    df = load_csv_safe(processed_path)
    data_source = "processed"
elif raw_path.exists():
    print(f"Loading raw data from: {raw_path}")
    df = load_csv_safe(raw_path)
    data_source = "raw"
else:
    raise FileNotFoundError("No data files found. Please run Step 1 first.")

print(f"\nDataset loaded successfully from {data_source} data!")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"Memory usage: {df.memory_usage(deep=True).sum()/1024/1024:.2f} MB")

# Display basic info
print(f"\nFirst 3 rows:")
print(df.head(3))



STEP 2: DATA LOADING
Loading processed data from: ..\data\processed\raw_personality_data.csv
  Successfully loaded with encoding: utf-8

Dataset loaded successfully from processed data!
Shape: (2900, 8)
Columns: ['Time_spent_Alone', 'Stage_fear', 'Social_event_attendance', 'Going_outside', 'Drained_after_socializing', 'Friends_circle_size', 'Post_frequency', 'Personality']
Memory usage: 0.55 MB

First 3 rows:
   Time_spent_Alone Stage_fear  Social_event_attendance  Going_outside  \
0               4.0         No                      4.0            6.0   
1               9.0        Yes                      0.0            0.0   
2               9.0        Yes                      1.0            2.0   

  Drained_after_socializing  Friends_circle_size  Post_frequency Personality  
0                        No                 13.0             5.0   Extrovert  
1                       Yes                  0.0             3.0   Introvert  
2                       Yes                  5.0    

## 3. Data Validation and Feature Identification


In [21]:
print("\n" + "="*60)
print("STEP 3: DATA VALIDATION AND FEATURE IDENTIFICATION")
print("="*60)

# Auto-detect key columns based on personality dataset structure
target_column = None
behavioral_features = []
psychological_features = []

# Identify target column
for col in df.columns:
    if col.upper() in ['PERSONALITY', 'TARGET', 'LABEL', 'CLASS']:
        target_column = col
        break

# Identify behavioral features (numerical scales)
expected_behavioral = ['Time_spent_Alone', 'Social_event_attendance', 'Going_outside', 
                      'Friends_circle_size', 'Post_frequency']
for feature in expected_behavioral:
    if feature in df.columns:
        behavioral_features.append(feature)

# Identify psychological features (categorical)
expected_psychological = ['Stage_fear', 'Drained_after_socializing']
for feature in expected_psychological:
    if feature in df.columns:
        psychological_features.append(feature)

print(f"COLUMN IDENTIFICATION RESULTS:")
print(f"Target column: {target_column}")
print(f"Behavioral features ({len(behavioral_features)}): {behavioral_features}")
print(f"Psychological features ({len(psychological_features)}): {psychological_features}")

# Validate required columns exist
if not target_column:
    print("ERROR: Target column 'Personality' not found!")
    
if len(behavioral_features) == 0:
    print("ERROR: No behavioral features found!")
    
if len(psychological_features) == 0:
    print("ERROR: No psychological features found!")

# Check data types
print(f"\nDATA TYPE VALIDATION:")
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"Numerical columns ({len(numerical_cols)}): {numerical_cols}")
print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")

# Basic data quality checks
print(f"\nDATA QUALITY OVERVIEW:")
print(f"Total rows: {len(df):,}")
print(f"Total columns: {len(df.columns)}")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"Duplicate rows: {df.duplicated().sum()}")

# Memory usage
memory_mb = df.memory_usage(deep=True).sum() / 1024 / 1024
print(f"Memory usage: {memory_mb:.2f} MB")

if target_column and target_column in df.columns:
    print(f"\nTARGET VARIABLE ANALYSIS:")
    target_counts = df[target_column].value_counts()
    for personality, count in target_counts.items():
        pct = (count / len(df)) * 100
        print(f"  {personality}: {count:,} ({pct:.1f}%)")
    
    balance_ratio = target_counts.max() / target_counts.min()
    print(f"  Class balance ratio: {balance_ratio:.2f}:1")
    
    if balance_ratio <= 1.5:
        balance_status = "WELL BALANCED"
    elif balance_ratio <= 3.0:
        balance_status = "MODERATELY IMBALANCED"
    else:
        balance_status = "HIGHLY IMBALANCED"
    
    print(f"  Balance assessment: {balance_status}")

print(f"\nValidation completed. Ready for data cleaning pipeline.")



STEP 3: DATA VALIDATION AND FEATURE IDENTIFICATION
COLUMN IDENTIFICATION RESULTS:
Target column: Personality
Behavioral features (5): ['Time_spent_Alone', 'Social_event_attendance', 'Going_outside', 'Friends_circle_size', 'Post_frequency']
Psychological features (2): ['Stage_fear', 'Drained_after_socializing']

DATA TYPE VALIDATION:
Numerical columns (5): ['Time_spent_Alone', 'Social_event_attendance', 'Going_outside', 'Friends_circle_size', 'Post_frequency']
Categorical columns (3): ['Stage_fear', 'Drained_after_socializing', 'Personality']

DATA QUALITY OVERVIEW:
Total rows: 2,900
Total columns: 8
Missing values: 458
Duplicate rows: 388
Memory usage: 0.55 MB

TARGET VARIABLE ANALYSIS:
  Extrovert: 1,491 (51.4%)
  Introvert: 1,409 (48.6%)
  Class balance ratio: 1.06:1
  Balance assessment: WELL BALANCED

Validation completed. Ready for data cleaning pipeline.


## 4. Data Cleaning and Missing Value Handling Functions


In [22]:
print("\n" + "="*60)
print("STEP 4: DEFINING DATA CLEANING FUNCTIONS")
print("="*60)

def handle_missing_values(df, strategy='domain_specific'):
    """
    Handle missing values using domain-specific strategies for personality data.
    """
    df_clean = df.copy()
    
    if strategy == 'domain_specific':
        # For behavioral features (0-10 scales), use median imputation
        for feature in behavioral_features:
            if feature in df_clean.columns and df_clean[feature].isnull().any():
                median_val = df_clean[feature].median()
                df_clean[feature].fillna(median_val, inplace=True)
                print(f"  {feature}: Filled {df[feature].isnull().sum()} missing values with median ({median_val})")
        
        # For categorical psychological features, use mode imputation
        for feature in psychological_features:
            if feature in df_clean.columns and df_clean[feature].isnull().any():
                mode_val = df_clean[feature].mode().iloc[0] if not df_clean[feature].mode().empty else 'Unknown'
                df_clean[feature].fillna(mode_val, inplace=True)
                print(f"  {feature}: Filled {df[feature].isnull().sum()} missing values with mode ({mode_val})")
        
        # For target variable, drop rows with missing values
        if target_column and df_clean[target_column].isnull().any():
            before_count = len(df_clean)
            df_clean = df_clean.dropna(subset=[target_column])
            dropped_count = before_count - len(df_clean)
            print(f"  {target_column}: Removed {dropped_count} rows with missing target values")
            
    return df_clean

def detect_outliers(df, features, method='iqr', threshold=1.5):
    """
    Detect outliers in numerical features using IQR or Z-score methods.
    """
    outlier_info = {}
    
    for feature in features:
        if feature in df.columns:
            if method == 'iqr':
                Q1 = df[feature].quantile(0.25)
                Q3 = df[feature].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - threshold * IQR
                upper_bound = Q3 + threshold * IQR
                
                outliers = df[(df[feature] < lower_bound) | (df[feature] > upper_bound)]
                outlier_count = len(outliers)
                
            elif method == 'zscore':
                z_scores = np.abs(zscore(df[feature]))
                outliers = df[z_scores > threshold]
                outlier_count = len(outliers)
                
            outlier_info[feature] = {
                'count': outlier_count,
                'percentage': (outlier_count / len(df)) * 100
            }
    
    return outlier_info

def validate_feature_ranges(df, features):
    """
    Validate that behavioral features are within expected ranges (0-10 scale).
    """
    validation_results = {}
    
    for feature in features:
        if feature in df.columns:
            min_val = df[feature].min()
            max_val = df[feature].max()
            
            # Check if values are within 0-10 range
            within_range = (min_val >= 0) and (max_val <= 10)
            out_of_range_count = len(df[(df[feature] < 0) | (df[feature] > 10)])
            
            validation_results[feature] = {
                'min': min_val,
                'max': max_val,
                'within_expected_range': within_range,
                'out_of_range_count': out_of_range_count
            }
    
    return validation_results

def clean_categorical_features(df, features):
    """
    Clean and standardize categorical features.
    """
    df_clean = df.copy()
    
    for feature in features:
        if feature in df_clean.columns:
            # Convert to string and strip whitespace
            df_clean[feature] = df_clean[feature].astype(str).str.strip()
            
            # Standardize common variations
            if feature == 'Stage_fear':
                df_clean[feature] = df_clean[feature].replace({
                    'yes': 'Yes', 'YES': 'Yes', 'y': 'Yes', 'Y': 'Yes',
                    'no': 'No', 'NO': 'No', 'n': 'No', 'N': 'No'
                })
            
            if feature == 'Drained_after_socializing':
                df_clean[feature] = df_clean[feature].replace({
                    'yes': 'Yes', 'YES': 'Yes', 'y': 'Yes', 'Y': 'Yes',
                    'no': 'No', 'NO': 'No', 'n': 'No', 'N': 'No'
                })
            
            # Remove 'nan' strings
            df_clean[feature] = df_clean[feature].replace('nan', np.nan)
    
    return df_clean

print("Data cleaning functions defined successfully:")
print("- handle_missing_values(): Domain-specific missing value imputation")
print("- detect_outliers(): IQR and Z-score outlier detection")
print("- validate_feature_ranges(): Validate 0-10 behavioral scales")
print("- clean_categorical_features(): Standardize categorical values")
print("\nReady to apply data cleaning pipeline...")



STEP 4: DEFINING DATA CLEANING FUNCTIONS
Data cleaning functions defined successfully:
- handle_missing_values(): Domain-specific missing value imputation
- detect_outliers(): IQR and Z-score outlier detection
- validate_feature_ranges(): Validate 0-10 behavioral scales
- clean_categorical_features(): Standardize categorical values

Ready to apply data cleaning pipeline...


## 5. Apply Data Cleaning Pipeline


In [23]:
if target_column and (behavioral_features or psychological_features):
    print("\n" + "="*60)
    print("STEP 5: APPLYING DATA CLEANING PIPELINE")
    print("="*60)
    
    # Store original data info
    original_shape = df.shape
    original_missing = df.isnull().sum().sum()
    
    print(f"Original dataset: {original_shape[0]:,} rows, {original_shape[1]} columns")
    print(f"Original missing values: {original_missing}")
    
    # Step 5.1: Remove duplicates
    df_before_dup = df.copy()
    df = df.drop_duplicates()
    duplicates_removed = len(df_before_dup) - len(df)
    if duplicates_removed > 0:
        print(f"\nStep 5.1: Removed {duplicates_removed} duplicate rows")
    else:
        print(f"\nStep 5.1: No duplicate rows found")
    
    # Step 5.2: Clean categorical features
    print(f"\nStep 5.2: Cleaning categorical features...")
    df = clean_categorical_features(df, psychological_features)
    
    # Step 5.3: Validate feature ranges
    print(f"\nStep 5.3: Validating behavioral feature ranges...")
    validation_results = validate_feature_ranges(df, behavioral_features)
    
    for feature, results in validation_results.items():
        print(f"  {feature}: Range [{results['min']:.1f}, {results['max']:.1f}], "
              f"Valid range: {results['within_expected_range']}, "
              f"Out of range: {results['out_of_range_count']}")
    
    # Step 5.4: Handle out-of-range values by capping
    for feature in behavioral_features:
        if feature in df.columns:
            original_out_of_range = len(df[(df[feature] < 0) | (df[feature] > 10)])
            df[feature] = df[feature].clip(0, 10)
            
            if original_out_of_range > 0:
                print(f"  {feature}: Capped {original_out_of_range} out-of-range values to [0, 10]")
    
    # Step 5.5: Detect outliers (before missing value handling)
    print(f"\nStep 5.5: Detecting outliers in behavioral features...")
    outlier_info = detect_outliers(df, behavioral_features, method='iqr', threshold=1.5)
    
    for feature, info in outlier_info.items():
        print(f"  {feature}: {info['count']} outliers ({info['percentage']:.1f}%)")
    
    # Step 5.6: Handle missing values
    print(f"\nStep 5.6: Handling missing values...")
    missing_before = df.isnull().sum().sum()
    print(f"Missing values before cleaning: {missing_before}")
    
    if missing_before > 0:
        df = handle_missing_values(df, strategy='domain_specific')
        missing_after = df.isnull().sum().sum()
        print(f"Missing values after cleaning: {missing_after}")
    else:
        print("No missing values to handle")
    
    # Step 5.7: Final validation
    print(f"\nStep 5.7: Final data validation...")
    final_shape = df.shape
    final_missing = df.isnull().sum().sum()
    
    print(f"Final dataset: {final_shape[0]:,} rows, {final_shape[1]} columns")
    print(f"Final missing values: {final_missing}")
    print(f"Rows retained: {(final_shape[0]/original_shape[0])*100:.1f}%")
    
    # Display cleaned data sample
    print(f"\nSample of cleaned data:")
    sample_cols = [target_column] + behavioral_features[:3] + psychological_features[:2]
    available_cols = [col for col in sample_cols if col in df.columns]
    print(df[available_cols].head(3))
    
    print(f"\nData cleaning pipeline completed successfully!")
    
else:
    print("ERROR: Cannot proceed without target column and feature columns")
    print(f"Target column found: {target_column is not None}")
    print(f"Behavioral features found: {len(behavioral_features)}")
    print(f"Psychological features found: {len(psychological_features)}")



STEP 5: APPLYING DATA CLEANING PIPELINE
Original dataset: 2,900 rows, 8 columns
Original missing values: 458

Step 5.1: Removed 388 duplicate rows

Step 5.2: Cleaning categorical features...

Step 5.3: Validating behavioral feature ranges...
  Time_spent_Alone: Range [0.0, 11.0], Valid range: False, Out of range: 116
  Social_event_attendance: Range [0.0, 10.0], Valid range: True, Out of range: 0
  Going_outside: Range [0.0, 7.0], Valid range: True, Out of range: 0
  Friends_circle_size: Range [0.0, 15.0], Valid range: False, Out of range: 571
  Post_frequency: Range [0.0, 10.0], Valid range: True, Out of range: 0
  Time_spent_Alone: Capped 116 out-of-range values to [0, 10]
  Friends_circle_size: Capped 571 out-of-range values to [0, 10]

Step 5.5: Detecting outliers in behavioral features...
  Time_spent_Alone: 0 outliers (0.0%)
  Social_event_attendance: 0 outliers (0.0%)
  Going_outside: 0 outliers (0.0%)
  Friends_circle_size: 0 outliers (0.0%)
  Post_frequency: 0 outliers (0.0%)

## 6. Feature Engineering for Personality Classification


In [24]:
print("\n" + "="*60)
print("STEP 6: FEATURE ENGINEERING FOR PERSONALITY CLASSIFICATION")
print("="*60)

# Create engineered features based on personality psychology
print("Creating personality-specific engineered features...")

# 6.1: Social Activity Composite Score
if all(feature in df.columns for feature in ['Social_event_attendance', 'Going_outside', 'Post_frequency']):
    df['Social_Activity_Score'] = (df['Social_event_attendance'] + df['Going_outside'] + df['Post_frequency']) / 3
    print(f"✓ Social_Activity_Score: Average of social event attendance, going outside, and posting frequency")

# 6.2: Introversion Tendency Score
if all(feature in df.columns for feature in ['Time_spent_Alone', 'Stage_fear', 'Drained_after_socializing']):
    # Convert categorical to numerical for calculation
    stage_fear_numeric = df['Stage_fear'].map({'Yes': 10, 'No': 0}) if 'Stage_fear' in df.columns else 0
    drained_numeric = df['Drained_after_socializing'].map({'Yes': 10, 'No': 0}) if 'Drained_after_socializing' in df.columns else 0
    
    df['Introversion_Score'] = (df['Time_spent_Alone'] + stage_fear_numeric + drained_numeric) / 3
    print(f"✓ Introversion_Score: Average of time alone, stage fear, and social draining")

# 6.3: Social Comfort Level
if 'Friends_circle_size' in df.columns and 'Stage_fear' in df.columns:
    stage_fear_inverted = df['Stage_fear'].map({'Yes': 0, 'No': 10})
    df['Social_Comfort'] = (df['Friends_circle_size'] + stage_fear_inverted) / 2
    print(f"✓ Social_Comfort: Combination of friend circle size and absence of stage fear")

# 6.4: Digital vs Physical Social Engagement
if 'Post_frequency' in df.columns and 'Social_event_attendance' in df.columns:
    df['Digital_vs_Physical_Social'] = df['Post_frequency'] - df['Social_event_attendance']
    print(f"✓ Digital_vs_Physical_Social: Difference between online and offline social engagement")

# 6.5: Energy Drain from Social Interaction
if 'Drained_after_socializing' in df.columns and 'Social_event_attendance' in df.columns:
    drained_penalty = df['Drained_after_socializing'].map({'Yes': -5, 'No': 0})
    df['Social_Energy_Balance'] = df['Social_event_attendance'] + drained_penalty
    print(f"✓ Social_Energy_Balance: Social attendance adjusted for energy drain")

# 6.6: Binary feature encodings for modeling
binary_features_created = []

if 'Stage_fear' in df.columns:
    df['Has_Stage_Fear'] = (df['Stage_fear'] == 'Yes').astype(int)
    binary_features_created.append('Has_Stage_Fear')

if 'Drained_after_socializing' in df.columns:
    df['Gets_Drained_Socializing'] = (df['Drained_after_socializing'] == 'Yes').astype(int)
    binary_features_created.append('Gets_Drained_Socializing')

if target_column in df.columns:
    df['Is_Introvert'] = (df[target_column] == 'Introvert').astype(int)
    binary_features_created.append('Is_Introvert')

print(f"✓ Binary encodings: {', '.join(binary_features_created)}")

# 6.7: Feature scaling groups
behavioral_numeric_features = [col for col in behavioral_features if col in df.columns]
engineered_features = [col for col in df.columns if col in [
    'Social_Activity_Score', 'Introversion_Score', 'Social_Comfort',
    'Digital_vs_Physical_Social', 'Social_Energy_Balance'
]]

print(f"\nFeature engineering summary:")
print(f"Original behavioral features: {len(behavioral_numeric_features)}")
print(f"Engineered composite features: {len(engineered_features)}")
print(f"Binary encoded features: {len(binary_features_created)}")
print(f"Total features for modeling: {len(behavioral_numeric_features) + len(engineered_features) + len(binary_features_created)}")

# Display engineered feature statistics
if engineered_features:
    print(f"\nEngineered feature statistics:")
    for feature in engineered_features:
        if feature in df.columns:
            mean_val = df[feature].mean()
            std_val = df[feature].std()
            print(f"  {feature}: Mean={mean_val:.2f}, Std={std_val:.2f}")

print(f"\nFeature engineering completed successfully!")



STEP 6: FEATURE ENGINEERING FOR PERSONALITY CLASSIFICATION
Creating personality-specific engineered features...
✓ Social_Activity_Score: Average of social event attendance, going outside, and posting frequency
✓ Introversion_Score: Average of time alone, stage fear, and social draining
✓ Social_Comfort: Combination of friend circle size and absence of stage fear
✓ Digital_vs_Physical_Social: Difference between online and offline social engagement
✓ Social_Energy_Balance: Social attendance adjusted for energy drain
✓ Binary encodings: Has_Stage_Fear, Gets_Drained_Socializing, Is_Introvert

Feature engineering summary:
Original behavioral features: 5
Engineered composite features: 5
Binary encoded features: 3
Total features for modeling: 13

Engineered feature statistics:
  Social_Activity_Score: Mean=3.74, Std=2.41
  Introversion_Score: Mean=4.27, Std=4.25
  Social_Comfort: Mean=5.81, Std=4.01
  Digital_vs_Physical_Social: Mean=-0.40, Std=2.21
  Social_Energy_Balance: Mean=2.05, Std=5.

## 7. Class Balancing Analysis


In [25]:
print("\n" + "="*60)
print("STEP 7: CLASS BALANCING ANALYSIS")
print("="*60)

if target_column and target_column in df.columns:
    # Analyze current class distribution
    class_counts = df[target_column].value_counts()
    total_samples = len(df)
    
    print(f"Current class distribution:")
    for personality_type, count in class_counts.items():
        percentage = (count / total_samples) * 100
        print(f"  {personality_type}: {count:,} samples ({percentage:.1f}%)")
    
    # Calculate imbalance ratio
    max_class = class_counts.max()
    min_class = class_counts.min()
    imbalance_ratio = max_class / min_class
    
    print(f"\nClass imbalance ratio: {imbalance_ratio:.2f}:1")
    
    # Determine if balancing is needed
    if imbalance_ratio <= 1.5:
        balance_status = "WELL BALANCED"
        needs_balancing = False
    elif imbalance_ratio <= 3.0:
        balance_status = "MODERATELY IMBALANCED"
        needs_balancing = True
    else:
        balance_status = "HIGHLY IMBALANCED"
        needs_balancing = True
    
    print(f"Balance assessment: {balance_status}")
    
    if needs_balancing:
        print(f"\nApplying class balancing techniques...")
        
        # Prepare features for balancing (exclude target and non-predictive columns)
        feature_columns = behavioral_numeric_features + engineered_features + binary_features_created
        feature_columns = [col for col in feature_columns if col in df.columns and col != target_column]
        
        X = df[feature_columns]
        y = df[target_column]
        
        print(f"Using {len(feature_columns)} features for balancing: {feature_columns[:5]}{'...' if len(feature_columns) > 5 else ''}")
        
        # Method 1: Random Undersampling
        try:
            undersampler = RandomUnderSampler(random_state=42)
            X_under, y_under = undersampler.fit_resample(X, y)
            
            # Reconstruct dataframe
            df_undersampled = pd.DataFrame(X_under, columns=feature_columns)
            df_undersampled[target_column] = y_under
            
            # Add back any other columns not used in balancing
            other_cols = [col for col in df.columns if col not in feature_columns and col != target_column]
            for col in other_cols:
                if col in df.columns:
                    # Sample the other columns based on the undersampled indices
                    df_undersampled[col] = df[col].iloc[undersampler.sample_indices_].reset_index(drop=True)
            
            print(f"  Undersampling: {len(df)} → {len(df_undersampled)} samples")
            
        except Exception as e:
            print(f"  Undersampling failed: {e}")
            df_undersampled = df.copy()
        
        # Method 2: Random Oversampling
        try:
            oversampler = RandomOverSampler(random_state=42)
            X_over, y_over = oversampler.fit_resample(X, y)
            
            # Reconstruct dataframe
            df_oversampled = pd.DataFrame(X_over, columns=feature_columns)
            df_oversampled[target_column] = y_over
            
            # For oversampled data, we need to handle the additional columns differently
            # since we have more rows than original data
            other_cols = [col for col in df.columns if col not in feature_columns and col != target_column]
            for col in other_cols:
                if col in df.columns:
                    # Replicate other columns based on oversampling pattern
                    original_values = df[col].values
                    oversampled_values = original_values[oversampler.sample_indices_]
                    df_oversampled[col] = oversampled_values
            
            print(f"  Oversampling: {len(df)} → {len(df_oversampled)} samples")
            
        except Exception as e:
            print(f"  Oversampling failed: {e}")
            df_oversampled = df.copy()
        
        # Method 3: SMOTE (if possible with enough samples)
        try:
            if len(df) >= 12:  # SMOTE needs at least 6 samples per class typically
                smote = SMOTE(random_state=42, k_neighbors=min(5, min_class-1))
                X_smote, y_smote = smote.fit_resample(X, y)
                
                # Reconstruct dataframe
                df_smote = pd.DataFrame(X_smote, columns=feature_columns)
                df_smote[target_column] = y_smote
                
                print(f"  SMOTE: {len(df)} → {len(df_smote)} samples")
            else:
                df_smote = df_oversampled.copy()
                print(f"  SMOTE skipped: Dataset too small")
                
        except Exception as e:
            print(f"  SMOTE failed: {e}")
            df_smote = df_oversampled.copy()
        
        # Verify balanced distributions
        print(f"\nBalanced datasets created:")
        print(f"- Original dataset: {df.shape[0]:,} samples")
        if 'df_undersampled' in locals():
            under_dist = df_undersampled[target_column].value_counts()
            print(f"- Undersampled dataset: {df_undersampled.shape[0]:,} samples, "
                  f"balance: {under_dist.iloc[0]}:{under_dist.iloc[1]}")
        if 'df_oversampled' in locals():
            over_dist = df_oversampled[target_column].value_counts()
            print(f"- Oversampled dataset: {df_oversampled.shape[0]:,} samples, "
                  f"balance: {over_dist.iloc[0]}:{over_dist.iloc[1]}")
        
    else:
        print("Dataset is well balanced. No balancing techniques applied.")
        df_undersampled = df.copy()
        df_oversampled = df.copy()
        if 'df_smote' not in locals():
            df_smote = df.copy()

else:
    print("ERROR: Cannot proceed without target column")
    print(f"Target column found: {target_column}")



STEP 7: CLASS BALANCING ANALYSIS
Current class distribution:
  Extrovert: 1,417 samples (56.4%)
  Introvert: 1,095 samples (43.6%)

Class imbalance ratio: 1.29:1
Balance assessment: WELL BALANCED
Dataset is well balanced. No balancing techniques applied.


## 8. Data Export and Final Processing


In [26]:
# Export cleaned datasets to the main data directory
print("\n" + "="*60)
print("STEP 8: DATA EXPORT AND FINAL PROCESSING")
print("="*60)

# Always use the main project data directory (never create local ones)
processed_dir = "../data/processed"

# Verify the main data directory exists
if os.path.exists(processed_dir):
    print(f"Using main project data directory: {processed_dir}")
    
    # Export original cleaned dataset
    if target_column and target_column in df.columns:
        output_path = os.path.join(processed_dir, "personality_dataset_cleaned.csv")
        df.to_csv(output_path, index=False, encoding='utf-8')
        print(f"Saved cleaned dataset to: {output_path}")
        print(f"  Shape: {df.shape}")
        
        # Export undersampled dataset if available
        if 'df_undersampled' in locals() and len(df_undersampled) != len(df):
            output_path = os.path.join(processed_dir, "personality_dataset_undersampled.csv")
            df_undersampled.to_csv(output_path, index=False, encoding='utf-8')
            print(f"Saved undersampled dataset to: {output_path}")
            print(f"  Shape: {df_undersampled.shape}")
        
        # Export oversampled dataset if available
        if 'df_oversampled' in locals() and len(df_oversampled) != len(df):
            output_path = os.path.join(processed_dir, "personality_dataset_oversampled.csv")
            df_oversampled.to_csv(output_path, index=False, encoding='utf-8')
            print(f"Saved oversampled dataset to: {output_path}")
            print(f"  Shape: {df_oversampled.shape}")
        
        # Export SMOTE dataset if available
        if 'df_smote' in locals() and len(df_smote) != len(df):
            output_path = os.path.join(processed_dir, "personality_dataset_smote.csv")
            df_smote.to_csv(output_path, index=False, encoding='utf-8')
            print(f"Saved SMOTE dataset to: {output_path}")
            print(f"  Shape: {df_smote.shape}")
        
        # Create feature metadata for modeling
        all_features = behavioral_numeric_features + engineered_features + binary_features_created
        feature_metadata = []
        
        for feature in all_features:
            if feature in df.columns:
                feature_info = {
                    'feature_name': feature,
                    'feature_type': 'behavioral' if feature in behavioral_numeric_features else 
                                  'engineered' if feature in engineered_features else 'binary',
                    'data_type': str(df[feature].dtype),
                    'min_value': float(df[feature].min()) if pd.api.types.is_numeric_dtype(df[feature]) else None,
                    'max_value': float(df[feature].max()) if pd.api.types.is_numeric_dtype(df[feature]) else None,
                    'mean_value': float(df[feature].mean()) if pd.api.types.is_numeric_dtype(df[feature]) else None,
                    'std_value': float(df[feature].std()) if pd.api.types.is_numeric_dtype(df[feature]) else None,
                    'missing_values': int(df[feature].isnull().sum()),
                    'unique_values': int(df[feature].nunique())
                }
                feature_metadata.append(feature_info)
        
        # Save feature metadata
        feature_metadata_df = pd.DataFrame(feature_metadata)
        metadata_path = os.path.join(processed_dir, "feature_metadata.csv")
        feature_metadata_df.to_csv(metadata_path, index=False, encoding='utf-8')
        print(f"Saved feature metadata to: {metadata_path}")
        
        # Create processing summary
        processing_summary = {
            'original_dataset_path': str(processed_path if processed_path.exists() else raw_path),
            'target_column': str(target_column) if target_column else None,
            'behavioral_features': [str(f) for f in behavioral_numeric_features],
            'psychological_features': [str(f) for f in psychological_features],
            'engineered_features': [str(f) for f in engineered_features],
            'binary_features': [str(f) for f in binary_features_created],
            'original_samples': int(len(df)),
            'final_samples': int(len(df)),
            'missing_values_handled': bool(original_missing > 0),
            'outliers_detected': int(sum([info['count'] for info in outlier_info.values()]) if 'outlier_info' in locals() else 0),
            'class_balance_ratio': float(imbalance_ratio) if 'imbalance_ratio' in locals() else 1.0,
            'balancing_applied': bool(needs_balancing) if 'needs_balancing' in locals() else False,
            'processing_steps': [
                'duplicate_removal',
                'categorical_standardization',
                'range_validation',
                'missing_value_imputation',
                'feature_engineering',
                'class_balancing' if 'needs_balancing' in locals() and needs_balancing else 'no_balancing_needed'
            ]
        }
        
        import json
        summary_path = os.path.join(processed_dir, "data_preparation_summary.json")
        with open(summary_path, 'w', encoding='utf-8') as f:
            json.dump(processing_summary, f, indent=2, ensure_ascii=False)
        print(f"Saved processing summary to: {summary_path}")
        
        print(f"\nDATA PREPARATION SUMMARY:")
        print(f"Target column: '{target_column}'")
        print(f"Behavioral features: {len(behavioral_numeric_features)}")
        print(f"Engineered features: {len(engineered_features)}")
        print(f"Binary features: {len(binary_features_created)}")
        print(f"Total modeling features: {len(all_features)}")
        
        print(f"\nDatasets exported:")
        print(f"- personality_dataset_cleaned.csv: Main processed dataset")
        if 'df_undersampled' in locals() and len(df_undersampled) != len(df):
            print(f"- personality_dataset_undersampled.csv: Balanced via undersampling")
        if 'df_oversampled' in locals() and len(df_oversampled) != len(df):
            print(f"- personality_dataset_oversampled.csv: Balanced via oversampling")
        if 'df_smote' in locals() and len(df_smote) != len(df):
            print(f"- personality_dataset_smote.csv: Balanced via SMOTE")
        print(f"- feature_metadata.csv: Feature information for modeling")
        print(f"- data_preparation_summary.json: Processing metadata")
        
    else:
        print("ERROR: Missing required target column for export")
        
else:
    print(f"ERROR: Main data directory not found at {processed_dir}")
    print("Please ensure you're running from the notebooks/ folder")



STEP 8: DATA EXPORT AND FINAL PROCESSING
Using main project data directory: ../data/processed
Saved cleaned dataset to: ../data/processed\personality_dataset_cleaned.csv
  Shape: (2512, 16)
Saved feature metadata to: ../data/processed\feature_metadata.csv
Saved processing summary to: ../data/processed\data_preparation_summary.json

DATA PREPARATION SUMMARY:
Target column: 'Personality'
Behavioral features: 5
Engineered features: 5
Binary features: 3
Total modeling features: 13

Datasets exported:
- personality_dataset_cleaned.csv: Main processed dataset
- feature_metadata.csv: Feature information for modeling
- data_preparation_summary.json: Processing metadata


## Summary

This notebook successfully completed the data preparation phase for extrovert-introvert personality classification:

**Step 1: Library Import**
- Imported all required libraries for psychological data preprocessing
- Configured statistical analysis and machine learning tools
- Set up data imputation, scaling, and class balancing methods

**Step 2: Data Loading**
- Loaded personality dataset with robust encoding handling
- Supported both processed and raw data sources
- Validated data integrity and basic structure

**Step 3: Data Validation and Feature Identification**
- Automatically identified target and feature columns
- Categorized behavioral vs psychological features
- Performed comprehensive data quality assessment
- Analyzed class balance and distribution

**Step 4: Data Cleaning Functions**
- Defined domain-specific missing value imputation strategies
- Implemented outlier detection using IQR and Z-score methods
- Created feature range validation for 0-10 behavioral scales
- Built categorical feature standardization functions

**Step 5: Data Cleaning Pipeline**
- Removed duplicate records
- Standardized categorical feature values
- Validated and corrected feature ranges (0-10 scale)
- Applied domain-specific missing value handling
- Detected and analyzed outliers in behavioral data

**Step 6: Feature Engineering**
- Created composite personality scores (Social Activity, Introversion)
- Engineered behavioral indicators (Social Comfort, Energy Balance)
- Generated binary feature encodings for modeling
- Built domain-specific features based on personality psychology

**Step 7: Class Balancing**
- Analyzed personality type distribution and imbalance
- Applied multiple balancing techniques (undersampling, oversampling, SMOTE)
- Created balanced datasets for different modeling approaches
- Maintained data integrity across all transformations

**Step 8: Data Export**
- Exported cleaned datasets to main data directory
- Created multiple dataset versions (original, undersampled, oversampled, SMOTE)
- Generated comprehensive feature metadata for modeling
- Saved processing summary with all transformation details

### Next Steps:
- Proceed to **Step 3: Data Exploration** for in-depth analysis
- Use cleaned datasets for personality classification model training
- Leverage engineered features for improved model performance
- All processed data available in `data/processed/` directory

### Key Outputs:
- `personality_dataset_cleaned.csv`: Main processed dataset ready for modeling
- `personality_dataset_undersampled.csv`: Balanced via undersampling (if needed)
- `personality_dataset_oversampled.csv`: Balanced via oversampling (if needed)
- `personality_dataset_smote.csv`: Balanced via SMOTE synthetic generation (if needed)
- `feature_metadata.csv`: Detailed feature information for model development
- `data_preparation_summary.json`: Complete processing pipeline documentation

The dataset is now clean, feature-engineered, and ready for machine learning model development with multiple balancing options available based on modeling requirements.
