# 02 - Data Preprocessing for Clustering

## Overview
This notebook prepares the data for clustering analysis by handling missing values, feature selection, scaling, and dimensionality reduction.

**Prerequisites:**
- Complete `01_Data_Exploration_EDA.ipynb` first
- Dataset: `../Data/eda_complete_dataset.csv`

**Objectives:**
- Feature engineering and selection
- Handle outliers and missing values
- Scale and normalize features
- Apply dimensionality reduction techniques
- Prepare data for clustering algorithms

**Outputs:**
- Preprocessed dataset ready for clustering
- Scaled features
- PCA and t-SNE transformed data
- Feature importance analysis

## 1. Library Imports and Setup

In [1]:
# Standard Data Manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Preprocessing and Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import PowerTransformer, LabelEncoder, OneHotEncoder

# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Feature Selection
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest, f_classif

# Outlier Detection
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Preprocessing libraries imported successfully!")

Preprocessing libraries imported successfully!


## 2. Load Preprocessed Data from EDA

In [2]:
# Load the dataset from EDA notebook
try:
    data = pd.read_csv('../Data/eda_complete_dataset.csv')
    print(f"Dataset loaded successfully: {data.shape}")
except FileNotFoundError:
    print("EDA dataset not found. Please run 01_Data_Exploration_EDA.ipynb first.")
    # Fallback to original dataset
    data = pd.read_csv('../Data/company_esg_financial_dataset.csv')
    print(f"Using original dataset: {data.shape}")

print(f"Columns: {list(data.columns)}")
data.head()

Dataset loaded successfully: (11000, 16)
Columns: ['CompanyID', 'CompanyName', 'Industry', 'Region', 'Year', 'Revenue', 'ProfitMargin', 'MarketCap', 'GrowthRate', 'ESG_Overall', 'ESG_Environmental', 'ESG_Social', 'ESG_Governance', 'CarbonEmissions', 'WaterUsage', 'EnergyConsumption']


Unnamed: 0,CompanyID,CompanyName,Industry,Region,Year,Revenue,ProfitMargin,MarketCap,GrowthRate,ESG_Overall,ESG_Environmental,ESG_Social,ESG_Governance,CarbonEmissions,WaterUsage,EnergyConsumption
0,1,Company_1,Retail,Latin America,2015,459.2,6.0,337.5,,57.0,60.7,33.5,76.8,35577.4,17788.7,71154.7
1,1,Company_1,Retail,Latin America,2016,473.8,4.6,366.6,3.2,56.7,58.9,32.8,78.5,37314.7,18657.4,74629.4
2,1,Company_1,Retail,Latin America,2017,564.9,5.2,313.4,19.2,56.5,57.6,34.0,77.8,45006.4,22503.2,90012.9
3,1,Company_1,Retail,Latin America,2018,558.4,4.3,283.0,-1.1,58.0,62.3,33.4,78.3,42650.1,21325.1,85300.2
4,1,Company_1,Retail,Latin America,2019,554.5,4.9,538.1,-0.7,56.6,63.7,30.0,76.1,41799.4,20899.7,83598.8


## 3. Feature Engineering and Selection

In [3]:
# Define feature categories
financial_features = ['Revenue', 'ProfitMargin', 'MarketCap', 'GrowthRate']
esg_features = ['ESG_Overall', 'ESG_Environmental', 'ESG_Social', 'ESG_Governance']
environmental_features = ['CarbonEmissions', 'WaterUsage', 'EnergyConsumption']
categorical_features = ['Industry', 'Region']
identifier_features = ['CompanyID', 'CompanyName', 'Year']

# Combine numerical features for clustering
numerical_features = financial_features + esg_features + environmental_features

print(f"Financial features: {financial_features}")
print(f"ESG features: {esg_features}")
print(f"Environmental features: {environmental_features}")
print(f"Categorical features: {categorical_features}")
print(f"\nTotal numerical features for clustering: {len(numerical_features)}")

Financial features: ['Revenue', 'ProfitMargin', 'MarketCap', 'GrowthRate']
ESG features: ['ESG_Overall', 'ESG_Environmental', 'ESG_Social', 'ESG_Governance']
Environmental features: ['CarbonEmissions', 'WaterUsage', 'EnergyConsumption']
Categorical features: ['Industry', 'Region']

Total numerical features for clustering: 11


## 4. Handle Missing Values and Outliers

In [4]:
# Temporal imputation for missing values in numerical features
# Fill missing values using forward fill, then backward fill within each company (or entity) over time

# Sort data by identifier and time
data.sort_values(by=['CompanyID', 'Year'], inplace=True)

missing_check = data[numerical_features].isnull().sum()
print("Missing values in numerical features:")
print(missing_check[missing_check > 0])

if missing_check.sum() > 0:
    for feature in numerical_features:
        if data[feature].isnull().sum() > 0:
            # Forward fill then backward fill within each company
            data[feature] = data.groupby('CompanyID')[feature].ffill().bfill()
            filled_count = data[feature].isnull().sum()
            if filled_count == 0:
                print(f"Filled {feature} missing values using temporal imputation (ffill/bfill).")
            else:
                print(f"{filled_count} missing values remain in {feature} after temporal imputation.")
else:
    print("No missing values found in numerical features.")




Missing values in numerical features:
GrowthRate    1000
dtype: int64
Filled GrowthRate missing values using temporal imputation (ffill/bfill).


### Strategic Recommendation: Retain All Data Points Without Outlier Removal

**Why We Avoid Outlier Detection for Clustering:**

- **Business Insights in "Outliers":**  
    In ESG and financial datasets, statistical "outliers" often represent unique companies, market leaders, disruptors, or emerging segments. Removing these points risks losing critical business intelligence.

- **Legitimacy of Extreme Values:**  
    Extreme values are expected across diverse industries, regions, and market segments. For example, tech giants, energy firms, or companies in high-growth markets naturally exhibit values far from the median.

- **Statistical Issues Already Addressed:**  
    Our preprocessing pipeline applies power transformation and robust scaling, correcting for skewness and variance without discarding data. This ensures statistical reliability for clustering.

- **Data Evidence:**  
    Exploratory analysis shows that statistical outliers correspond to meaningful business groups (e.g., top revenue companies, ESG leaders, or regional specialists), not errors or noise.

- **Maximum Information Preservation:**  
    By retaining all data points, we maximize the diversity and richness of clusters. Our approach ensures that clustering reflects the full spectrum of business realities, not just the statistical average.

**Conclusion:**  
Outlier removal is not recommended for this business context. Our pipeline preserves all information, corrects statistical challenges, and enables clustering to reveal actionable, real-world business segments.

### 4.2 Categorical Variable Encoding Implementation

**Decision: Implement One-Hot Encoding for Industry and Region**

Based on business analysis, we'll encode categorical variables to preserve meaningful relationships:
- **Industry patterns**: Technology vs Healthcare vs Finance have distinct ESG and financial profiles
- **Regional differences**: EU vs US vs Asia have different regulatory environments and ESG standards
- **Business context**: Cross-industry and regional patterns will enhance clustering interpretability

In [5]:
# CATEGORICAL VARIABLE ENCODING IMPLEMENTATION
print("IMPLEMENTING ONE-HOT ENCODING FOR CATEGORICAL VARIABLES")
print("="*60)

# First, run previous cells to ensure variables are loaded
if 'data' not in locals():
    print("Loading data and running previous preprocessing steps...")
    exec(open('previous_cells.py').read())  # This would run previous cells
    
print("1. ANALYZING CATEGORICAL VARIABLES:")
print(f"   • Industry categories: {data['Industry'].nunique()}")
print(f"   • Region categories: {data['Region'].nunique()}")
print(f"   • Industries: {sorted(data['Industry'].unique())}")
print(f"   • Regions: {sorted(data['Region'].unique())}")

# Create one-hot encoder (drop='first' to avoid multicollinearity)
categorical_encoder = OneHotEncoder(drop='first', sparse_output=False)

# Fit and transform categorical features
categorical_data = data[categorical_features]
categorical_encoded = categorical_encoder.fit_transform(categorical_data)

# Get feature names for encoded variables
encoded_feature_names = categorical_encoder.get_feature_names_out(categorical_features)
print(f"\n2. ONE-HOT ENCODING RESULTS:")
print(f"   • Original categorical features: {len(categorical_features)}")
print(f"   • Encoded features created: {len(encoded_feature_names)}")
print(f"   • Encoded feature names: {list(encoded_feature_names)}")

# Create DataFrame with encoded features
categorical_encoded_df = pd.DataFrame(
    categorical_encoded, 
    columns=encoded_feature_names,
    index=data.index
)

print(f"\n   ✅ Categorical encoding shape: {categorical_encoded_df.shape}")
print(f"   ✅ Sample of encoded features:")
print(categorical_encoded_df.head(3))

IMPLEMENTING ONE-HOT ENCODING FOR CATEGORICAL VARIABLES
1. ANALYZING CATEGORICAL VARIABLES:
   • Industry categories: 9
   • Region categories: 7
   • Industries: ['Consumer Goods', 'Energy', 'Finance', 'Healthcare', 'Manufacturing', 'Retail', 'Technology', 'Transportation', 'Utilities']
   • Regions: ['Africa', 'Asia', 'Europe', 'Latin America', 'Middle East', 'North America', 'Oceania']

2. ONE-HOT ENCODING RESULTS:
   • Original categorical features: 2
   • Encoded features created: 14
   • Encoded feature names: ['Industry_Energy', 'Industry_Finance', 'Industry_Healthcare', 'Industry_Manufacturing', 'Industry_Retail', 'Industry_Technology', 'Industry_Transportation', 'Industry_Utilities', 'Region_Asia', 'Region_Europe', 'Region_Latin America', 'Region_Middle East', 'Region_North America', 'Region_Oceania']

   ✅ Categorical encoding shape: (11000, 14)
   ✅ Sample of encoded features:
   Industry_Energy  Industry_Finance  Industry_Healthcare  \
0              0.0               0.0  

In [6]:
# COMBINE NUMERICAL AND CATEGORICAL FEATURES
print("\n3. COMBINING NUMERICAL AND CATEGORICAL FEATURES:")

# Ensure we have the numerical features processed (X from previous preprocessing)
if 'X' not in locals():
    X = data[numerical_features].copy()
    print("   • Created numerical features matrix")

# Combine numerical and encoded categorical features
X_combined = pd.concat([
    X,  # Original numerical features (11 features)
    categorical_encoded_df  # Encoded categorical features
], axis=1)

print(f"   • Original numerical features: {X.shape[1]}")
print(f"   • Encoded categorical features: {categorical_encoded_df.shape[1]}")
print(f"   • Combined feature set: {X_combined.shape[1]} features")
print(f"   • Combined dataset shape: {X_combined.shape}")

# Update feature lists for enhanced clustering
combined_features = list(X.columns) + list(encoded_feature_names)
print(f"   • Total features for enhanced clustering: {len(combined_features)}")

print(f"\n4. BUSINESS CONTEXT PRESERVATION:")
print("   Sample companies with their categorical encodings:")
for idx in range(3):
    company_name = data.iloc[idx]['CompanyName']
    industry = data.iloc[idx]['Industry'] 
    region = data.iloc[idx]['Region']
    print(f"   • {company_name} ({industry}, {region})")
    
    # Show which categorical features are active (value = 1)
    active_features = []
    for feature in encoded_feature_names:
        if categorical_encoded_df.iloc[idx][feature] == 1:
            active_features.append(feature)
    print(f"     Active encodings: {active_features}")
    print()

print(f"✅ ENHANCED FEATURE SET READY FOR PREPROCESSING")
print(f"   📊 {X_combined.shape[1]} total features preserve business context")


3. COMBINING NUMERICAL AND CATEGORICAL FEATURES:
   • Created numerical features matrix
   • Original numerical features: 11
   • Encoded categorical features: 14
   • Combined feature set: 25 features
   • Combined dataset shape: (11000, 25)
   • Total features for enhanced clustering: 25

4. BUSINESS CONTEXT PRESERVATION:
   Sample companies with their categorical encodings:
   • Company_1 (Retail, Latin America)
     Active encodings: ['Industry_Retail', 'Region_Latin America']

   • Company_1 (Retail, Latin America)
     Active encodings: ['Industry_Retail', 'Region_Latin America']

   • Company_1 (Retail, Latin America)
     Active encodings: ['Industry_Retail', 'Region_Latin America']

✅ ENHANCED FEATURE SET READY FOR PREPROCESSING
   📊 25 total features preserve business context


In [7]:
# ENHANCED PREPROCESSING PIPELINE FOR COMBINED DATASET
print("\n" + "="*60)
print("ENHANCED PREPROCESSING: NUMERICAL + CATEGORICAL FEATURES")
print("="*60)

print("5. POWER TRANSFORMATION FOR COMBINED DATASET:")
# Apply power transformation only to numerical features 
# (categorical encoded features are already 0/1 and don't need transformation)
X_combined_transformed = X_combined.copy()

# Apply power transformation only to original numerical features
print("   Applying power transformation to numerical features...")
if 'power_transformer' not in locals():
    from sklearn.preprocessing import PowerTransformer
    power_transformer = PowerTransformer(method='yeo-johnson', standardize=False)
    
# Transform only numerical features
numerical_indices = [X_combined.columns.get_loc(col) for col in numerical_features if col in X_combined.columns]
X_combined_transformed.iloc[:, numerical_indices] = power_transformer.fit_transform(X_combined.iloc[:, numerical_indices])

print(f"   ✅ Power transformation applied to {len(numerical_features)} numerical features")
print(f"   ✅ Categorical encoded features kept as binary (0/1)")

print("\n6. SCALING COMBINED DATASET:")
# Scale the combined dataset
from sklearn.preprocessing import StandardScaler
combined_scaler = StandardScaler()
X_combined_scaled = combined_scaler.fit_transform(X_combined_transformed)
X_combined_scaled_df = pd.DataFrame(
    X_combined_scaled, 
    columns=combined_features, 
    index=X_combined.index
)

print(f"   ✅ StandardScaler applied to all {len(combined_features)} features")
print(f"   📊 Final combined scaled dataset shape: {X_combined_scaled_df.shape}")

print("\n7. ENHANCED APPROACH SUMMARY:")
print(f"   🔍 Original approach: {len(numerical_features)} numerical features only")  
print(f"   🚀 Enhanced approach: {X_combined_scaled_df.shape[1]} features (numerical + categorical)")
print(f"   📈 Business context preserved through Industry/Region encoding")
print(f"   ⚡ Ready for enhanced clustering with richer feature representation")

# Show sample of scaled combined features
print(f"\n8. SAMPLE OF ENHANCED FEATURES:")
print("   First 3 companies with combined scaled features:")
sample_features = ['Revenue', 'ESG_Overall'] + list(encoded_feature_names[:3])
available_features = [f for f in sample_features if f in X_combined_scaled_df.columns]
print(X_combined_scaled_df[available_features].head(3))


ENHANCED PREPROCESSING: NUMERICAL + CATEGORICAL FEATURES
5. POWER TRANSFORMATION FOR COMBINED DATASET:
   Applying power transformation to numerical features...
   ✅ Power transformation applied to 11 numerical features
   ✅ Categorical encoded features kept as binary (0/1)

6. SCALING COMBINED DATASET:
   ✅ StandardScaler applied to all 25 features
   📊 Final combined scaled dataset shape: (11000, 25)

7. ENHANCED APPROACH SUMMARY:
   🔍 Original approach: 11 numerical features only
   🚀 Enhanced approach: 25 features (numerical + categorical)
   📈 Business context preserved through Industry/Region encoding
   ⚡ Ready for enhanced clustering with richer feature representation

8. SAMPLE OF ENHANCED FEATURES:
   First 3 companies with combined scaled features:
    Revenue  ESG_Overall  Industry_Energy  Industry_Finance  \
0 -1.445479     0.149811         -0.34796         -0.356925   
1 -1.410825     0.130932         -0.34796         -0.356925   
2 -1.218985     0.118347         -0.3479

In [8]:
# ENHANCED PCA ANALYSIS WITH CATEGORICAL FEATURES
print("\n" + "="*60)
print("ENHANCED PCA ANALYSIS")
print("="*60)

print("9. APPLYING PCA TO ENHANCED FEATURE SET:")
# Apply PCA to the combined scaled dataset
from sklearn.decomposition import PCA
pca_enhanced = PCA()
X_enhanced_pca = pca_enhanced.fit_transform(X_combined_scaled_df)

# Calculate cumulative explained variance
cumsum_var_enhanced = np.cumsum(pca_enhanced.explained_variance_ratio_)

# Find number of components for 95% variance
n_components_95_enhanced = np.argmax(cumsum_var_enhanced >= 0.95) + 1

print(f"   • Enhanced features: {X_combined_scaled_df.shape[1]}")
print(f"   • Components for 95% variance: {n_components_95_enhanced}")
print(f"   • Variance explained by first 3 components: {cumsum_var_enhanced[2]:.3f}")

# Create optimal PCA dataframe
pca_enhanced_optimal = PCA(n_components=n_components_95_enhanced)
X_enhanced_pca_optimal = pca_enhanced_optimal.fit_transform(X_combined_scaled_df)

pca_enhanced_columns = [f'PCA_Enhanced_{i+1}' for i in range(n_components_95_enhanced)]
X_enhanced_pca_df = pd.DataFrame(
    X_enhanced_pca_optimal, 
    columns=pca_enhanced_columns, 
    index=X_combined_scaled_df.index
)

print(f"   ✅ Enhanced PCA dataset shape: {X_enhanced_pca_df.shape}")

# Compare with original numerical-only approach (if available)
print(f"\n10. APPROACH COMPARISON:")
if 'n_components_95' in locals():
    print(f"   Original (numerical-only):")
    print(f"   • Features: {len(numerical_features)}")
    print(f"   • PCA components for 95% variance: {n_components_95}")
    
print(f"   Enhanced (with categorical encoding):")
print(f"   • Features: {X_combined_scaled_df.shape[1]}")
print(f"   • PCA components for 95% variance: {n_components_95_enhanced}")

print(f"\n11. TOP PCA COMPONENTS (ENHANCED APPROACH):")
for i in range(min(5, n_components_95_enhanced)):
    variance_pct = pca_enhanced.explained_variance_ratio_[i] * 100
    print(f"   • PC{i+1}: {variance_pct:.1f}% of variance")

print(f"\n🎯 ENHANCED PCA READY FOR CLUSTERING!")
print(f"✅ Business context preserved in dimensionality reduction")
print(f"✅ {n_components_95_enhanced} optimal components capture industry/regional patterns")


ENHANCED PCA ANALYSIS
9. APPLYING PCA TO ENHANCED FEATURE SET:
   • Enhanced features: 25
   • Components for 95% variance: 17
   • Variance explained by first 3 components: 0.403
   ✅ Enhanced PCA dataset shape: (11000, 17)

10. APPROACH COMPARISON:
   Enhanced (with categorical encoding):
   • Features: 25
   • PCA components for 95% variance: 17

11. TOP PCA COMPONENTS (ENHANCED APPROACH):
   • PC1: 16.8% of variance
   • PC2: 13.4% of variance
   • PC3: 10.1% of variance
   • PC4: 5.1% of variance
   • PC5: 5.0% of variance

🎯 ENHANCED PCA READY FOR CLUSTERING!
✅ Business context preserved in dimensionality reduction
✅ 17 optimal components capture industry/regional patterns


In [9]:
# SAVE ENHANCED DATASETS WITH CATEGORICAL ENCODING
print("\n" + "="*60)
print("SAVING ENHANCED DATASETS")
print("="*60)

print("12. CREATING ENHANCED T-SNE:")
# Create t-SNE for enhanced approach
from sklearn.manifold import TSNE
tsne_enhanced = TSNE(n_components=2, random_state=42, perplexity=30)
X_enhanced_tsne = tsne_enhanced.fit_transform(X_enhanced_pca_optimal[:, :5])

X_enhanced_tsne_df = pd.DataFrame(
    X_enhanced_tsne, 
    columns=['t-SNE_Enhanced_1', 't-SNE_Enhanced_2'], 
    index=X_combined_scaled_df.index
)

print(f"   ✅ Enhanced t-SNE shape: {X_enhanced_tsne_df.shape}")

print("\n13. SAVING ENHANCED DATASETS:")
# Save enhanced preprocessing datasets
X_combined_scaled_df.to_csv('../Data/scaled_features_with_categorical.csv', index=False)
X_enhanced_pca_df.to_csv('../Data/pca_features_with_categorical.csv', index=False) 
X_enhanced_tsne_df.to_csv('../Data/tsne_features_with_categorical.csv', index=False)
categorical_encoded_df.to_csv('../Data/categorical_encoded_features.csv', index=False)

# Update the complete preprocessed dataset
if 'preprocessed_data' in locals():
    preprocessed_data_enhanced = preprocessed_data.copy()
else:
    preprocessed_data_enhanced = data.copy()

# Add enhanced scaled features
for col in X_combined_scaled_df.columns:
    preprocessed_data_enhanced[f'{col}_enhanced_scaled'] = X_combined_scaled_df[col]

# Add enhanced PCA features
for col in X_enhanced_pca_df.columns:
    preprocessed_data_enhanced[col] = X_enhanced_pca_df[col]

# Add enhanced t-SNE features
for col in X_enhanced_tsne_df.columns:
    preprocessed_data_enhanced[col] = X_enhanced_tsne_df[col]

# Save complete enhanced dataset
preprocessed_data_enhanced.to_csv('../Data/preprocessed_complete_dataset_enhanced.csv', index=False)

print("   ✅ Enhanced datasets saved:")
print("   - ../Data/scaled_features_with_categorical.csv")
print("   - ../Data/pca_features_with_categorical.csv") 
print("   - ../Data/tsne_features_with_categorical.csv")
print("   - ../Data/categorical_encoded_features.csv")
print("   - ../Data/preprocessed_complete_dataset_enhanced.csv")

# Save encoders and transformers for consistency
import pickle
with open('../Data/categorical_encoder.pkl', 'wb') as f:
    pickle.dump(categorical_encoder, f)
with open('../Data/combined_scaler.pkl', 'wb') as f:
    pickle.dump(combined_scaler, f)
    
print("   • Categorical encoder and combined scaler saved")

print(f"\n🎯 ENHANCED PREPROCESSING COMPLETE!")
print("="*65)
print(f"✅ DUAL APPROACH READY FOR CLUSTERING COMPARISON:")
print(f"   1. Original: {len(numerical_features)} numerical features only")
print(f"   2. Enhanced: {X_combined_scaled_df.shape[1]} features (numerical + categorical)")
print(f"✅ Business context preserved through Industry/Region encoding")
print(f"✅ Both approaches optimized for clustering algorithms")
print(f"📊 Ready for comprehensive clustering analysis!")


SAVING ENHANCED DATASETS
12. CREATING ENHANCED T-SNE:
   ✅ Enhanced t-SNE shape: (11000, 2)

13. SAVING ENHANCED DATASETS:
   ✅ Enhanced datasets saved:
   - ../Data/scaled_features_with_categorical.csv
   - ../Data/pca_features_with_categorical.csv
   - ../Data/tsne_features_with_categorical.csv
   - ../Data/categorical_encoded_features.csv
   - ../Data/preprocessed_complete_dataset_enhanced.csv
   • Categorical encoder and combined scaler saved

🎯 ENHANCED PREPROCESSING COMPLETE!
✅ DUAL APPROACH READY FOR CLUSTERING COMPARISON:
   1. Original: 11 numerical features only
   2. Enhanced: 25 features (numerical + categorical)
✅ Business context preserved through Industry/Region encoding
✅ Both approaches optimized for clustering algorithms
📊 Ready for comprehensive clustering analysis!


## 5. Feature Scaling and Normalization

### 4.1 Advanced Transformation for Skewed Features

Based on EDA analysis, 5 features show extreme skewness (|skew| >= 1.0):
- CarbonEmissions: 15.848
- EnergyConsumption: 15.654  
- WaterUsage: 14.386
- MarketCap: 8.884
- Revenue: 7.369

Standard scaling alone doesn't fix skewness. We'll apply power transformation first, then scale.

In [115]:
# Prepare the data for scaling
X = data[numerical_features].copy()

print("PREPROCESSING PIPELINE WITH SKEWNESS CORRECTION")
print("="*60)

# Step 1: Apply Power Transformation to handle skewness
from sklearn.preprocessing import PowerTransformer
from scipy.stats import skew

print("\n1. ANALYZING ORIGINAL SKEWNESS:")
original_skew = {}
for feature in numerical_features:
    skewness = skew(X[feature].dropna())
    original_skew[feature] = skewness
    if abs(skewness) >= 1.0:
        print(f"   • {feature}: {skewness:.3f} (HIGH SKEWNESS)")

# Apply Yeo-Johnson Power Transformation (handles positive and negative values)
print("\n2. APPLYING YEO-JOHNSON POWER TRANSFORMATION:")
power_transformer = PowerTransformer(method='yeo-johnson', standardize=False)
X_power_transformed = power_transformer.fit_transform(X)
X_power_df = pd.DataFrame(X_power_transformed, columns=numerical_features, index=X.index)

# Check skewness after power transformation
print("   Post-transformation skewness:")
for i, feature in enumerate(numerical_features):
    new_skewness = skew(X_power_transformed[:, i])
    improvement = abs(original_skew[feature]) - abs(new_skewness)
    if abs(original_skew[feature]) >= 1.0:
        print(f"   • {feature}: {new_skewness:.3f} (improved by {improvement:.3f})")

# Step 2: Apply different scaling methods to power-transformed data
print("\n3. APPLYING SCALING TO POWER-TRANSFORMED DATA:")
scalers = {
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler()
}

scaled_data = {}

for name, scaler in scalers.items():
    scaled_features = scaler.fit_transform(X_power_df)
    scaled_data[name] = pd.DataFrame(scaled_features, columns=numerical_features, index=X.index)
    print(f"   {name} applied to power-transformed data.")

# Use StandardScaler on power-transformed data as default
X_scaled = scaled_data['StandardScaler']

print(f"\n4. FINAL PREPROCESSING SUMMARY:")
print(f"   ✅ Power transformation applied to fix skewness")
print(f"   ✅ Standard scaling applied for clustering")
print(f"   📊 Final preprocessed data shape: {X_scaled.shape}")

# Verify final skewness
print(f"\n5. FINAL SKEWNESS VERIFICATION:")
final_highly_skewed = 0
for i, feature in enumerate(numerical_features):
    final_skewness = skew(X_scaled.iloc[:, i])
    if abs(final_skewness) >= 1.0:
        final_highly_skewed += 1
        print(f"   ⚠️  {feature}: {final_skewness:.3f} (still high)")

if final_highly_skewed == 0:
    print("   ✅ All features now have acceptable skewness (|skew| < 1.0)")
else:
    print(f"   ⚠️  {final_highly_skewed} features still highly skewed")

PREPROCESSING PIPELINE WITH SKEWNESS CORRECTION

1. ANALYZING ORIGINAL SKEWNESS:
   • Revenue: 7.369 (HIGH SKEWNESS)
   • MarketCap: 8.884 (HIGH SKEWNESS)
   • CarbonEmissions: 15.848 (HIGH SKEWNESS)
   • WaterUsage: 14.386 (HIGH SKEWNESS)
   • EnergyConsumption: 15.654 (HIGH SKEWNESS)

2. APPLYING YEO-JOHNSON POWER TRANSFORMATION:
   Post-transformation skewness:
   • Revenue: 0.008 (improved by 7.361)
   • MarketCap: -0.001 (improved by 8.883)
   • CarbonEmissions: -0.014 (improved by 15.834)
   • WaterUsage: -0.001 (improved by 14.385)
   • EnergyConsumption: 0.019 (improved by 15.634)

3. APPLYING SCALING TO POWER-TRANSFORMED DATA:
   StandardScaler applied to power-transformed data.
   MinMaxScaler applied to power-transformed data.
   RobustScaler applied to power-transformed data.

4. FINAL PREPROCESSING SUMMARY:
   ✅ Power transformation applied to fix skewness
   ✅ Standard scaling applied for clustering
   📊 Final preprocessed data shape: (11000, 11)

5. FINAL SKEWNESS VERIFI

## 6. Dimensionality Reduction

In [116]:
# Apply PCA to the properly transformed and scaled data
print("\n6. DIMENSIONALITY REDUCTION WITH PCA:")
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Calculate cumulative explained variance
cumsum_var = np.cumsum(pca.explained_variance_ratio_)

# Find number of components for 95% variance
n_components_95 = np.argmax(cumsum_var >= 0.95) + 1

print(f"   Number of components for 95% variance: {n_components_95}")
print(f"   Explained variance by first 3 components: {cumsum_var[2]:.3f}")

# Create PCA dataframe with optimal components
pca_optimal = PCA(n_components=n_components_95)
X_pca_optimal = pca_optimal.fit_transform(X_scaled)

pca_columns = [f'PCA_{i+1}' for i in range(n_components_95)]
X_pca_df = pd.DataFrame(X_pca_optimal, columns=pca_columns, index=X_scaled.index)

print(f"   ✅ PCA transformed data shape: {X_pca_df.shape}")

# Show PCA component importance
print(f"\n   PCA Components Explained Variance:")
for i in range(min(5, n_components_95)):
    print(f"   • PC{i+1}: {pca.explained_variance_ratio_[i]:.3f} ({pca.explained_variance_ratio_[i]*100:.1f}%)")


6. DIMENSIONALITY REDUCTION WITH PCA:
   Number of components for 95% variance: 7
   Explained variance by first 3 components: 0.727
   ✅ PCA transformed data shape: (11000, 7)

   PCA Components Explained Variance:
   • PC1: 0.314 (31.4%)
   • PC2: 0.241 (24.1%)
   • PC3: 0.171 (17.1%)
   • PC4: 0.087 (8.7%)
   • PC5: 0.070 (7.0%)


In [117]:
# Apply t-SNE for visualization (using first few PCA components)
print("\n7. T-SNE VISUALIZATION PREPARATION:")
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_pca_optimal[:, :5])  # Use first 5 PCA components

X_tsne_df = pd.DataFrame(X_tsne, columns=['t-SNE_1', 't-SNE_2'], index=X_scaled.index)

print(f"   ✅ t-SNE transformed data shape: {X_tsne_df.shape}")
print(f"   ✅ Ready for clustering visualization")


7. T-SNE VISUALIZATION PREPARATION:
   ✅ t-SNE transformed data shape: (11000, 2)
   ✅ Ready for clustering visualization
   ✅ t-SNE transformed data shape: (11000, 2)
   ✅ Ready for clustering visualization


## 7. Save Preprocessed Data

In [118]:
# Combine all preprocessed data
preprocessed_data = data.copy()

# Add scaled features
for col in X_scaled.columns:
    preprocessed_data[f'{col}_scaled'] = X_scaled[col]

# Add PCA features
for col in X_pca_df.columns:
    preprocessed_data[col] = X_pca_df[col]

# Add t-SNE features
for col in X_tsne_df.columns:
    preprocessed_data[col] = X_tsne_df[col]

# Save datasets with improved preprocessing
print("\n8. SAVING IMPROVED PREPROCESSED DATASETS:")

# Save power-transformed features (before scaling) for reference
X_power_df.to_csv('../Data/power_transformed_features.csv', index=False)

# Update the complete preprocessed dataset
preprocessed_data = data.copy()

# Add power-transformed + scaled features (our improved approach)
for col in X_scaled.columns:
    preprocessed_data[f'{col}_scaled'] = X_scaled[col]

# Add PCA features (based on improved preprocessing)
for col in X_pca_df.columns:
    preprocessed_data[col] = X_pca_df[col]

# Add t-SNE features
for col in X_tsne_df.columns:
    preprocessed_data[col] = X_tsne_df[col]

# Save all datasets
preprocessed_data.to_csv('../Data/preprocessed_complete_dataset.csv', index=False)
X_scaled.to_csv('../Data/scaled_features.csv', index=False)  # Now power-transformed + scaled
X_pca_df.to_csv('../Data/pca_features.csv', index=False)
X_tsne_df.to_csv('../Data/tsne_features.csv', index=False)

print("   ✅ Improved preprocessed datasets saved:")
print("   - ../Data/preprocessed_complete_dataset.csv")
print("   - ../Data/scaled_features.csv (power-transformed + scaled)")
print("   - ../Data/pca_features.csv (based on improved features)")
print("   - ../Data/tsne_features.csv")
print("   - ../Data/power_transformed_features.csv (intermediate step)")

print(f"\n🎯 PREPROCESSING COMPLETE - READY FOR IMPROVED CLUSTERING!")
print("="*65)
print(f"✅ Skewness corrected with Yeo-Johnson power transformation")
print(f"✅ Features properly scaled with StandardScaler")
print(f"✅ Dimensionality reduced with PCA ({n_components_95} components)")
print(f"✅ Visualization prepared with t-SNE")
print(f"📊 Final dataset shapes:")
print(f"   • Scaled features: {X_scaled.shape}")
print(f"   • PCA features: {X_pca_df.shape}")
print(f"   • t-SNE features: {X_tsne_df.shape}")

# Save transformation objects for consistency
import pickle
with open('../Data/power_transformer.pkl', 'wb') as f:
    pickle.dump(power_transformer, f)
print("   • Power transformer saved for future use")


8. SAVING IMPROVED PREPROCESSED DATASETS:
   ✅ Improved preprocessed datasets saved:
   - ../Data/preprocessed_complete_dataset.csv
   - ../Data/scaled_features.csv (power-transformed + scaled)
   - ../Data/pca_features.csv (based on improved features)
   - ../Data/tsne_features.csv
   - ../Data/power_transformed_features.csv (intermediate step)

🎯 PREPROCESSING COMPLETE - READY FOR IMPROVED CLUSTERING!
✅ Skewness corrected with Yeo-Johnson power transformation
✅ Features properly scaled with StandardScaler
✅ Dimensionality reduced with PCA (7 components)
✅ Visualization prepared with t-SNE
📊 Final dataset shapes:
   • Scaled features: (11000, 11)
   • PCA features: (11000, 7)
   • t-SNE features: (11000, 2)
   • Power transformer saved for future use
