# 02 - Data Preprocessing for Clustering

## Overview
This notebook prepares the data for clustering analysis by handling missing values, feature selection, scaling, and dimensionality reduction.

**Prerequisites:**
- Complete `01_Data_Exploration_EDA.ipynb` first
- Dataset: `../Data/eda_complete_dataset.csv`

**Objectives:**
- Feature engineering and selection
- Handle outliers and missing values
- Scale and normalize features
- Apply dimensionality reduction techniques
- Prepare data for clustering algorithms

**Outputs:**
- Preprocessed dataset ready for clustering
- Scaled features
- PCA and t-SNE transformed data
- Feature importance analysis

## 1. Library Imports and Setup

In [None]:
# Standard Data Manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Preprocessing and Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import PowerTransformer, LabelEncoder, OneHotEncoder

# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Feature Selection
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest, f_classif

# Outlier Detection
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Preprocessing libraries imported successfully!")

Preprocessing libraries imported successfully!


## 2. Load Preprocessed Data from EDA

In [26]:
# Load the dataset from EDA notebook
try:
    data = pd.read_csv('../Data/eda_complete_dataset.csv')
    print(f"Dataset loaded successfully: {data.shape}")
except FileNotFoundError:
    print("EDA dataset not found. Please run 01_Data_Exploration_EDA.ipynb first.")
    # Fallback to original dataset
    data = pd.read_csv('../Data/company_esg_financial_dataset.csv')
    print(f"Using original dataset: {data.shape}")

print(f"Columns: {list(data.columns)}")
data.head()

Dataset loaded successfully: (11000, 16)
Columns: ['CompanyID', 'CompanyName', 'Industry', 'Region', 'Year', 'Revenue', 'ProfitMargin', 'MarketCap', 'GrowthRate', 'ESG_Overall', 'ESG_Environmental', 'ESG_Social', 'ESG_Governance', 'CarbonEmissions', 'WaterUsage', 'EnergyConsumption']


Unnamed: 0,CompanyID,CompanyName,Industry,Region,Year,Revenue,ProfitMargin,MarketCap,GrowthRate,ESG_Overall,ESG_Environmental,ESG_Social,ESG_Governance,CarbonEmissions,WaterUsage,EnergyConsumption
0,1,Company_1,Retail,Latin America,2015,459.2,6.0,337.5,,57.0,60.7,33.5,76.8,35577.4,17788.7,71154.7
1,1,Company_1,Retail,Latin America,2016,473.8,4.6,366.6,3.2,56.7,58.9,32.8,78.5,37314.7,18657.4,74629.4
2,1,Company_1,Retail,Latin America,2017,564.9,5.2,313.4,19.2,56.5,57.6,34.0,77.8,45006.4,22503.2,90012.9
3,1,Company_1,Retail,Latin America,2018,558.4,4.3,283.0,-1.1,58.0,62.3,33.4,78.3,42650.1,21325.1,85300.2
4,1,Company_1,Retail,Latin America,2019,554.5,4.9,538.1,-0.7,56.6,63.7,30.0,76.1,41799.4,20899.7,83598.8


## 3. Feature Engineering and Selection

In [27]:
# Define feature categories
financial_features = ['Revenue', 'ProfitMargin', 'MarketCap', 'GrowthRate']
esg_features = ['ESG_Overall', 'ESG_Environmental', 'ESG_Social', 'ESG_Governance']
environmental_features = ['CarbonEmissions', 'WaterUsage', 'EnergyConsumption']
categorical_features = ['Industry', 'Region']
identifier_features = ['CompanyID', 'CompanyName', 'Year']

# Combine numerical features for clustering
numerical_features = financial_features + esg_features + environmental_features

print(f"Financial features: {financial_features}")
print(f"ESG features: {esg_features}")
print(f"Environmental features: {environmental_features}")
print(f"Categorical features: {categorical_features}")
print(f"\nTotal numerical features for clustering: {len(numerical_features)}")

Financial features: ['Revenue', 'ProfitMargin', 'MarketCap', 'GrowthRate']
ESG features: ['ESG_Overall', 'ESG_Environmental', 'ESG_Social', 'ESG_Governance']
Environmental features: ['CarbonEmissions', 'WaterUsage', 'EnergyConsumption']
Categorical features: ['Industry', 'Region']

Total numerical features for clustering: 11


## 4. Handle Missing Values and Outliers

In [28]:
# Check for missing values in numerical features
missing_check = data[numerical_features].isnull().sum()
print("Missing values in numerical features:")
print(missing_check[missing_check > 0])

# Handle missing values if any
if missing_check.sum() > 0:
    # Fill missing values with median for numerical features
    for feature in numerical_features:
        if data[feature].isnull().sum() > 0:
            median_value = data[feature].median()
            data[feature].fillna(median_value, inplace=True)
            print(f"Filled {feature} missing values with median: {median_value:.2f}")
else:
    print("No missing values found in numerical features.")

Missing values in numerical features:
GrowthRate    1000
dtype: int64
Filled GrowthRate missing values with median: 4.90


## 5. Feature Scaling and Normalization

### 4.1 Advanced Transformation for Skewed Features

Based on EDA analysis, 5 features show extreme skewness (|skew| >= 1.0):
- CarbonEmissions: 15.848
- EnergyConsumption: 15.654  
- WaterUsage: 14.386
- MarketCap: 8.884
- Revenue: 7.369

Standard scaling alone doesn't fix skewness. We'll apply power transformation first, then scale.

In [33]:
# Prepare the data for scaling
X = data[numerical_features].copy()

print("PREPROCESSING PIPELINE WITH SKEWNESS CORRECTION")
print("="*60)

# Step 1: Apply Power Transformation to handle skewness
from sklearn.preprocessing import PowerTransformer
from scipy.stats import skew

print("\n1. ANALYZING ORIGINAL SKEWNESS:")
original_skew = {}
for feature in numerical_features:
    skewness = skew(X[feature].dropna())
    original_skew[feature] = skewness
    if abs(skewness) >= 1.0:
        print(f"   • {feature}: {skewness:.3f} (HIGH SKEWNESS)")

# Apply Yeo-Johnson Power Transformation (handles positive and negative values)
print("\n2. APPLYING YEO-JOHNSON POWER TRANSFORMATION:")
power_transformer = PowerTransformer(method='yeo-johnson', standardize=False)
X_power_transformed = power_transformer.fit_transform(X)
X_power_df = pd.DataFrame(X_power_transformed, columns=numerical_features, index=X.index)

# Check skewness after power transformation
print("   Post-transformation skewness:")
for i, feature in enumerate(numerical_features):
    new_skewness = skew(X_power_transformed[:, i])
    improvement = abs(original_skew[feature]) - abs(new_skewness)
    if abs(original_skew[feature]) >= 1.0:
        print(f"   • {feature}: {new_skewness:.3f} (improved by {improvement:.3f})")

# Step 2: Apply different scaling methods to power-transformed data
print("\n3. APPLYING SCALING TO POWER-TRANSFORMED DATA:")
scalers = {
    'StandardScaler': StandardScaler(),
    'MinMaxScaler': MinMaxScaler(),
    'RobustScaler': RobustScaler()
}

scaled_data = {}

for name, scaler in scalers.items():
    scaled_features = scaler.fit_transform(X_power_df)
    scaled_data[name] = pd.DataFrame(scaled_features, columns=numerical_features, index=X.index)
    print(f"   {name} applied to power-transformed data.")

# Use StandardScaler on power-transformed data as default
X_scaled = scaled_data['StandardScaler']

print(f"\n4. FINAL PREPROCESSING SUMMARY:")
print(f"   ✅ Power transformation applied to fix skewness")
print(f"   ✅ Standard scaling applied for clustering")
print(f"   📊 Final preprocessed data shape: {X_scaled.shape}")

# Verify final skewness
print(f"\n5. FINAL SKEWNESS VERIFICATION:")
final_highly_skewed = 0
for i, feature in enumerate(numerical_features):
    final_skewness = skew(X_scaled.iloc[:, i])
    if abs(final_skewness) >= 1.0:
        final_highly_skewed += 1
        print(f"   ⚠️  {feature}: {final_skewness:.3f} (still high)")

if final_highly_skewed == 0:
    print("   ✅ All features now have acceptable skewness (|skew| < 1.0)")
else:
    print(f"   ⚠️  {final_highly_skewed} features still highly skewed")

PREPROCESSING PIPELINE WITH SKEWNESS CORRECTION

1. ANALYZING ORIGINAL SKEWNESS:
   • Revenue: 7.369 (HIGH SKEWNESS)
   • MarketCap: 8.884 (HIGH SKEWNESS)
   • CarbonEmissions: 15.848 (HIGH SKEWNESS)
   • WaterUsage: 14.386 (HIGH SKEWNESS)
   • EnergyConsumption: 15.654 (HIGH SKEWNESS)

2. APPLYING YEO-JOHNSON POWER TRANSFORMATION:
   Post-transformation skewness:
   • Revenue: 0.008 (improved by 7.361)
   • MarketCap: -0.001 (improved by 8.883)
   • CarbonEmissions: -0.014 (improved by 15.834)
   • WaterUsage: -0.001 (improved by 14.385)
   • EnergyConsumption: 0.019 (improved by 15.634)

3. APPLYING SCALING TO POWER-TRANSFORMED DATA:
   StandardScaler applied to power-transformed data.
   MinMaxScaler applied to power-transformed data.
   RobustScaler applied to power-transformed data.

4. FINAL PREPROCESSING SUMMARY:
   ✅ Power transformation applied to fix skewness
   ✅ Standard scaling applied for clustering
   📊 Final preprocessed data shape: (11000, 11)

5. FINAL SKEWNESS VERIFI

## 6. Dimensionality Reduction

In [34]:
# Apply PCA to the properly transformed and scaled data
print("\n6. DIMENSIONALITY REDUCTION WITH PCA:")
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Calculate cumulative explained variance
cumsum_var = np.cumsum(pca.explained_variance_ratio_)

# Find number of components for 95% variance
n_components_95 = np.argmax(cumsum_var >= 0.95) + 1

print(f"   Number of components for 95% variance: {n_components_95}")
print(f"   Explained variance by first 3 components: {cumsum_var[2]:.3f}")

# Create PCA dataframe with optimal components
pca_optimal = PCA(n_components=n_components_95)
X_pca_optimal = pca_optimal.fit_transform(X_scaled)

pca_columns = [f'PCA_{i+1}' for i in range(n_components_95)]
X_pca_df = pd.DataFrame(X_pca_optimal, columns=pca_columns, index=X_scaled.index)

print(f"   ✅ PCA transformed data shape: {X_pca_df.shape}")

# Show PCA component importance
print(f"\n   PCA Components Explained Variance:")
for i in range(min(5, n_components_95)):
    print(f"   • PC{i+1}: {pca.explained_variance_ratio_[i]:.3f} ({pca.explained_variance_ratio_[i]*100:.1f}%)")


6. DIMENSIONALITY REDUCTION WITH PCA:
   Number of components for 95% variance: 7
   Explained variance by first 3 components: 0.727
   ✅ PCA transformed data shape: (11000, 7)

   PCA Components Explained Variance:
   • PC1: 0.315 (31.5%)
   • PC2: 0.241 (24.1%)
   • PC3: 0.171 (17.1%)
   • PC4: 0.087 (8.7%)
   • PC5: 0.070 (7.0%)


In [35]:
# Apply t-SNE for visualization (using first few PCA components)
print("\n7. T-SNE VISUALIZATION PREPARATION:")
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_pca_optimal[:, :5])  # Use first 5 PCA components

X_tsne_df = pd.DataFrame(X_tsne, columns=['t-SNE_1', 't-SNE_2'], index=X_scaled.index)

print(f"   ✅ t-SNE transformed data shape: {X_tsne_df.shape}")
print(f"   ✅ Ready for clustering visualization")


7. T-SNE VISUALIZATION PREPARATION:
   ✅ t-SNE transformed data shape: (11000, 2)
   ✅ Ready for clustering visualization
   ✅ t-SNE transformed data shape: (11000, 2)
   ✅ Ready for clustering visualization


## 7. Save Preprocessed Data

In [36]:
# Combine all preprocessed data
preprocessed_data = data.copy()

# Add scaled features
for col in X_scaled.columns:
    preprocessed_data[f'{col}_scaled'] = X_scaled[col]

# Add PCA features
for col in X_pca_df.columns:
    preprocessed_data[col] = X_pca_df[col]

# Add t-SNE features
for col in X_tsne_df.columns:
    preprocessed_data[col] = X_tsne_df[col]

# Save datasets with improved preprocessing
print("\n8. SAVING IMPROVED PREPROCESSED DATASETS:")

# Save power-transformed features (before scaling) for reference
X_power_df.to_csv('../Data/power_transformed_features.csv', index=False)

# Update the complete preprocessed dataset
preprocessed_data = data.copy()

# Add power-transformed + scaled features (our improved approach)
for col in X_scaled.columns:
    preprocessed_data[f'{col}_scaled'] = X_scaled[col]

# Add PCA features (based on improved preprocessing)
for col in X_pca_df.columns:
    preprocessed_data[col] = X_pca_df[col]

# Add t-SNE features
for col in X_tsne_df.columns:
    preprocessed_data[col] = X_tsne_df[col]

# Save all datasets
preprocessed_data.to_csv('../Data/preprocessed_complete_dataset.csv', index=False)
X_scaled.to_csv('../Data/scaled_features.csv', index=False)  # Now power-transformed + scaled
X_pca_df.to_csv('../Data/pca_features.csv', index=False)
X_tsne_df.to_csv('../Data/tsne_features.csv', index=False)

print("   ✅ Improved preprocessed datasets saved:")
print("   - ../Data/preprocessed_complete_dataset.csv")
print("   - ../Data/scaled_features.csv (power-transformed + scaled)")
print("   - ../Data/pca_features.csv (based on improved features)")
print("   - ../Data/tsne_features.csv")
print("   - ../Data/power_transformed_features.csv (intermediate step)")

print(f"\n🎯 PREPROCESSING COMPLETE - READY FOR IMPROVED CLUSTERING!")
print("="*65)
print(f"✅ Skewness corrected with Yeo-Johnson power transformation")
print(f"✅ Features properly scaled with StandardScaler")
print(f"✅ Dimensionality reduced with PCA ({n_components_95} components)")
print(f"✅ Visualization prepared with t-SNE")
print(f"📊 Final dataset shapes:")
print(f"   • Scaled features: {X_scaled.shape}")
print(f"   • PCA features: {X_pca_df.shape}")
print(f"   • t-SNE features: {X_tsne_df.shape}")

# Save transformation objects for consistency
import pickle
with open('../Data/power_transformer.pkl', 'wb') as f:
    pickle.dump(power_transformer, f)
print("   • Power transformer saved for future use")


8. SAVING IMPROVED PREPROCESSED DATASETS:
   ✅ Improved preprocessed datasets saved:
   - ../Data/preprocessed_complete_dataset.csv
   - ../Data/scaled_features.csv (power-transformed + scaled)
   - ../Data/pca_features.csv (based on improved features)
   - ../Data/tsne_features.csv
   - ../Data/power_transformed_features.csv (intermediate step)

🎯 PREPROCESSING COMPLETE - READY FOR IMPROVED CLUSTERING!
✅ Skewness corrected with Yeo-Johnson power transformation
✅ Features properly scaled with StandardScaler
✅ Dimensionality reduced with PCA (7 components)
✅ Visualization prepared with t-SNE
📊 Final dataset shapes:
   • Scaled features: (11000, 11)
   • PCA features: (11000, 7)
   • t-SNE features: (11000, 2)
   • Power transformer saved for future use
   ✅ Improved preprocessed datasets saved:
   - ../Data/preprocessed_complete_dataset.csv
   - ../Data/scaled_features.csv (power-transformed + scaled)
   - ../Data/pca_features.csv (based on improved features)
   - ../Data/tsne_features