# ESG Data Augmentation v·ªõi Real Labels

Notebook n√†y s·∫Ω th·ª±c hi·ªán:
1. **ƒê·ªçc v√† ph√¢n t√≠ch d·ªØ li·ªáu ESG features** t·ª´ file `esg_features_with_tiers_labels.csv` 
2. **Augment d·ªØ li·ªáu** ƒë·ªÉ tƒÉng k√≠ch th∆∞·ªõc dataset b·∫±ng nhi·ªÅu ph∆∞∆°ng ph√°p
3. **S·ª≠ d·ª•ng real E, S, G labels** thay v√¨ synthetic labels
4. **X·ª≠ l√Ω ƒë·∫∑c bi·ªát cho integer columns** v√† ratio columns
5. **Output augmented dataset** v·ªõi real labels ƒë·ªÉ training

**ƒê·∫∑c ƒëi·ªÉm dataset:**
- C√≥ real labels: `e_score`, `s_score`, `g_score`
- T·∫•t c·∫£ columns ƒë·ªÅu integer, ngo·∫°i tr·ª´ `esg_pos_ratio`, `esg_neg_ratio`
- X√≥a `esg_tier`, gi·ªØ `esg_cluster`
- Noise injection ph·∫£i maintain integer constraints

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")

Libraries imported successfully!


In [9]:
# ƒê·ªçc d·ªØ li·ªáu ESG features v·ªõi real labels
esg_data = pd.read_csv('esg_features_with_tiers_labels.csv')

# X√≥a c·ªôt esg_tier nh∆∞ y√™u c·∫ßu
if 'esg_tier' in esg_data.columns:
    esg_data = esg_data.drop('esg_tier', axis=1)
    print("‚úÖ Removed 'esg_tier' column as requested")

print("=== ESG FEATURES DATA v·ªõi REAL LABELS ===")
print(f"Dataset shape: {esg_data.shape}")
print(f"Number of companies: {len(esg_data)}")

# Identify column types for augmentation
ratio_cols = ['esg_pos_ratio', 'esg_neg_ratio']
label_cols = ['e_score', 's_score', 'g_score']
metadata_cols = ['filename', 'esg_cluster']
integer_cols = [col for col in esg_data.columns 
                if col not in ratio_cols + label_cols + metadata_cols]

print(f"\nColumn categorization:")
print(f"   Integer columns: {len(integer_cols)}")
print(f"   Ratio columns: {len(ratio_cols)} {ratio_cols}")
print(f"   Label columns: {len(label_cols)} {label_cols}")
print(f"   Metadata columns: {len(metadata_cols)} {metadata_cols}")

# Hi·ªÉn th·ªã m·ªôt v√†i rows ƒë·∫ßu
print(f"\nFirst 3 rows preview:")
display_cols = ['filename', 'total_esg_mentions', 'esg_pos_ratio', 'esg_neg_ratio', 'esg_cluster', 'e_score', 's_score', 'g_score']
print(esg_data[display_cols].head(3))

# Ph√¢n t√≠ch real ESG labels
print(f"\n=== REAL ESG LABELS ANALYSIS ===")
for label_col in label_cols:
    scores = esg_data[label_col]
    print(f"{label_col}: Œº={scores.mean():.1f}, œÉ={scores.std():.1f}, range=[{scores.min():.1f}, {scores.max():.1f}]")

# ESG Cluster distribution
print(f"\nESG Cluster distribution:")
print(esg_data['esg_cluster'].value_counts().sort_index())

# Data quality checks
print(f"\n=== DATA QUALITY CHECKS ===")
print(f"Missing values: {esg_data.isnull().sum().sum()}")

# Check integer constraints
print(f"\nInteger columns validation (sample):")
for col in integer_cols[:5]:  # Check first 5 as sample
    is_integer = esg_data[col].apply(lambda x: x == int(x) if pd.notnull(x) else True).all()
    print(f"   {col}: {'‚úÖ' if is_integer else '‚ùå'} All integers")

print(f"\nRatio columns validation:")
for col in ratio_cols:
    min_val, max_val = esg_data[col].min(), esg_data[col].max()
    print(f"   {col}: range=[{min_val:.3f}, {max_val:.3f}]")

‚úÖ Removed 'esg_tier' column as requested
=== ESG FEATURES DATA v·ªõi REAL LABELS ===
Dataset shape: (49, 63)
Number of companies: 49

Column categorization:
   Integer columns: 56
   Ratio columns: 2 ['esg_pos_ratio', 'esg_neg_ratio']
   Label columns: 3 ['e_score', 's_score', 'g_score']
   Metadata columns: 2 ['filename', 'esg_cluster']

First 3 rows preview:
          filename  total_esg_mentions  esg_pos_ratio  esg_neg_ratio  \
0  AR BBC 2023.txt                 355       0.814085       0.185915   
1  AR BBC 2024.txt                 708       0.922316       0.077684   
2  AR BID 2024.txt                 891       0.955129       0.044871   

   esg_cluster  e_score  s_score  g_score  
0            2     70.6     65.6     66.4  
1            2     68.1     71.0     73.0  
2            1     59.7     60.9     65.0  

=== REAL ESG LABELS ANALYSIS ===
e_score: Œº=77.3, œÉ=14.2, range=[35.0, 95.0]
s_score: Œº=77.2, œÉ=13.2, range=[35.0, 88.0]
g_score: Œº=80.3, œÉ=11.2, range=[57.2, 95.0

In [10]:
# Data Augmentation Functions cho ESG Features v·ªõi Real Labels
def augment_esg_data_with_labels(df, methods=['noise', 'interpolation', 'scaling', 'synthetic'], 
                                samples_per_method=2, noise_factor=0.05):
    """
    Augment ESG features data v·ªõi real labels:
    
    - Maintain integer constraints cho t·∫•t c·∫£ columns ngo·∫°i tr·ª´ ratio columns
    - Preserve label relationships
    - Handle different column types appropriately
    """
    
    # Define column types
    ratio_cols = ['esg_pos_ratio', 'esg_neg_ratio']
    label_cols = ['e_score', 's_score', 'g_score'] 
    metadata_cols = ['filename', 'esg_cluster']
    integer_cols = [col for col in df.columns 
                    if col not in ratio_cols + label_cols + metadata_cols]
    
    print(f"üîß Augmentation setup:")
    print(f"   Integer columns: {len(integer_cols)}")
    print(f"   Ratio columns: {len(ratio_cols)}")
    print(f"   Label columns: {len(label_cols)}")
    
    augmented_data = []
    
    # Gi·ªØ l·∫°i d·ªØ li·ªáu g·ªëc
    print(f"üìä Original data: {len(df)} samples")
    for idx, row in df.iterrows():
        augmented_data.append(row.to_dict())
    
    # Method 1: Noise Injection v·ªõi integer constraints
    if 'noise' in methods:
        print(f"üîä Applying noise injection...")
        for i in range(samples_per_method):
            for idx, row in df.iterrows():
                new_row = row.copy()
                
                # Integer columns: add noise then round
                for col in integer_cols:
                    original_value = row[col]
                    # T·∫°o noise nh·ªè cho integer columns
                    std = noise_factor * (abs(original_value) + 0.1)
                    noise = np.random.normal(0, std)
                    new_value = max(0, original_value + noise)
                    new_row[col] = int(round(new_value))  # Round to integer
                
                # Ratio columns: normal noise
                for col in ratio_cols:
                    original_value = row[col]
                    std = noise_factor * 0.1  # Smaller noise for ratios
                    noise = np.random.normal(0, std)
                    new_value = np.clip(original_value + noise, 0, 1)  # Keep in [0,1]
                    new_row[col] = new_value
                
                # Labels: small noise but reasonable ranges
                for col in label_cols:
                    original_value = row[col]
                    std = noise_factor * 5  # Allow up to ¬±2.5 point change
                    noise = np.random.normal(0, std)
                    new_value = np.clip(original_value + noise, 0, 100)  # Keep in [0,100]
                    new_row[col] = round(new_value, 1)
                
                # Metadata: keep original
                new_row['filename'] = f"{row['filename']}_noise_{i+1}"
                
                augmented_data.append(new_row)
    
    # Method 2: Interpolation
    if 'interpolation' in methods:
        print(f"üîÑ Applying interpolation...")
        for i in range(samples_per_method):
            for idx in range(len(df)):
                # Ch·ªçn 2 samples ng·∫´u nhi√™n t·ª´ c√πng cluster n·∫øu c√≥ th·ªÉ
                cluster = df.iloc[idx]['esg_cluster']
                same_cluster = df[df['esg_cluster'] == cluster]
                
                if len(same_cluster) > 1:
                    pair = same_cluster.sample(2)
                else:
                    pair = df.sample(2)
                
                row1, row2 = pair.iloc[0], pair.iloc[1]
                
                # Interpolate v·ªõi weight ng·∫´u nhi√™n
                alpha = np.random.uniform(0.3, 0.7)
                new_row = {}
                
                # Integer columns: interpolate then round
                for col in integer_cols:
                    interpolated = alpha * row1[col] + (1 - alpha) * row2[col]
                    new_row[col] = int(round(interpolated))
                
                # Ratio columns: normal interpolation
                for col in ratio_cols:
                    new_row[col] = alpha * row1[col] + (1 - alpha) * row2[col]
                
                # Labels: interpolate
                for col in label_cols:
                    interpolated = alpha * row1[col] + (1 - alpha) * row2[col]
                    new_row[col] = round(interpolated, 1)
                
                # Metadata
                new_row['filename'] = f"{row1['filename']}_interp_{i+1}"
                new_row['esg_cluster'] = row1['esg_cluster']  # Keep from first sample
                
                augmented_data.append(new_row)
    
    # Method 3: Feature Scaling
    if 'scaling' in methods:
        print(f"üìè Applying feature scaling...")
        for i in range(samples_per_method):
            for idx, row in df.iterrows():
                new_row = row.copy()
                
                # Scale random subset of integer features
                features_to_scale = np.random.choice(integer_cols, 
                                                   size=max(1, len(integer_cols)//3), 
                                                   replace=False)
                
                for col in features_to_scale:
                    scale_factor = np.random.uniform(0.8, 1.2)
                    scaled_value = row[col] * scale_factor
                    new_row[col] = int(round(max(0, scaled_value)))
                
                # Adjust ratios slightly
                for col in ratio_cols:
                    scale_factor = np.random.uniform(0.95, 1.05)
                    scaled_value = row[col] * scale_factor
                    new_row[col] = np.clip(scaled_value, 0, 1)
                
                # Labels: minor scaling
                for col in label_cols:
                    scale_factor = np.random.uniform(0.95, 1.05)
                    scaled_value = row[col] * scale_factor
                    new_row[col] = round(np.clip(scaled_value, 0, 100), 1)
                
                new_row['filename'] = f"{row['filename']}_scale_{i+1}"
                
                augmented_data.append(new_row)
    
    # Method 4: Synthetic Generation based on clusters
    if 'synthetic' in methods:
        print(f"ü§ñ Applying synthetic generation...")
        for cluster in df['esg_cluster'].unique():
            cluster_data = df[df['esg_cluster'] == cluster]
            
            for i in range(samples_per_method):
                # Calculate cluster statistics
                new_row = {}
                
                # Integer columns: sample from cluster distribution
                for col in integer_cols:
                    cluster_mean = cluster_data[col].mean()
                    cluster_std = cluster_data[col].std()
                    if cluster_std == 0:
                        cluster_std = 0.1
                    
                    synthetic_value = np.random.normal(cluster_mean, cluster_std)
                    new_row[col] = int(round(max(0, synthetic_value)))
                
                # Ratio columns: cluster-based generation
                for col in ratio_cols:
                    cluster_mean = cluster_data[col].mean()
                    cluster_std = cluster_data[col].std()
                    if cluster_std == 0:
                        cluster_std = 0.01
                    
                    synthetic_value = np.random.normal(cluster_mean, cluster_std)
                    new_row[col] = np.clip(synthetic_value, 0, 1)
                
                # Labels: cluster-based with some variation
                for col in label_cols:
                    cluster_mean = cluster_data[col].mean()
                    cluster_std = cluster_data[col].std()
                    if cluster_std == 0:
                        cluster_std = 2.0
                    
                    synthetic_value = np.random.normal(cluster_mean, cluster_std)
                    new_row[col] = round(np.clip(synthetic_value, 0, 100), 1)
                
                # Metadata
                new_row['filename'] = f"synthetic_cluster_{cluster}_{i+1}.txt"
                new_row['esg_cluster'] = cluster
                
                augmented_data.append(new_row)
    
    return pd.DataFrame(augmented_data)

print("üõ†Ô∏è Enhanced data augmentation functions defined successfully!")

üõ†Ô∏è Enhanced data augmentation functions defined successfully!


In [11]:
# Th·ª±c hi·ªán Data Augmentation v·ªõi Real Labels
print("=== PERFORMING DATA AUGMENTATION v·ªõi REAL LABELS ===")

# Augment data v·ªõi enhanced function
augmented_esg_data = augment_esg_data_with_labels(
    esg_data, 
    methods=['noise', 'interpolation', 'scaling', 'synthetic'],
    samples_per_method=2,  # 2 samples per method per original sample
    noise_factor=0.04  # Noise factor nh·ªè ƒë·ªÉ gi·ªØ t√≠nh ch·∫•t d·ªØ li·ªáu
)

print(f"\n=== AUGMENTATION RESULTS ===")
print(f"Original dataset: {esg_data.shape}")
print(f"Augmented dataset: {augmented_esg_data.shape}")
print(f"Increase factor: {len(augmented_esg_data) / len(esg_data):.1f}x")

# Ki·ªÉm tra integer constraints
print(f"\n=== INTEGER CONSTRAINT VALIDATION ===")
ratio_cols = ['esg_pos_ratio', 'esg_neg_ratio']
label_cols = ['e_score', 's_score', 'g_score']
metadata_cols = ['filename', 'esg_cluster']
integer_cols = [col for col in augmented_esg_data.columns 
                if col not in ratio_cols + label_cols + metadata_cols]

# Check first 5 integer columns
print("Integer columns validation (sample):")
for col in integer_cols[:5]:
    # Check if all values are integers
    is_integer = augmented_esg_data[col].apply(lambda x: x == int(x) if pd.notnull(x) else True).all()
    print(f"   {col}: {'‚úÖ' if is_integer else '‚ùå'} All integers")

# Ki·ªÉm tra ch·∫•t l∆∞·ª£ng augmentation
print(f"\n=== QUALITY CHECK ===")
key_features = ['total_esg_mentions', 'esg_pos_ratio', 'total_pos_environmental', 
                'total_pos_social', 'total_pos_governance']

print("Original data statistics:")
orig_stats = esg_data[key_features].describe()
print(orig_stats.loc[['mean', 'std', 'min', 'max']].round(2))

print("\nAugmented data statistics:")
aug_stats = augmented_esg_data[key_features].describe()
print(aug_stats.loc[['mean', 'std', 'min', 'max']].round(2))

# Ki·ªÉm tra s·ª± kh√°c bi·ªát v·ªÅ mean v√† std
print(f"\n=== STATISTICAL COMPARISON ===")
for feature in key_features:
    orig_mean, orig_std = esg_data[feature].mean(), esg_data[feature].std()
    aug_mean, aug_std = augmented_esg_data[feature].mean(), augmented_esg_data[feature].std()
    
    mean_diff = abs(aug_mean - orig_mean) / orig_mean * 100
    std_diff = abs(aug_std - orig_std) / orig_std * 100 if orig_std > 0 else 0
    
    print(f"{feature}:")
    print(f"  Mean difference: {mean_diff:.1f}%")
    print(f"  Std difference: {std_diff:.1f}%")

=== PERFORMING DATA AUGMENTATION v·ªõi REAL LABELS ===
üîß Augmentation setup:
   Integer columns: 56
   Ratio columns: 2
   Label columns: 3
üìä Original data: 49 samples
üîä Applying noise injection...
üîÑ Applying interpolation...
üìè Applying feature scaling...
ü§ñ Applying synthetic generation...

=== AUGMENTATION RESULTS ===
Original dataset: (49, 63)
Augmented dataset: (351, 63)
Increase factor: 7.2x

=== INTEGER CONSTRAINT VALIDATION ===
Integer columns validation (sample):
   pos_env_climate_action: ‚úÖ All integers
   neg_env_climate_action: ‚úÖ All integers
   pos_env_energy_transition: ‚úÖ All integers
   neg_env_energy_transition: ‚úÖ All integers
   pos_env_water_stewardship: ‚úÖ All integers

=== QUALITY CHECK ===
Original data statistics:
      total_esg_mentions  esg_pos_ratio  total_pos_environmental  \
mean              403.02           0.93                    61.41   
std               255.02           0.04                    29.08   
min                 3.00 

In [12]:
# Ph√¢n t√≠ch Real ESG Labels
print("=== ANALYZING REAL ESG LABELS ===")
print("‚úÖ Using real E, S, G labels from dataset!")

# Th·ªëng k√™ v·ªÅ real labels
print(f"\n=== REAL ESG LABELS STATISTICS ===")
label_cols = ['e_score', 's_score', 'g_score']
label_stats = augmented_esg_data[label_cols].describe()
print(label_stats.round(2))

# T√≠nh correlation gi·ªØa c√°c scores
print(f"\n=== INTER-SCORE CORRELATIONS ===")
correlation_matrix = augmented_esg_data[label_cols].corr()
print(correlation_matrix.round(3))

# Ph√¢n t√≠ch distribution theo clusters
print(f"\n=== LABEL DISTRIBUTION BY CLUSTER ===")
for cluster in sorted(augmented_esg_data['esg_cluster'].unique()):
    cluster_data = augmented_esg_data[augmented_esg_data['esg_cluster'] == cluster]
    print(f"\nCluster {cluster} ({len(cluster_data)} samples):")
    for col in label_cols:
        mean_score = cluster_data[col].mean()
        std_score = cluster_data[col].std()
        print(f"   {col}: Œº={mean_score:.1f}, œÉ={std_score:.1f}")

# Check for any anomalies
print(f"\n=== DATA QUALITY CHECKS ===")
for col in label_cols:
    scores = augmented_esg_data[col]
    print(f"{col}:")
    print(f"   Range: [{scores.min():.1f}, {scores.max():.1f}]")
    print(f"   Missing values: {scores.isnull().sum()}")
    print(f"   Outliers (>3œÉ): {len(scores[abs(scores - scores.mean()) > 3*scores.std()])}")

print(f"\n‚úÖ Real labels analysis completed!")

=== ANALYZING REAL ESG LABELS ===
‚úÖ Using real E, S, G labels from dataset!

=== REAL ESG LABELS STATISTICS ===
       e_score  s_score  g_score
count   351.00   351.00   351.00
mean     77.26    77.10    80.22
std      13.81    12.85    11.18
min      34.60    34.10    56.30
25%      71.35    70.95    71.85
50%      82.10    82.60    85.50
75%      87.40    87.90    87.90
max      98.40    92.40    99.30

=== INTER-SCORE CORRELATIONS ===
         e_score  s_score  g_score
e_score    1.000    0.955    0.847
s_score    0.955    1.000    0.815
g_score    0.847    0.815    1.000

=== LABEL DISTRIBUTION BY CLUSTER ===

Cluster 0 (163 samples):
   e_score: Œº=83.2, œÉ=9.8
   s_score: Œº=84.1, œÉ=10.1
   g_score: Œº=87.0, œÉ=2.0

Cluster 1 (51 samples):
   e_score: Œº=53.4, œÉ=4.0
   s_score: Œº=56.3, œÉ=4.0
   g_score: Œº=60.9, œÉ=3.0

Cluster 2 (93 samples):
   e_score: Œº=71.9, œÉ=1.9
   s_score: Œº=71.2, œÉ=4.7
   g_score: Œº=72.1, œÉ=2.5

Cluster 3 (44 samples):
   e_score: Œº=94.1, œ

In [13]:
# Chu·∫©n b·ªã d·ªØ li·ªáu v·ªõi Real Labels
print("=== PREPARING DATA v·ªõi REAL LABELS ===")

# X√°c ƒë·ªãnh features v√† targets v·ªõi real labels
exclude_cols = ['filename', 'esg_cluster', 'e_score', 's_score', 'g_score']
feature_cols = [col for col in augmented_esg_data.columns if col not in exclude_cols]
X = augmented_esg_data[feature_cols]

print(f"üìä Features prepared:")
print(f"  Number of features: {len(feature_cols)}")
print(f"  Data shape: {X.shape}")
print(f"  Sample features: {feature_cols[:8]}...")

print(f"\nüìà Real Target statistics:")
label_cols = ['e_score', 's_score', 'g_score']
for label_col in label_cols:
    scores = augmented_esg_data[label_col]
    print(f"  {label_col}: Œº={scores.mean():.1f}, œÉ={scores.std():.1f}, range=[{scores.min():.1f}, {scores.max():.1f}]")

print(f"\n‚úÖ Data preparation v·ªõi real labels completed!")
print(f"\nüìä FINAL AUGMENTED DATASET v·ªõi REAL LABELS:")
print(f"   ‚Ä¢ Original dataset: {len(esg_data)} companies")
print(f"   ‚Ä¢ Augmented dataset: {len(augmented_esg_data)} samples ({len(augmented_esg_data)/len(esg_data):.1f}x increase)")
print(f"   ‚Ä¢ Features: {len(feature_cols)} columns")
print(f"   ‚Ä¢ Real labels: E, S, G scores")

=== PREPARING DATA v·ªõi REAL LABELS ===
üìä Features prepared:
  Number of features: 58
  Data shape: (351, 58)
  Sample features: ['pos_env_climate_action', 'neg_env_climate_action', 'pos_env_energy_transition', 'neg_env_energy_transition', 'pos_env_water_stewardship', 'neg_env_water_stewardship', 'pos_env_biodiversity_nature', 'neg_env_biodiversity_nature']...

üìà Real Target statistics:
  e_score: Œº=77.3, œÉ=13.8, range=[34.6, 98.4]
  s_score: Œº=77.1, œÉ=12.9, range=[34.1, 92.4]
  g_score: Œº=80.2, œÉ=11.2, range=[56.3, 99.3]

‚úÖ Data preparation v·ªõi real labels completed!

üìä FINAL AUGMENTED DATASET v·ªõi REAL LABELS:
   ‚Ä¢ Original dataset: 49 companies
   ‚Ä¢ Augmented dataset: 351 samples (7.2x increase)
   ‚Ä¢ Features: 58 columns
   ‚Ä¢ Real labels: E, S, G scores


In [14]:
# Save Augmented Dataset v·ªõi Real Labels
print("=== SAVING AUGMENTED DATASET v·ªõi REAL LABELS ===")

# Save the complete augmented dataset with real labels
output_filename = 'augmented_esg_dataset_with_real_labels.csv'
augmented_esg_data.to_csv(output_filename, index=False)

print(f"‚úÖ Saved augmented dataset: {output_filename}")
print(f"   ‚Ä¢ Shape: {augmented_esg_data.shape}")
print(f"   ‚Ä¢ Columns: {list(augmented_esg_data.columns)}")

# Display sample of the final dataset
print(f"\nüìã Sample of augmented dataset v·ªõi real labels:")
sample_cols = ['filename', 'total_esg_mentions', 'esg_pos_ratio', 'esg_cluster', 'e_score', 's_score', 'g_score']
if all(col in augmented_esg_data.columns for col in sample_cols):
    print(augmented_esg_data[sample_cols].head())

print(f"\nüéâ DATASET AUGMENTATION v·ªõi REAL LABELS COMPLETED!")
print(f"üöÄ Ready for XGBoost training v·ªõi real E, S, G labels!")

# Final summary
print(f"\nüìä FINAL SUMMARY:")
print(f"   ‚Ä¢ Original dataset: {len(esg_data)} companies")
print(f"   ‚Ä¢ Augmented dataset: {len(augmented_esg_data)} samples ({len(augmented_esg_data)/len(esg_data):.1f}x increase)")
print(f"   ‚Ä¢ Features: {len([col for col in augmented_esg_data.columns if col not in ['filename', 'esg_cluster', 'e_score', 's_score', 'g_score']])} feature columns")
print(f"   ‚Ä¢ Real labels: E ({augmented_esg_data['e_score'].mean():.1f}¬±{augmented_esg_data['e_score'].std():.1f}), S ({augmented_esg_data['s_score'].mean():.1f}¬±{augmented_esg_data['s_score'].std():.1f}), G ({augmented_esg_data['g_score'].mean():.1f}¬±{augmented_esg_data['g_score'].std():.1f})")
print(f"   ‚Ä¢ Integer constraints: Maintained for all non-ratio columns")
print(f"   ‚Ä¢ Output file: {output_filename}")

# Validation summary
print(f"\nüîç VALIDATION SUMMARY:")
ratio_cols = ['esg_pos_ratio', 'esg_neg_ratio']
label_cols = ['e_score', 's_score', 'g_score']
metadata_cols = ['filename', 'esg_cluster']
integer_cols = [col for col in augmented_esg_data.columns 
                if col not in ratio_cols + label_cols + metadata_cols]

print(f"   ‚Ä¢ Integer columns: {len(integer_cols)} (should all be integers)")
print(f"   ‚Ä¢ Ratio columns: {len(ratio_cols)} (float values in [0,1])")
print(f"   ‚Ä¢ Real labels: {len(label_cols)} (E, S, G scores)")
print(f"   ‚Ä¢ No esg_tier column: {'‚úÖ' if 'esg_tier' not in augmented_esg_data.columns else '‚ùå'}")
print(f"   ‚Ä¢ esg_cluster preserved: {'‚úÖ' if 'esg_cluster' in augmented_esg_data.columns else '‚ùå'}")

=== SAVING AUGMENTED DATASET v·ªõi REAL LABELS ===
‚úÖ Saved augmented dataset: augmented_esg_dataset_with_real_labels.csv
   ‚Ä¢ Shape: (351, 63)
   ‚Ä¢ Columns: ['filename', 'pos_env_climate_action', 'neg_env_climate_action', 'pos_env_energy_transition', 'neg_env_energy_transition', 'pos_env_water_stewardship', 'neg_env_water_stewardship', 'pos_env_biodiversity_nature', 'neg_env_biodiversity_nature', 'pos_env_pollution_prevention', 'neg_env_pollution_prevention', 'pos_env_circular_economy', 'neg_env_circular_economy', 'pos_env_sustainable_practices', 'neg_env_sustainable_practices', 'pos_social_diversity_inclusion', 'neg_social_diversity_inclusion', 'pos_social_workforce_development', 'neg_social_workforce_development', 'pos_social_health_safety', 'neg_social_health_safety', 'pos_social_human_rights', 'neg_social_human_rights', 'pos_social_community_engagement', 'neg_social_community_engagement', 'pos_social_customer_stakeholder', 'neg_social_customer_stakeholder', 'pos_social_financ