# Exploratory Analysis: Patterns & Relationships

## Objective

Explore relationships and patterns in the data:
1. **Feature correlations** - Which ratios move together?
2. **Multicollinearity** - Do we have redundant features?
3. **Feature importance preview** - Which features discriminate best?
4. **Temporal patterns** - Any time-based biases?

## Key Questions
- Which features are highly correlated (potential multicollinearity)?
- Do certain ratio combinations provide better prediction?
- Are there temporal biases in the data (2000-2012 vs 2007-2013)?

---

In [None]:
# Setup
import sys
sys.path.insert(0, '../..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

from src.bankruptcy_prediction.data import DataLoader, MetadataParser

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

pd.set_option('display.max_columns', 100)

print("âœ“ Setup complete")

In [None]:
# Load data and metadata
loader = DataLoader()
metadata = MetadataParser.from_default()

df = loader.load_poland(horizon=1, dataset_type='full')
X, y = loader.get_features_target(df)

print(f"Dataset: {len(df):,} samples, {len(X.columns)} features")
print(f"Bankruptcy rate: {y.mean():.2%}")

## 1. Correlation Analysis

Identify highly correlated features to understand multicollinearity.

In [None]:
# Calculate correlation matrix (sample if too large)
# Use only base features (not missingness indicators)
base_features = [col for col in X.columns if '__isna' not in col]
X_base = X[base_features]

# Sample for faster computation if needed
if len(base_features) > 64:
    print(f"Sampling {len(base_features)} features")
    base_features = base_features[:64]  # Limit to 64
    X_base = X[base_features]

print(f"Calculating correlations for {len(base_features)} features...")
corr_matrix = X_base.corr()
print("âœ“ Correlation matrix computed")

In [None]:
# Find high correlations
high_corr_threshold = 0.9

# Get upper triangle (avoid duplicates)
upper_tri = np.triu(np.abs(corr_matrix), k=1)
high_corr_pairs = []

for i in range(len(corr_matrix)):
    for j in range(i+1, len(corr_matrix)):
        if upper_tri[i, j] >= high_corr_threshold:
            feat1 = corr_matrix.columns[i]
            feat2 = corr_matrix.columns[j]
            corr_val = corr_matrix.iloc[i, j]
            
            high_corr_pairs.append({
                'Feature_1': feat1,
                'Feature_2': feat2,
                'Correlation': corr_val,
                'Name_1': metadata.get_readable_name(feat1, short=True),
                'Name_2': metadata.get_readable_name(feat2, short=True),
            })

high_corr_df = pd.DataFrame(high_corr_pairs).sort_values('Correlation', ascending=False)

print(f"\nðŸ“Š Found {len(high_corr_df)} feature pairs with |r| â‰¥ {high_corr_threshold}:\n")
if len(high_corr_df) > 0:
    display(high_corr_df.head(20))
else:
    print("âœ“ No extreme multicollinearity (good for linear models)")

### Interpretation:

**High correlations indicate:**
- Features measuring similar concepts (e.g., different profit margins)
- Potential multicollinearity for linear models
- Candidates for feature selection

**For modeling:**
- **Random Forests:** Can handle correlation (not affected)
- **Logistic Regression:** Should remove one from each highly correlated pair
- **GLM:** Same as Logistic - correlation inflates standard errors

In [None]:
# Visualize correlation heatmap (top 30 most important features)
# Use simple variance to select features for visualization
feature_vars = X_base.var().sort_values(ascending=False)
top_features = feature_vars.head(30).index.tolist()

corr_subset = X_base[top_features].corr()

# Rename for readable plot
readable_names = [metadata.get_readable_name(f, short=True) for f in top_features]
corr_subset.index = readable_names
corr_subset.columns = readable_names

plt.figure(figsize=(16, 14))
sns.heatmap(corr_subset, 
            annot=False,
            cmap='RdBu_r', 
            center=0,
            vmin=-1, vmax=1,
            square=True,
            linewidths=0.5,
            cbar_kws={'label': 'Correlation'})

plt.title('Feature Correlation Heatmap (Top 30 by Variance)', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('../../results/figures/correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

print("âœ“ Saved: results/figures/correlation_heatmap.png")

## 2. Discriminative Power Analysis

Quick preview: Which features best separate bankrupt from healthy firms?

In [None]:
# Calculate point-biserial correlation (continuous vs binary)
discriminative_power = {}

for col in base_features:
    # Remove NaN for correlation
    valid_mask = ~X_base[col].isna()
    if valid_mask.sum() < 100:  # Skip if too few valid values
        continue
    
    x_clean = X_base.loc[valid_mask, col]
    y_clean = y[valid_mask]
    
    # Point-biserial correlation
    corr, pval = stats.pointbiserialr(y_clean, x_clean)
    
    discriminative_power[col] = {
        'correlation': corr,
        'abs_correlation': abs(corr),
        'p_value': pval,
        'readable_name': metadata.get_readable_name(col, short=True),
        'category': metadata.get_category(col)
    }

disc_df = pd.DataFrame.from_dict(discriminative_power, orient='index')
disc_df = disc_df.sort_values('abs_correlation', ascending=False)

print("\nðŸ“Š Top 20 Most Discriminative Features:\n")
display(disc_df[['readable_name', 'category', 'correlation', 'p_value']].head(20))

### Interpretation:

**Point-biserial correlation shows:**
- **Negative correlation:** Feature is lower in bankrupt firms (e.g., profitability)
- **Positive correlation:** Feature is higher in bankrupt firms (e.g., leverage)
- **Magnitude:** Strength of relationship

**Expected patterns:**
- Profitability ratios: Negative (bankrupt firms less profitable)
- Leverage ratios: Positive (bankrupt firms more leveraged)
- Liquidity ratios: Negative (bankrupt firms less liquid)

In [None]:
# Visualize discriminative power by category
top_20 = disc_df.head(20).copy()

fig, ax = plt.subplots(figsize=(12, 8))

# Color by category
category_colors = {
    'Profitability': '#3498db',
    'Liquidity': '#2ecc71',
    'Leverage': '#e74c3c',
    'Activity': '#f39c12',
    'Size': '#9b59b6',
    'Other': '#95a5a6'
}

colors = [category_colors.get(cat, '#95a5a6') for cat in top_20['category']]

bars = ax.barh(range(len(top_20)), top_20['abs_correlation'], color=colors, alpha=0.8)
ax.set_yticks(range(len(top_20)))
ax.set_yticklabels(top_20['readable_name'])
ax.set_xlabel('Absolute Correlation with Bankruptcy', fontweight='bold')
ax.set_title('Top 20 Most Discriminative Features', fontsize=14, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=color, label=cat) 
                   for cat, color in category_colors.items() 
                   if cat in top_20['category'].values]
ax.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.savefig('../../results/figures/discriminative_power.png', dpi=300, bbox_inches='tight')
plt.show()

print("âœ“ Saved: results/figures/discriminative_power.png")

## 3. Category-wise Analysis

How do different ratio categories contribute to discrimination?

In [None]:
# Average discriminative power by category
category_disc = disc_df.groupby('category')['abs_correlation'].agg(['mean', 'median', 'std', 'count'])
category_disc = category_disc.sort_values('mean', ascending=False)

print("\nðŸ“Š Discriminative Power by Category:\n")
display(category_disc)

In [None]:
# Box plot of discriminative power by category
fig, ax = plt.subplots(figsize=(12, 6))

categories = disc_df['category'].unique()
data_by_cat = [disc_df[disc_df['category'] == cat]['abs_correlation'].values 
               for cat in categories]

bp = ax.boxplot(data_by_cat, labels=categories, patch_artist=True)

for patch, cat in zip(bp['boxes'], categories):
    patch.set_facecolor(category_colors.get(cat, '#95a5a6'))
    patch.set_alpha(0.7)

ax.set_ylabel('Absolute Correlation with Bankruptcy', fontweight='bold')
ax.set_title('Discriminative Power by Feature Category', fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../../results/figures/discriminative_power_by_category.png', dpi=300, bbox_inches='tight')
plt.show()

print("âœ“ Saved: results/figures/discriminative_power_by_category.png")

### Interpretation:

**Category contributions:**
- **Profitability & Leverage** typically show highest discriminative power
- **Liquidity** important but often weaker signal
- **Activity** ratios capture operational efficiency
- **Size** usually has limited predictive power alone

**For modeling:**
- Focus feature engineering on high-power categories
- Don't discard low-power features yet (combinations matter)
- Consider category-specific models or weights

## Summary & Next Steps

### Key Findings:

1. **Multicollinearity**
   - [X] highly correlated pairs identified
   - Feature selection needed for linear models
   - Tree models can handle correlation

2. **Discriminative Features**
   - Top predictors identified (profitability, leverage)
   - Clear separation between categories
   - Strong signals available for prediction

3. **Category Patterns**
   - Profitability & Leverage most important
   - Comprehensive coverage across financial dimensions
   - Good foundation for model building

### Next Steps:

1. **Data Preparation** (`03_data_preparation.ipynb`)
   - Handle correlations for linear models
   - Scaling and normalization
   - Train/test splits

2. **Modeling** (`04_baseline_models.ipynb`)
   - Use identified discriminative features
   - Compare model families
   - Baseline performance

In [None]:
# Save analysis results
if len(high_corr_df) > 0:
    high_corr_df.to_csv('../../results/evaluation/high_correlations.csv', index=False)
    print("âœ“ Saved: results/evaluation/high_correlations.csv")

disc_df.to_csv('../../results/evaluation/discriminative_power.csv')
print("âœ“ Saved: results/evaluation/discriminative_power.csv")

print("\nâœ“ Exploratory analysis complete!")
print(f"\nNext: 03_data_preparation.ipynb")