# ESG and Financial Performance Clustering Analysis

## Project Overview
This notebook performs comprehensive clustering analysis on ESG (Environmental, Social, Governance) and financial performance data to identify distinct company profiles and patterns. The analysis aims to understand how companies cluster based on their ESG initiatives and financial metrics.

**Objectives:**
- Identify distinct clusters of companies based on ESG and financial performance
- Analyze relationships between ESG scores and financial metrics
- Provide actionable insights for investment and sustainability strategies

## 1. Library Imports and Setup
Setting up all required libraries for data manipulation, visualization, and clustering analysis.

In [12]:
# Standard Data Manipulation and Analysis
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Statistical Analysis
import scipy.stats as stats
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist, squareform

# Machine Learning - Clustering
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering, SpectralClustering
from sklearn.mixture import GaussianMixture

# Machine Learning - Preprocessing and Metrics
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

# Clustering Validation and Analysis
from sklearn.cluster import KMeans
from kneed import KneeLocator
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("All clustering and EDA modules imported successfully!")

All clustering and EDA modules imported successfully!


## 2. Data Loading and Initial Inspection
Loading the ESG and financial performance dataset and performing initial data exploration.

In [13]:
# Load the dataset
# Dataset is now available in the Data directory
try:
    # Try loading from Data directory
    data = pd.read_csv('../Data/company_esg_financial_dataset.csv')
    print("Dataset loaded successfully from ../Data/")
except FileNotFoundError:
    try:
        # Try loading from current directory (if file was moved)
        data = pd.read_csv('company_esg_financial_dataset.csv')
        print("Dataset loaded successfully from current directory")
    except FileNotFoundError:
        print("Dataset not found. Please make sure the file exists.")
        data = None

if data is not None:
    print(f"Dataset shape: {data.shape}")
    print(f"Columns: {list(data.columns)}")
    print(f"Memory usage: {data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    print("\nFirst few rows:")
    print(data.head())

Dataset loaded successfully from ../Data/
Dataset shape: (11000, 16)
Columns: ['CompanyID', 'CompanyName', 'Industry', 'Region', 'Year', 'Revenue', 'ProfitMargin', 'MarketCap', 'GrowthRate', 'ESG_Overall', 'ESG_Environmental', 'ESG_Social', 'ESG_Governance', 'CarbonEmissions', 'WaterUsage', 'EnergyConsumption']
Memory usage: 3.19 MB

First few rows:
   CompanyID CompanyName Industry         Region  Year  Revenue  ProfitMargin  \
0          1   Company_1   Retail  Latin America  2015    459.2           6.0   
1          1   Company_1   Retail  Latin America  2016    473.8           4.6   
2          1   Company_1   Retail  Latin America  2017    564.9           5.2   
3          1   Company_1   Retail  Latin America  2018    558.4           4.3   
4          1   Company_1   Retail  Latin America  2019    554.5           4.9   

   MarketCap  GrowthRate  ESG_Overall  ESG_Environmental  ESG_Social  \
0      337.5         NaN         57.0               60.7        33.5   
1      366.6     

## 3. Exploratory Data Analysis (EDA)

### 3.1 Dataset Overview and Basic Statistics
Understanding the structure, data types, and basic statistical properties of the dataset.

### 3.2 Missing Values and Data Quality Assessment
Identifying and handling missing values, outliers, and data quality issues.

### 3.3 Distribution Analysis
Analyzing the distribution of key variables including financial metrics and ESG scores.

### 3.4 Correlation Analysis
Examining relationships between financial performance metrics and ESG scores.

### 3.5 Industry and Regional Analysis
Exploring patterns across different industries and geographical regions.

## 4. Data Preprocessing for Clustering

### 4.1 Feature Selection
Selecting the most relevant features for clustering analysis.

### 4.2 Data Cleaning and Outlier Treatment
Handling missing values, outliers, and data inconsistencies.

### 4.3 Feature Scaling and Normalization
Standardizing features to ensure equal contribution to clustering algorithms.

### 4.4 Dimensionality Reduction
Applying PCA and t-SNE for visualization and potentially improved clustering.

## 5. Optimal Number of Clusters

### 5.1 Elbow Method
Using the within-cluster sum of squares (WCSS) to determine optimal k.

### 5.2 Silhouette Analysis
Evaluating cluster quality using silhouette scores for different k values.

### 5.3 Gap Statistic and Other Methods
Additional methods for determining optimal cluster numbers.

## 6. Clustering Algorithm Implementation

### 6.1 K-Means Clustering
Implementing K-Means clustering with optimal parameters.

### 6.2 Hierarchical Clustering
Implementing agglomerative and divisive hierarchical clustering methods.

### 6.3 DBSCAN Clustering
Density-based clustering to identify clusters of varying shapes and sizes.

### 6.4 Gaussian Mixture Models
Probabilistic clustering using Gaussian Mixture Models.

## 7. Cluster Validation and Evaluation

### 7.1 Internal Validation Metrics
Evaluating clusters using silhouette score, Calinski-Harabasz index, and Davies-Bouldin index.

### 7.2 Algorithm Comparison
Comparing the performance of different clustering algorithms.

## 8. Cluster Analysis and Interpretation

### 8.1 Cluster Characteristics
Analyzing the characteristics of each identified cluster.

### 8.2 Cluster Profiling
Creating detailed profiles for each cluster including ESG and financial characteristics.

### 8.3 Industry and Regional Patterns
Analyzing how clusters relate to industry sectors and geographical regions.

## 9. Visualization and Insights

### 9.1 Cluster Visualization
Creating comprehensive visualizations of clusters in 2D and 3D space.

### 9.2 Interactive Dashboards
Creating interactive visualizations for cluster exploration.

### 9.3 Business Intelligence Insights
Extracting actionable business insights from the clustering results.

## 10. Conclusions and Recommendations

### 10.1 Key Findings
Summarizing the main discoveries from the clustering analysis.

### 10.2 Investment and Business Strategy Recommendations
Providing actionable recommendations based on cluster characteristics.

### 10.3 Future Work and Limitations
Discussing limitations of the current analysis and suggestions for future improvements.