##EDA Of inx_Future_Employee_Performance

This script defines a comprehensive **Exploratory Data Analysis (EDA) system** using Python. It loads an employee performance dataset and automatically generates a complete suite of visualizations, summaries, and interactive dashboards. The entire workflow is packaged inside an `EDAVisualizer` class, making it easy to run a full EDA with a single function call.

### ** Initialization**
- Loads the dataset from the given path.
- Creates an output folder (`eda_outputs/`) to store all generated charts and files.
- Prints the number of rows and columns loaded.

---

### ** Target Variable Analysis**
- `plot_target_distribution()`
  - Creates bar and pie charts showing how employee performance ratings are distributed.
  - Saves `target_distribution.png`.

---

### ** Numerical Feature Analysis**
- `plot_numerical_distributions()`
  - Generates histogram plots for all numerical features.
  - Adds mean and median markers.
  - Saves `numerical_distributions.png`.

---

### ** Categorical Feature Analysis**
- `plot_categorical_distributions()`
  - Creates horizontal count bar charts for every categorical variable.
  - Saves `categorical_distributions.png`.

---

### ** Performance vs Categories**
- `plot_performance_by_categories()`
  - Shows average performance rating across categories like gender, education, job role, overtime, etc.
  - Saves `performance_by_categories.png`.

---

### ** Correlation Analysis**
- `plot_correlation_heatmap()`
  - Creates a triangular heatmap showing correlations among all numeric features.
  - Saves `correlation_heatmap.png`.

- `plot_performance_correlations()`
  - Plots how strongly each numeric variable correlates with performance rating.
  - Saves `performance_correlations.png`.

---

### ** Boxplots by Performance**
- `plot_boxplots_by_performance()`
  - Shows how numerical features vary across performance rating groups.
  - Saves `boxplots_by_performance.png`.

---

### ** Satisfaction Metrics Analysis**
- `plot_satisfaction_analysis()`
  - Combines bar and line charts to show employee counts and average performance across satisfaction levels.
  - Saves `satisfaction_analysis.png`.

---

### ** Tenure Analysis**
- `plot_tenure_analysis()`
  - Analyzes performance across experience bins (0‚Äì1, 1‚Äì3, 3‚Äì5 years, etc.).
  - Highlights the ‚Äúsweet spot‚Äù with highest performance.
  - Saves `tenure_analysis.png`.

---

### ** Department-Level Insights**
- `plot_department_comparison()`
  - Compares performance across departments using:
    - Horizontal bar charts  
    - Violin plots  
    - Size vs performance scatter  
    - High performer percentages  
  - Saves `department_comparison.png`.

---

### ** Interactive Dashboard**
- `create_interactive_dashboard()`
  - Builds a 4-panel interactive dashboard using Plotly:
    - Performance distribution  
    - Department performance  
    - Satisfaction vs performance  
    - Tenure vs performance  
  - Saves `interactive_dashboard.html`.

---

### **12. Summary Statistics**
- `generate_summary_statistics()`
  - Computes:
    - Average performance  
    - High performer %  
    - Low performer %  
    - Standard deviation  
    - Total employees  
  - Saves `summary_statistics.csv`.

---


### **14. Main Function**
- Loads the dataset.
- Creates the `EDAVisualizer` object.
- Runs the full EDA pipeline.

---

### **Overall Purpose**
This code acts as a **complete automated EDA engine**, producing both static and interactive visualizations that help deeply understand employee performance patterns, trends, correlations, and departmental behavior. Perfect for analytical reporting, dashboard building, and preparing data for machine learning.



In [3]:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

class EDAVisualizer:
    """Create exploratory data analysis visualizations."""

    def __init__(self, data_path):
        """Initialize with dataset path."""
        self.df = pd.read_csv(data_path)
        self.output_dir = "eda_outputs/"

        # Create output directory
        import os
        os.makedirs(self.output_dir, exist_ok=True)

        print(f" Dataset loaded: {self.df.shape[0]} rows, {self.df.shape[1]} columns")

    def plot_target_distribution(self):
        """Visualize performance rating distribution."""
        print("\n Creating target distribution plot...")

        fig, axes = plt.subplots(1, 2, figsize=(14, 5))

        # Count plot
        performance_counts = self.df['PerformanceRating'].value_counts().sort_index()
        colors = ['#FF6B6B', '#FFA500', '#FFD93D', '#6BCF7F', '#4ECDC4']

        axes[0].bar(performance_counts.index, performance_counts.values, color=colors)
        axes[0].set_xlabel('Performance Rating', fontsize=12, fontweight='bold')
        axes[0].set_ylabel('Number of Employees', fontsize=12, fontweight='bold')
        axes[0].set_title('Distribution of Performance Ratings', fontsize=14, fontweight='bold')
        axes[0].grid(axis='y', alpha=0.3)

        # Add count labels
        for i, (idx, val) in enumerate(performance_counts.items()):
            axes[0].text(idx, val + 5, str(val), ha='center', fontweight='bold')

        # Percentage pie chart
        axes[1].pie(performance_counts.values, labels=performance_counts.index,
                   autopct='%1.1f%%', colors=colors, startangle=90)
        axes[1].set_title('Performance Rating Percentage', fontsize=14, fontweight='bold')

        plt.tight_layout()
        plt.savefig(f'{self.output_dir}target_distribution.png', dpi=300, bbox_inches='tight')
        plt.close()

        print(f" Saved: {self.output_dir}target_distribution.png")

    def plot_numerical_distributions(self):
        """Plot distributions of numerical features."""
        print("\n Creating numerical feature distributions...")

        numerical_cols = self.df.select_dtypes(include=[np.number]).columns.tolist()
        numerical_cols = [col for col in numerical_cols if col != 'PerformanceRating']

        n_cols = 4
        n_rows = (len(numerical_cols) + n_cols - 1) // n_cols

        fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, n_rows * 4))
        axes = axes.flatten()

        for idx, col in enumerate(numerical_cols):
            axes[idx].hist(self.df[col].dropna(), bins=30, color='skyblue',
                          edgecolor='black', alpha=0.7)
            axes[idx].set_xlabel(col, fontsize=10, fontweight='bold')
            axes[idx].set_ylabel('Frequency', fontsize=10)
            axes[idx].set_title(f'Distribution of {col}', fontsize=11, fontweight='bold')
            axes[idx].grid(axis='y', alpha=0.3)

            # Add statistics
            mean_val = self.df[col].mean()
            median_val = self.df[col].median()
            axes[idx].axvline(mean_val, color='red', linestyle='--', label=f'Mean: {mean_val:.2f}')
            axes[idx].axvline(median_val, color='green', linestyle='--', label=f'Median: {median_val:.2f}')
            axes[idx].legend(fontsize=8)

        # Remove extra subplots
        for idx in range(len(numerical_cols), len(axes)):
            fig.delaxes(axes[idx])

        plt.tight_layout()
        plt.savefig(f'{self.output_dir}numerical_distributions.png', dpi=300, bbox_inches='tight')
        plt.close()

        print(f" Saved: {self.output_dir}numerical_distributions.png")

    def plot_categorical_distributions(self):
        """Plot distributions of categorical features."""
        print("\n Creating categorical feature distributions...")

        categorical_cols = self.df.select_dtypes(include=['object']).columns.tolist()
        categorical_cols = [col for col in categorical_cols if col != 'PerformanceRating']

        n_cols = 3
        n_rows = (len(categorical_cols) + n_cols - 1) // n_cols

        fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, n_rows * 4))
        axes = axes.flatten()

        for idx, col in enumerate(categorical_cols):
            value_counts = self.df[col].value_counts()
            axes[idx].barh(value_counts.index, value_counts.values, color='coral')
            axes[idx].set_xlabel('Count', fontsize=10, fontweight='bold')
            axes[idx].set_ylabel(col, fontsize=10, fontweight='bold')
            axes[idx].set_title(f'Distribution of {col}', fontsize=11, fontweight='bold')
            axes[idx].grid(axis='x', alpha=0.3)

            # Add count labels
            for i, (cat, count) in enumerate(value_counts.items()):
                axes[idx].text(count + 5, i, str(count), va='center', fontweight='bold')

        # Remove extra subplots
        for idx in range(len(categorical_cols), len(axes)):
            fig.delaxes(axes[idx])

        plt.tight_layout()
        plt.savefig(f'{self.output_dir}categorical_distributions.png', dpi=300, bbox_inches='tight')
        plt.close()

        print(f" Saved: {self.output_dir}categorical_distributions.png")

    def plot_performance_by_categories(self):
        """Plot performance ratings across categorical variables."""
        print("\n Creating performance by category plots...")

        categorical_cols = ['Department', 'Gender', 'Education', 'OverTime', 'JobRole']
        categorical_cols = [col for col in categorical_cols if col in self.df.columns]

        fig, axes = plt.subplots(2, 3, figsize=(18, 10))
        axes = axes.flatten()

        for idx, col in enumerate(categorical_cols):
            perf_by_cat = self.df.groupby(col)['PerformanceRating'].mean().sort_values(ascending=False)

            axes[idx].barh(perf_by_cat.index, perf_by_cat.values,
                          color=plt.cm.RdYlGn(perf_by_cat.values / 5))
            axes[idx].set_xlabel('Average Performance Rating', fontsize=10, fontweight='bold')
            axes[idx].set_ylabel(col, fontsize=10, fontweight='bold')
            axes[idx].set_title(f'Avg Performance by {col}', fontsize=11, fontweight='bold')
            axes[idx].axvline(x=3.0, color='red', linestyle='--', alpha=0.5, label='Baseline (3.0)')
            axes[idx].grid(axis='x', alpha=0.3)
            axes[idx].legend()

            # Add value labels
            for i, (cat, val) in enumerate(perf_by_cat.items()):
                axes[idx].text(val + 0.05, i, f'{val:.2f}', va='center', fontweight='bold')

        # Remove extra subplot
        if len(categorical_cols) < len(axes):
            for idx in range(len(categorical_cols), len(axes)):
                fig.delaxes(axes[idx])

        plt.tight_layout()
        plt.savefig(f'{self.output_dir}performance_by_categories.png', dpi=300, bbox_inches='tight')
        plt.close()

        print(f" Saved: {self.output_dir}performance_by_categories.png")

    def plot_correlation_heatmap(self):
        """Create correlation heatmap for numerical features."""
        print("\n Creating correlation heatmap...")

        # Select numerical columns
        numerical_df = self.df.select_dtypes(include=[np.number])

        # Calculate correlation
        corr_matrix = numerical_df.corr()

        # Create heatmap
        plt.figure(figsize=(16, 14))
        mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

        sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f',
                   cmap='RdYlGn', center=0, square=True, linewidths=1,
                   cbar_kws={"shrink": 0.8})

        plt.title('Correlation Heatmap - Numerical Features',
                 fontsize=16, fontweight='bold', pad=20)
        plt.tight_layout()
        plt.savefig(f'{self.output_dir}correlation_heatmap.png', dpi=300, bbox_inches='tight')
        plt.close()

        print(f" Saved: {self.output_dir}correlation_heatmap.png")

    def plot_performance_correlations(self):
        """Plot features most correlated with performance."""
        print("\n Creating performance correlation plot...")

        numerical_df = self.df.select_dtypes(include=[np.number])

        if 'PerformanceRating' in numerical_df.columns:
            correlations = numerical_df.corr()['PerformanceRating'].drop('PerformanceRating')
            correlations = correlations.sort_values(ascending=True)

            # Plot
            fig, ax = plt.subplots(figsize=(10, 12))
            colors = ['red' if x < 0 else 'green' for x in correlations.values]

            ax.barh(correlations.index, correlations.values, color=colors, alpha=0.7)
            ax.set_xlabel('Correlation with Performance Rating', fontsize=12, fontweight='bold')
            ax.set_ylabel('Features', fontsize=12, fontweight='bold')
            ax.set_title('Feature Correlations with Performance Rating',
                        fontsize=14, fontweight='bold')
            ax.axvline(x=0, color='black', linestyle='-', linewidth=0.8)
            ax.grid(axis='x', alpha=0.3)

            # Add value labels
            for i, (idx, val) in enumerate(correlations.items()):
                ax.text(val + 0.01 if val > 0 else val - 0.01, i, f'{val:.3f}',
                       va='center', fontweight='bold', fontsize=9)

            plt.tight_layout()
            plt.savefig(f'{self.output_dir}performance_correlations.png', dpi=300, bbox_inches='tight')
            plt.close()

            print(f" Saved: {self.output_dir}performance_correlations.png")

    def plot_boxplots_by_performance(self):
        """Create boxplots of numerical features by performance rating."""
        print("\n Creating boxplots by performance rating...")

        numerical_cols = ['Age', 'ExperienceYearsAtThisCompany', 'ExperienceYearsInCurrentRole',
                         'YearsSinceLastPromotion', 'TrainingTimesLastYear', 'MonthlyIncome']
        numerical_cols = [col for col in numerical_cols if col in self.df.columns]

        n_cols = 3
        n_rows = (len(numerical_cols) + n_cols - 1) // n_cols

        fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, n_rows * 5))
        axes = axes.flatten()

        for idx, col in enumerate(numerical_cols):
            sns.boxplot(data=self.df, x='PerformanceRating', y=col,
                       palette='Set2', ax=axes[idx])
            axes[idx].set_xlabel('Performance Rating', fontsize=11, fontweight='bold')
            axes[idx].set_ylabel(col, fontsize=11, fontweight='bold')
            axes[idx].set_title(f'{col} by Performance Rating', fontsize=12, fontweight='bold')
            axes[idx].grid(axis='y', alpha=0.3)

        # Remove extra subplots
        for idx in range(len(numerical_cols), len(axes)):
            fig.delaxes(axes[idx])

        plt.tight_layout()
        plt.savefig(f'{self.output_dir}boxplots_by_performance.png', dpi=300, bbox_inches='tight')
        plt.close()

        print(f" Saved: {self.output_dir}boxplots_by_performance.png")

    def plot_satisfaction_analysis(self):
        """Analyze satisfaction metrics."""
        print("\n Creating satisfaction analysis...")

        satisfaction_cols = ['JobSatisfaction', 'EnvironmentSatisfaction',
                           'WorkLifeBalance', 'JobInvolvement']
        satisfaction_cols = [col for col in satisfaction_cols if col in self.df.columns]

        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        axes = axes.flatten()

        for idx, col in enumerate(satisfaction_cols):
            # Group by satisfaction level and calculate average performance
            grouped = self.df.groupby(col)['PerformanceRating'].agg(['mean', 'count'])

            # Create bar plot
            ax1 = axes[idx]
            bars = ax1.bar(grouped.index, grouped['mean'],
                          color=plt.cm.RdYlGn(grouped['mean'] / 5),
                          edgecolor='black', alpha=0.7)
            ax1.set_xlabel(col, fontsize=11, fontweight='bold')
            ax1.set_ylabel('Avg Performance Rating', fontsize=11, fontweight='bold', color='blue')
            ax1.set_title(f'Performance vs {col}', fontsize=12, fontweight='bold')
            ax1.tick_params(axis='y', labelcolor='blue')
            ax1.grid(axis='y', alpha=0.3)

            # Add count on secondary y-axis
            ax2 = ax1.twinx()
            ax2.plot(grouped.index, grouped['count'], color='red',
                    marker='o', linewidth=2, markersize=8, label='Employee Count')
            ax2.set_ylabel('Employee Count', fontsize=11, fontweight='bold', color='red')
            ax2.tick_params(axis='y', labelcolor='red')

            # Add value labels on bars
            for i, (level, mean_val) in enumerate(grouped['mean'].items()):
                ax1.text(level, mean_val + 0.05, f'{mean_val:.2f}',
                        ha='center', fontweight='bold')

        plt.tight_layout()
        plt.savefig(f'{self.output_dir}satisfaction_analysis.png', dpi=300, bbox_inches='tight')
        plt.close()

        print(f" Saved: {self.output_dir}satisfaction_analysis.png")

    def plot_tenure_analysis(self):
        """Analyze performance by tenure."""
        print("\n Creating tenure analysis...")

        # Create tenure bins
        self.df['TenureBin'] = pd.cut(self.df['ExperienceYearsAtThisCompany'],
                                      bins=[0, 1, 3, 5, 8, 12, 50],
                                      labels=['0-1', '1-3', '3-5', '5-8', '8-12', '12+'])

        fig, axes = plt.subplots(1, 2, figsize=(16, 6))

        # Average performance by tenure
        tenure_perf = self.df.groupby('TenureBin')['PerformanceRating'].mean()
        axes[0].plot(tenure_perf.index, tenure_perf.values, marker='o',
                    linewidth=3, markersize=12, color='#4ECDC4')
        axes[0].fill_between(range(len(tenure_perf)), tenure_perf.values, alpha=0.3)
        axes[0].set_xlabel('Years at Company', fontsize=12, fontweight='bold')
        axes[0].set_ylabel('Average Performance Rating', fontsize=12, fontweight='bold')
        axes[0].set_title('The Tenure Sweet Spot', fontsize=14, fontweight='bold')
        axes[0].grid(True, alpha=0.3)
        axes[0].axhline(y=3.0, color='red', linestyle='--', label='Baseline (3.0)')

        # Highlight sweet spot
        max_idx = tenure_perf.idxmax()
        max_val = tenure_perf.max()
        axes[0].scatter([max_idx], [max_val], color='gold', s=300,
                       zorder=5, edgecolor='black', linewidth=2)
        axes[0].annotate('Sweet Spot', xy=(max_idx, max_val),
                        xytext=(max_idx, max_val + 0.2),
                        fontsize=12, fontweight='bold', ha='center',
                        arrowprops=dict(arrowstyle='->', color='black', lw=2))
        axes[0].legend()

        # Distribution by tenure
        tenure_dist = self.df['TenureBin'].value_counts().sort_index()
        axes[1].bar(tenure_dist.index, tenure_dist.values,
                   color='coral', edgecolor='black', alpha=0.7)
        axes[1].set_xlabel('Years at Company', fontsize=12, fontweight='bold')
        axes[1].set_ylabel('Number of Employees', fontsize=12, fontweight='bold')
        axes[1].set_title('Employee Distribution by Tenure', fontsize=14, fontweight='bold')
        axes[1].grid(axis='y', alpha=0.3)

        # Add count labels
        for i, (cat, count) in enumerate(tenure_dist.items()):
            axes[1].text(i, count + 5, str(count), ha='center', fontweight='bold')

        plt.tight_layout()
        plt.savefig(f'{self.output_dir}tenure_analysis.png', dpi=300, bbox_inches='tight')
        plt.close()

        print(f" Saved: {self.output_dir}tenure_analysis.png")

    def plot_department_comparison(self):
        """Compare performance across departments."""
        print("\n Creating department comparison...")

        if 'Department' not in self.df.columns:
            print("  Department column not found. Skipping...")
            return

        fig, axes = plt.subplots(2, 2, figsize=(16, 12))

        # 1. Average performance by department
        dept_perf = self.df.groupby('Department')['PerformanceRating'].mean().sort_values(ascending=True)
        axes[0, 0].barh(dept_perf.index, dept_perf.values,
                       color=plt.cm.RdYlGn(dept_perf.values / 5))
        axes[0, 0].set_xlabel('Average Performance Rating', fontsize=11, fontweight='bold')
        axes[0, 0].set_ylabel('Department', fontsize=11, fontweight='bold')
        axes[0, 0].set_title('Average Performance by Department', fontsize=12, fontweight='bold')
        axes[0, 0].axvline(x=self.df['PerformanceRating'].mean(),
                          color='red', linestyle='--', label='Company Average')
        axes[0, 0].legend()
        axes[0, 0].grid(axis='x', alpha=0.3)

        for i, (dept, val) in enumerate(dept_perf.items()):
            axes[0, 0].text(val + 0.02, i, f'{val:.2f}', va='center', fontweight='bold')

        # 2. Performance distribution by department
        dept_order = dept_perf.index.tolist()
        sns.violinplot(data=self.df, x='Department', y='PerformanceRating',
                      order=dept_order, palette='Set2', ax=axes[0, 1])
        axes[0, 1].set_xlabel('Department', fontsize=11, fontweight='bold')
        axes[0, 1].set_ylabel('Performance Rating', fontsize=11, fontweight='bold')
        axes[0, 1].set_title('Performance Distribution by Department', fontsize=12, fontweight='bold')
        axes[0, 1].tick_params(axis='x', rotation=45)
        axes[0, 1].grid(axis='y', alpha=0.3)

        # 3. Department size vs performance
        dept_size = self.df['Department'].value_counts()
        dept_perf_full = self.df.groupby('Department')['PerformanceRating'].mean()

        scatter_data = pd.DataFrame({
            'Size': dept_size,
            'Performance': dept_perf_full
        })

        axes[1, 0].scatter(scatter_data['Size'], scatter_data['Performance'],
                          s=300, alpha=0.6, c=scatter_data['Performance'],
                          cmap='RdYlGn', edgecolors='black', linewidth=2)

        for dept, row in scatter_data.iterrows():
            axes[1, 0].annotate(dept, (row['Size'], row['Performance']),
                               fontsize=9, fontweight='bold', ha='center')

        axes[1, 0].set_xlabel('Department Size (# Employees)', fontsize=11, fontweight='bold')
        axes[1, 0].set_ylabel('Average Performance', fontsize=11, fontweight='bold')
        axes[1, 0].set_title('Department Size vs Performance', fontsize=12, fontweight='bold')
        axes[1, 0].grid(True, alpha=0.3)

        # 4. High performers percentage by department
        high_performers = self.df[self.df['PerformanceRating'] >= 4].groupby('Department').size()
        total_employees = self.df.groupby('Department').size()
        high_perf_pct = (high_performers / total_employees * 100).sort_values(ascending=True)

        axes[1, 1].barh(high_perf_pct.index, high_perf_pct.values, color='green', alpha=0.7)
        axes[1, 1].set_xlabel('High Performers %', fontsize=11, fontweight='bold')
        axes[1, 1].set_ylabel('Department', fontsize=11, fontweight='bold')
        axes[1, 1].set_title('High Performers (Rating ‚â•4) by Department',
                            fontsize=12, fontweight='bold')
        axes[1, 1].grid(axis='x', alpha=0.3)

        for i, (dept, val) in enumerate(high_perf_pct.items()):
            axes[1, 1].text(val + 1, i, f'{val:.1f}%', va='center', fontweight='bold')

        plt.tight_layout()
        plt.savefig(f'{self.output_dir}department_comparison.png', dpi=300, bbox_inches='tight')
        plt.close()

        print(f" Saved: {self.output_dir}department_comparison.png")

    def create_interactive_dashboard(self):
        """Create interactive Plotly dashboard."""
        print("\n Creating interactive dashboard...")

        # Create subplots
        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=('Performance Distribution',
                          'Performance by Department',
                          'Satisfaction vs Performance',
                          'Tenure vs Performance'),
            specs=[[{'type': 'bar'}, {'type': 'box'}],
                   [{'type': 'scatter'}, {'type': 'scatter'}]]
        )

        # 1. Performance distribution
        perf_counts = self.df['PerformanceRating'].value_counts().sort_index()
        fig.add_trace(
            go.Bar(x=perf_counts.index, y=perf_counts.values,
                  marker_color=['#FF6B6B', '#FFA500', '#FFD93D', '#6BCF7F', '#4ECDC4'],
                  name='Count'),
            row=1, col=1
        )

        # 2. Performance by department
        if 'Department' in self.df.columns:
            for dept in self.df['Department'].unique():
                dept_data = self.df[self.df['Department'] == dept]['PerformanceRating']
                fig.add_trace(
                    go.Box(y=dept_data, name=dept),
                    row=1, col=2
                )

        # 3. Satisfaction vs Performance
        if 'JobSatisfaction' in self.df.columns:
            fig.add_trace(
                go.Scatter(x=self.df['JobSatisfaction'],
                          y=self.df['PerformanceRating'],
                          mode='markers',
                          marker=dict(size=8, opacity=0.5, color=self.df['PerformanceRating'],
                                    colorscale='RdYlGn', showscale=True),
                          name='Employees'),
                row=2, col=1
            )

        # 4. Tenure vs Performance
        if 'ExperienceYearsAtThisCompany' in self.df.columns:
            fig.add_trace(
                go.Scatter(x=self.df['ExperienceYearsAtThisCompany'],
                          y=self.df['PerformanceRating'],
                          mode='markers',
                          marker=dict(size=8, opacity=0.5, color=self.df['PerformanceRating'],
                                    colorscale='Viridis'),
                          name='Employees'),
                row=2, col=2
            )

        # Update layout
        fig.update_layout(
            title_text="Employee Performance Interactive Dashboard",
            title_font_size=20,
            showlegend=False,
            height=800
        )

        fig.write_html(f'{self.output_dir}interactive_dashboard.html')
        print(f" Saved: {self.output_dir}interactive_dashboard.html")

    def generate_summary_statistics(self):
        """Generate and save summary statistics."""
        print("\n Generating summary statistics...")

        # Overall statistics
        summary = pd.DataFrame({
            'Total Employees': [len(self.df)],
            'Avg Performance': [self.df['PerformanceRating'].mean()],
            'Std Performance': [self.df['PerformanceRating'].std()],
            'High Performers (%)': [(self.df['PerformanceRating'] >= 4).sum() / len(self.df) * 100],
            'Low Performers (%)': [(self.df['PerformanceRating'] <= 2).sum() / len(self.df) * 100]
        })

        summary.to_csv(f'{self.output_dir}summary_statistics.csv', index=False)
        print(f" Saved: {self.output_dir}summary_statistics.csv")

        # Print to console
        print("\n" + "="*60)
        print("SUMMARY STATISTICS")
        print("="*60)
        print(summary.to_string(index=False))
        print("="*60)

    def run_all_eda(self):
        """Run all EDA visualizations."""
        print("\n" + "="*60)
        print(" RUNNING COMPLETE EDA ANALYSIS")
        print("="*60)

        self.plot_target_distribution()
        self.plot_numerical_distributions()
        self.plot_categorical_distributions()
        self.plot_performance_by_categories()
        self.plot_correlation_heatmap()
        self.plot_performance_correlations()
        self.plot_boxplots_by_performance()
        self.plot_satisfaction_analysis()
        self.plot_tenure_analysis()
        self.plot_department_comparison()
        self.create_interactive_dashboard()
        self.generate_summary_statistics()

        print("\n" + "="*60)
        print("EDA ANALYSIS COMPLETE!")
        print(f" All visualizations saved to: {self.output_dir}")
        print("="*60)


def main():
    """Main execution function."""
    # Update with your data path
    data_path = "/content/INX_Future_Inc_Employee_Performance_CDS_Project2_Data_V1.8.csv"

    # Create visualizer
    visualizer = EDAVisualizer(data_path)

    # Run all EDA
    visualizer.run_all_eda()


if __name__ == "__main__":
    main()

 Dataset loaded: 1200 rows, 28 columns

 RUNNING COMPLETE EDA ANALYSIS

 Creating target distribution plot...
 Saved: eda_outputs/target_distribution.png

 Creating numerical feature distributions...
 Saved: eda_outputs/numerical_distributions.png

 Creating categorical feature distributions...
 Saved: eda_outputs/categorical_distributions.png

üìä Creating performance by category plots...
 Saved: eda_outputs/performance_by_categories.png

 Creating correlation heatmap...
 Saved: eda_outputs/correlation_heatmap.png

 Creating performance correlation plot...
 Saved: eda_outputs/performance_correlations.png

 Creating boxplots by performance rating...
 Saved: eda_outputs/boxplots_by_performance.png

 Creating satisfaction analysis...
 Saved: eda_outputs/satisfaction_analysis.png

 Creating tenure analysis...
 Saved: eda_outputs/tenure_analysis.png

 Creating department comparison...
  Department column not found. Skipping...

üìä Creating interactive dashboard...
 Saved: eda_outputs/in