# Exploratory Data Analysis (EDA) Demo

This notebook demonstrates comprehensive exploratory data analysis techniques using Python.

## Learning Objectives
- Understand dataset structure and basic statistics
- Identify and handle missing values
- Analyze distributions of variables
- Explore relationships between variables
- Detect outliers using statistical methods
- Generate professional analysis reports

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")

In [None]:
# Import our custom EDA analyzer
import sys
sys.path.append('../src')
from eda_analyzer import DataAnalyzer

print("EDA analyzer imported successfully!")

## Step 1: Initialize the Data Analyzer

Let's create an instance of our DataAnalyzer class with sample data.

In [None]:
# Initialize the analyzer with sample data
analyzer = DataAnalyzer()

print("Data analyzer initialized with sample dataset")
print(f"Dataset shape: {analyzer.df.shape}")
print(f"Numerical columns: {len(analyzer.numeric_columns)}")
print(f"Categorical columns: {len(analyzer.categorical_columns)}")

## Step 2: Basic Dataset Information

Let's start with understanding our dataset structure.

In [None]:
# Get basic information about the dataset
analyzer.basic_info()

## Step 3: Missing Values Analysis

Identifying and understanding missing values is crucial for data quality.

In [None]:
# Analyze missing values
missing_data = analyzer.missing_values_analysis()

## Step 4: Numerical Variables Analysis

Let's explore the distributions and characteristics of numerical variables.

In [None]:
# Analyze numerical variables
desc_stats = analyzer.numerical_analysis()

## Step 5: Categorical Variables Analysis

Understanding the distribution of categorical variables.

In [None]:
# Analyze categorical variables
analyzer.categorical_analysis()

## Step 6: Correlation Analysis

Exploring relationships between numerical variables.

In [None]:
# Perform correlation analysis
correlation_matrix = analyzer.correlation_analysis()

## Step 7: Outlier Detection

Identifying outliers using the IQR (Interquartile Range) method.

In [None]:
# Detect outliers
outlier_summary = analyzer.outlier_detection()

## Step 8: Custom Analysis

Let's perform some additional custom analysis on our dataset.

In [None]:
# Custom analysis: Income vs Education relationship
plt.figure(figsize=(10, 6))
plt.scatter(analyzer.df['education_years'], analyzer.df['income'], alpha=0.6)
plt.xlabel('Education Years')
plt.ylabel('Income')
plt.title('Income vs Education Years')

# Add correlation coefficient
corr_coef = analyzer.df['education_years'].corr(analyzer.df['income'])
plt.text(0.05, 0.95, f'Correlation: {corr_coef:.3f}', 
         transform=plt.gca().transAxes, 
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Department-wise analysis
plt.figure(figsize=(12, 8))

# Subplot 1: Average income by department
plt.subplot(2, 2, 1)
dept_income = analyzer.df.groupby('department')['income'].mean().sort_values(ascending=False)
dept_income.plot(kind='bar', color='skyblue')
plt.title('Average Income by Department')
plt.ylabel('Income')
plt.xticks(rotation=45)

# Subplot 2: Average satisfaction by department
plt.subplot(2, 2, 2)
dept_satisfaction = analyzer.df.groupby('department')['satisfaction_score'].mean().sort_values(ascending=False)
dept_satisfaction.plot(kind='bar', color='lightgreen')
plt.title('Average Satisfaction by Department')
plt.ylabel('Satisfaction Score')
plt.xticks(rotation=45)

# Subplot 3: Remote work distribution
plt.subplot(2, 2, 3)
remote_work_dist = analyzer.df['remote_work'].value_counts()
plt.pie(remote_work_dist.values, labels=remote_work_dist.index, autopct='%1.1f%%')
plt.title('Remote Work Distribution')

# Subplot 4: Age distribution by department
plt.subplot(2, 2, 4)
analyzer.df.boxplot(column='age', by='department', ax=plt.gca())
plt.title('Age Distribution by Department')
plt.suptitle('')  # Remove automatic title

plt.tight_layout()
plt.show()

## Step 9: Statistical Testing

Let's perform some statistical tests to validate our observations.

In [None]:
# Test for normality in numerical variables
print("Normality Tests (Shapiro-Wilk):")
print("=" * 40)

for col in analyzer.numeric_columns[:3]:  # Test first 3 for brevity
    if len(analyzer.df[col].dropna()) <= 5000:  # Shapiro-Wilk limit
        stat, p_value = stats.shapiro(analyzer.df[col].dropna())
        is_normal = "Yes" if p_value > 0.05 else "No"
        print(f"{col}: p-value = {p_value:.6f}, Normal: {is_normal}")

print("\n" + "=" * 50)

# Test for difference in income between remote and non-remote workers
remote_yes = analyzer.df[analyzer.df['remote_work'] == 'Yes']['income']
remote_no = analyzer.df[analyzer.df['remote_work'] == 'No']['income']

stat, p_value = stats.ttest_ind(remote_yes, remote_no)
print(f"T-test for income difference (remote vs non-remote):")
print(f"t-statistic: {stat:.4f}")
print(f"p-value: {p_value:.6f}")
print(f"Significant difference: {'Yes' if p_value < 0.05 else 'No'}")

print(f"\nMean income (remote): ${remote_yes.mean():.2f}")
print(f"Mean income (non-remote): ${remote_no.mean():.2f}")

## Step 10: Generate Complete Report

Finally, let's generate a comprehensive analysis report.

In [None]:
# Generate complete EDA report
analyzer.generate_report()

## Conclusion

In this EDA demo, we've successfully:
1. Analyzed dataset structure and basic statistics
2. Identified and visualized missing values
3. Explored distributions of numerical and categorical variables
4. Analyzed correlations between variables
5. Detected outliers using statistical methods
6. Performed custom analysis and statistical testing
7. Generated comprehensive visualizations and reports

## Key Insights
- Our sample dataset contains 1000 employee records
- Strong correlations exist between age and experience
- Income distribution shows log-normal characteristics
- Different departments show varying satisfaction levels
- Missing values are minimal (5% in satisfaction scores)

## Next Steps
- Handle missing values appropriately
- Consider data transformation for skewed variables
- Investigate outliers for business insights
- Prepare data for machine learning models
- Create interactive dashboards for stakeholders