# 📊 Salifort Motors Employee Retention Analysis
## Phase 1: Data Exploration and Understanding (PACE - Plan & Analyze)

**Project Overview:** This notebook focuses on understanding our dataset, exploring employee retention patterns, and setting the foundation for our predictive model.

**Business Context:** Salifort Motors is experiencing employee turnover challenges. Our goal is to identify key factors driving attrition and build predictive capabilities to support HR decision-making.

---

### 📋 Table of Contents
1. [Project Setup & Data Loading](#setup)
2. [Data Overview & Quality Assessment](#overview)
3. [Exploratory Data Analysis](#eda)
4. [Key Insights & Observations](#insights)
5. [Next Steps](#next-steps)

---

## 🛠️ Project Setup & Data Loading {#setup}

Let's start by importing necessary libraries and loading our dataset.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print("✅ Libraries imported successfully!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"📈 NumPy version: {np.__version__}")

In [None]:
# Load the dataset
# Note: Update the file path based on your data location
try:
    df = pd.read_csv('../data/raw/hr_dataset.csv')
    print(f"✅ Dataset loaded successfully!")
    print(f"📏 Dataset shape: {df.shape}")
    print(f"👥 Number of employees: {df.shape[0]:,}")
    print(f"📊 Number of features: {df.shape[1]}")
except FileNotFoundError:
    print("❌ Dataset file not found. Please ensure the file is in the correct location.")
    print("Expected location: ../data/raw/hr_dataset.csv")

## 📊 Data Overview & Quality Assessment {#overview}

Let's examine the structure and quality of our data.

In [None]:
# Display basic information about the dataset
print("📋 DATASET INFORMATION")
print("=" * 50)
df.info()
print("\n")

# Display first few rows
print("👀 FIRST 5 ROWS")
print("=" * 50)
display(df.head())

In [None]:
# Check for missing values
print("🔍 MISSING VALUES ANALYSIS")
print("=" * 50)
missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Percentage': missing_percent
}).sort_values('Missing Count', ascending=False)

print(missing_df[missing_df['Missing Count'] > 0])

if missing_df['Missing Count'].sum() == 0:
    print("✅ No missing values found in the dataset!")
else:
    print(f"⚠️  Total missing values: {missing_df['Missing Count'].sum()}")

In [None]:
# Statistical summary
print("📊 STATISTICAL SUMMARY")
print("=" * 50)
display(df.describe())

print("\n📈 CATEGORICAL VARIABLES SUMMARY")
print("=" * 50)
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    print(f"\n{col.upper()}:")
    print(df[col].value_counts())

## 🔍 Exploratory Data Analysis {#eda}

Now let's dive deep into understanding patterns in our employee data.

In [None]:
# Target variable analysis
print("🎯 TARGET VARIABLE ANALYSIS")
print("=" * 50)

# Assuming 'left' is our target variable (adjust if different)
target_col = 'left'  # Update this if your target column has a different name

if target_col in df.columns:
    retention_counts = df[target_col].value_counts()
    retention_percent = df[target_col].value_counts(normalize=True) * 100
    
    print(f"Employee Retention Status:")
    print(f"Stayed: {retention_counts[0]:,} ({retention_percent[0]:.1f}%)")
    print(f"Left: {retention_counts[1]:,} ({retention_percent[1]:.1f}%)")
    
    # Visualize retention distribution
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Count plot
    sns.countplot(data=df, x=target_col, ax=ax1)
    ax1.set_title('Employee Retention Distribution', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Employee Status (0=Stayed, 1=Left)')
    ax1.set_ylabel('Count')
    
    # Pie chart
    colors = ['#2E8B57', '#CD5C5C']
    ax2.pie(retention_counts.values, labels=['Stayed', 'Left'], autopct='%1.1f%%', colors=colors, startangle=90)
    ax2.set_title('Retention Rate Distribution', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
else:
    print("❌ Target column 'left' not found. Please check column names.")
    print(f"Available columns: {list(df.columns)}")

In [None]:
# Correlation analysis
print("🔗 CORRELATION ANALYSIS")
print("=" * 50)

# Select numeric columns for correlation
numeric_cols = df.select_dtypes(include=[np.number]).columns
correlation_matrix = df[numeric_cols].corr()

# Create correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='RdYlBu_r', center=0, 
            square=True, linewidths=0.5, cbar_kws={"shrink": .8})
plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Show correlations with target variable
if target_col in correlation_matrix.columns:
    target_correlations = correlation_matrix[target_col].abs().sort_values(ascending=False)
    print(f"\n🎯 Features most correlated with {target_col.upper()}:")
    print(target_correlations.drop(target_col).head(10))

In [None]:
# Distribution analysis of key features
print("📊 FEATURE DISTRIBUTION ANALYSIS")
print("=" * 50)

# Key features to analyze (adjust based on your dataset)
key_features = ['satisfaction_level', 'last_evaluation', 'number_project', 
                'average_montly_hours', 'time_spend_company']  # Update these names as needed

# Filter features that exist in the dataset
available_features = [col for col in key_features if col in df.columns]

if available_features:
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.ravel()
    
    for i, feature in enumerate(available_features[:6]):
        if i < len(axes):
            sns.histplot(data=df, x=feature, hue=target_col if target_col in df.columns else None, 
                        kde=True, ax=axes[i], alpha=0.7)
            axes[i].set_title(f'Distribution of {feature.replace("_", " ").title()}', 
                            fontweight='bold')
            axes[i].set_xlabel(feature.replace("_", " ").title())
    
    # Remove empty subplots
    for j in range(len(available_features), len(axes)):
        fig.delaxes(axes[j])
    
    plt.tight_layout()
    plt.show()
else:
    print("⚠️  Key features not found with expected names. Available columns:")
    print(list(df.columns))

## 💡 Key Insights & Observations {#insights}

Based on our exploratory analysis, here are the key findings:

In [None]:
# Generate automated insights based on the data
print("🔍 AUTOMATED INSIGHTS GENERATION")
print("=" * 50)

insights = []

# Dataset size insight
insights.append(f"📏 Dataset contains {df.shape[0]:,} employee records with {df.shape[1]} features")

# Missing data insight
missing_count = df.isnull().sum().sum()
if missing_count == 0:
    insights.append("✅ Dataset is complete with no missing values")
else:
    insights.append(f"⚠️  Dataset has {missing_count} missing values that need attention")

# Target variable insight
if target_col in df.columns:
    left_rate = df[target_col].mean() * 100
    insights.append(f"📊 Employee turnover rate: {left_rate:.1f}%")
    
    if left_rate > 20:
        insights.append("🚨 High turnover rate indicates significant retention challenges")
    elif left_rate > 10:
        insights.append("⚠️  Moderate turnover rate requires attention")
    else:
        insights.append("✅ Low turnover rate indicates good retention")

# Feature correlation insights
if target_col in df.columns and target_col in correlation_matrix.columns:
    high_corr_features = correlation_matrix[target_col].abs().sort_values(ascending=False).drop(target_col).head(3)
    insights.append(f"🔗 Top predictive features: {', '.join(high_corr_features.index)}")

# Print insights
for i, insight in enumerate(insights, 1):
    print(f"{i}. {insight}")

print("\n" + "="*50)
print("📝 RECOMMENDATIONS FOR NEXT PHASE:")
print("="*50)
recommendations = [
    "🧹 Proceed to data cleaning and preprocessing",
    "📊 Focus on highly correlated features for modeling",
    "🎯 Consider feature engineering opportunities",
    "⚖️  Address class imbalance if present",
    "🔄 Validate findings with domain experts"
]

for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec}")

## 🚀 Next Steps {#next-steps}

Based on our exploration, the next steps in our analysis will be:

1. **Data Cleaning** (`02_data_cleaning.ipynb`)
   - Handle any data quality issues
   - Address outliers and anomalies
   - Prepare data for modeling

2. **Feature Engineering**
   - Create new meaningful features
   - Transform categorical variables
   - Scale numerical features

3. **Model Development** (`03_modeling.ipynb`)
   - Build and compare multiple models
   - Hyperparameter tuning
   - Model validation

4. **Results Analysis** (`04_results.ipynb`)
   - Model interpretation
   - Business insights
   - Recommendations

---

**📊 Data Exploration Complete!** 

*Next notebook: `02_data_cleaning.ipynb`*