# Exploratory Data Analysis (EDA) — Employee Performance

This Jupyter Notebook contains step-by-step code to read an employee performance dataset, clean it, explore it, visualize count plots and boxplots, and save a cleaned CSV for modelling. Replace `employee_performance.csv` with your dataset file in the same folder and run all cells (Kernel ▶ Restart & Run All).

In [None]:
# 1. Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

sns.set(style='whitegrid')
%matplotlib inline


In [None]:
# 2. Load dataset (place your CSV in the same folder and update the filename if needed)
DATA_FILE = 'employee_performance.csv'
if os.path.exists(DATA_FILE):
    df = pd.read_csv(DATA_FILE)
    print('Loaded', DATA_FILE)
else:
    # Create a small sample dataframe scaffold so the notebook runs and demonstrates methods
    print('WARNING: employee_performance.csv not found in notebook folder. Creating a small sample dataframe to demonstrate the EDA steps.')
    df = pd.DataFrame({
        'EmployeeID': [1001,1002,1003,1004,1005],
        'Age': [29, 35, 40, 28, 50],
        'Gender': ['Male','Female','Male','Female','Male'],
        'Department': ['Sales','R&D','R&D','Sales','HR'],
        'JobRole': ['Sales Executive','Research Scientist','Laboratory Technician','Sales Executive','HR Manager'],
        'MonthlyIncome': [4000, 6500, 3000, 4200, 9000],
        'YearsAtCompany': [2, 7, 3, 1, 20],
        'OverTime': ['Yes','No','No','Yes','No'],
        'JobSatisfaction': [3,4,2,3,4],
        'WorkLifeBalance': [2,3,3,2,4],
        'PerformanceRating': [3,4,2,3,5]
    })
df.head()

In [None]:
# 3. Quick inspection
print('Shape:', df.shape)
print('\nInfo:')
print(df.info())
print('\nDescribe:')
display(df.describe(include='all'))

In [None]:
# 4. Missing values and duplicates
print('Missing values per column:')
print(df.isnull().sum())

print('\nDuplicate rows:', df.duplicated().sum())


In [None]:
# 5. Handling missing values (general approach) - adapt per column as needed
num_cols = df.select_dtypes(include=['int64','float64']).columns.tolist()
cat_cols = df.select_dtypes(include=['object','category']).columns.tolist()
print('Numeric cols:', num_cols)
print('Categorical cols:', cat_cols)

# Fill numeric with median and categorical with mode (demonstration)
for c in num_cols:
    if df[c].isnull().any():
        df[c].fillna(df[c].median(), inplace=True)
for c in cat_cols:
    if df[c].isnull().any():
        df[c].fillna(df[c].mode().iloc[0], inplace=True)

# Drop duplicates
df.drop_duplicates(inplace=True)
print('After cleaning - shape:', df.shape)


In [None]:
# 6. Drop irrelevant columns example
if 'EmployeeID' in df.columns:
    df.drop(columns=['EmployeeID'], inplace=True)
df.head()

In [None]:
# 7. Summary statistics
display(df.describe(include='all'))

# Unique values for categorical columns
for c in cat_cols:
    print(f"{c}: {df[c].nunique()} unique values -> {list(df[c].unique())[:10]}")


In [None]:
# 8. Count plots for key categorical features
plt.figure(figsize=(10,6))
if 'Department' in df.columns:
    sns.countplot(data=df, x='Department', order=df['Department'].value_counts().index)
    plt.title('Count of employees by Department')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

plt.figure(figsize=(10,6))
if 'JobRole' in df.columns:
    sns.countplot(data=df, x='JobRole', order=df['JobRole'].value_counts().index)
    plt.title('Count of employees by Job Role')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()


In [None]:
# 9. Boxplots to detect outliers for numeric columns
num_cols = df.select_dtypes(include=['int64','float64']).columns.tolist()
for c in num_cols:
    plt.figure(figsize=(8,3))
    sns.boxplot(x=df[c])
    plt.title(f'Boxplot of {c}')
    plt.tight_layout()
    plt.show()


In [None]:
# 10. Correlation heatmap for numeric features
if len(num_cols) > 1:
    plt.figure(figsize=(8,6))
    sns.heatmap(df[num_cols].corr(), annot=True, fmt='.2f', cmap='Blues')
    plt.title('Correlation Heatmap')
    plt.show()


In [None]:
# 11. Save cleaned dataset
CLEAN_FILE = 'clean_employee_performance.csv'
df.to_csv(CLEAN_FILE, index=False)
print('Saved cleaned dataset as', CLEAN_FILE)


### Next steps (Component 3 - Modelling)

1. Feature engineering (create derived features, encode categorical variables)
2. Split data into train/test sets and scale features if required
3. Train baseline models (Logistic Regression / Random Forest) and evaluate
4. Use SHAP/feature importance for explainability
