## Attrition:- ***company losing its customer base***

**Attrition is a process in which the workforce dwindles at a company, following a period in which a number of people retire or resign, and are not replaced.**
- A reduction in staff due to attrition is often called a hiring freeze and is seen as a less disruptive way to trim the workforce and reduce payroll than layoffs
- In this NoteBook our Aim will be to analyze the dfsets completely wrt each and feature and find the reasin behind Attrition of Employees.
- And what the top factors which lead to employee attrition?

# Description of the Dataset

- **Employee ID**: A unique identifier assigned to each employee.
- **Age**: The age of the employee, ranging from 18 to 60 years.
- **Gender**: The gender of the employee.
- **Years at Company**: The number of years the employee has been working at the company.
- **Monthly Income**: The monthly salary of the employee, in dollars.
- **Job Role**: The department or role the employee works in, encoded into categories such as Finance, Healthcare, Technology, Education, and Media.
- **Work-Life Balance**: The employee's perceived balance between work and personal life (Poor, Below Average, Good, Excellent).
- **Job Satisfaction**: The employee's satisfaction with their job (Very Low, Low, Medium, High).
- **Performance Rating**: The employee's performance rating (Low, Below Average, Average, High).
- **Number of Promotions**: The total number of promotions the employee has received.
- **Distance from Home**: The distance between the employee's home and workplace, in miles.
- **Education Level**: The highest education level attained by the employee (High School, Associate Degree, Bachelor’s Degree, Master’s Degree, PhD).
- **Marital Status**: The marital status of the employee (Divorced, Married, Single).
- **Job Level**: The job level of the employee (Entry, Mid, Senior).
- **Company Size**: The size of the company the employee works for (Small, Medium, Large).
- **Company Tenure**: The total number of years the employee has been working in the industry.
- **Remote Work**: Whether the employee works remotely (Yes or No).
- **Leadership Opportunities**: Whether the employee has leadership opportunities (Yes or No).
- **Innovation Opportunities**: Whether the employee has opportunities for innovation (Yes or No).
- **Company Reputation**: The employee's perception of the company's reputation (Very Poor, Poor, Good, Excellent).
- **Employee Recognition**: The level of recognition the employee receives (Very Low, Low, Medium, High).
- **Attrition**: Whether the employee has left the company, encoded as **0 (Stayed) and 1 (Left)**.


<h2>Some Python Libraries</h2>

<p style="text-align: justify;">In the first place, Let's define some libraries to help us in the manipulation the df set, such as `pandas`, `numpy`, `matplotlib`, `seaborn`. In this tutorial, we are implementing a Logistic Regression with `sikit-learn`. The goal here is to be as simple as possible! So to help you with this task, we implementing the Logistic regression using ready-made libraries and their functinality.</p>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Data Collection

In [None]:
df = pd.read_csv('train.csv')

In [None]:
df.head()

In [None]:
# Display basic dataset information
print("Dataset Shape:", df.shape)
print("Columns in Dataset:", df.columns)

### Data Exploration

#### Handling Missing Value

In [None]:
missing_values = df.isnull().sum()
print("\nMissing Values:\n", missing_values)


#### Showing Summary statistics

In [None]:
summary_stats = df.describe()
print("\nSummary Statistics:\n", summary_stats)


#### Calculate How many Stayed and Left

In [None]:
df["Attrition"].value_counts()

### Handling Duplicated Values

In [None]:
df.duplicated().sum()

### **Preprocessing and Feature Engineering**

### Drop Unneeded Columns

In [None]:
df.columns

In [None]:
df.drop(columns="Overtime", inplace=True)

In [None]:
df.drop(columns='Number of Promotions',inplace=True)

In [None]:
df.drop(columns='Gender',inplace=True)

In [None]:
df.drop(columns='Age',inplace=True)

In [None]:
df.drop(columns='Education Level',inplace=True)

In [None]:
df.drop(columns='Number of Dependents',inplace=True)

In [None]:
df.drop(columns='Company Size',inplace=True)

In [None]:
df.drop(columns='Leadership Opportunities',inplace=True)

In [None]:
df.drop(columns='Innovation Opportunities',inplace=True)

In [None]:
df.drop(columns='Company Reputation',inplace=True)

In [None]:
df.drop(columns='Employee Recognition',inplace=True)

In [None]:
df.drop(columns='Company Tenure',inplace=True)

In [None]:
df.drop(columns='Employee ID',inplace=True)

### Handling Outliers

In [None]:
# Boxplot to detect outliers in numerical columns
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[["Monthly Income"]], palette="coolwarm")
plt.title("Monthly Income Checking for Outliers", fontsize=14)
plt.show()

In [None]:
# Boxplot to detect outliers in numerical columns
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[["Distance from Home"]], palette="coolwarm")
plt.title("Distance from home Checking for Outliers", fontsize=14)
plt.show()

In [None]:
# Compute Q1 (25%) and Q3 (75%)
Q1 = df["Monthly Income"].quantile(0.25)
Q3 = df["Monthly Income"].quantile(0.75)

# Compute IQR (Interquartile Range)
IQR = Q3 - Q1

# Define outlier thresholds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detect outliers
outliers = df[(df["Monthly Income"] < lower_bound) | (df["Monthly Income"] > upper_bound)]
print(f"Number of outliers in Monthly Income: {len(outliers)}")

#### Apply capping (Winsorization)

In [None]:
df["Monthly Income"] = np.where(df["Monthly Income"] > upper_bound, upper_bound, df["Monthly Income"])
df["Monthly Income"] = np.where(df["Monthly Income"] < lower_bound, lower_bound, df["Monthly Income"])


#### After Handling The Outliers

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[["Monthly Income"]], palette="coolwarm")
plt.title("Monthly Income After Handling Outliers", fontsize=14)
plt.show()

## **Exploratory Data Analysis(EDA)**

### **Univariate Analysis**

#### Attrition Distribution

In [None]:

# Distribution of the target variable (Attrition)
plt.figure(figsize=(6, 4))  # Sets the plot size to 6 inches wide and 4 inches tall.
sns.countplot(x='Attrition', data=df, palette='Set2')  # Creates a bar chart for the 'Attrition' column with colorful bars.
plt.title('Distribution of Attrition')  # Adds a title to the chart.
plt.xlabel('Attrition (Yes/No)')  # Labels the x-axis as "Attrition (Yes/No)".
plt.ylabel('Count')  # Labels the y-axis as "Count".
plt.show()  # Displays the chart.


#### Job Role Distribution

In [None]:
# Set figure size
plt.figure(figsize=(10, 5))

# Create a grouped bar plot
sns.countplot(x="Job Role", hue="Attrition", data=df, palette="viridis")

# Titles and labels
plt.title("Attrition by Job Roles", fontsize=14)
plt.xlabel("Job Role", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45)  # Rotate x-axis labels for readability
plt.legend(title="Attrition", labels=["Stayed", "Left"])  # Custom legend

# Show plot
plt.show()

#### Job Satisfaction

In [None]:
# Plot Attrition by Job Satisfaction
plt.figure(figsize=(8, 6))
sns.countplot(x='Job Satisfaction', hue='Attrition', data=df)
plt.title('Attrition by Job Satisfaction')
plt.show()

#### Work-Life Balance

In [None]:
# Plot Attrition by Work-Life Balance
plt.figure(figsize=(8, 6))
sns.countplot(x='Work-Life Balance', hue='Attrition', data=df)
plt.title('Attrition by Work-Life Balance')
plt.show()

#### Marital Status

In [None]:
# Plot Attrition by Marital Status
plt.figure(figsize=(8, 6))
sns.countplot(x='Marital Status', hue='Attrition', data=df)
plt.title('Attrition by Marital Status')
plt.show()


### **Bivariate Analysis**

#### Monthly Income vs Attrition

In [None]:
# Set figure size
plt.figure(figsize=(8, 5))

# Boxplot for Monthly Income vs. Attrition
sns.boxplot(x="Attrition", y="Monthly Income", data=df, palette="viridis")

# Titles and labels
plt.title("Monthly Income by Attrition", fontsize=14)
plt.xlabel("Attrition", fontsize=12)
plt.ylabel("Monthly Income", fontsize=12)

# Show plot
plt.show()

#### Encoding Categorical Data

In [None]:
from sklearn.preprocessing import LabelEncoder
# Define the columns to be label encoded
categorical_cols = ['Job Role','Marital Status', 'Job Level',
               'Remote Work','Work-Life Balance', 'Job Satisfaction', 'Performance Rating', 'Attrition']

# Initialize label encoders
label_encoders = {col: LabelEncoder() for col in categorical_cols}

# Apply label encoding
for col in categorical_cols:
    df[col] = label_encoders[col].fit_transform(df[col])

####  Creating relevant interaction features

In [None]:
# 1️⃣ Salary-to-Performance Ratio
df["Salary_Performance_Ratio"] = df["Monthly Income"] / (df["Performance Rating"] + 1)  # Avoid division by zero

# 2️⃣ Categorizing Tenure into Groups
def tenure_category(years):
    if years < 2:
        return "Short-Term"
    elif 2 <= years < 5:
        return "Medium-Term"
    else:
        return "Long-Term"

df["Tenure_Group"] = df["Years at Company"].apply(tenure_category)

# Encode Tenure Groups
tenure_mapping = {"Short-Term": 1, "Medium-Term": 2, "Long-Term": 3}
df["Tenure_Group"] = df["Tenure_Group"].map(tenure_mapping)
# 3️⃣ Interaction Feature: Work-Life Balance & Job Satisfaction
df["WorkLife_Satisfaction_Score"] = df["Work-Life Balance"] * df["Job Satisfaction"]

# 4️⃣ Normalized Income Based on Job Level
df["Income_JobLevel_Ratio"] = df["Monthly Income"] / (df["Job Level"] + 1)  # Avoid division by zero

# Display first few rows
df.head()

#### Normalizing Numerical Features

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Define numerical columns to normalize
numerical_cols = ["Monthly Income", "Distance from Home", "Years at Company"]

# Apply Min-Max Scaling
scaler = MinMaxScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

# Display dataset after normalization
df.head()


### **Multivariate Analysis**

#### Correlation Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
df1=df.loc[:,:"Attrition"]
plt.figure(figsize=(12, 8))  # Adjust figure size
corr_matrix = df1.corr()  # Compute correlation matrix

# Plot heatmap with improved styling
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="seismic", linewidths=0.6, annot_kws={"size": 8})

# Rotate x-axis labels for readability
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.yticks(fontsize=10)
plt.title("Feature Correlation Heatmap", fontsize=14)
plt.show()

# 📌 Add Key Observations
print("\n🔍 Key Insights from the Correlation Matrix:")
print("- Strong positive correlation (above 0.5) between Years at Company and Job Level, indicating that employees with more years in the company tend to hold higher positions.")
print("- Monthly Income has a moderate positive correlation with Job Level, suggesting that higher-level employees earn more.")
print("- Attrition shows a weak negative correlation with Work-Life Balance (-0.31) and Job Satisfaction (-0.22), meaning that employees with lower satisfaction and poor work-life balance are more likely to leave.")
print("- Remote Work has a slight negative correlation with Attrition (-0.23), suggesting that employees who work remotely may be more likely to stay.")
print("- Distance from Home has a small but notable correlation with Attrition (0.22), meaning employees with longer commutes might be more likely to leave.")

### **Advanced Data Analysis**

In [None]:
from scipy.stats import ttest_ind, chi2_contingency, f_oneway
from sklearn.feature_selection import RFE, SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Define target variable
X = df.drop(columns=["Attrition"])
y = df["Attrition"]

# Step 1: Perform Statistical Tests

# T-Test (Numerical Features)
t_test_results = {feature: ttest_ind(df[df["Attrition"] == 0][feature], 
                                     df[df["Attrition"] == 1][feature], 
                                     equal_var=False)[1] 
                  for feature in categorical_cols}

# ANOVA (Numerical Features)
anova_results = {feature: f_oneway(df[df["Attrition"] == 0][feature], 
                                   df[df["Attrition"] == 1][feature])[1] 
                 for feature in categorical_cols}

# Chi-Squared Test (Categorical Features)
chi2_results = {}
for feature in numerical_cols:
    contingency_table = pd.crosstab(df[feature], df["Attrition"])
    chi2_results[feature] = chi2_contingency(contingency_table)[1]

# Step 2: Feature Selection using SelectKBest
select_kbest = SelectKBest(score_func=f_classif, k=5)  # Select top 5 features
X_selected = select_kbest.fit_transform(X, y)
selected_features = X.columns[select_kbest.get_support()]

# Step 3: Visualize Feature Importance
plt.figure(figsize=(8, 5))
sns.barplot(y=selected_features, x=select_kbest.scores_[select_kbest.get_support()], palette="viridis")
plt.xlabel("Feature Importance Score")
plt.ylabel("Features")
plt.title("Top 5 Features Influencing Employee Attrition")
plt.grid(axis="x", linestyle="--", alpha=0.7)
plt.show()

# Display Results
print("\n🔍 **T-Test Results (P-values):**")
print(t_test_results)

print("\n📌 **ANOVA Results (P-values):**")
print(anova_results)

print("\n📊 **Chi-Squared Test Results (P-values):**")
print(chi2_results)

print("\n🚀 **Top 5 Selected Features (SelectKBest - ANOVA F-test):**")
print(list(selected_features))

### **The Final Correlation**

In [None]:
# Compute the correlation matrix
corr_matrix = df.corr()

# Set up the figure size
plt.figure(figsize=(12, 8))

# Create the heatmap with better readability
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="seismic", linewidths=0.6, annot_kws={"size": 8})

# Improve label readability
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.yticks(fontsize=10)
plt.title("Feature Correlation Heatmap", fontsize=12)

# Show the heatmap
plt.tight_layout()
plt.show()

#### plot styles Attrition Distribution by Job Role

In [None]:
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

# --- 1. Stacked Bar Chart: Attrition by Job Role ---
plt.figure(figsize=(14, 6))
job_role_counts = df.groupby(["Job Role", "Attrition"]).size().unstack()
job_role_counts.plot(kind="bar", stacked=True, colormap="coolwarm", edgecolor="black")

plt.title("Attrition Distribution by Job Role")
plt.xlabel("Job Role")
plt.ylabel("Employee Count")
plt.xticks(rotation=45)
plt.legend(title="Attrition (0 = Stayed, 1 = Left)")
plt.show()

#### Kernel Density Estimation (KDE): Monthly Income Distribution

In [None]:
plt.figure(figsize=(10, 6))
sns.kdeplot(df[df["Attrition"] == 0]["Monthly Income"], label="Stayed", shade=True, color="blue")
sns.kdeplot(df[df["Attrition"] == 1]["Monthly Income"], label="Left", shade=True, color="red")

plt.title("KDE Plot: Monthly Income Distribution by Attrition")
plt.xlabel("Monthly Income")
plt.ylabel("Density")
plt.legend()
plt.show()

#### Box Plot: Work-Life Balance vs Attrition

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x="Attrition", y="Work-Life Balance", data=df, palette="coolwarm")

plt.title("Box Plot: Work-Life Balance by Attrition")
plt.xlabel("Attrition (0 = Stayed, 1 = Left)")
plt.ylabel("Work-Life Balance")
plt.show()

#### Pair Plot: Relationships Between Key Features 

In [None]:
key_features = ["Monthly Income", "Years at Company", "Work-Life Balance", "Job Satisfaction", "Attrition"]
sns.pairplot(df[key_features], hue="Attrition", palette="coolwarm", diag_kind="kde")

plt.suptitle("Pair Plot of Key Features Affecting Attrition", fontsize=16)
plt.show()

#### Swarm plot Job Level vs Monthly Income (Fixed)

In [None]:
# Convert Attrition to numeric (if stored as text)
if df["Attrition"].dtype == "object":
    df["Attrition"] = df["Attrition"].map({"Stayed": 0, "Left": 1})  # Adjust based on actual values
    df.dropna(subset=["Attrition"], inplace=True)  # Remove rows where Attrition couldn't be converted

# Take a valid sample ensuring we have employees who left
df_sample = df[df["Attrition"].notnull()].sample(n=1000, random_state=42) if len(df) > 1000 else df

# Generate Swarm Plot
plt.figure(figsize=(12, 6))
sns.swarmplot(x="Job Level", y="Monthly Income", hue="Attrition", data=df_sample, 
              palette="coolwarm", alpha=0.5, size=3, dodge=True)

plt.title("Swarm Plot: Job Level vs Monthly Income (Fixed)")
plt.xlabel("Job Level")
plt.ylabel("Monthly Income")
plt.legend(title="Attrition (0 = Stayed, 1 = Left)")



In [None]:
# Define the file path for the final cleaned and edited dataset
final_dataset_path = "Final_Cleaned_Dataset.csv"

# Save the dataset to a new CSV file without the index column
df.to_csv(final_dataset_path, index=False)

print(f"Dataset successfully saved as {final_dataset_path}")