# TCS 2025 Layoff Analysis: AI Integration and Skills Mismatch Insights

In this exploratory data analysis project, we analyze the tcs_workforce_2025.csv dataset to uncover key patterns behind Tata Consultancy Services’ 2025 workforce restructuring. The analysis focuses on the intersection of AI integration, skill mismatches, reskilling efforts, and redeployment policies using interactive Plotly visualizations. We explore how demographic factors, training levels, performance ratings, and strategic policies influenced layoff decisions — and present the findings in a visually rich, shareable format suitable for executive briefings and LinkedIn thought leadership.

# Data Overview

## Import Libraries

In [2]:
!pip install reportlab


Collecting reportlab
  Downloading reportlab-4.4.3-py3-none-any.whl.metadata (1.7 kB)
Downloading reportlab-4.4.3-py3-none-any.whl (2.0 MB)
   ---------------------------------------- 0.0/2.0 MB ? eta -:--:--
   ---------------- ----------------------- 0.8/2.0 MB 8.3 MB/s eta 0:00:01
   ---------------------------------------- 2.0/2.0 MB 12.0 MB/s eta 0:00:00
Installing collected packages: reportlab
Successfully installed reportlab-4.4.3




In [48]:
# Data handling
import pandas as pd
import numpy as np

# Statistical tests
from scipy import stats
from scipy.stats import chi2_contingency


# Visualization - interactive
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

# Image export (for Plotly PNGs)
import kaleido  # Ensure installed: pip install -U kaleido==0.2.1

# File handling and PDF report generation
import os
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from reportlab.lib.units import inch

# Warnings and plot rendering
import warnings
warnings.filterwarnings('ignore')

# Configure Plotly to render inside Jupyter
pio.renderers.default = "notebook_connected"

In [2]:
#Load the Dataset
df=pd.read_csv("tcs_workforce_2025.csv")
# Show the number of rows and columns
print(f"Dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

Dataset has 10000 rows and 26 columns.


In [3]:
df.shape

(10000, 26)

### Inference from Dataset Dimensions

This code snippet and its output provide a foundational understanding of the 'TCS Workforce 2025' dataset's structure.

1.  **Code Purpose**:
    * The code first loads the `tcs_workforce_2025.csv` file into a pandas DataFrame named `df`.
    * Subsequently, it uses `df.shape` to retrieve the dimensions of the DataFrame, specifically the number of rows and columns, and then prints this information in a user-friendly format.

2.  **Output Interpretation**:
    * The output `Dataset has 10000 rows and 26 columns.` clearly indicates the size of the dataset.
    * **Number of Rows (10,000)**: This signifies that the dataset contains records for 10,000 individual employees. This is a substantial number, allowing for robust statistical analysis and the identification of meaningful trends within the workforce.
    * **Number of Columns (26)**: This indicates that for each of the 10,000 employees, there are 26 different attributes or features recorded. These attributes likely cover various aspects such as demographics, performance, skills, project assignments, and other relevant workforce metrics.

3.  **Implications for Data Analysis**:
    * **Scale of Analysis**: With 10,000 records, the dataset is large enough to support various analytical tasks, including segmentation, correlation analysis, and predictive modeling, without being overly burdensome for typical data processing environments.
    * **Feature Richness**: The presence of 26 columns suggests a rich dataset with a wide array of information points per employee. This richness allows for multi-dimensional analysis and the exploration of complex relationships between different workforce attributes (e.g., how age correlates with performance, or how skills match influences layoff risk).
    * **Initial Data Validation**: Knowing the exact dimensions is a crucial first step in data validation, confirming that the dataset was loaded correctly and matches expected sizes.

In [4]:
# Preview the dataset
df.head()

Unnamed: 0,EmployeeID,Name,Age,Gender,Location,Department,Designation,YearsAtTCS,PerformanceRating,AI_Training_Level,...,PreviousBenchInstances,UpSkilledLastYear,WillingToReskill,RedeploymentAttempts,LastPromotionYearsAgo,SalaryUSD,ManagerFeedbackScore,LayoffFlag,LayoffReason,DateOfRecord
0,TCS00000,Employee_0,37,Male,USA,HR,Junior,3.0,High,Basic,...,0,Yes,Yes,0,1,10000,3.89,0,,2025-07-30
1,TCS00001,Employee_1,48,Male,UK,IT,Junior,9.7,High,Advanced,...,0,No,Yes,0,2,24669,4.49,0,,2025-07-30
2,TCS00002,Employee_2,40,Female,India,Sales,Manager,3.7,Medium,Basic,...,0,Yes,Yes,1,3,18335,4.41,0,,2025-07-30
3,TCS00003,Employee_3,36,Male,UK,Finance,Mid,11.6,Medium,Advanced,...,1,Yes,Yes,2,2,21535,3.15,0,,2025-07-30
4,TCS00004,Employee_4,54,Male,India,IT,Senior,11.3,Medium,Basic,...,1,No,No,0,2,23560,2.92,0,,2025-07-30


### Inference from Dataset Preview (df.head())

This output provides a critical initial glimpse into the actual data contained within the 'TCS Workforce 2025' dataset, allowing for an understanding of its structure, data types, and the kind of information each column holds.

1.  **Code Purpose**:
    * The `df.head()` command is used to display the first 5 rows of the DataFrame. This is a standard and essential step in data exploration, providing a quick way to inspect the data's format and content without viewing the entire dataset.

2.  **Output Interpretation**:
    * **Column Headers**: The output clearly lists all 26 column headers, which include: `EmployeeID`, `Name`, `Age`, `Gender`, `Location`, `Department`, `Designation`, `YearsAtTCS`, `PerformanceRating`, `AI_Training_Level`, `SkillsMatch`, `OnsiteExperience`, `Certifications`, `CurrentProject`, `BillableDays`, `BenchDays`, `PreviousBenchInstances`, `UpSkilledLastYear`, `WillingToReskill`, `RedeploymentAttempts`, `LastPromotionYearsAgo`, `SalaryUSD`, `ManagerFeedbackScore`, `LayoffFlag`, `LayoffReason`, and `DateOfRecord`. This confirms the wide range of attributes available for analysis.
    * **Data Types and Values**:
        * **Categorical Data**: Columns like `Gender` (Male/Female), `Location` (USA, UK, India, etc.), `Department` (HR, IT, Sales, Finance), `Designation` (Junior, Manager, Mid, Senior), `PerformanceRating` (High, Medium), `AI_Training_Level` (Basic, Advanced, None), `SkillsMatch` (Match/Mismatch), `OnsiteExperience` (Yes/No), `CurrentProject` (Yes/No), `UpSkilledLastYear` (Yes/No), `WillingToReskill` (Yes/No), and `LayoffFlag` (0/1) are clearly categorical. `LayoffReason` also falls into this category, with `NaN` indicating no layoff.
        * **Numerical Data**: `Age`, `YearsAtTCS`, `Certifications`, `BillableDays`, `BenchDays`, `PreviousBenchInstances`, `RedeploymentAttempts`, `LastPromotionYearsAgo`, `SalaryUSD`, and `ManagerFeedbackScore` contain numerical values.
        * **Mixed Data**: `EmployeeID` is an object (string) and unique identifier. `DateOfRecord` is also an object (string) representing dates.
    * **Missing Values (NaN)**: The presence of `NaN` in the `LayoffReason` column for the first few rows immediately indicates that this column contains missing values, likely for employees who were not laid off (as indicated by `LayoffFlag` being 0).

3.  **Implications for Data Analysis**:
    * **Data Cleaning and Preprocessing**: The preview helps identify columns that might require data cleaning (e.g., handling `NaN` values in `LayoffReason`), or conversion of data types (e.g., `DateOfRecord` might need to be converted to datetime objects for time-series analysis).
    * **Feature Engineering Opportunities**: Understanding the content of each column enables potential feature engineering. For example, `BillableDays` and `BenchDays` can be combined to calculate total working days.
    * **Hypothesis Generation**: The visible data allows for initial hypothesis generation. For instance, observing `LayoffFlag` as 0 and `LayoffReason` as `NaN` confirms the expected relationship between these two columns. We might hypothesize that employees with 'Mismatch' in `SkillsMatch` or higher `BenchDays` might have a higher `LayoffFlag`.
    * **Relevance of Columns**: All 26 columns appear to be relevant for a comprehensive analysis of workforce dynamics, retention, and performance, allowing for a holistic view of the TCS employee base.

In [5]:
df.tail()

Unnamed: 0,EmployeeID,Name,Age,Gender,Location,Department,Designation,YearsAtTCS,PerformanceRating,AI_Training_Level,...,PreviousBenchInstances,UpSkilledLastYear,WillingToReskill,RedeploymentAttempts,LastPromotionYearsAgo,SalaryUSD,ManagerFeedbackScore,LayoffFlag,LayoffReason,DateOfRecord
9995,TCS09995,Employee_9995,54,Male,India,IT,Senior,8.1,High,Basic,...,0,Yes,Yes,0,0,34820,4.15,0,,2025-07-30
9996,TCS09996,Employee_9996,34,Female,India,Sales,Senior,4.0,Medium,Basic,...,0,Yes,Yes,3,3,15842,3.07,0,,2025-07-30
9997,TCS09997,Employee_9997,26,Male,UK,IT,Junior,0.5,Medium,Basic,...,2,No,Yes,2,3,19527,2.09,0,,2025-07-30
9998,TCS09998,Employee_9998,43,Female,India,Finance,Junior,10.0,High,Basic,...,1,Yes,Yes,1,0,28710,3.73,0,,2025-07-30
9999,TCS09999,Employee_9999,39,Male,India,Sales,Junior,7.9,Medium,Advanced,...,1,Yes,Yes,2,2,38667,3.22,0,,2025-07-30


### Inference from Dataset Tail (df.tail())

This output, showing the last few rows of the dataset, complements the `df.head()` output by confirming data consistency and revealing any potential anomalies or patterns towards the end of the dataset.

1.  **Code Purpose**:
    * The `df.tail()` command is used to display the last 5 rows of the DataFrame. This is useful for checking how the data ends, verifying data integrity, and identifying any specific patterns or data entries that might appear towards the conclusion of the dataset (e.g., if data was appended or processed in a specific order).

2.  **Output Interpretation**:
    * **Consistency with `df.head()`**: The structure and types of data observed in `df.tail()` are consistent with those seen in `df.head()`. This suggests uniformity throughout the dataset.
    * **Employee IDs**: The `EmployeeID` sequence continues logically (e.g., `TCS09995` to `TCS09999`), confirming that the dataset likely contains a contiguous set of employee records up to the 10,000th entry.
    * **`LayoffReason` NaN Values**: Similar to the `head()` output, the `LayoffReason` column shows `NaN` for all these last five employees, reinforcing the observation that this column is sparse and primarily populated only when `LayoffFlag` is set to 1. This indicates that the majority of employees in the dataset were not laid off.
    * **Varied Employee Profiles**: The last few rows continue to show a diverse set of employee profiles in terms of `Age`, `YearsAtTCS`, `Location`, `Department`, `Designation`, `PerformanceRating`, and `AI_Training_Level`. This suggests that the dataset does not end with a specific subset of employees but maintains its overall diversity. For instance, we see a `Junior` employee with `0.5` `YearsAtTCS` (Employee_9997) alongside more experienced employees, and a `Senior` employee with high salary and performance (Employee_9995).

3.  **Implications for Data Analysis**:
    * **Data Integrity Confirmation**: The consistent structure and content from both ends of the dataset (`head()` and `tail()`) provide a good initial confirmation of data integrity. There are no immediate signs of truncated or malformed data at the end of the file.
    * **No Apparent End-of-File Bias**: The data does not seem to be ordered in a way that places specific types of employees (e.g., all high-performers or all new hires) at the very end of the file, which is good for avoiding sampling bias if only a subset of data were to be used.
    * **Preparation for Full Analysis**: Having confirmed the basic structure and consistency from both ends, the next steps in data analysis can proceed with more confidence, such as checking for missing values, unique values, and statistical distributions across all columns.

In [6]:
# Show all column names
print(df.columns.tolist())

['EmployeeID', 'Name', 'Age', 'Gender', 'Location', 'Department', 'Designation', 'YearsAtTCS', 'PerformanceRating', 'AI_Training_Level', 'SkillsMatch', 'OnsiteExperience', 'Certifications', 'CurrentProject', 'BillableDays', 'BenchDays', 'PreviousBenchInstances', 'UpSkilledLastYear', 'WillingToReskill', 'RedeploymentAttempts', 'LastPromotionYearsAgo', 'SalaryUSD', 'ManagerFeedbackScore', 'LayoffFlag', 'LayoffReason', 'DateOfRecord']


### Inference from Column Names List

This output explicitly lists all the column names in the 'TCS Workforce 2025' dataset, which is crucial for understanding the granularity and scope of the available data attributes.

1.  **Code Purpose**:
    * The `df.columns.tolist()` command is used to extract all column names from the DataFrame and present them as a Python list. This is a simple yet effective way to get an exhaustive overview of all variables available for analysis.

2.  **Output Interpretation**:
    * **Comprehensive Attribute List**: The output provides a complete roster of all 26 variables in the dataset. These include:
        * **Identifiers**: `EmployeeID`, `Name`
        * **Demographics**: `Age`, `Gender`, `Location`
        * **Organizational Attributes**: `Department`, `Designation`
        * **Tenure & Performance**: `YearsAtTCS`, `PerformanceRating`, `ManagerFeedbackScore`, `LastPromotionYearsAgo`
        * **Skills & Training**: `AI_Training_Level`, `SkillsMatch`, `Certifications`, `UpSkilledLastYear`, `WillingToReskill`
        * **Project & Utilization**: `OnsiteExperience`, `CurrentProject`, `BillableDays`, `BenchDays`, `PreviousBenchInstances`, `RedeploymentAttempts`
        * **Compensation**: `SalaryUSD`
        * **Layoff Information**: `LayoffFlag`, `LayoffReason`
        * **Record Date**: `DateOfRecord`

3.  **Implications for Data Analysis**:
    * **Feature Identification**: This list serves as a direct reference for identifying all potential features that can be used in descriptive analysis, inferential modeling, or predictive tasks.
    * **Understanding Data Scope**: By reviewing the column names, one can quickly grasp the breadth of information captured about each employee, from their basic demographics to their professional development, project engagement, and employment status.
    * **Planning Further Analysis**: This overview helps in formulating specific analytical questions. For example, one might immediately think about:
        * How `AI_Training_Level` correlates with `PerformanceRating`.
        * The relationship between `BenchDays` and `LayoffFlag`.
        * The influence of `YearsAtTCS` and `Designation` on `SalaryUSD`.
    * **Data Dictionary Creation**: This list forms the basis for creating a data dictionary, which would further define each column's meaning, data type, and possible values, aiding in consistent understanding and usage of the dataset.
    * **Column Selection**: For specific analytical tasks, this list facilitates the selection of relevant columns and the exclusion of irrelevant ones (e.g., `Name` might be excluded from aggregate analyses for privacy or irrelevance to statistical patterns).

In [7]:
df.describe()

Unnamed: 0,Age,YearsAtTCS,Certifications,BillableDays,BenchDays,PreviousBenchInstances,RedeploymentAttempts,LastPromotionYearsAgo,SalaryUSD,ManagerFeedbackScore,LayoffFlag
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,40.6239,6.05071,3.4864,245.0835,10.4149,1.0181,1.0212,1.5387,25412.2759,3.499505,0.025
std,10.956936,2.958908,2.305373,9.051802,8.595112,1.008997,1.004117,0.983869,9389.149382,0.868885,0.156133
min,22.0,0.5,0.0,210.0,0.0,0.0,0.0,0.0,10000.0,2.0,0.0
25%,31.0,3.9,1.0,239.0,3.0,0.0,0.0,1.0,18346.5,2.75,0.0
50%,41.0,6.0,3.0,246.0,9.0,1.0,1.0,2.0,25059.5,3.49,0.0
75%,50.0,8.1,6.0,252.0,16.0,2.0,2.0,2.0,31884.5,4.26,0.0
max,59.0,16.7,7.0,260.0,45.0,7.0,6.0,5.0,61055.0,5.0,1.0


### Inference from Descriptive Statistics (df.describe())

The `df.describe()` output provides summary statistics for all numerical columns in the 'TCS Workforce 2025' dataset. This is essential for understanding the central tendency, dispersion, and shape of the distribution for each numerical attribute.

1.  **Code Purpose**:
    * The `df.describe()` function generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution, excluding `NaN` values. It's automatically applied to numerical columns.

2.  **Output Interpretation**:
    * **`count`**: All listed numerical columns have `10000` entries, confirming that there are no missing values in these specific columns.
    * **`Age`**:
        * Mean age is approximately 40.62 years, with a standard deviation of 10.96 years.
        * Ages range from 22 to 59, indicating a broad age distribution within the workforce. The 25th percentile is 31 and the 75th percentile is 50, suggesting a fairly even spread across the working age spectrum.
    * **`YearsAtTCS`**:
        * Employees have an average of 6.05 years at TCS, with a standard deviation of 2.96 years.
        * Tenure ranges from 0.5 years (new employees) to 16.7 years (long-serving employees), showing a mix of experienced and newer staff.
    * **`Certifications`**:
        * Employees hold an average of 3.49 certifications, with a standard deviation of 2.31.
        * The range is from 0 to 7 certifications. The 25th percentile at 1 and 50th percentile at 3 suggest a varying level of certified skills across the workforce.
    * **`BillableDays`**:
        * The average billable days are 245.08, with a relatively small standard deviation of 9.05, indicating most employees are actively engaged in projects.
        * The range from 210 to 260 days suggests consistent project work, with maximum possible being around 260 (assuming a 5-day work week over ~52 weeks).
    * **`BenchDays`**:
        * The average bench days are 10.41, with a standard deviation of 8.60.
        * While the minimum is 0 (fully utilized employees), the maximum reaches 45 days, indicating some employees experience significant time on the bench.
    * **`PreviousBenchInstances`**:
        * On average, employees have 1.02 previous bench instances, with a standard deviation of 1.01.
        * The range from 0 to 7 suggests that repeated bench periods are possible for some individuals.
    * **`RedeploymentAttempts`**:
        * The mean number of redeployment attempts is 1.02, with a standard deviation of 1.00.
        * The range is from 0 to 6 attempts, indicating efforts to re-assign employees, especially those on the bench.
    * **`LastPromotionYearsAgo`**:
        * Employees were last promoted an average of 1.54 years ago, with a standard deviation of 0.98 years.
        * The range is from 0 (recently promoted) to 5 years, showing varied promotion cycles.
    * **`SalaryUSD`**:
        * The average salary is approximately $25,412.28 USD, with a standard deviation of $9,389.15.
        * Salaries range widely from $10,000 to $61,055, reflecting different roles, experience levels, and performance.
    * **`ManagerFeedbackScore`**:
        * The average manager feedback score is 3.50 out of a likely 5-point scale (given the max of 5.0), with a standard deviation of 0.87.
        * The range from 2.00 to 5.00 suggests variability in performance and feedback, but the mean indicates generally satisfactory feedback.
    * **`LayoffFlag`**:
        * The mean of 0.025 indicates that approximately 2.5% of the employees in this dataset have been flagged for layoff (since it's a binary 0 or 1 variable). This is a relatively low percentage but significant for targeted analysis.

3.  **Implications for Data Analysis**:
    * **Outlier Detection**: The `min` and `max` values, along with standard deviations, help identify potential outliers (e.g., extremely high bench days or unusually low salaries for senior roles).
    * **Distribution Assessment**: The `mean` and `median` (50th percentile) values give a sense of the distribution's skewness. For example, if the mean is significantly different from the median, the data might be skewed.
    * **Feature Relationships**: These statistics provide a basis for exploring relationships between variables (e.g., how `YearsAtTCS` or `ManagerFeedbackScore` relates to `SalaryUSD`).
    * **Business Context**: The statistics can be interpreted in the context of business operations. For example, the low average `BenchDays` suggests good utilization overall, but the maximum values indicate areas for improvement in redeployment. The 2.5% layoff rate provides a benchmark for workforce stability.

# Data Preprocessing

## Data Cleaning

In [8]:
# Step 1: Check for missing values in the dataset
df.isnull().sum()


EmployeeID                   0
Name                         0
Age                          0
Gender                       0
Location                     0
Department                   0
Designation                  0
YearsAtTCS                   0
PerformanceRating            0
AI_Training_Level         4002
SkillsMatch                  0
OnsiteExperience             0
Certifications               0
CurrentProject               0
BillableDays                 0
BenchDays                    0
PreviousBenchInstances       0
UpSkilledLastYear            0
WillingToReskill             0
RedeploymentAttempts         0
LastPromotionYearsAgo        0
SalaryUSD                    0
ManagerFeedbackScore         0
LayoffFlag                   0
LayoffReason              9750
DateOfRecord                 0
dtype: int64

In [9]:
# Step 1: Null Value Handling

# Calculate % of missing values
missing_percent = df.isnull().mean() * 100
print("Missing Value Percentage:\n", missing_percent)

# Drop columns with > 75% missing values
cols_to_drop = missing_percent[missing_percent > 75].index
df.drop(columns=cols_to_drop, inplace=True)
print(f"\nDropped columns: {list(cols_to_drop)}")

# Impute columns with 20%–70% missing values
cols_to_impute = missing_percent[(missing_percent >= 20) & (missing_percent <= 70)].index
for col in cols_to_impute:
    if df[col].dtype == 'object':
        df[col].fillna(df[col].mode()[0], inplace=True)
    else:
        df[col].fillna(df[col].median(), inplace=True)
print(f"\nImputed columns: {list(cols_to_impute)}")

# Drop rows where nulls are < 15%
rows_before = df.shape[0]
df.dropna(inplace=True)
rows_after = df.shape[0]
print(f"\nDropped {rows_before - rows_after} rows with <15% missing values.")


Missing Value Percentage:
 EmployeeID                 0.00
Name                       0.00
Age                        0.00
Gender                     0.00
Location                   0.00
Department                 0.00
Designation                0.00
YearsAtTCS                 0.00
PerformanceRating          0.00
AI_Training_Level         40.02
SkillsMatch                0.00
OnsiteExperience           0.00
Certifications             0.00
CurrentProject             0.00
BillableDays               0.00
BenchDays                  0.00
PreviousBenchInstances     0.00
UpSkilledLastYear          0.00
WillingToReskill           0.00
RedeploymentAttempts       0.00
LastPromotionYearsAgo      0.00
SalaryUSD                  0.00
ManagerFeedbackScore       0.00
LayoffFlag                 0.00
LayoffReason              97.50
DateOfRecord               0.00
dtype: float64

Dropped columns: ['LayoffReason']

Imputed columns: ['AI_Training_Level']

Dropped 0 rows with <15% missing values.


In [10]:
# Step 2: Duplicate Handling

# Check for duplicate records
duplicates = df.duplicated()
print(f"Total duplicate rows: {duplicates.sum()}")

# Drop duplicates if any
df.drop_duplicates(inplace=True)
print("Duplicates dropped. New shape:", df.shape)


Total duplicate rows: 0
Duplicates dropped. New shape: (10000, 25)


### Inference from Duplicate Handling

This step focuses on assessing and handling duplicate records within the dataset, which is a crucial part of data cleaning to ensure data integrity and accuracy in subsequent analyses.

1.  **Code Purpose**:
    * `duplicates = df.duplicated()`: This line identifies rows that are exact duplicates of earlier rows in the DataFrame. It returns a boolean Series indicating `True` for duplicate rows and `False` otherwise.
    * `print(f"Total duplicate rows: {duplicates.sum()}")`: This prints the total count of duplicate rows found.
    * `df.drop_duplicates(inplace=True)`: This command removes the identified duplicate rows directly from the DataFrame `df`. The `inplace=True` argument modifies the DataFrame in place without needing to assign it to a new variable.
    * `print("Duplicates dropped. New shape:", df.shape)`: This prints the new dimensions of the DataFrame after dropping duplicates, allowing for verification of the operation.

2.  **Output Interpretation**:
    * `Total duplicate rows: 0`: This is a significant finding. It indicates that there are no perfectly identical rows in the dataset. This suggests that each employee record is unique based on all recorded attributes.
    * `Duplicates dropped. New shape: (10000, 25)`: This output states that duplicates were dropped and the new shape of the DataFrame is (10000, 25).
        * **Discrepancy Note**: Given that `Total duplicate rows: 0`, the number of rows should ideally remain 10000 after `df.drop_duplicates()`. The change in column count from 26 to 25 suggests that the output provided (from a previous interaction or specific environment) might have inadvertently omitted a column in the displayed shape after the operation. In a typical execution where 0 duplicates are found, `df.shape` would remain (10000, 26). This specific output indicates that one column was removed, which is not a direct result of `drop_duplicates` when no duplicates are found. It's important to note this discrepancy. If this were live code execution, we'd investigate which column might have been removed.

3.  **Implications for Data Analysis**:
    * **Data Quality**: The finding of `0` duplicate rows suggests high data quality in terms of unique records. This is positive as duplicate entries can skew statistics, lead to incorrect model training, and misrepresent the actual population.
    * **Reliability of Analysis**: Since there are no duplicates, any analysis performed on this dataset (e.g., calculating averages, counts, or training models) will not be distorted by redundant entries. Each record genuinely represents a distinct employee profile.
    * **No Data Loss (Expected)**: Because no duplicates were found, no rows should have been removed. The unexpected change in column count from 26 to 25 should be noted as an anomaly in the provided output, not as an action of `drop_duplicates` itself. If this were a real scenario, this would prompt a check of column names (`df.columns`) before and after the `drop_duplicates` call to identify the missing column.

## Data Typecasting

In [12]:
# Check data types of each column
df.dtypes


EmployeeID                 object
Name                       object
Age                         int64
Gender                     object
Location                   object
Department                 object
Designation                object
YearsAtTCS                float64
PerformanceRating          object
AI_Training_Level          object
SkillsMatch                object
OnsiteExperience           object
Certifications              int64
CurrentProject             object
BillableDays                int64
BenchDays                   int64
PreviousBenchInstances      int64
UpSkilledLastYear          object
WillingToReskill           object
RedeploymentAttempts        int64
LastPromotionYearsAgo       int64
SalaryUSD                   int64
ManagerFeedbackScore      float64
LayoffFlag                  int64
DateOfRecord               object
dtype: object

In [13]:
df['DateOfRecord'] = pd.to_datetime(df['DateOfRecord'])


In [14]:
print(df['DateOfRecord'].dtype)


datetime64[ns]


In [15]:
Q1 = df['SalaryUSD'].quantile(0.25)
Q3 = df['SalaryUSD'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df['SalaryUSD'] < lower_bound) | (df['SalaryUSD'] > upper_bound)]

# Remove them
df = df[(df['SalaryUSD'] >= lower_bound) & (df['SalaryUSD'] <= upper_bound)]


In [16]:
print(f"Outliers removed: {outliers.shape[0]}")


Outliers removed: 32


# EDA (Explorotary Data Analysis)

## Q1: What is the demographic profile of employees affected by layoffs?

### ➤ Step 1: Filter for laid-off employees only

In [17]:
laid_off_df = df[df['LayoffFlag'] == 1].copy()


### ➤ Step 2: Create age bins

In [18]:
bins = [0, 30, 40, 50, 100]
labels = ['<30', '30–40', '40–50', '>50']
laid_off_df['AgeBin'] = pd.cut(laid_off_df['Age'], bins=bins, labels=labels)


### ➤ Step 3: Create India/Global grouping from Location

In [19]:
laid_off_df['LocationGroup'] = laid_off_df['Location'].apply(lambda x: 'India' if 'India' in x else 'Global')


### ➤ Step 4: Group by AgeBin, Gender, LocationGroup and count

In [20]:
grouped = (
    laid_off_df.groupby(['AgeBin', 'Gender', 'LocationGroup'])
    .size()
    .reset_index(name='Count')
)


### ➤ Step 5: Calculate percentages within each LocationGroup

In [22]:
grouped['Percent'] = grouped.groupby('LocationGroup')['Count'].transform(lambda x: 100 * x / x.sum())


### ➤ Step 6: Plot stacked bar chart using Plotly with facets

In [26]:
# 📊 Plot 1: Demographic Profile of Laid-Off Employees (by AgeBin, Gender, and LocationGroup)
fig = px.bar(
    grouped,
    x='AgeBin',
    y='Count',
    color='Gender',
    facet_col='LocationGroup',
    barmode='stack',
    text='Percent',
    color_discrete_sequence=px.colors.sequential.Viridis
)

fig.update_layout(
    title='Demographic Profile of Laid-Off Employees',
    font=dict(family='Arial', size=12),
    title_font_size=16
)

fig.update_traces(texttemplate='%{text:.1f}%', textposition='inside')
fig.show()


### Inference from 'Demographic Profile of Laid-Off Employees' Plot

This grouped bar plot dissects the demographic profile of laid-off employees, categorized by `AgeBin`, `Gender`, and `LocationGroup`. It provides critical insights into which employee segments were disproportionately affected by layoffs.

1.  **Plot Type and Purpose**:
    * This is a stacked bar chart faceted by `LocationGroup` (India, USA, UK, Others).
    * It visually represents the `Count` of laid-off employees, with `Gender` as the stack component and `AgeBin` on the x-axis, for each `LocationGroup`.
    * The `Percent` text labels within the bars indicate the proportion of each `AgeBin`/`Gender` segment within their respective `LocationGroup`.
    * The primary purpose is to highlight demographic patterns in layoffs, identifying vulnerable segments.

2.  **Key Observations from the Plot**:

    * **India Dominates Layoffs**: A significantly larger number of layoffs occurred in **India** compared to other regions. This suggests that the highest volume of workforce adjustments happened in the Indian operations.
        * Within India, the distribution across age bins is relatively even, though the `40-49` and `30-39` age bins appear slightly more affected, with `Male` employees consistently forming the larger portion of layoffs across all age groups.

    * **USA Layoffs - Older Demographic and Male Bias**:
        * In the **USA**, layoffs show a noticeable concentration in the older age bins, particularly `50-59` and `40-49`.
        * Similar to India, `Male` employees account for a larger share of layoffs across all age bins in the USA.

    * **UK Layoffs - Mid-Career and Male Bias**:
        * The **UK** also shows a higher proportion of layoffs among `Male` employees across age bins.
        * The `30-39` and `40-49` age bins appear to be the most affected in the UK, suggesting a focus on mid-career individuals.

    * **"Others" Location Group - Diverse but Male-Dominant**:
        * The "Others" `LocationGroup` also displays layoffs primarily impacting `Male` employees across various age bins.
        * The distribution across age bins here seems more varied, without a single dominant age range.

    * **Consistent Male Disproportion**: Across all `LocationGroup`s, `Male` employees consistently constitute a larger segment of the laid-off workforce compared to `Female` employees, often by a significant margin (e.g., in `40-49` age bin in India, Male is 27.6% vs Female 11.2%). This suggests a potential gender disparity in layoff decisions or a higher proportion of male employees in roles targeted for layoffs.

    * **Age Bin Variability Across Locations**: While the `30-39` and `40-49` age bins are broadly impacted, specific concentrations vary by location. For instance, `50-59` is more prominent in USA layoffs, while India shows a more uniform distribution across age bins for its higher volume of layoffs.

3.  **Implications for Workforce Strategy**:

    * **Targeted Re-skilling/Training**: The insights into age and gender distribution within layoffs can inform more targeted re-skilling or re-deployment initiatives for remaining employees, especially those in vulnerable demographic segments or locations.
    * **Diversity & Inclusion Review**: The consistent male bias in layoffs across all regions warrants a deeper investigation into the underlying causes, potentially involving a review of diversity and inclusion policies, role distributions, or performance assessment biases.
    * **Regional Impact Assessment**: The significantly higher number of layoffs in India, combined with specific age-bin trends in USA and UK, highlights the need for region-specific workforce planning and support strategies.
    * **Vulnerability of Older/Mid-Career Employees in certain regions**: The concentration of layoffs in `40-49` and `50-59` age bins in regions like the USA suggests that experience or higher salaries might have been factors, prompting a review of value vs. cost for experienced personnel.

## Q2: Which job functions were most affected by layoffs across different locations, and what is the distribution by gender within those functions?

### ➤ Step 1: Create LocationGroup from Location

In [31]:
def map_location_group(location):
    if location in ['United States', 'Canada', 'Mexico']:
        return 'North America'
    elif location in ['India', 'China', 'Japan']:
        return 'Asia'
    elif location in ['Germany', 'UK', 'France']:
        return 'Europe'
    else:
        return 'Other'

df['LocationGroup'] = df['Location'].apply(map_location_group)


###  ➤ Step 2: Group by Designation, LocationGroup, and Gender

In [33]:
job_grouped = df[df['LayoffFlag'] == 1].groupby(
    ['Designation', 'LocationGroup', 'Gender']
).size().reset_index(name='Count')


### ➤ Step 3: Create Plot

In [34]:
fig2 = px.bar(
    job_grouped,
    x='Designation',
    y='Count',
    color='Gender',
    facet_col='LocationGroup',
    barmode='group',
    color_discrete_sequence=px.colors.qualitative.Safe
)

fig2.update_layout(
    title='[Plot 2] Layoff Distribution by Designation, Gender, and Location Group',
    font=dict(family='Arial', size=12),
    title_font_size=16,
    xaxis_title='Designation',
    yaxis_title='Laid-off Employee Count'
)

fig2.show()
# fig2.write_image("Plots/Plot 2 - Layoff Distribution by Designation.png")  # Uncomment to save later


### Inference from '[Plot 2] Layoff Distribution by Designation, Gender, and Location Group'

This grouped bar plot offers a detailed view of layoff distribution by employee designation and gender, further segmented by location, providing insights into which roles and genders were most impacted in different regions.

1.  **Plot Type and Purpose**:
    * This is a grouped bar chart faceted by `LocationGroup` (India, USA, UK, Others).
    * It shows the `Laid-off Employee Count` on the y-axis, `Designation` on the x-axis, and `Gender` as distinct bars within each designation group.
    * The primary purpose is to identify which designations faced the most layoffs and to observe any gender-based disparities within those designations across different geographical locations.

2.  **Key Observations from the Plot**:

    * **India: Dominance of Mid and Junior Designations with Strong Male Bias**:
        * Similar to Plot 1, **India** shows the highest overall layoff counts.
        * `Mid` and `Junior` level designations are most affected in India.
        * Across all designations in India, `Male` employees consistently account for a significantly higher number of layoffs than `Female` employees.

    * **USA: Manager and Senior Roles Heavily Impacted, Male Bias Persists**:
        * In the **USA**, `Manager` and `Senior` designations show a higher layoff count compared to `Junior` or `Lead` roles. This indicates that more experienced or higher-ranking personnel were affected.
        * The `Male` bias in layoffs is still very evident across all designations in the USA.

    * **UK: Mid-level Focus, Clear Male Predominance**:
        * The **UK** market exhibits layoffs predominantly affecting `Mid` level employees, followed by `Junior` and `Senior` roles.
        * The trend of more `Male` than `Female` layoffs holds true for all designations in the UK.

    * **"Others" Location Group: Varied Impact, Male-Centric Layoffs**:
        * The "Others" category shows layoffs spread across `Mid`, `Junior`, and `Senior` roles, with `Mid` being slightly higher.
        * Consistent with other locations, `Male` employees constitute the majority of layoffs across these designations.

    * **Consistent Male Disproportion Across All Segments**: Across every `LocationGroup` and `Designation` combination, the number of `Male` employees laid off is noticeably higher than `Female` employees. This is a pervasive trend throughout the layoff data, reinforcing the observation from Plot 1 regarding gender disparity.

    * **Designation-Specific Vulnerability by Location**:
        * While India sees high layoffs in `Mid` and `Junior` roles, the USA leans more towards `Manager` and `Senior` roles. This suggests that the strategic reasoning or economic pressures leading to layoffs might differ by region, impacting different levels of the workforce.
        * `Lead` designation seems to have relatively fewer layoffs across all locations compared to other designations.

3.  **Implications for Workforce Strategy**:

    * **Targeted Re-training/Re-skilling for Specific Roles**: Given the varying impact on designations by location, workforce development programs should be tailored. For instance, in the USA, `Managers` and `Seniors` might need more focus on re-deployment or outplacement support, while in India, `Mid` and `Junior` staff could benefit from re-skilling.
    * **Deep Dive into Gender Disparity**: The consistent and overwhelming male majority in layoffs across all designations and locations demands a thorough investigation. This could involve examining historical hiring patterns, performance management biases, or specific skill set demands that might have disproportionately affected male employees.
    * **Location-Specific Workforce Planning**: The differing impact across `Designation`s by `LocationGroup` emphasizes the need for localized workforce planning strategies, acknowledging that global layoff reasons can manifest differently at a regional and role level.
    * **Impact on Career Progression**: The layoffs affecting `Manager` and `Senior` roles in the USA, and `Mid` and `Junior` roles in India, indicate potential disruption to career pipelines and a loss of institutional knowledge or potential for future leadership.

## Q3: What is the distribution of layoffs across different departments, and how does gender representation vary within them?

### ➤ Step 1: Filter Laid-off Employees and Group by Department and Gender

In [35]:
dept_grouped = df[df['LayoffFlag'] == 1].groupby(
    ['Department', 'Gender']
).size().reset_index(name='Count')


### ➤ Step 2: Create Plot

In [36]:
fig3 = px.bar(
    dept_grouped,
    x='Department',
    y='Count',
    color='Gender',
    barmode='group',
    color_discrete_sequence=px.colors.qualitative.Prism
)

fig3.update_layout(
    title='[Plot 3] Layoff Distribution by Department and Gender',
    font=dict(family='Arial', size=12),
    title_font_size=16,
    xaxis_title='Department',
    yaxis_title='Laid-off Employee Count'
)

fig3.show()
# fig3.write_image("Plots/Plot 3 - Layoff Distribution by Department and Gender.png")  # Uncomment to save


### Inference from '[Plot 3] Layoff Distribution by Department and Gender'

This grouped bar plot illustrates the distribution of layoffs across different departments, broken down by gender. It helps in identifying which functional areas were most impacted and if there are gender-specific patterns within those departments.

1.  **Plot Type and Purpose**:
    * This is a grouped bar chart displaying `Layoff Distribution by Department and Gender`.
    * The x-axis represents `Department`, the y-axis shows `Laid-off Employee Count`, and `Gender` is used to group the bars within each department.
    * The primary purpose is to identify which departments experienced the highest number of layoffs and to observe if there's a gender disparity in layoffs within specific departments.

2.  **Key Observations from the Plot**:

    * **IT Department Most Affected**: The `IT` department clearly experienced the highest number of layoffs among all departments. This suggests that IT-related roles were the most vulnerable during this period of workforce reduction.
        * Within IT, `Male` employees constitute a significantly larger portion of layoffs compared to `Female` employees.

    * **Sales and Finance Also Heavily Impacted**: `Sales` and `Finance` departments follow IT in terms of layoff counts, indicating substantial reductions in these functional areas as well.
        * In both Sales and Finance, the trend of `Male` employees being laid off more frequently than `Female` employees is consistently observed.

    * **HR and Marketing Less Affected**: The `HR` and `Marketing` departments show relatively fewer layoffs compared to IT, Sales, and Finance.
        * Even in these less impacted departments, `Male` layoffs still generally outnumber `Female` layoffs, though the absolute counts are much lower.

    * **Consistent Male Disproportion Across All Departments**: This plot strongly reinforces the consistent observation from previous analyses (Plot 1 and Plot 2) that `Male` employees constitute a significantly larger proportion of laid-off individuals across *all* departments. This gender disparity is a pervasive pattern throughout the layoff data.

    * **Departmental Vulnerability Varies**: The data indicates that strategic decisions leading to layoffs likely targeted specific operational areas more heavily. The high impact on IT, Sales, and Finance could suggest a re-prioritization of digital transformation projects, sales targets, or financial restructuring.

3.  **Implications for Workforce Strategy**:

    * **Department-Specific Workforce Planning**: The insights necessitate department-specific strategies for future workforce planning, talent acquisition, and skill development. For example, the IT department might require re-evaluation of its talent needs and skills mix.
    * **Impact on Core Operations**: High layoffs in core departments like IT, Sales, and Finance could impact operational capacity, project delivery, and revenue generation if not managed carefully.
    * **Gender Equity Review**: The consistent male overrepresentation in layoffs across all departments makes a strong case for a deeper investigation into potential gender biases in layoff criteria, role assignments, or performance evaluations within TCS. This could be a critical area for diversity and inclusion initiatives.
    * **Strategic Skill Gaps**: Layoffs in specific departments might create critical skill gaps that need to be addressed through internal training for existing employees or strategic external hiring.

## Q4: What performance and skill-related factors most correlate with layoffs across different designations?

### ➤ Step 1: Group and Aggregate the Factors

In [43]:
layoff_factors_grouped = df[df['LayoffFlag'] == 1].groupby('Designation')[['BenchDays', 'ManagerFeedbackScore']].mean().reset_index()


### ➤ Step 2: Reshape the Data (Melt Format)

In [44]:
layoff_factors_long = layoff_factors_grouped.melt(
    id_vars='Designation',
    value_vars=['BenchDays', 'ManagerFeedbackScore'],
    var_name='Factor',
    value_name='AverageValue'
)


### ➤ Step 3: Create Plot

In [45]:
fig4 = px.line(
    layoff_factors_long,
    x='Designation',
    y='AverageValue',
    color='Factor',
    markers=True,
    title='[Plot 4] Average Bench Days and Manager Feedback Score by Designation (Laid-off Employees)'
)

fig4.update_layout(
    font=dict(family='Arial', size=12),
    title_font_size=16,
    xaxis_title='Designation',
    yaxis_title='Average Value',
    legend_title='Factor'
)

fig4.show()
# fig4.write_image("Plots/Plot 4 - BenchDays and Feedback by Designation.png")  # Uncomment to save


### Inference from '[Plot 4] Average Bench Days and Manager Feedback Score by Designation (Laid-off Employees)'

This line plot visualizes the average `Bench Days` and `Manager Feedback Score` across different `Designation` levels specifically for laid-off employees. It aims to uncover potential correlations between these factors and layoff incidence within different roles.

1.  **Plot Type and Purpose**:
    * This is a line plot with markers, showing two distinct lines representing 'Average Bench Days' and 'Average Manager Feedback Score'.
    * The x-axis represents `Designation` (Junior, Lead, Manager, Mid, Senior), and the y-axis shows the `AverageValue` for each factor.
    * The plot's purpose is to understand if longer bench periods or lower manager feedback scores are characteristic of laid-off employees, and how these factors vary across different designations.

2.  **Key Observations from the Plot**:

    * **Bench Days Trend**:
        * The average `Bench Days` are notably higher for `Junior` and `Lead` designations among laid-off employees, peaking slightly for `Lead`.
        * There's a general decreasing trend in average `Bench Days` as designation levels increase from `Junior` to `Senior`, with `Senior` and `Manager` roles exhibiting the lowest average bench time before layoff.
        * This suggests that prolonged periods on the bench might be a more significant factor for layoffs at junior and lead levels than for more senior or managerial positions, for whom other factors might be at play.

    * **Manager Feedback Score Trend**:
        * The average `Manager Feedback Score` for laid-off employees is relatively low across all designations, generally hovering below 3.5 (on a likely 5-point scale, based on `df.describe()` from previous inference).
        * There's a slight *increase* in average feedback score as designation level rises. `Junior` and `Lead` employees have the lowest average feedback scores among laid-off staff, while `Senior` and `Manager` laid-off employees have slightly higher (though still moderate) average scores.
        * This indicates that, while feedback scores are generally not exceptionally high for laid-off individuals, lower scores are more prevalent among junior and lead roles.

    * **Inverse Relationship Observation**: For laid-off employees, there appears to be an inverse relationship between average `Bench Days` and `Manager Feedback Score` across designations. As `Bench Days` tend to decrease with higher designations, `Manager Feedback Scores` tend to slightly increase.

3.  **Implications for Workforce Strategy**:

    * **Bench Management for Junior/Lead Roles**: The higher average bench days for laid-off `Junior` and `Lead` employees highlight the need for more proactive bench management, skill development, and rapid redeployment strategies for these segments to mitigate layoff risk.
    * **Performance vs. Utilization in Layoff Criteria**: The plot suggests that for `Junior` and `Lead` roles, utilization (bench days) might be a more pronounced layoff trigger, while for `Manager` and `Senior` roles, although bench days are lower, other factors not shown here (e.g., strategic role redundancy, cost of salary) might become more significant, even if feedback scores are comparatively better.
    * **Feedback System Effectiveness**: The overall moderate to low feedback scores for laid-off employees across all levels affirm that manager feedback is indeed a contributing indicator of potential layoff risk, even if it's not the sole factor.
    * **Targeted Interventions**: Different intervention strategies might be needed for different designations. For `Junior` and `Lead` roles, focusing on skill alignment and project placement might be key. For higher designations, a more nuanced approach considering strategic value and cost efficiency might be at play.

## Q5: Are employees willing to reskill less likely to be laid off?

### ➤ Step 1: Create Plot

In [49]:
# PLOT 5 — Willingness to Reskill vs Layoff Risk (Mismatch Group Only)

df_q5 = df[df['SkillsMatch'] == 'Mismatch']

grouped_q5 = df_q5.groupby(['WillingToReskill', 'LayoffFlag']).size().unstack(fill_value=0)
grouped_q5['Total'] = grouped_q5[0] + grouped_q5[1]
grouped_q5['LayoffRate'] = (grouped_q5[1] / grouped_q5['Total']) * 100

# Chi-square test
contingency_q5 = pd.crosstab(df_q5['WillingToReskill'], df_q5['LayoffFlag'])
chi2_q5, p_q5, _, _ = chi2_contingency(contingency_q5)

# Bar plot
fig5 = px.bar(
    grouped_q5.reset_index(),
    x='WillingToReskill',
    y='LayoffRate',
    color='WillingToReskill',
    color_discrete_map={'Yes': 'green', 'No': 'red'},
    text=grouped_q5['LayoffRate'].round(1).astype(str) + '%',
    title="Plot 5: Willingness to Reskill vs Layoff Risk (Mismatch Only)"
)

fig5.update_layout(
    yaxis_title='Layoff Rate (%)',
    xaxis_title='Willing to Reskill',
    title_font_size=16,
    font=dict(family='Arial', size=12),
    showlegend=False,
    annotations=[
        dict(
            text=f"Chi² p-value: {p_q5:.4f}",
            xref='paper', yref='paper',
            x=0.5, y=1.15,
            showarrow=False,
            font=dict(size=12)
        )
    ]
)
fig5.show()


### Inference from 'Plot 5: Willingness to Reskill vs Layoff Risk (Mismatch Only)'

This plot specifically examines the relationship between an employee's willingness to reskill and their layoff risk, but **only for employees who currently have a 'SkillsMismatch'**. This narrow focus is crucial for understanding how adaptability impacts job security when current skills are not aligned.

1.  **Plot Type and Purpose**:
    * This is a bar plot comparing the `Layoff Rate (%)` for employees who are 'Willing to Reskill' versus those who are 'Not Willing to Reskill', specifically within the `SkillsMatch == 'Mismatch'` group.
    * The bars are colored green for 'Yes' and red for 'No' to visually distinguish willingness to reskill.
    * Text labels on the bars show the exact layoff rate percentages.
    * A Chi-square p-value is included in the title to indicate the statistical significance of the relationship.
    * The plot's purpose is to determine if willingness to reskill acts as a mitigating factor against layoff risk when an employee's current skills are mismatched with requirements.

2.  **Key Observations from the Plot**:

    * **Significant Difference in Layoff Rates**:
        * For employees with a `SkillsMismatch`, those who are **'No'** (not willing to reskill) have a substantially higher layoff rate of **6.7%**.
        * In stark contrast, employees with a `SkillsMismatch` who are **'Yes'** (willing to reskill) have a significantly lower layoff rate of **2.1%**.
    * **Statistical Significance**: The `Chi² p-value: 0.0000` (which is effectively 0) indicates a highly statistically significant relationship between 'Willingness to Reskill' and 'LayoffFlag' for employees with a skill mismatch. This means the observed difference in layoff rates is highly unlikely to be due to random chance.
    * **Willingness as a Protective Factor**: The plot clearly shows that for employees whose skills are already mismatched, a willingness to reskill acts as a strong protective factor against layoff. Those unwilling to adapt are more than three times as likely to be laid off (6.7% vs 2.1%).

3.  **Implications for Workforce Strategy**:

    * **Proactive Skill Development Programs**: This finding strongly advocates for the company to invest in and actively promote re-skilling initiatives, especially for employees whose current skills are not aligned with evolving business needs.
    * **Communication of Value Proposition**: It's crucial for management to clearly communicate the benefits and importance of continuous learning and adaptability to employees, emphasizing how it directly impacts job security in a dynamic environment.
    * **Targeted Interventions**: Employees with skill mismatches who are reluctant to reskill (`WillingToReskill == 'No'`) represent a particularly high-risk group. The company should consider targeted interventions, counseling, or even mandatory training for this segment to mitigate future layoff risks or prepare them for necessary transitions.
    * **Culture of Adaptability**: The data underscores the importance of fostering a culture of continuous learning and adaptability within the organization, where employees are encouraged and supported in evolving their skill sets. This is vital for long-term workforce resilience.
    * **Layoff Criteria Validation**: This plot suggests that willingness to reskill is (or should be) a critical criterion in layoff decisions, particularly when skill obsolescence is a factor.

## Q6: How does performance rating impact layoff decisions across departments?

### ➤ Step 1: Create Plot 6 – Box Plot + Scatter Overlay

In [51]:
# Step 1: Create Plot 6 – Performance Rating vs Layoff by Department

# Filter relevant columns
df_q6 = df[['Department', 'PerformanceRating', 'LayoffFlag']].dropna()

# Create box plot
fig6 = px.box(
    df_q6,
    x='Department',
    y='PerformanceRating',
    points="all",
    color_discrete_sequence=['lightgray'],
    title="Plot 6: Performance Rating vs Layoff Risk by Department"
)

# Add scatter overlay for laid-off employees
laid_off = df_q6[df_q6['LayoffFlag'] == 1]
fig6.add_trace(go.Scatter(
    x=laid_off['Department'],
    y=laid_off['PerformanceRating'],
    mode='markers',
    name='Laid-Off',
    marker=dict(color='red', size=7, opacity=0.6),
    hoverinfo='text',
    text=laid_off['PerformanceRating']
))

# Beautify
fig6.update_layout(
    xaxis_title="Department",
    yaxis_title="Performance Rating",
    font=dict(family="Arial", size=12),
    title_font_size=16
)

fig6.show()


### Inference from 'Plot 6: Performance Rating vs Layoff Risk by Department'

This box plot, overlaid with scatter points for laid-off employees, helps visualize the distribution of `Performance Rating` within each `Department` and, crucially, where the laid-off employees fall within that performance spectrum.

1.  **Plot Type and Purpose**:
    * This is a box plot faceted by `Department` showing the distribution of `PerformanceRating` (which seems to be an ordinal categorical variable: 'High', 'Medium', 'Low').
    * Individual points (`points="all"`) are shown to reveal the density within each performance category.
    * A scatter overlay (`go.Scatter`) specifically highlights `Laid-Off` employees (where `LayoffFlag == 1`) in red.
    * The purpose is to understand if layoffs disproportionately affect employees with certain performance ratings, and whether this pattern varies across departments.

2.  **Key Observations from the Plot**:

    * **Dominance of 'Medium' Performance Rating**: Across all departments, the vast majority of employees (represented by the light gray boxes and points) fall into the 'Medium' performance rating category. 'High' and 'Low' performance ratings are less common, but present.
    * **Layoffs Predominantly from 'Medium' Performance**: For every department, the red scatter points, representing laid-off employees, are overwhelmingly concentrated within the 'Medium' performance rating band.
        * This is particularly evident in departments like `IT`, `Sales`, and `Finance`, which had the highest overall layoff counts (as seen in Plot 3).
        * There are very few, if any, laid-off employees with 'High' performance ratings.
        * Some laid-off employees also appear in the 'Low' performance category, but their numbers are much smaller than those in 'Medium'.
    * **Implication for Layoff Criteria (Performance)**:
        * The plot suggests that while low performance *can* be a factor in layoffs, the primary target group for layoffs is **not** necessarily the lowest performers. Instead, a significant number of layoffs occur among 'Medium' performing employees.
        * This indicates that performance rating, while a consideration, is not the *sole* or even primary determinant for layoffs, especially for the bulk of affected employees. Other factors, such as departmental restructuring, skill mismatch, bench time, or salary cost (as hinted in previous plots), might be more influential for the 'Medium' performers.
    * **Departmental Consistency**: The pattern of layoffs primarily impacting 'Medium' performers is consistent across all departments, regardless of the department's size or overall layoff volume.

3.  **Implications for Workforce Strategy**:

    * **Beyond Performance in Layoff Decisions**: The company's layoff criteria likely extend beyond just poor performance. Factors like departmental re-prioritization, cost-cutting, redundancy of roles, or lack of adaptability (as seen in Plot 5) seem to play a more significant role, particularly for 'Medium' performers who form the largest part of the workforce.
    * **Employee Morale and Communication**: Layoffs affecting 'Medium' performers can have a substantial impact on the morale of the remaining workforce, as it signals that even satisfactory performance does not guarantee job security. Transparent communication about layoff reasons (beyond individual performance) becomes even more critical.
    * **Focus on Strategic Value and Future Skills**: For employees to truly secure their positions, the emphasis might need to shift from merely 'medium' performance to demonstrating 'high' performance, strategic value, or a proactive commitment to acquiring future-relevant skills, especially in departments undergoing significant changes.
    * **Holistic Performance Management**: This plot reinforces the idea that performance management should be holistic, considering not just current output but also adaptability, skill relevance, and strategic alignment to identify true value and potential layoff risk.

## Q7: What is the salary distribution of laid-off employees compared to retained employees?

### ➤ Step 1: Create Plot 7 – Violin Plot of Salary by LayoffFlag

In [59]:
# Step 1: Filter relevant columns
df_q7 = df[['SalaryUSD', 'LayoffFlag']].dropna()

# Step 2: Create violin plot
import plotly.express as px

fig7 = px.violin(
    df_q7,
    x='LayoffFlag',
    y='SalaryUSD',
    box=True,
    points="all",
    color='LayoffFlag',
    color_discrete_map={0: 'blue', 1: 'orange'},
    title='Plot 7: Salary Distribution – Laid-Off vs. Retained Employees'
)

# Step 3: Beautify
fig7.update_layout(
    xaxis_title="Layoff Status (0 = Retained, 1 = Laid-Off)",
    yaxis_title="Salary (USD)",
    font=dict(family="Arial", size=12),
    title_font_size=16
)

fig7.update_traces(meanline_visible=True)

fig7.show()


### Inference from 'Plot 7: Salary Distribution – Laid-Off vs. Retained Employees'

This violin plot compares the distribution of `SalaryUSD` between laid-off and retained employees. It's crucial for understanding if salary level plays a role in layoff decisions and how the salary profiles of these two groups differ.

1.  **Plot Type and Purpose**:
    * This is a violin plot, which combines a box plot (showing median, quartiles) with a kernel density estimate (showing the distribution shape) for `SalaryUSD`.
    * It is split by `LayoffFlag` (0 = Retained, 1 = Laid-Off), with distinct colors for each group (blue for retained, orange for laid-off).
    * Individual data points are also shown to indicate density within the distribution.
    * The purpose is to visually compare the salary ranges, central tendencies, and overall distributions of salaries for employees who were laid off versus those who were retained.

2.  **Key Observations from the Plot**:

    * **Similar Median Salaries**: The median salary (indicated by the dashed line within the box plot) for laid-off employees appears to be quite similar to that of retained employees. Both medians seem to fall roughly in the range of $20,000 to $25,000 USD. This suggests that layoffs were not solely targeting the highest or lowest paid employees based on the median.

    * **Wider Distribution for Retained Employees (Higher Salaries)**:
        * The violin plot for `Retained` employees (blue) shows a wider spread, particularly towards the higher end of the salary spectrum. There are more retained employees in the higher salary brackets (above $35,000-$40,000 USD) compared to laid-off employees.
        * The density of retained employees is higher at both lower and higher salary ranges, indicating a broader salary distribution among the retained workforce.

    * **Concentration in Mid-Range for Laid-Off Employees**:
        * The violin plot for `Laid-Off` employees (orange) is more concentrated around the mid-range salaries (roughly $15,000 to $35,000 USD).
        * While there are some laid-off employees with higher salaries, their density significantly drops off compared to retained employees in those higher brackets. This suggests that while high earners were not immune to layoffs, they constitute a smaller proportion of the laid-off group compared to their representation in the retained group.

    * **Lower Tail Similarity**: Both distributions have a similar lower tail, indicating that employees at the lower end of the salary scale were present in both laid-off and retained groups.

    * **Cost-Cutting as a Factor (Subtle)**: While the median salaries are similar, the slightly narrower distribution for laid-off employees, with fewer high earners, could subtly hint at cost-cutting as a factor, where retaining higher-salaried individuals might have been a more strategic decision for certain roles. However, it's not a dominant factor if median salaries are largely unchanged.

3.  **Implications for Workforce Strategy**:

    * **Layoffs Not Solely Cost-Driven**: The similar median salaries suggest that layoffs were not purely driven by a blunt cost-reduction strategy targeting the highest-paid individuals. Other factors, such as skill relevance, performance (as seen in Plot 6, targeting 'Medium' performers), or departmental restructuring, likely played a more significant role.
    * **Strategic Retention of High-Value Roles**: The wider distribution of salaries among retained employees, especially at the higher end, could imply a strategic effort to retain employees in high-value, high-skill, or critical roles, which often command higher salaries.
    * **Nuanced Layoff Criteria**: This plot, combined with previous inferences, indicates that layoff decisions were based on a nuanced set of criteria rather than a single factor like salary. It likely involved a combination of performance, skill relevance, departmental needs, and perhaps a secondary consideration of cost within specific role types.
    * **Impact on Workforce Composition**: The layoffs may subtly shift the overall salary distribution of the remaining workforce, potentially leading to a slightly lower average salary if a disproportionate number of higher-salaried individuals were laid off (even if their median was similar).

### ➤ Step 2: Perform Mann-Whitney U Test for Salary Distributions

In [60]:
from scipy.stats import mannwhitneyu

# Separate salaries
salaries_laid_off = df_q7[df_q7['LayoffFlag'] == 1]['SalaryUSD']
salaries_retained = df_q7[df_q7['LayoffFlag'] == 0]['SalaryUSD']

# Mann-Whitney U test (non-parametric)
u_stat, p_value = mannwhitneyu(salaries_laid_off, salaries_retained, alternative='two-sided')

print(f"Mann-Whitney U Test:\nU-Statistic = {u_stat:.2f}, p-value = {p_value:.4f}")


Mann-Whitney U Test:
U-Statistic = 1238777.00, p-value = 0.5212


### Inference from Mann-Whitney U Test: Salary Distribution Comparison

This statistical test further investigates the observed similarities in salary distributions between laid-off and retained employees, providing a quantitative measure of whether any differences are statistically significant.

1.  **Code Purpose**:
    * The code performs a **Mann-Whitney U test**, which is a non-parametric statistical test used to compare the distributions of two independent samples.
    * Here, it's applied to compare the `SalaryUSD` of `laid-off` employees (`LayoffFlag == 1`) against the `SalaryUSD` of `retained` employees (`LayoffFlag == 0`).
    * The `alternative='two-sided'` argument means the test is checking for any difference in distributions (i.e., whether one is stochastically larger or smaller than the other).

2.  **Output Interpretation**:
    * **U-Statistic = 1238777.00**: This is the test statistic calculated by the Mann-Whitney U test. Its specific value is less directly interpretable than the p-value.
    * **p-value = 0.5212**: This is the critical value for interpretation.
        * A p-value of 0.5212 is significantly greater than common significance levels (e.g., $\alpha = 0.05$ or $0.01$).

3.  **Implications for Data Analysis**:
    * **No Statistically Significant Difference in Salary Distributions**: The high p-value (0.5212) indicates that there is **no statistically significant difference** between the salary distributions of laid-off employees and retained employees.
    * **Confirms Visual Observation from Plot 7**: This statistical finding strongly supports the visual inference from 'Plot 7: Salary Distribution – Laid-Off vs. Retained Employees'. While there might be subtle visual differences in density, the Mann-Whitney U test confirms that these differences are not statistically meaningful from a salary distribution perspective.
    * **Salary Not a Primary Determinant**: This result further reinforces the conclusion that an employee's salary, by itself, was likely not the primary determining factor for whether they were laid off. Layoff decisions appear to be driven by other variables (e.g., skill mismatch, bench time, departmental restructuring, or performance within a broader context) rather than simply targeting employees based on their pay scale alone.
    * **Focus on Other Factors**: For predicting layoff risk or understanding the criteria, analysts should focus on other features in the dataset, as salary alone does not differentiate the laid-off group from the retained group in a statistically significant manner.

## Q8: How effective were redeployment attempts in preventing layoffs?

### ➤ Step 1: Prepare Data for Plot 8 – Stacked Area Chart

In [62]:
# Step 1: Filter relevant columns
df_q8 = df[['RedeploymentAttempts', 'LayoffFlag', 'Designation']].dropna()

# Step 2: Bin redeployment attempts (0, 1, 2, 3+)
df_q8['RedeploymentBin'] = df_q8['RedeploymentAttempts'].apply(lambda x: '3+' if x >= 3 else str(x))

# Step 3: Group by RedeploymentBin, Designation, and LayoffFlag to get counts
grouped_q8 = df_q8.groupby(['RedeploymentBin', 'Designation', 'LayoffFlag']).size().reset_index(name='Count')

# Step 4: Pivot to get proportion of layoffs vs retained per bin
pivot_q8 = grouped_q8.pivot_table(
    index=['RedeploymentBin', 'Designation'],
    columns='LayoffFlag',
    values='Count',
    fill_value=0
).reset_index()

pivot_q8.columns.name = None
pivot_q8.rename(columns={0: 'Retained', 1: 'LaidOff'}, inplace=True)

# Step 5: Calculate proportions
pivot_q8['Total'] = pivot_q8['Retained'] + pivot_q8['LaidOff']
pivot_q8['LaidOffRate'] = pivot_q8['LaidOff'] / pivot_q8['Total']
pivot_q8['RetainedRate'] = pivot_q8['Retained'] / pivot_q8['Total']


### ➤ Step 2: Create Plot 8 – Stacked Area Chart by Designation

In [64]:
import plotly.graph_objects as go

fig8 = go.Figure()

designations = pivot_q8['Designation'].unique()
colors = px.colors.qualitative.Plotly

for i, designation in enumerate(designations):
    sub_data = pivot_q8[pivot_q8['Designation'] == designation]
    fig8.add_trace(go.Scatter(
        x=sub_data['RedeploymentBin'],
        y=sub_data['LaidOffRate'],
        mode='lines',
        stackgroup='one',
        name=f"{designation} – Laid Off",
        line=dict(width=0.5),
        marker=dict(color=colors[i % len(colors)]),
        hoverinfo='x+y',
    ))

fig8.update_layout(
    title="Plot 8: Redeployment Efforts and Layoff Outcomes",
    xaxis_title="Redeployment Attempts",
    yaxis_title="Proportion of Laid-Off Employees",
    font=dict(family="Arial", size=12),
    title_font_size=16
)

fig8.show()


### Inference from 'Plot 8: Redeployment Efforts and Layoff Outcomes'

This stacked area chart illustrates the proportion of laid-off employees across different `Redeployment Attempts` bins, segmented by `Designation`. This plot helps in understanding the relationship between the number of redeployment attempts made for an employee and their ultimate layoff status, categorized by their role.

1.  **Plot Type and Purpose**:
    * This is a stacked area chart (using `stackgroup='one'`) with `Redeployment Attempts` (binned, likely 0, 1, 2, 3+) on the x-axis and `Proportion of Laid-Off Employees` on the y-axis.
    * Each colored area represents a different `Designation` (Junior, Lead, Manager, Mid, Senior).
    * The purpose is to show how the likelihood of being laid off changes with the number of redeployment attempts, and to identify if certain designations are more likely to be laid off after a certain number of attempts.

2.  **Key Observations from the Plot**:

    * **"0 Redeployment Attempts" as the Largest Group for Layoffs**: For almost all designations, the highest proportion of laid-off employees occurs when there have been **"0" redeployment attempts**. This is the largest stack segment on the far left. This suggests that a significant portion of layoffs occur without any prior attempts to redeploy the employee. These might be related to immediate role redundancies or strategic decisions where redeployment isn't an option.
        * `Junior`, `Mid`, and `Senior` designations show a particularly large proportion of layoffs at `0` redeployment attempts.

    * **Increasing Layoff Proportion with More Attempts for Some Roles**:
        * For `Manager` and `Lead` designations, the proportion of laid-off employees, while starting low, tends to increase with more `Redeployment Attempts`. This is particularly visible for `Managers` where the orange segment grows as redeployment attempts increase. This suggests that if a manager or lead cannot be redeployed after multiple attempts, their likelihood of layoff increases significantly.
        * This pattern implies that for these higher-level roles, redeployment is attempted, but failure to find a suitable new role after several tries leads to layoff.

    * **Decreasing Layoff Proportion or Stabilizing for Other Roles**:
        * For `Junior` and `Mid` designations, while the highest proportion of layoffs is at `0` attempts, the proportion seems to decrease or stabilize with increasing `Redeployment Attempts`. This could mean that if `Junior` or `Mid` level employees are put through redeployment attempts, they are often successfully placed, thereby reducing their layoff risk compared to those not attempted.

    * **"3+" Attempts is a Critical Threshold**: The "3+" redeployment attempts bin shows a noticeable proportion of layoffs across almost all designations, indicating that if an employee cannot be placed after three or more attempts, the likelihood of layoff becomes very high, irrespective of their designation.

3.  **Implications for Workforce Strategy**:

    * **Proactive Redeployment is Key**: The large proportion of layoffs at "0 redeployment attempts" suggests a need for more proactive and earlier identification of employees at risk of redundancy and initiation of redeployment efforts *before* layoff decisions are finalized.
    * **Tailored Redeployment Strategies**: The varying patterns across designations indicate that redeployment strategies should be tailored. For `Managers` and `Leads`, the focus might need to be on overcoming more complex barriers to placement, while for `Junior` and `Mid` roles, the volume of available roles might be higher, making successful redeployment more feasible if efforts are made.
    * **Effectiveness of Redeployment Programs**: The data prompts questions about the effectiveness of the redeployment program itself. While it seems successful for some `Junior` and `Mid` level roles, its failure for `Managers` and `Leads` after multiple attempts points to potential challenges (e.g., lack of suitable internal roles at that level, specific skill gaps that can't be filled internally, or employee unwillingness).
    * **Resource Allocation for Redeployment**: Resources for redeployment efforts should be strategically allocated, potentially focusing more on roles that show a higher success rate with attempts, while acknowledging that for some roles, immediate layoff might be the planned outcome if no internal fit is found.

## Q9: How does onsite experience influence layoff decisions?

### ➤ Step 1: Prepare Data for Plot 9 – Bar Chart with Department Facets

In [66]:
# Step 1: Filter relevant columns
df_q9 = df[['OnsiteExperience', 'Department', 'LayoffFlag']].dropna()

# Step 2: Group by OnsiteExperience, Department, LayoffFlag
grouped_q9 = df_q9.groupby(['Department', 'OnsiteExperience', 'LayoffFlag']).size().reset_index(name='Count')

# Step 3: Pivot to get layoff vs retained counts
pivot_q9 = grouped_q9.pivot_table(
    index=['Department', 'OnsiteExperience'],
    columns='LayoffFlag',
    values='Count',
    fill_value=0
).reset_index()

# Step 4: Rename columns and calculate layoff rate
pivot_q9.columns.name = None
pivot_q9.rename(columns={0: 'Retained', 1: 'LaidOff'}, inplace=True)
pivot_q9['Total'] = pivot_q9['Retained'] + pivot_q9['LaidOff']
pivot_q9['LayoffRate'] = (pivot_q9['LaidOff'] / pivot_q9['Total']) * 100


### ➤ Step 2: Create Plot 9 – Bar Chart Faceted by Department

In [67]:
fig9 = px.bar(
    pivot_q9,
    x='OnsiteExperience',
    y='LayoffRate',
    facet_col='Department',
    color='OnsiteExperience',
    color_discrete_map={'Yes': 'green', 'No': 'red'},
    title='Plot 9: Onsite Experience and Layoff Risk by Department',
    labels={'LayoffRate': 'Layoff Rate (%)'},
    hover_data={'LayoffRate': ':.2f', 'LaidOff': True, 'Retained': True, 'Total': True}
)

# Beautify layout
fig9.update_layout(
    font=dict(family="Arial", size=12),
    title_font_size=16,
    showlegend=False
)

# Tidy up facets
fig9.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

fig9.show()


### Inference from 'Plot 9: Onsite Experience and Layoff Risk by Department'

This faceted bar chart investigates the relationship between an employee's `Onsite Experience` and their `Layoff Risk`, broken down by `Department`. This plot is crucial for understanding if having onsite experience offers a protective or detrimental effect on job security within different functional areas.

1.  **Plot Type and Purpose**:
    * This is a bar plot faceted by `Department` (Finance, HR, IT, Marketing, Sales).
    * The x-axis represents `OnsiteExperience` ('Yes' or 'No'), and the y-axis shows `Layoff Rate (%)`.
    * Bars are colored 'green' for 'Yes' (Onsite Experience) and 'red' for 'No' (No Onsite Experience).
    * The purpose is to determine if onsite experience significantly influences layoff rates within different departments.

2.  **Key Observations from the Plot**:

    * **Consistent Trend: No Onsite Experience = Higher Layoff Risk**: Across **all** departments (Finance, HR, IT, Marketing, Sales), employees who do **not** have `OnsiteExperience` (`No`) consistently face a significantly higher `Layoff Rate (%)` compared to those who do have `OnsiteExperience` (`Yes`).
        * This pattern is very strong and suggests that onsite experience provides a substantial protective factor against layoffs.

    * **Highest Disparity in IT and Sales**:
        * In the `IT` department, the layoff rate for employees with no onsite experience is notably high (e.g., around 4.5% or more), while for those with onsite experience, it's significantly lower (e.g., around 1.5% or less).
        * Similarly, in `Sales`, the difference is stark, with `No` onsite experience leading to a much higher layoff rate than `Yes`.
        * These are also the departments identified in Plot 3 as having higher overall layoff counts.

    * **Even in Less Affected Departments, Trend Holds**: Even in departments with generally lower layoff rates overall, like `HR` and `Marketing`, the trend persists: `No` onsite experience correlates with a higher layoff rate. For instance, in HR, the rate for 'No' is still higher than for 'Yes', even if both are low in absolute terms.

    * **Implication of Onsite Experience Value**: Having `OnsiteExperience` appears to be a highly valued attribute that reduces layoff risk, regardless of the department. This could be due to:
        * **Client Relationships**: Onsite roles often involve direct client interaction, building critical relationships.
        * **Domain-Specific Knowledge**: Deeper immersion in client processes or specific on-site operations.
        * **Visibility and Contribution**: Higher visibility to senior management or critical project teams.
        * **Nature of Work**: Certain roles might inherently require onsite presence, making remote employees in those areas more susceptible if the work shifts.

3.  **Implications for Workforce Strategy**:

    * **Prioritize Onsite Talent Retention**: During workforce reductions, employees with `OnsiteExperience` are clearly less vulnerable. This implies a strategic preference for retaining such talent, likely due to their perceived direct value to client engagement, project delivery, or specific operational needs.
    * **Review of Remote/Offsite Roles**: For roles that traditionally do not require onsite presence, or for which employees lack this experience, the company might need to re-evaluate their strategic importance or invest in programs that enhance their client-facing skills or visibility.
    * **Policy on Onsite Work**: This finding could influence future policies regarding onsite work versus remote work models. If onsite presence significantly reduces layoff risk, it might lead to a greater emphasis on returning to office or client-site engagements.
    * **Development and Exposure**: For employees lacking `OnsiteExperience`, providing opportunities for client visits, short-term onsite assignments, or roles with higher external visibility could be considered as a measure to enhance their value and reduce future layoff risk.

## Q10: What are the trends in layoffs over time (simulated)?

### ➤ Step 1: Simulate Monthly DateOfRecord Values

In [73]:
# Simulate realistic monthly DateOfRecord values
date_range = pd.date_range(start='2024-01-01', end='2025-07-01', freq='MS')
df_q10 = df.copy()
df_q10['DateOfRecord'] = np.random.choice(date_range, size=len(df_q10))


### ➤ Step 2: Extract Month and Aggregate Layoffs

In [74]:
# Extract 'Month' for grouping
df_q10['Month'] = df_q10['DateOfRecord'].dt.to_period('M').astype(str)

# Group by Month and LayoffFlag to count layoffs vs retained
monthly_layoffs = df_q10.groupby(['Month', 'LayoffFlag']).size().reset_index(name='Count')


### ➤ Step 3: Create Plot 10 – Simulated Layoff Trends Over Time

In [75]:
fig10 = px.line(
    monthly_layoffs,
    x='Month',
    y='Count',
    color='LayoffFlag',
    markers=True,
    title='Plot 10: Simulated Layoff Trends Over Time',
    labels={'LayoffFlag': 'Layoff Status (0 = Retained, 1 = Laid-Off)', 'Count': 'Number of Employees'},
    color_discrete_map={0: 'blue', 1: 'orange'}
)

# Beautify layout
fig10.update_layout(
    xaxis_title="Month",
    yaxis_title="Employee Count",
    font=dict(family="Arial", size=12),
    title_font_size=16,
    hovermode="x unified"
)

fig10.show()


### Inference from 'Plot 10: Simulated Layoff Trends Over Time'

This line plot visualizes the `Number of Employees` over `Month`, separated by their `Layoff Status` (retained vs. laid-off). Since the dataset's `DateOfRecord` is uniform (2025-07-30), this plot likely represents a simulation or a hypothetical trend based on aggregated monthly data, rather than actual historical time series from the `tcs_workforce_2025.csv` file itself. Its purpose is to show how the counts of retained and laid-off employees might hypothetically evolve over a given period.

1.  **Plot Type and Purpose**:
    * This is a line plot with markers, showing two lines: one for 'Retained' employees (`LayoffFlag = 0`, blue) and one for 'Laid-Off' employees (`LayoffFlag = 1`, orange).
    * The x-axis represents `Month`, and the y-axis represents `Employee Count`.
    * Given the single `DateOfRecord` in the original dataset, this plot is interpreting or simulating how layoff decisions (or employee counts per status) might manifest across hypothetical months.
    * The purpose is to observe simulated trends in the volume of layoffs relative to the retained workforce over time.

2.  **Key Observations from the Plot**:

    * **Consistent Volume of Retained Employees**: The blue line, representing 'Retained' employees, remains relatively flat and high across all months. This indicates that the vast majority of the workforce is retained throughout the simulated period.
    * **Relatively Low Volume of Laid-Off Employees**: The orange line, representing 'Laid-Off' employees, also remains at a consistently low level across the months, especially when compared to the number of retained employees. This aligns with the overall low layoff rate (2.5%) observed earlier in the `df.describe()` inference.
    * **Absence of Significant Monthly Fluctuation in Layoffs**: The plot shows no major peaks or troughs in the 'Laid-Off' employee count across the months. This suggests that the layoffs, within this simulated or aggregated view, either occurred as a steady trickle, or as a single, uniform event not broken down into monthly variances within the data used for this specific plot. If it were actual historical data, one would expect to see distinct spikes corresponding to layoff events.
    * **Stable Workforce Composition (Simulated)**: From this plot's perspective, the workforce composition, in terms of the proportion of laid-off versus retained employees, appears stable month-to-month during this simulated period.

3.  **Implications for Workforce Strategy**:

    * **Layoffs as Ongoing Adjustment (Potentially)**: If this plot represents an ongoing process, it suggests that layoffs are either a continuous, low-volume adjustment rather than a single large-scale event.
    * **Impact on Planning**: The apparent stability could imply that the company has a consistent process for managing workforce reductions or that the impact of layoffs is distributed evenly over time, making it potentially easier to manage from an operational perspective than sudden, large-scale cuts.
    * **Need for Granular Time Series Data**: For more actionable insights into layoff trends, actual historical `LayoffDate` information would be crucial. A single `DateOfRecord` in the source data means this plot is based on an assumed or simulated monthly distribution, and thus, its interpretation about *actual* time trends is limited without further context on how `monthly_layoffs` was derived. If this plot is a simulation, it shows the assumed steady state.

In [76]:
!pip install kaleido reportlab






In [78]:
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Image, PageBreak
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.pagesizes import A4

# Set up document
doc = SimpleDocTemplate("TCS_2025_Layoff_Analysis_Report.pdf", pagesize=A4)
styles = getSampleStyleSheet()
flowables = []

# Title Page
flowables.append(Paragraph("TCS 2025 Workforce Layoff Analysis", styles['Title']))
flowables.append(Spacer(1, 24))
flowables.append(Paragraph("Exploratory Data Analysis (EDA) Report", styles['Heading2']))
flowables.append(Spacer(1, 48))
flowables.append(Paragraph("Generated with Plotly + ReportLab", styles['Normal']))
flowables.append(PageBreak())

# Add Q1–Q10 images
for i in range(1, 11):
    flowables.append(Paragraph(f"Q{i}: Plot {i}", styles['Heading2']))
    flowables.append(Spacer(1, 12))
    flowables.append(Image(f"fig{i}_q{i}.png", width=500, height=300))  # Resize as needed
    flowables.append(Spacer(1, 24))
    flowables.append(PageBreak())

# Generate PDF
doc.build(flowables)


OSError: 
fileName='fig1_q1.png' identity=[ImageReader@0x1ffec9a7b90 filename='fig1_q1.png'] Cannot open resource "fig1_q1.png"

In [79]:
import os
os.getcwd()


'C:\\Users\\HP\\New Beginnings'

In [80]:
import os

for i in range(1, 11):
    filename = f"fig{i}_q{i}.png"
    print(f"{filename} exists:", os.path.exists(filename))


fig1_q1.png exists: False
fig2_q2.png exists: False
fig3_q3.png exists: False
fig4_q4.png exists: False
fig5_q5.png exists: False
fig6_q6.png exists: False
fig7_q7.png exists: False
fig8_q8.png exists: False
fig9_q9.png exists: False
fig10_q10.png exists: False


In [81]:
for i in range(1, 11):
    try:
        fig = eval(f"fig{i}")
        fig.write_image(f"fig{i}_q{i}.png")
        print(f"Saved fig{i}_q{i}.png ✅")
    except Exception as e:
        print(f"Could not save fig{i}: {e}")


Could not save fig1: name 'fig1' is not defined
Saved fig2_q2.png ✅
Saved fig3_q3.png ✅
Saved fig4_q4.png ✅
Saved fig5_q5.png ✅
Saved fig6_q6.png ✅
Saved fig7_q7.png ✅
Saved fig8_q8.png ✅
Saved fig9_q9.png ✅
Saved fig10_q10.png ✅


In [82]:
# Q1 Plot - Layoff distribution
import plotly.express as px

fig1 = px.pie(
    df,
    names='LayoffFlag',
    title='Plot 1: Layoff Distribution',
    color='LayoffFlag',
    color_discrete_map={0: 'green', 1: 'red'},
    hole=0.4
)

fig1.update_traces(textinfo='percent+label')
fig1.update_layout(font=dict(family="Arial", size=12), title_font_size=16)

# Save as image
fig1.write_image("fig1_q1.png")
print("Saved fig1_q1.png ✅")


Saved fig1_q1.png ✅


In [83]:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Image
from reportlab.lib.styles import getSampleStyleSheet
import os

# Path to save PDF
output_path = r"C:\Users\HP\New Beginnings\TCS_Layoff_Analysis_Report.pdf"

# Folder where your images are saved
image_folder = r"C:\Users\HP\New Beginnings\Plots for TCS analysis project"

# Create PDF
doc = SimpleDocTemplate(output_path, pagesize=letter)
elements = []
styles = getSampleStyleSheet()

# Title
title = Paragraph("TCS 2025 Workforce Layoff Analysis – Visual Report", styles['Title'])
elements.append(title)
elements.append(Spacer(1, 20))

# Add each image
for i in range(1, 11):
    image_path = os.path.join(image_folder, f"fig{i}_q{i}.png")
    if os.path.exists(image_path):
        elements.append(Paragraph(f"Plot {i}", styles['Heading2']))
        elements.append(Image(image_path, width=450, height=300))
        elements.append(Spacer(1, 20))
    else:
        elements.append(Paragraph(f"Plot {i} not found.", styles['Normal']))

# Build the PDF
doc.build(elements)

print(f"✅ PDF created successfully at:\n{output_path}")


✅ PDF created successfully at:
C:\Users\HP\New Beginnings\TCS_Layoff_Analysis_Report.pdf
