In [4]:
!pip install --upgrade --force-reinstall numpy pandas

Collecting numpy
  Downloading numpy-2.2.6-cp310-cp310-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting pandas
  Downloading pandas-2.3.2-cp310-cp310-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting python-dateutil>=2.8.2 (from pandas)
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting six>=1.5 (from python-dateutil>=2.8.2->pandas)
  Using cached six-1.17.0-py2.py3-none-any.whl.metadata (1.7 kB)
Downloading numpy-2.2.6-cp310-cp310-macosx_14_0_arm64.whl (5.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m15.2 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25hDownloading pandas-2.3.2-cp310-cp310-macosx_11_0_arm64.whl (10.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10

## UseCase Intro: Employee Attrition
## You are a Data Scientist working at a Jio

- The company is facing a huge problem of employee attrition
- Your task is to help the company find a solution to this problem.

#### Why is attrition a problem?

  - A new employee asks for more compensation
  - Training of new employees
  - Lots of time and resources required for searching a new candidate

#### What can be done to solve the problem ?

1. Identify the employees who may leave in future.
  - Targeted approaches can be undertaken to retain such employees.
  - These might include addressing their problems with the company and so on ...

2. Help identify the key indicators/factors leading to an employee leaving.
  - #### What all reasons can you think of contributing to attrition ?
    - Forcing employees to come to office daily
    - Unhealthy culture etc
  - Identifying these key factors helps in taking better measures to improve employee retention



In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import io

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

### Now lets import our dataset


In [None]:
!gdown 16KtxSt_QEGQvfluEaMls5cCHPwhRXgCk

In [None]:
df = pd.read_csv("HR-Employee-Attrition.csv")
df.info()

In [None]:
df.head()


#### What can we see from this info ?
- The dataset has around 1500 samples
- It contains information about :

    1. Employee demographics\
     Eg: Age, Gender, Marital Status

    2. Employee work-life\
     Eg: Working hours, job satisfaction etc

#### How can we use this information for our problem ?

To understand this lets analyze the features


### EDA

First lets try to find their ditsributions

#### How can we do that ?
- Plotting their histograms
- Recall why we do that ?

In [None]:
df.hist(figsize = (20,20))
plt.show()

#### What can we observe from these plots ?

1. Many histograms are tail-heavy;

  - Lot of attributes are right-skewed\
 (e.g. MonthlyIncome DistanceFromHome, YearsAtCompany)

  - Data transformation methods **may be** required for standardisation
    - Recall why standardisation is preferred ?

2. Some features seem to have normal distributions

  - Eg: Age:
    - Slightly right-skewed normal distribution
    - Bulk of the staff between 25 and 45 years old

3. Some features are constant

  - Eg: EmployeeCount and StandardHours are constant values for all employees.

  - They're likely to be redundant features.

  - #### How can these features contribute to our problem ?
    - Constant features are not in any way useful for predictions
    - So we can drop these features from the dataset

4. Some features seem to be uniformly distributed.

  - Eg: EmployeeNumber

  - **Uniformly distributed and constant features won't contribute** to our analysis. Why?
    - Each value is equally likely to occur

  - #### So what should we do ?
    - We can drop these features from our dataset

5. Some features are categorical i.e **binomially/multinomially distributed**

  - Eg: WorkLifeBalance, StockOptionLevel etc

  - #### Can we use these features directly in our problem ?
    - No. They willl first have to be encoded

  - #### Recall which encoding has to be used for which features

    - Binary Encoding (0/1) : Features with only 2 unique values

    - Label Encoding (0, 1, 2, 3 ....) :  More than 2 unique values having a particular order

  - OneHot Encoding ([0 0 0 1], ...) : More than 2 unique values having no order

  - Target encoding ([0.1, 0.33, .....)] : Features with a lot of unique vals having no order


7. We can also see from these features that their ranges vary a lot

  - Recall why different feature scales can be a problem

  - We will deal with this problem later

First, lets remove the features that won't contribute to our analysis


In [None]:
df.drop(['EmployeeCount', 'EmployeeNumber', 'StandardHours', 'Over18'], axis=1, inplace=True)

Now lets encode our categorical features

#### Which encoding technique should we use ?

  - It depends upon:
    - Number of unique values a feature has
    - If there is a sequence between the feature vals

Lets first check how many unique values each feature has


In [None]:
def unique_vals(col):

  if col.dtype == "object":

    print(f'{col.name}: {col.nunique()}')

df.apply(lambda col: unique_vals(col))

#### On basis of this info, which encoding technique should we use ?

 - We will use binary encoding for features with 2 or less unique val.
 - For features < 6 unique vals we will use OneHot encoding
 - Rest of the categorical features will be Target encoded


In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Create a label encoder object
le = LabelEncoder()

def label_encode(ser):

    if ser.dtype=="object" and ser.nunique() <= 2:
      print(ser.name)

      le.fit(ser)
      ser = le.transform(ser)

    return ser

df = df.apply(lambda col: label_encode(col))

In [None]:
# convert rest of categorical variable into dummy
df = pd.get_dummies(df, columns = ["BusinessTravel", "Department", "MaritalStatus"], drop_first = True)

In [None]:
df.head()

#### Lets analyse the target feature now

In [None]:
target = df['Attrition'].copy()
df = df.drop(["Attrition"], axis = 1)
type(target)

In [None]:
target.value_counts()

#### What can we infer from this info ?
  - The dataset is extremely imbalanced
  - Recall how we deal with imbalanced data

For this dataset we will use SMOTE oversampling technique to balance the data

But SMOTE is applied only to training set

So we need to split the data first

#### In what sets should we split it ?

  - Train/test set

  - #### Why not create a validation set ?
    - We already have less amount of data
    - And we want to train the model with max possible data
    - So we will use K-Fold cross validation instead

#### What ratios should we use for splitting ?
  - 80%/20% for train/test looks enough

Lets split the dataset now

In [None]:
# Since we have class imbalance (i.e. more employees with turnover=0 than turnover=1)
# let's use stratify=y to maintain the same ratio as in the training dataset when splitting the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df,
                                                    target,
                                                    test_size=0.25,
                                                    random_state=7,
                                                    stratify=target)

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

In [None]:
len(X_train.columns)

Now we will first perform target encoding

In [None]:
!pip install category_encoders

In [None]:
import category_encoders as ce

ce_target = ce.TargetEncoder(cols = ['EducationField', 'JobRole'])
X_train = ce_target.fit_transform(X_train, y_train)
X_test = ce_target.transform(X_test)

### Upsampling using SMOTE

In [None]:
from imblearn.over_sampling import SMOTE
from collections import Counter

smt = SMOTE()
X_sm, y_sm = smt.fit_resample(X_train, y_train)

print('Resampled dataset shape {}'.format(Counter(y_sm)))

In [None]:
X_sm.shape

In [None]:
X_sm

### Preprocessed data

In [None]:
!gdown 19L3rYatfhbBL1r5MHrv-p_oM2wlvrhqk
!gdown 1OHLKJwA3qZopKPvlKoRldM6BvA1A4dYF
!gdown 1N7O_fWCTJLu8SIa_paKcDEzllgpMk8sK
!gdown 12Bh2AN8LcZAlg20ehpQrEWccUDaSdsOG

In [None]:
import pickle
# Load data (deserialize)
with open('preprocessed_X_sm.pickle', 'rb') as handle:
    X_sm = pickle.load(handle)

with open('X_test.pickle', 'rb') as handle:
    X_test = pickle.load(handle)

with open('y_sm.pickle', 'rb') as handle:
    y_sm = pickle.load(handle)

with open('y_test.pickle', 'rb') as handle:
    y_test = pickle.load(handle)