### HR Employee Attrition dataset (https://www.kaggle.com/datasets/saurabhbadole/hr-employee-attrition) 
#### This is a historical Employee Data with a number of features about each employee. This dataset provides information on employees within an organization, including their demographics, job-related factors, and attrition status. Dataset has 35 columns and 1470 rows. This is a classification dataset and I will perfom logistic regression machine learning model here.

In [None]:
# Needed imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# pip install scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

In [3]:
# load the dataset:
df = pd.read_csv("HR-Employee-Attrition.csv")

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


#### Target variable - "Attrition" column (show how likely someone is to quit), it has binary yes/no values. Here are a lot of other variables. I will choose support variable base on correlation to "Attrition" column. First, I will clean up the dataset:

### Cleaning up the dataset:

In [None]:
# checking for dublicates,
# No duplicates.
df.duplicated().sum()

0

In [None]:
# checking target variable.
# only yes/no values. I will use LabelEncoder to converts the value of column to 0 or 1.
# just before LabelEncoder I will check other variable that may also need that, so I can encoder all at once.
df['Attrition'].value_counts()

Attrition
No     1233
Yes     237
Name: count, dtype: int64

In [None]:
df.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [10]:
# some columns are not related to attrition: "Over18" - all employees are adults, "EmployeeNumber" - not usefull, 
# "EmployeeCount" - constant value for this dataset;"EducationField" and "Department" - column is categorical, 
# they can be usefull but we have enough other options for support variable; The "Education" column is a numerical
# feature that represents education level, but it may not strongly influence attrition compared to salary orjob satisfaction.
#  I will drop those columns:

columns_to_drop = [
    'Over18',
    'EmployeeCount',
    'EmployeeNumber',
    'Department',
    'EducationField',
    "Education"
]

df = df.drop(columns_to_drop, axis=1)

In [11]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,4,80,1,6,3,3,2,2,2,2


In [None]:
# now the dataset has two columns with categorical values: "BusinessTravel" and "Gender" not counting target column.
# let's check "BusinessTravel" column. It has three different values, I will keep it and convert categories into numerical form
# using one-hot encoded. I kept this variable because frequent business travel is often associated with higher attrition due to stress and job demands.

df['BusinessTravel'].value_counts()

BusinessTravel
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: count, dtype: int64