# **Workforce_Attrition_Prediction** 

### The dataset is about employee attrition. This analysis can discover if any particular factors or patterns that lead to attrition. If so, employers can take certain precausion to prevent attrition which in employer of view, employee attrition is a loss to company, in both monetary and non-monetary. 

### **Import packages**

In [None]:
##Importing the packages
#Data processing packages
import numpy as np 
import pandas as pd 

#Visualization packages
import matplotlib.pyplot as plt 
import seaborn as sns 

#Machine Learning packages
from sklearn.svm import SVC,NuSVC
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis, LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler	
from sklearn.metrics import confusion_matrix

#Suppress warnings
import warnings
warnings.filterwarnings('ignore')

### **Import data**

In [None]:
#Import Employee Attrition data
data=pd.read_csv('Workforce_attrition_dataset.csv')

### **Check and remediate if there are any null values**

In [None]:
data.head()

In [None]:
data.info()

**COMMENT:** Above output shows that there are No Null values.

### **Check and remove if there are any fields which does not add value**

In [None]:
data['Over18'].value_counts()

**COMMENT:** From the above output ALL the employees are above 18, so this field does not add any value.

In [None]:
data.describe()

**COMMENT:** Standard deviation(std) for the fields "EmployeeCount" and ."StandardHours" are ZERO.  Hence these fields does not add value, hence they can be removed.

In [None]:
#These fields does not add value, hence removed
data = data.drop(['EmployeeCount','Over18'], axis = 1)

In [None]:
data.head()

### **Convert Categorical values to Numeric Values**

#### **Perform datatype conversion or translation wherever required**

"Attrition" field has values **Yes/No**, however for machin learning algorithms we need numeric values.
Hence translating **Yes/No** to binary **1/0**

In [None]:
#A lambda function is a small anonymous function.
#A lambda function can take any number of arguments, but can only have one expression.
# 0:No, 1: Yes
data['Attrition']=data['Attrition'].apply(lambda x : 1 if x=='Yes' else 0)

In [None]:
# Ensure column is string type and strip any spaces
data['BusinessTravel'] = data['BusinessTravel'].astype(str).str.strip()

# Convert categorical variable BusinessTravel to numerical values
# 0: Non-Travel, 1: Travel_Frequently, 2: Travel_Rarely
data['BusinessTravel'] = data['BusinessTravel'].apply(
    lambda x: 1 if x == 'Travel_Frequently' else (2 if x == 'Travel_Rarely' else 0)
)


In [None]:
# Convert categorical variables Department to numerical values
# 0: Sales, 1: Research & Development, 2: Human Resources
data['Department'] = data['Department'].apply(
    lambda x: 0 if x == 'Sales' else (1 if x == 'Research & Development' else 2)
)

In [None]:
# Convert categorical variables EducationField to numerical values
# 0: Life Sciences, 1: Medical, 2: Marketing, 3: Technical Degree, 4: Human Resources
data['EducationField'] = data['EducationField'].apply(
    lambda x: 0 if x == 'Life Sciences' else (1 if x == 'Medical' else (2 if x == 'Marketing' else (3 if x == 'Technical Degree' else 4)))
)

In [None]:
# Convert categorical values of Gender to numerical values
# 0: Male, 1: Female
data['Gender'] = data['Gender'].apply(
    lambda x: 0 if x == "Male" else 1
)

In [None]:
# Convert categorical variables JobRole to numerical values
# 0: Sales Executive, 1: Research Scientist, 2: Laboratory Technician, 3: Manufacturing Director, 4: Healthcare Representative, 5: Manager, 6: Sales Representative, 7: Research Director
# 8: Human Resources
data['JobRole']=data['JobRole'].apply(lambda x : 0 if x=='Sales Executive' else (1 if x=='Research Scientist' else (2 if x=='Laboratory Technician' else (3 if x=='Manufacturing Director' else (4 if x=='Healthcare Representative' else (5 if x=='Manager' else (6 if x=='Sales Representative' else (7 if x=='Research Director' else 8))))))))

In [None]:
# Convert categorical variables MaritalStatus to numerical values
# 0: Single, 1: Married, 2: Divorced
data['MaritalStatus']=data['MaritalStatus'].apply(lambda x : 0 if x=='Single' else (1 if x=='Married' else 2))

In [None]:
# Convert categorical variables OverTime to numerical values
# 0: No, 1: Yes
data['OverTime']=data['OverTime'].apply(lambda x : 0 if x=='No' else 1)

In [None]:
#This function is used to convert Categorical values to Numerical values
# data=pd.get_dummies(data)
data.info()

In [None]:
data.head()

**COMMENT:** It can be seen from the difference in the output of **data.head()** before and after the coversion that now **ALL the fields have numerical values.**

### **General preprocessing of data**

##### **Separating the Feature and Target Matrices**

In [None]:
#Separating Feature and Target matrices
X = data.drop(['Attrition'], axis=1)
y=data['Attrition']

##### **Scaling the data values to standardize the range of independent variables**

In [None]:
#Feature scaling is a method used to standardize the range of independent variables or features of data.
#Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. 
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X = scale.fit_transform(X)

##### **Split the data into Training set and Testing set**

In [None]:
# Split the data into Training set and Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.2,random_state=42)

### **Function definition**

#### These functions will be used to prepare machine learning models

In [None]:
#Function to Train and Test Machine Learning Model
def train_test_ml_model(X_train,y_train,X_test,Model):
    model.fit(X_train,y_train) #Train the Model
    y_pred = model.predict(X_test) #Use the Model for prediction

    # Test the Model
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(y_test,y_pred)
    accuracy = round(100*np.trace(cm)/np.sum(cm),1)

    #Plot/Display the results
    cm_plot(cm,Model)
    print('Accuracy of the Model' ,Model, str(accuracy)+'%')

In [None]:
#Function to plot Confusion Matrix
def cm_plot(cm,Model):
    plt.clf()
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Wistia)
    classNames = ['Negative','Positive']
    plt.title('Comparison of Prediction Result for '+ Model)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    tick_marks = np.arange(len(classNames))
    plt.xticks(tick_marks, classNames, rotation=45)
    plt.yticks(tick_marks, classNames)
    s = [['TN','FP'], ['FN', 'TP']]
    for i in range(2):
        for j in range(2):
            plt.text(j,i, str(s[i][j])+" = "+str(cm[i][j]))
    plt.show()

### **PERFORM PREDICTIONS USING DIFFERENT MACHINE LEARNING ALGORITHMS**

#### These predictions are done for the purpose of deciding which ML model has to be used

In [None]:
from sklearn.svm import SVC,NuSVC  #Import packages related to Model
Model = "SVC"
model=SVC() #Create the Model

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.svm import SVC,NuSVC  #Import packages related to Model
Model = "NuSVC"
model=NuSVC(nu=0.285)#Create the Model

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from xgboost import XGBClassifier  #Import packages related to Model
Model = "XGBClassifier()"
model=XGBClassifier() #Create the Model

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.neighbors import KNeighborsClassifier  #Import packages related to Model
Model = "KNeighborsClassifier"
model=KNeighborsClassifier()

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.naive_bayes import GaussianNB,MultinomialNB  #Import packages related to Model
Model = "GaussianNB"
model=GaussianNB()

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.linear_model import SGDClassifier, LogisticRegression #Import packages related to Model
Model = "SGDClassifier"
model=SGDClassifier()

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.linear_model import SGDClassifier, LogisticRegression #Import packages related to Model
Model = "LogisticRegression"
model=LogisticRegression()

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier #Import packages related to Model
Model = "DecisionTreeClassifier"
model=DecisionTreeClassifier()

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier #Import packages related to Model
Model = "ExtraTreeClassifier"
model=ExtraTreeClassifier()

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis, LinearDiscriminantAnalysis #Import packages related to Model
Model = "QuadraticDiscriminantAnalysis"
model = QuadraticDiscriminantAnalysis()

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis, LinearDiscriminantAnalysis #Import packages related to Model
Model = "LinearDiscriminantAnalysis"
model=LinearDiscriminantAnalysis()

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier #Import packages related to Model
Model = "RandomForestClassifier"
model=RandomForestClassifier()

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier #Import packages related to Model
Model = "AdaBoostClassifier"
model=AdaBoostClassifier()

train_test_ml_model(X_train,y_train,X_test,Model)

In [None]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier #Import packages related to Model
Model = "GradientBoostingClassifier"
model=GradientBoostingClassifier()

train_test_ml_model(X_train,y_train,X_test,Model)

## **For Employee Demographics**

#### Preparing data for ML model

In [None]:
#Making data ready for prediction
A = data.drop([
    'Attrition', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 
    'Education', 'EducationField', 'EnvironmentSatisfaction', 'HourlyRate', 
    'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction','MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager','EmployeeNumber'], axis=1)
B=data['Attrition']

In [None]:
# Split the data into Training set and Testing set
from sklearn.model_selection import train_test_split
A_train, A_test, B_train, B_test = train_test_split(A,B,test_size =0.2,random_state=42)

#### Training the ML model for required data

In [None]:
# Model to be used for prediction LogisticRegression with accuracy 89.5%
from sklearn.linear_model import SGDClassifier, LogisticRegression #Import packages related to Model
Model = "LogisticRegression"
model=LogisticRegression()

train_test_ml_model(A_train,B_train,A_test,Model)
# making prediction on new data
new_data = [[28, 1, 0]]  # Data Age	Gender	MaritalStatus
new_predictions = model.predict(new_data)
print("Predictions on new data:", new_predictions)

In [None]:
A

In [None]:
B