<center><h1 class="list-group-item list-group-item-success">HR Analytics: Job Change Prediction</center>
<img src = "https://www.valamis.com/documents/10197/605345/hr-analytics.png">

### Context
A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.

The whole data divided to train and test . Target isn't included in test but the test target values data file is in hands for related tasks. A sample submission correspond to enrollee_id of test set provided too with columns : enrollee _id , target

### Contents:
<font size = 3.5 color = "blue">
<li>Importing Packages</li>
<li>Importing Data</li>
<li>Analysing Data</li>
<li>Data Overview</li>
<li>Transforming data to required format</li>
<li>Visualization</li>
<li>Data Preprocessing</li>
<li>One Hot encoding</li>
<li>Filling NA Values</li>
<li>Data Upscaling</li>
<li>Training Models</li>
<li>Evaluation Metrics</li>


## Importing Packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import random
import statistics 
from pandas import get_dummies
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,mean_squared_error
from sklearn.metrics import roc_auc_score,roc_curve
from imblearn.pipeline import Pipeline as imbPipe
from xgboost import  XGBClassifier
import warnings
warnings.filterwarnings("ignore")

## Importing Data

In [None]:
df_train = pd.read_csv("../input/hr-analytics-job-change-of-data-scientists/aug_train.csv")
df_test = pd.read_csv("../input/hr-analytics-job-change-of-data-scientists/aug_test.csv")

# Analysing Data

In [None]:
df_train

In [None]:
df_train.info()

In [None]:
df_train.isnull().sum()

### More NA Values 😫

In [None]:
df_test.isnull().sum()

# Data Overview

In [None]:
display(df_train[['city','city_development_index','relevent_experience','gender','education_level','major_discipline','experience','company_size','company_type','target']].groupby(['gender','education_level','experience','company_size']).agg(["max",'mean',"min"]).style.background_gradient(cmap="viridis"))

# Visualization

#### Countplots with respect to educational level 
##### Education Level
##### This dataset contains 5 education level:

<li>Graduate<br></li>
<li>Masters<br></li>
<li>High School<br></li>
<li>PhD<br></li>
<li>Primary School</li>


In [None]:
#Countplots showing the frequency of each category with respect to education level 
plt.figure(figsize=[15,17])
plot=["relevent_experience", "education_level","major_discipline", "experience","company_size","company_type", "training_hours","target"]
n=1
for f in plot:
    plt.subplot(4,2,n)
    sns.countplot(x=f, hue='education_level', edgecolor="black", alpha=0.7, data=df_train)
    sns.despine()
    plt.title("Countplot of {}  by education_level".format(f))
    n=n+1
plt.tight_layout()
plt.show()

In [None]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

el = df_train['education_level'].value_counts().reset_index()
el.columns = [
    'education_level', 
    'percent'
]
el['percent'] /= len(df_train)

fig = px.pie(
    el, 
    names='education_level', 
    values='percent', 
    title='Education_level', 
    width=800,
    height=500 
)

fig.show()

### Pairplot with numerical values

In [None]:
sns.pairplot(df_train)

### Exploring the Distribution of df.experience

In [None]:
# Experience
# Enrolee total experience in years
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

ep = df_train['experience'].value_counts().reset_index()
ep.columns = [
    'experience', 
    'percent'
]
ep['percent'] /= len(df_train)

fig = px.pie(
    ep, 
    names='experience', 
    values='percent', 
    title='Experience', 
    width=800,
    height=500 
)

fig.show()

### Transforming data to required format

In [None]:
def clean_experience(df):
    for i in df["experience"]:
        if(i==">20"):
            df["experience"][df["experience"]==i]=27
        if(i == "<1"):
            df["experience"][df["experience"]==i]=0
clean_experience(df_train)
clean_experience(df_test)

df_train["experience"] = df_train["experience"].fillna(0)
df_train["experience"] = df_train['experience'].astype('int')
df_test["experience"] = df_test["experience"].fillna(0)
df_test["experience"] = df_test['experience'].astype('int')

### Distribution of Training Hours

In [None]:
# Taining_hours
f, axes = plt.subplots(1,1, figsize = (16, 5))
g1 = sns.distplot(df_train["training_hours"], color="blue",ax = axes)
plt.title("Distributional of training_hours")

### Education Level Vs Training Hours

In [None]:
# education_level:training_hours
et = df_train.sort_values(by='training_hours', ascending=True)[:7000]
figure = plt.figure(figsize=(10,6))
sns.barplot(y=et.education_level, x=et.training_hours)
plt.xticks()
plt.xlabel('training_hours')
plt.ylabel('education_level')
plt.title('education_level:training_hours ')
plt.show()

# Data Preprocessing 

In [None]:
def clean_NAN(df):
    df["gender"] = df["gender"].fillna("Unknown")
    df["education_level"]=df["education_level"].fillna("Unknown")
    df["major_discipline"].fillna(value="Unknown", inplace=True)
    df["experience"] = df["experience"].fillna(df["experience"].mean())
    df["company_type"] = df["company_type"].fillna("Unknown")
clean_NAN(df_train)
clean_NAN(df_test)

#### NAN Values are replaced with unknown

In [None]:
def clean_company_size_1(df):
    converted_list_1 = []
    converted_list_2 = []
    converted_list_3 = []
    for i in df["company_size"]:
        if i == "10/49":
            i = "10-49"
            converted_list_1.append(i)
        converted_list_1.append(i)
        if i == "<10":
            i = '1-9'
            converted_list_2.append(i)
        converted_list_2.append(i)
        if i == "10000+":
            i = '10000-20000'
            converted_list_3.append(i)
        converted_list_3.append(i)
    df["company_size"]=pd.Series(converted_list_1)
    df["company_size"]=pd.Series(converted_list_2)
    df["company_size"]=pd.Series(converted_list_3)
    new = df['company_size'].str.split("-", n = 1, expand = True) 
    df['company_size_min']= new[0]
    df['company_size_max']= new[1] 
    df["company_size_max"] = df['company_size_max'].astype('int')
    df["company_size_min"] = df['company_size_min'].astype('int')
df_train["company_size"]=df_train["company_size"].fillna("0-0")
df_test["company_size"]=df_test["company_size"].fillna("0-0")
clean_company_size_1(df_train)
clean_company_size_1(df_test)

#### Cleaning company_size to attain the required format and split them into min and max company_size 

In [None]:
def clean_last_new_job(df):
    converted_list_1 = []
    converted_list_2 = []
    for i in df["last_new_job"]:
        if i == "never" or i == np.NaN:
            i = 0
            converted_list_1.append(i)
        converted_list_1.append(i)
        if i == ">4":
            i = 6
            converted_list_2.append(i)
        converted_list_2.append(i)
    df["last_new_job"]=pd.Series(converted_list_1)
    df["last_new_job"]=pd.Series(converted_list_2)
clean_last_new_job(df_train)
clean_last_new_job(df_test)

#### Cleaning company_size to attain the required format for both test and train

In [None]:
def clean_city(df):
    converted_list_1 = []
    for i in range(len(df["city"])):
        j = df["city"][i].replace("city_","")
        converted_list_1.append(j)
    df["city"]=pd.Series(converted_list_1)
clean_city(df_train)
clean_city(df_test)

In [None]:
def clean_relevent_experience(df):
    converted_list_1 = []
    converted_list_2 = []
    for i in df["relevent_experience"]:
        if i == "Has relevent experience":
            i = 1
            converted_list_1.append(i)
        converted_list_1.append(i)
        if i == "No relevent experience":
            i = 0
            converted_list_2.append(i)
        converted_list_2.append(i)
    df["relevent_experience"]=pd.Series(converted_list_1)
    df["relevent_experience"]=pd.Series(converted_list_2)
clean_relevent_experience(df_train)
clean_relevent_experience(df_test)

## One Hot encoding
It is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In [None]:
def one_hot_encoding(df):
    enrolled_dummies = pd.get_dummies(df["enrolled_university"], dummy_na=True)
    gender_dummies = pd.get_dummies(df["gender"], dummy_na=True)
    education_dummies = pd.get_dummies(df["education_level"],dummy_na=True)
    stream_dummies = pd.get_dummies(df["major_discipline"],dummy_na=True)
    company_dummies = pd.get_dummies(df["company_type"],dummy_na=True)
    df["Type_no_enrollment"] = enrolled_dummies["no_enrollment"]
    df["Type_Full_time_course"] = enrolled_dummies["Full time course"]
    df["Type_Part_time_course"]=enrolled_dummies["Part time course"]
    df["Gender_Male"] = gender_dummies["Male"]
    df["Gender_Female"] =gender_dummies["Female"]
    df["Gender_Unknown"]=gender_dummies["Unknown"]
    df["Gender_Other"]=gender_dummies["Other"]
    df["Education_Graduate"] = education_dummies["Graduate"]
    df["Education_Masters"] = education_dummies["Masters"]
    df["Education_High_School"] = education_dummies["High School"]
    df["Education_Primary_School"] = education_dummies["Primary School"]
    df["Education_Phd"] = education_dummies["Phd"]
    df["Education_Unknown"] = education_dummies["Unknown"]
    df["Stream_STEM"] = stream_dummies["STEM"]
    df["Stream_Humanities"] = stream_dummies["Humanities"]
    df["Stream_Other"] = stream_dummies["Other"]
    df["Stream_Business_Degree"] = stream_dummies["Business Degree"]
    df["Stream_Arts"] = stream_dummies["Arts"]
    df["Stream_No_Major"] = stream_dummies["No Major"]
    df["Stream_Unknown"] = stream_dummies["Unknown"]
    df["Company_Pvt_Ltd"] = company_dummies["Pvt Ltd"]
    df["Company_Funded_Startup"] = company_dummies["Funded Startup"]
    df["Company_Public_Sector"]=company_dummies["Public Sector"]
    df["Company_Early_Stage_Startup"] = company_dummies["Early Stage Startup"]
    df["Company_NGO"] = company_dummies["NGO"]
    df["Company_Other"] = company_dummies["Other"]
    df["Company_Unknown"] = company_dummies["Unknown"]
one_hot_encoding(df_train)
one_hot_encoding(df_test)

In [None]:
df_train = df_train.dropna(subset=['enrolled_university',"last_new_job"])
df_test = df_test.dropna(subset=['enrolled_university',"last_new_job"])

In [None]:
def clean_company_size_2(df):
    converted_list_1 = []
    converted_list_2 = []
    for i in df["company_size_min"]:
        if i == 0:
            i = int(df["company_size_min"].mean())
            converted_list_1.append(i)
        converted_list_1.append(i)
    for i in df["company_size_max"]:
        if i == 0:
            i = int(df["company_size_max"].mean())
            converted_list_2.append(i)
        converted_list_2.append(i)
    df["company_size_min"]=pd.Series(converted_list_1)
    df["company_size_max"]=pd.Series(converted_list_2)
df_train["company_size_min"] = df_train["company_size_min"].fillna(int(df_train["company_size_min"].mean()))
df_train["company_size_max"] = df_train["company_size_max"].fillna(int(df_train["company_size_max"].mean()))

df_test["company_size_min"] = df_test["company_size_min"].fillna(int(df_test["company_size_min"].mean()))
df_test["company_size_max"] = df_test["company_size_max"].fillna(int(df_test["company_size_max"].mean()))

clean_company_size_2(df_test)
clean_company_size_2(df_train)

### Replacing the NaN values with the average

In [None]:
df_test.isnull().sum()

<center><img src="https://media.tenor.com/images/aa37ff519d18dc4b51b8a55fb36e27e7/tenor.gif"></img></center><br>
<center><font size = 4 color = "red">Data Cleaning done successfully ✨</font></center>

In [None]:
# Target
# 0 – Not looking for job change,
# 1 – Looking for a job change
# As you can see, here we have imbalanced data, the number of 1 ( Looking for a job change) < 0 (Not looking for job change)
mnj = df_train['target'].value_counts()  
plt.figure(figsize=(6,4))
sns.barplot(mnj.index, mnj.values, alpha=0.8)
plt.ylabel('Number of Data', fontsize=12)
plt.xlabel('target', fontsize=9)
plt.xticks(rotation=90)
plt.show();

In [None]:
df_test.index = np.arange(0,len(df_test))

In [None]:
df_test_copy = df_test.copy()
df_test

In [None]:
df_train = df_train.drop(['enrollee_id','gender','enrolled_university','education_level','major_discipline','company_type','company_size'],axis=1)
df_test = df_test.drop(['enrollee_id','gender','enrolled_university','education_level','major_discipline','company_type',"company_size"],axis=1)

In [None]:
X = df_train.drop("target",axis=1)
Y = pd.DataFrame(df_train["target"])

## Data Upscaling

In [None]:
smote = SMOTE()
X, Y = smote.fit_resample(X, Y)

In [None]:
X

In [None]:
Y["target"].value_counts()

### Data Balanced Successfully 🤘

In [None]:
df_train_final = X.copy()
df_train_final['target'] = Y
df_test_final = df_test.copy()

In [None]:
cols_to_be_normalized = ["city","city_development_index","experience","last_new_job","training_hours","company_size_min","company_size_max"]
cols_not_to_be_normalized = ["relevent_experience","Type_no_enrollment","Type_Full_time_course","Type_Part_time_course","Gender_Male","Gender_Female","Gender_Unknown",
                            "Gender_Other","Education_Graduate","Education_Masters","Education_High_School","Education_Primary_School","Education_Phd",
                            "Education_Unknown","Stream_STEM","Stream_Humanities","Stream_Other","Stream_Business_Degree","Stream_Arts","Stream_No_Major",
                            "Stream_Unknown","Company_Pvt_Ltd","Company_Funded_Startup","Company_Public_Sector", "Company_Early_Stage_Startup", "Company_NGO",
                            "Company_Other", "Company_Unknown", "target"]

# Normalization
### Noramalizing the train & test data for better accuracy

In [None]:
train_normalized = normalize(df_train_final[cols_to_be_normalized])
train_boolean = df_train_final[cols_not_to_be_normalized]
df_train_normalized = pd.DataFrame(train_normalized,columns = cols_to_be_normalized)
df_train_boolean = pd.DataFrame(train_boolean,columns=cols_not_to_be_normalized)

In [None]:
cols_to_be_normalized = ["city","city_development_index","experience","last_new_job","training_hours","company_size_min","company_size_max"]
cols_not_to_be_normalized = ["relevent_experience","Type_no_enrollment","Type_Full_time_course","Type_Part_time_course","Gender_Male","Gender_Female","Gender_Unknown",
                            "Gender_Other","Education_Graduate","Education_Masters","Education_High_School","Education_Primary_School","Education_Phd",
                            "Education_Unknown","Stream_STEM","Stream_Humanities","Stream_Other","Stream_Business_Degree","Stream_Arts","Stream_No_Major",
                            "Stream_Unknown","Company_Pvt_Ltd","Company_Funded_Startup","Company_Public_Sector", "Company_Early_Stage_Startup", "Company_NGO",
                            "Company_Other", "Company_Unknown"]

In [None]:
test_normalized = normalize(df_test_final[cols_to_be_normalized])
test_boolean = df_test_final[cols_not_to_be_normalized]
df_test_normalized = pd.DataFrame(test_normalized,columns = cols_to_be_normalized)
df_test_boolean = pd.DataFrame(test_boolean,columns=cols_not_to_be_normalized)

In [None]:
df_train_final = df_train_normalized.merge(df_train_boolean,left_index=True, right_index=True)
df_test_final = df_test_normalized.merge(df_test_boolean,left_index=True, right_index=True)
df_test_final.index = np.arange(0,len(df_test_final))
df_test_final

In [None]:
df_test_final

### Splitting  Independent & dependent varaiables 

In [None]:
X = df_train_final.drop("target",axis = 1)
Y = df_train_final["target"]

### Train Test Split 

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42,shuffle=True, stratify = Y)

# <font color = "red">Logistic Regression</font>
Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.).'

In [None]:
logitsic_model = LogisticRegression()

### Fitting the data into Logistic Model

In [None]:
logitsic_model.fit(X_train,Y_train)

### Predicting x_test with trained model

In [None]:
Y_pred = logitsic_model.predict(X_test)

### Evaluation Metrics


#### Classification Report

In [None]:
print(classification_report(Y_test,Y_pred))

#### Confusion Matrix

In [None]:
confusion_matrix(Y_test,Y_pred)

#### Mean Squared Error

In [None]:
print(mean_squared_error(Y_test,Y_pred))

#### Accuracy Score

In [None]:
print ("Accuracy : ", accuracy_score(Y_test, Y_pred)) 

# <font color = "red">Gradient Boosting Classifier</font>
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
# Now we can try setting different learning rates, so that we can compare the performance of the classifier's 
#performance at different learning rates. Let's see what the performance was for different learning rates:

lr_list = [0.005, 0.0075, 0.01, 0.025, 0.05,0.1,0.25,0.5,1,0.88,0.9,1]

for learning_rate in lr_list:
    gb_clf = GradientBoostingClassifier(n_estimators=50, learning_rate=learning_rate, max_features=2, max_depth=2, random_state=0)
    gb_clf.fit(X_train, Y_train)

    print("Learning rate: ", learning_rate)
    print("Accuracy score (training): {0:.3f}".format(gb_clf.score(X_train, Y_train)))
    print("Accuracy score (validation): {0:.3f}".format(gb_clf.score(X_test, Y_test)))

# <font color = "red">XGBoost</font>
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks.

In [None]:
import warnings
warnings.filterwarnings("ignore")
XGBoost_pipe = imbPipe([
    ("XGBoost", XGBClassifier(random_state=42,n_jobs=-1,tree_method="hist"))
])

params={
    "XGBoost__max_depth": [20,21],
    "XGBoost__min_child_weight":[22,23],
    "XGBoost__n_estimators":[25,27],
    "XGBoost__subsample":[0.4,0.5,0.6],
    "XGBoost__colsample_bytree":[0.4,0.5,0.6],
    "XGBoost__gamma":[1,2,3],
    
}

XGBoost_grid = GridSearchCV(XGBoost_pipe, params, n_jobs=-1,cv=3,scoring="roc_auc")
XGBoost_grid.fit(X_train, Y_train)
print("Best Parameters for Model:  ",XGBoost_grid.best_params_)
Y_pred=XGBoost_grid.predict(X_train)
print("\n")
print(classification_report(Y_train, Y_pred))


## Classisification report of the XGBoost model

In [None]:
Y_pred=XGBoost_grid.predict(X_test)  
print(classification_report(Y_test, Y_pred))

## Accuracy score

In [None]:
accuracy_score(Y_test,Y_pred)

### Based on the above models , XGBoost has higher accuracy  💥 

### Imputing our model on given test data

In [None]:
X_test = df_test_final.copy()

In [None]:
Y_pred = XGBoost_grid.predict(X_test)  

In [None]:
submission = pd.DataFrame(df_test_copy['enrollee_id'])
submission["target"] = Y_pred

In [None]:
filename = 'submission.csv'
submission.to_csv(filename,index=False)

<center><img src="https://media1.giphy.com/media/2vq9I9HGKrpjaHNLVb/giphy.gif"></img></center><br>

# Thank You 🤗
### I hope you had a good time reading my notebook. Pls do support and comment! 😎