###### Problem Statement:
- HRWorks supports several information technology (IT) companies in India with their talent acquisition. One of the challenges they face is about 30% of the candidates who accept the jobs offer do not join the company. This leads to huge loss of revenue and time as the companies initiate the recruitment process again to fill the workforce demand. HRWorks wants to find if a model can be built to predict the likelihood of a candidate joining the company.

#### 1.Identify and define the problem statement clearly also mention why it is necessary for an organisation to solve the problem.¶

HRworks is a Workforce supply company which supplies people to differnet companies throughouht India, in the IT field. One of the major problem the company faces is that the people back out after accepting the offer letter. This creates a huge lose of the company in term sof money, resources and time. So here, the business problem is to creater a model to resolve this issue. We need to find the factors which affects this and hence will be easier to resolve.

#### 2. Define any hypothesis if possible

Hypothesis:
- Null Hypothesis: All the candidates will join the firms
- Alternate Hypothesis : Not all candidates join the firms

#### 3. Do the EDA of dataset and write the observation you got form the dataset?

## Importing all the necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('always')  # "error", "ignore", "always", "default", "module" or "once"
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv("../input/hrworks-dataset/hr_data.csv")

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.describe(include="object")

## Data Preprocessing

First let us check for missing values in our data set

In [None]:
df.isnull().sum()

We can see that there are very minor number of null values in two of the features. We can either drop this or fill it up. I am replacing the null values. ALso, I am considering Gender as a factor to influence Age

In [None]:
df['Offered band'].fillna(method='ffill',inplace=True)
df['Age'].fillna(df.groupby('Gender')['Age'].transform('mean'), inplace = True) 

In [None]:
df.isnull().sum()

Let us now drop the unwanted columns that adds no values to our data.

In [None]:
df.drop('SLNO', axis=1, inplace=True)
df.drop(columns=['Offered band','Candidate relocate actual'])

Next, let us convert our target variable from categorical to numerical value for ease of analysis and model building.

In [None]:
df['Status'] = df['Status'].map(lambda x:1 if x=='Joined' else 0)

In [None]:
df['DOJ Extended'].value_counts()

In [None]:
df['LOB'].value_counts()

In [None]:
df['Gender'].value_counts()

In [None]:
df['Candidate Source'].value_counts()

In [None]:
df['Location'].value_counts()

In [None]:
df.head()

## Analysis
### Univariate Analysis

In [None]:
Y = df["Status"]
total = len(Y)*1.
ax = sns.countplot(x="Status", data=df)
for p in ax.patches:
    ax.annotate('{:.2f}%'.format(100*p.get_height()/total), (p.get_x()+0.1, p.get_height()+5))
plt.show()

This shows the percent of people who did not join versus the percent who joined. We can see that there is data imbalance. We will deal with this during the model Building.

### Analysis with categorical variables

In [None]:
def plot_withY(label, dataset):
    plt.figure(figsize=(10,10))
    sns.countplot(x=label, data=df, hue="Status")
    plt.xticks(rotation=45)
    plt.show()

In [None]:
plot_withY('Gender', df)

In [None]:
plot_withY('Candidate Source', df)

In [None]:
plot_withY('Joining Bonus', df)

After plotting for each and everyone of the categorical variable, there is nothing worthwhile that can be sumamrzed from the analysis because of the huge data imbalance. So, next let us look into the Numerical variables.

### Analysis of Numerical variables

In [None]:
def plot_num(feature):
    sns.boxplot(data=df, x='Status', y = feature)
    plt.show()

In [None]:
plot_num('Duration to accept offer')

In [None]:
plot_num('Notice period')

In [None]:
plot_num('Pecent hike expected in CTC')

In [None]:
plot_num('Percent hike offered in CTC')

In [None]:
plot_num('Percent difference CTC')

In [None]:
plot_num('Rex in Yrs')

In [None]:
plot_num('Age')

In [None]:
plt.figure(figsize=(20,10))
norm = plt.Normalize(vmin=0,vmax=10)
sns.scatterplot(x= 'Percent hike offered in CTC', y= 'Pecent hike expected in CTC', hue='Status', data=df)
#sns.relplot(, hue='Status', data=df)
plt.show()

We can see that there is an increasingly linear relation between the expected and offered ctc among the joined candidates. 

In [None]:
plt.figure(figsize = (7,7))

sns.heatmap(df.corr(), annot = True)

After the analysis, we can see the following conclusions:

- None of the categorical variable had any effect on our target variable
- Among the numerical variables, we have some observations.
- When we plot age we can see that there are some outliers. But among the candidates who did not choose to join, we can see that there are some extreme outlier towards the age 60. So, there are chances that age affects our target variable.
- When we consider years of experience, we can see some extreme outliers among the candidates who joined. We can assume that there is slight effect by years of experience on target variable.
- When we take the CTC offered, we can see that there are a large number of outliers to much extreme values. This shows that more candidates join when offered higher ctc. When we analysis the hike expected that the outliers are higher among teh candidates who joined. So, we can conclude that these two features are related. That is, when the candidates expect higher hike and if it is satisfied, the candidates tend to join.
- We can conclude that lesser the notice period duration(peaking at 40 to 45 days), higher are the chances for the candidates to join. More the notice period, the longer the candidates will procrastinate and may decide not to join.
- Also, we can see that there are many outliers in the feature "Duration to accept the offer" in the candidates who did not join. Reducing that duration can make sure that candidates do not drop out.

#### Q4. Develop a machine learning algorithms and compare different models.

## Machine Learning models

Now we will create models using different algorithms and find out which is best among all.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn import metrics

In [None]:
x_feature = list(df.columns)
x_feature.remove('Status')
encoded_data = pd.get_dummies(df[x_feature], drop_first = True) 
y = df['Status']
x = encoded_data

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,train_size=0.7, random_state = 42)

#### Logistic Regression

In [None]:
log_reg=LogisticRegression(max_iter=55,solver= "newton-cg")
log_reg.fit(x_train,y_train)
y_pred = log_reg.predict(x_test)
log_reg.predict_proba(x_test)

In [None]:
print(metrics.classification_report(y_test, y_pred))
print(metrics.confusion_matrix(y_test,y_pred))
print(metrics.accuracy_score(y_test,y_pred))

In [None]:
from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}
clf = GridSearchCV(LogisticRegression(),grid,cv=10,scoring = 'roc_auc')
clf.fit(x_train, y_train)
train_predictions = clf.predict(x_test)

In [None]:
print(metrics.classification_report(y_test, train_predictions))

#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf=RandomForestClassifier()
clf.fit(x_train,y_train)

In [None]:
y_pred_clf=clf.predict(x_test)
print(metrics.confusion_matrix(y_test,y_pred_clf))
print(metrics.accuracy_score(y_test,y_pred_clf))
print(metrics.classification_report(y_test,y_pred_clf))

We had observed above that our data is imbalanced. SO before coming to any conclusion, we need to apply some sampling technique overcome this issue

In [None]:
from sklearn.utils import resample

In [None]:
x_train_u, y_train_u = resample(x_train[y_train==1],
                               y_train[y_train==1],
                               n_samples = x_train[y_train==0].shape[0],
                               random_state = 1)

x_train_u = np.concatenate((x_train[y_train==0],x_train_u))
y_train_u = np.concatenate((y_train[y_train==0],y_train_u))
print(x_train_u.shape)
print(y_train_u.shape)

In [None]:
log_reg_up = LogisticRegression()
log_reg_up.fit(x_train_u, y_train_u)
y_pred_up = log_reg_up.predict(x_test)
print(metrics.classification_report(y_test, y_pred_up))

In [None]:
log_reg_clf = RandomForestClassifier()
log_reg_clf.fit(x_train_u, y_train_u)
y_pred_clfu = log_reg_up.predict(x_test)
print(metrics.classification_report(y_test, y_pred_clfu))

##### Let us now try to fine tune the parameters and do crossvalidation for both the algorithms

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Logistic Regression Cross Validation
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}
LR = GridSearchCV(LogisticRegression(),grid,cv=10,scoring = 'roc_auc')
LR.fit(x_train_u, y_train_u)
train_predictions = LR.predict(x_test)

In [None]:
print(metrics.classification_report(y_test,train_predictions))

In [None]:
# Random Forest Cross Validation
tunned_parameters =  [{'n_estimators': [200, 500],
                       'max_features': ['auto', 'sqrt', 'log2'],
                       'max_depth' : [4,5,6,7,8],
                       'criterion' :['gini', 'entropy']}]
RF = GridSearchCV(RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True),
                  tunned_parameters,
                  cv=5,
                  scoring = 'roc_auc')
RF.fit(x_train_u,y_train_u)
y_pred_rf = RF.predict(x_test)
print(metrics.classification_report(y_test,y_pred_rf))

#### Ensemble Technique

Let us now try ensemble technique to see if the model can be further improved.

In [None]:
from imblearn.ensemble import EasyEnsembleClassifier

In [None]:
ensem=EasyEnsembleClassifier()

In [None]:
ensem.fit(x_train_u, y_train_u)
y_pred_en = ensem.predict(x_test)
print(metrics.confusion_matrix(y_test, y_pred_en))
print(metrics.accuracy_score(y_test,y_pred_en))
print(metrics.classification_report(y_test,y_pred_en))

After the Cross validation, we can see that the Logistic Regression model has the best scores.

In [None]:
importance = log_reg.coef_[0]
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
plt.figure(figsize=(20,10))
plt.bar([x for x in range(len(importance))], importance)
plt.show()