# HR Analytics job 

## Index
- [1. Import libraries and download data](#section1)
- [2. EDA](#section2)
- [3. Data Engineering](#section3)
- [4. Cleaning Data](#section4)
- [5. Modelling](#section5)


## 1. Import libraries and dowonload data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import re

import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
path = '/kaggle/input/hr-analytics-job-change-of-data-scientists/'
train = pd.read_csv(path + 'aug_train.csv')
test = pd.read_csv(path + 'aug_test.csv')
sample_submission = pd.read_csv(path + 'sample_submission.csv')

## 2. EDA

We are going to analyse the data.

### 2.1. Structure

Let's see what shape the data has, what type of features there are and whether they contain null values for each data set, train and test.

#### 2.1.1. Train

- Shape & dataframe's head

In [None]:
print('Train shape:', train.shape)
train.head()

- Type of features

In [None]:
train.info()

- Null values

In [None]:
df_null = pd.DataFrame(train.isnull().sum())
df_null = df_null.rename(columns={0:'Number of null values'})
df_null['Percentage null values'] = round(train.isnull().sum()/train. enrollee_id.count()*100,2)
df_null

#### 2.1.2. Test

- Shape & dataframe's head

In [None]:
print('Test shape:', test.shape)
test.head()

- Type of features

In [None]:
test.info()

- Null values

In [None]:

df_null = pd.DataFrame(test.isnull().sum())
df_null = df_null.rename(columns={0:'Number of null values'})
df_null['Percentage null values'] = round(test.isnull().sum()/test. enrollee_id.count()*100,2)
df_null

Observing the previous results, obviously the test set contains 1 feature less than the train, which is the target. The 13 features that they have in common, they are 3 numerical and 10 categorical. Both sets have missing values in the same features and similar percentages. 

### 2.2 Features

Let's study the features from both sets, and how the target is represented in them.

#### 2.2.1. Target

Note that the target value indicates wether the person in the sample is looking for a job change (=1) or not (=0). Looking at the plot below, the target is binary and there is not balance between values, there is much less people looking for a job change than people not looking for it.

In [None]:
fig, axes = plt.subplots(nrows=1,ncols=1, figsize=(16,4))
aux = train['target'].value_counts().to_frame()
plt.title('Frequency of Target')
aux.plot.bar(ax = axes)
plt.show()

#### 2.2.2. City

The city feature is grouped and we only show the cities which are represented at least 50 times for train set plot and at least 10 times for test set plot. Both have similar shape and even the majority of cities are the same. In addition, the target is added in train plot and values do not have the same proportion in each city.  

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,8))
plt.subplots_adjust(hspace = 0.45)
# Train
axes[0].title.set_text('Frequency of the cities (represented >50)- Train set')
aux = train['city'].value_counts().to_frame()
inde_50=list(aux[aux.city>50].index)
df_city_target = train.groupby('city')['target'].value_counts().to_frame().unstack()
df_city_target[df_city_target.index.isin(inde_50)].reindex(inde_50).plot.bar(ax = axes[0], stacked = True)
axes[0].legend(['Not job change', 'Job change'])


# Test
axes[1].title.set_text('Frequency of the cities (represented >10)- Test set')
aux = test['city'].value_counts().to_frame()
aux[aux.city>10].plot.bar(ax = axes[1])

plt.show()

#### 2.2.3. City development index

This feature gets values between 0 and 1, it is scaled, for that reason we are going to use a histogram and a density function in order to plot the feature. In the train set, we make distinctions between the total, not looking for a job change and looking for a job change. Comparing these three plots and test plot, they have the similar shape, changing a bit the concentration of data in these two peaks around  0.6 and 0.9. 

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,8))
plt.subplots_adjust(hspace = 0.45)

train_not_looking = train[train['target']==0].city_development_index
train_looking = train[train['target']==1].city_development_index
train_total = train.city_development_index

# train
sns.distplot(train_not_looking, ax=axes[0] )
sns.distplot(train_looking, ax = axes[0])
sns.distplot(train_total, ax = axes[0] ).set_title('city_development_index histrogram and density function - Train set')
axes[0].legend(['Not job change', 'Job change','Total'])


#test
sns.distplot(test.city_development_index, ax = axes[1]).set_title('city_development_index histrogram and density function - Test set')
plt.show()

#### 2.2.4. Gender

Gender's plots have the same shape for train set and test set. Most of the people of both groups are men. Despite of the difference between groups are significantly, the percentage of people who are not looking or looking for a job change are quite homogenous between the gender's type.

In [None]:
df_gender_target = train.groupby('gender')['target'].value_counts().to_frame().unstack()
df = pd.DataFrame(df_gender_target.sum(axis=1))
df = df.rename(columns={0:'Total'})
df['Percen 0 gender'] = np.round((df_gender_target.target[[0.0]][0.0].values/df.Total.values) * 100,2)
df['Percen 1 gender'] = np.round((df_gender_target.target[[1.0]][1.0].values/df.Total.values) * 100,2)
df

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,10))
plt.subplots_adjust(hspace = 0.4)
# Train
df_gender_target = train.groupby('gender')['target'].value_counts().to_frame().unstack()
df_gender_target.reindex(['Male', 'Female', 'Other']).plot.bar(ax = axes[0], stacked = True)
axes[0].title.set_text('Frequency of Gender- Train set')
axes[0].legend(['Not job change', 'Job change'])
# Test
axes[1].title.set_text('Frequency of Gender- Test set')
aux = test['gender'].value_counts().to_frame()
aux.plot.bar(ax = axes[1])

plt.show()

#### 2.2.5. Professional Experience

##### 2.2.5.1. Relevent_Expereience


For train and test sets, the relevent experience has the same shape. The majority of people from these sets have a relevant experience. However, comparing the percentage of people who are looking for a job change, it is bigger for people who do not have experience.

In [None]:
df_rel_exp_target = train.groupby('relevent_experience')['target'].value_counts().to_frame().unstack()
df = pd.DataFrame(df_rel_exp_target.sum(axis=1))
df = df.rename(columns={0:'Total'})
df['Percen 0 Has expe'] = np.round((df_rel_exp_target.target[[0.0]][0.0].values/df.Total.values) * 100,2)
df['Percen 1 No expe'] = np.round((df_rel_exp_target.target[[1.0]][1.0].values/df.Total.values) * 100,2)
df

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,10))
plt.subplots_adjust(hspace = 0.7)
# Train
df_rel_exp_target = train.groupby('relevent_experience')['target'].value_counts().to_frame().unstack()
df_rel_exp_target.plot.bar(ax = axes[0], stacked = True,)
axes[0].set_xticklabels(list(df_rel_exp_target.index.values),rotation=25, ha='right')
axes[0].title.set_text('Frequency of relevent_experience- Train set')
axes[0].legend(['Not job change', 'Job change'])
# Test
axes[1].title.set_text('Frequency of relevent_experience- Test set')
aux = test['relevent_experience'].value_counts().to_frame()
aux.plot.bar(ax = axes[1])
axes[1].set_xticklabels(list(aux.index.values),rotation=25, ha='right')
plt.show()

##### 2.2.5.2. Experience

Plotting the experience, it has the same shape for train and test data. Moreover, we can observe that the proportion of people who are or not looking for a job change, is changing through different period of experience.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,8))
plt.subplots_adjust(hspace = 0.5)
# Train
df_experience_target = train.groupby('experience')['target'].value_counts().to_frame().unstack()
ind = ['<1','1','3', '4', '5', '6', '7', '8', '9','2','10', '11', '12', '13', '14',
     '15', '16', '17', '18', '19', '20', '>20']
df_experience_target = df_experience_target.reindex(index=ind).plot.bar(ax = axes[0], stacked = True)
axes[0].set_xticklabels(ind,rotation=0, ha='right')
axes[0].title.set_text('Frequency of experience- Train set')
axes[0].legend(['Not job change', 'Job change'])
# Test
axes[1].title.set_text('Frequency of experience- Test set')
aux = test['experience'].value_counts().to_frame()
aux = aux.reindex(index=ind)
aux.plot.bar(ax = axes[1])
axes[1].set_xticklabels(ind,rotation=0, ha='right')
plt.show()

##### 2.2.5.3. Company_size

Both plots have similar distribution through compani size, and the proprortion of not looking or looking for a job change seems that it does not differ too much between company_size.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,8))
plt.subplots_adjust(hspace = 0.5)
# Train
df_size_comp_target = train.groupby('company_size')['target'].value_counts().to_frame().unstack()
ind = ['<10','10/49','50-99', '100-500', '500-999', '1000-4999', '5000-9999', '10000+']
df_size_comp_target.reindex(index=ind).plot.bar(ax = axes[0], stacked = True)
axes[0].set_xticklabels(ind,rotation=25, ha='right')
axes[0].title.set_text('Frequency of company_size - Train set')
axes[0].legend(['Not job change', 'Job change'])
# Test
axes[1].title.set_text('Frequency of company_size - Test set')
aux = test['company_size'].value_counts().to_frame()
aux = aux.reindex(index=ind)
aux.plot.bar(ax = axes[1])
axes[1].set_xticklabels(ind,rotation=25, ha='right')
plt.show()

##### 2.2.5.4. Company_type 

Both graphs have the same shape. The proportion for people who are or not looking for a job change differs in different through company type.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,8))
plt.subplots_adjust(hspace = 0.5)
# Train
df_type_comp_target = train.groupby('company_type')['target'].value_counts().to_frame().unstack()
ind =['Pvt Ltd','Public Sector','Funded Startup','Early Stage Startup', 'NGO', 'Other']
df_type_comp_target.reindex(ind).plot.bar(ax = axes[0], stacked=True)
axes[0].set_xticklabels(ind,rotation=25, ha='right')
axes[0].title.set_text('Frequency of company_type - Train set')
axes[0].legend(['Not job change', 'Job change'])
# Test
axes[1].title.set_text('Frequency of company_type - Test set')
aux = test['company_type'].value_counts().to_frame()
aux.plot.bar(ax = axes[1])
axes[1].set_xticklabels(ind,rotation=25, ha='right')
plt.show()

##### 2.2.5.5. Last_new_job

Comparing two plots of last new job, they have similar shape, where the value 1 is the most popular. Looking at the first plot, the proportion of  values from target is different through the last new job feature.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,8))
plt.subplots_adjust(hspace = 0.5)
# Train
df_last_new_job_target = train.groupby('last_new_job')['target'].value_counts().to_frame().unstack()
ind = ['never','1','2','3', '4', '>4']
df_last_new_job_target.reindex(index=ind).plot.bar(ax = axes[0], stacked=True)
axes[0].legend(['Not job change', 'Job change'])
axes[0].title.set_text('Frequency of last_new_job - Train set')
# Test
axes[1].title.set_text('Frequency of last_new_job - Test set')
aux = test['last_new_job'].value_counts().to_frame()
aux = aux.reindex(index=ind)
aux.plot.bar(ax = axes[1])

plt.show()

##### 2.2.5.6. Training hours 

The training hours feature are grouped by intervals to make it more tidy and see a structure. Then, both plots have nearly the same shape. However, focusing on the proportion of different values of looking or not looking for a job change differs through training hours.

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,9))
plt.subplots_adjust(hspace = 0.6)

# Train
aux = train['training_hours'].value_counts().to_frame()
n = np.linspace(min(aux.index), max(aux.index), 70, endpoint = True,dtype = int)
train['train_hours_2'] = pd.cut(train.training_hours, n)
test['train_hours_2'] = pd.cut(test.training_hours, n)

df_train_hours_target = train.groupby('train_hours_2')['target'].value_counts().to_frame().unstack()
df_train_hours_target.sort_index().plot.bar(ax = axes[0], stacked = True)
axes[0].title.set_text('Frequency of training_hours - Train set')
axes[0].legend(['Not job change', 'Job change'])

# Test

aux_1 = test['train_hours_2'].value_counts().to_frame()
aux_1.sort_index().plot.bar(ax = axes[1])
axes[1].title.set_text('Frequency of training_hours - Test set')
axes[1].legend(['training hours'])
plt.show()

#### 2.2.6. Education

##### 2.2.6.1. Enrolled_university

Both barcharts continue to have the same shape. The majoritiy of people are not enrolled in university. However, the proportion between the values which get target differs in the three type of enrolled university. The higher proportion for people who are looking for a job change is in Full time course.

In [None]:
df_enro_uni_target = train.groupby('enrolled_university')['target'].value_counts().to_frame().unstack()
df = pd.DataFrame(df_enro_uni_target.sum(axis=1))
df = df.rename(columns={0:'Total'})
df['Percen 0 enrolled_university'] = np.round((df_enro_uni_target.target[[0.0]][0.0].values/df.Total.values) * 100,2)
df['Percen 1 enrolled_university'] = np.round((df_enro_uni_target.target[[1.0]][1.0].values/df.Total.values) * 100,2)
df

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,9))
plt.subplots_adjust(hspace = 0.6)

# Train
df_enro_uni_target = train.groupby('enrolled_university')['target'].value_counts().to_frame().unstack()
df_enro_uni_target.sort_index().plot.bar(ax = axes[0], stacked = True)
axes[0].title.set_text('Frequency of enrolled university - Train set')
axes[0].legend(['Not job change', 'Job change'])
axes[0].set_xticklabels(df_enro_uni_target.index.values,rotation=25, ha='right')

# Test

aux_1 = test['enrolled_university'].value_counts().to_frame()
aux_1.sort_index().plot.bar(ax = axes[1])
axes[1].title.set_text('Frequency of enrolled university - Test set')
axes[1].set_xticklabels(aux_1.index.sort_values(),rotation=25, ha='right')

plt.show()

##### 2.2.6.2. Education_level

Both barplots have the same shape, where the majority of people from data are Graduated and inside of  this group is where we can find the higher proportion of people who are looking for a job change.

In [None]:
df_edu_level_target = train.groupby('education_level')['target'].value_counts().to_frame().unstack()
df = pd.DataFrame(df_edu_level_target.sum(axis=1))
df = df.rename(columns={0:'Total'})
df['Percen 0 education_level'] = np.round((df_edu_level_target.target[[0.0]][0.0].values/df.Total.values) * 100,2)
df['Percen 1 education_level'] = np.round((df_edu_level_target.target[[1.0]][1.0].values/df.Total.values) * 100,2)
df


In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,9))
plt.subplots_adjust(hspace = 0.6)

# Train
df_edu_level_target = train.groupby('education_level')['target'].value_counts().to_frame().unstack()
ind = ['Primary School','High School','Graduate','Masters','Phd'] 
df_edu_level_target.reindex(ind).plot.bar(ax = axes[0], stacked = True)
axes[0].title.set_text('Frequency of education level - Train set')
axes[0].legend(['Not job change', 'Job change'])
axes[0].set_xticklabels(ind,rotation=25, ha='right')

# Test

aux_1 = test['education_level'].value_counts().to_frame()
aux_1.reindex(ind).plot.bar(ax = axes[1])
axes[1].title.set_text('Frequency of education_level - Test set')
axes[1].set_xticklabels(ind,rotation=25, ha='right')

plt.show()

##### 2.2.6.3. Major_discipline

This feature also has the same shape for train and test set. The majority of people have the major discipline in STEM. Looking at the behaviour of the values that target can get, the porportion of them seems similar through the different major discipline.

In [None]:
df_MajDisci_target = train.groupby('major_discipline')['target'].value_counts().to_frame().unstack()
df = pd.DataFrame(df_MajDisci_target.sum(axis=1))
df = df.rename(columns={0:'Total'})
df['Percen 0 education_level'] = np.round((df_MajDisci_target.target[[0.0]][0.0].values/df.Total.values) * 100,2)
df['Percen 1 education_level'] = np.round((df_MajDisci_target.target[[1.0]][1.0].values/df.Total.values) * 100,2)
df


In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(16,9))
plt.subplots_adjust(hspace = 0.6)

# Train
df_MajDisci_target = train.groupby('major_discipline')['target'].value_counts().to_frame().unstack()
ind=['STEM','Business Degree', 'Humanities', 'Arts', 'No Major', 'Other']
df_MajDisci_target.reindex(ind).plot.bar(ax = axes[0], stacked = True)
axes[0].title.set_text('Frequency of major discipline - Train set')
axes[0].legend(['Not job change', 'Job change'])
axes[0].set_xticklabels(ind,rotation=25, ha='right')

# Test

aux_1 = test['major_discipline'].value_counts().to_frame()
aux_1.reindex(ind).plot.bar(ax = axes[1])
axes[1].title.set_text('Frequency of major discipline - Test set')
axes[1].set_xticklabels(ind,rotation=25, ha='right')

plt.show()

## 3. Data Engineering

The data is composed by different type of features, as you can see below:
    - Numerical: city_development_index and training_hours.
    - Categorical: 
        - Nominal: city, gender, relevent_experience, enrolled_university, major_discipline and company_type. 
        - Ordinal: education_level, company size, experience and last_new_job. 

In this section, we are going to convert categorical data to numerical.

### 3.1. Nominal Features

#### 3.1.1. City

We convert the feature in number, taking only the value which is assigned.

In [None]:
def find_number(text):
    num = re.findall(r'[0-9]+',text)
    return " ".join(num)

train['city'] = train['city'].apply(lambda x: find_number(x))
train['city']= train['city'].astype(int)
test['city'] = test['city']. apply(lambda x: find_number(x))
test['city']= test['city'].astype(int)

#### 3.1.2. Gender

Transforming feature, being Other in 0, Male in 1 and Female in 2.

In [None]:
train['gender']= train.gender.replace({'Male':1, 'Female':2, 'Other':0})
test['gender']= test.gender.replace({'Male':1, 'Female':2, 'Other':0})

#### 3.1.3. Relevent_experience
Transforming the feature in binary type: Has relevent experince in 1 and No relevent experience in 0.

In [None]:
train['relevent_experience']=train.relevent_experience.replace({'Has relevent experience':1, 'No relevent experience':0})
test['relevent_experience']=test.relevent_experience.replace({'Has relevent experience':1, 'No relevent experience':0})

#### 3.1.4 Major_discipline and company_type

These two features, we will apply them One-hot-encode.

### 3.2. Ordinal Features

We assign a number depending on level is taking the string value.

#### 3.2.1. Education_level

In [None]:
train['education_level'] = train.education_level.replace({'Primary School': 1,
                                'High School': 2,
                                'Graduate': 3,
                                'Masters': 4,
                                'Phd': 5})
test['education_level'] = test.education_level.replace({'Primary School': 1,
                                'High School': 2,
                                'Graduate': 3,
                                'Masters': 4,
                                'Phd': 5})

#### 3.2.2. Company_size

In [None]:
train['company_size'] = train.company_size.replace({'<10':0,'10/49':1,'50-99':2, '100-500': 3,
                            '500-999':4, '1000-4999':5, '5000-9999':6, '5000-9999': 7,'10000+':8})
test['company_size'] = test.company_size.replace({'<10':0,'10/49':1,'50-99':2, '100-500': 3,
                            '500-999':4, '1000-4999':5, '5000-9999':6, '5000-9999': 7,'10000+':8})

#### 3.2.3. Experience

In [None]:
train['experience'] = train.experience.replace({'<1':0,'>20':21})
train['experience'] = train['experience'].astype(str).astype(float)
test['experience'] = test.experience.replace({'<1':0,'>20':21})
test['experience'] = test['experience'].astype(str).astype(float)

#### 3.2.4. Last_new_job

In [None]:
train['last_new_job'] = train.last_new_job.replace({'never':0, '>4':5})
train['last_new_job'] = train['last_new_job'].astype(str).astype(float)
test['last_new_job'] = test.last_new_job.replace({'never':0, '>4':5})
test['last_new_job'] = test['last_new_job'].astype(str).astype(float)

## 4. Cleaning data

In this section, we are going to fill the missing values. We will use two process: firstly, for the features that have been converted in numerical, we will use KNN imputer, and secondly, for the features which are still categorical, we will use the mode.

### 4.1. KNN imputer

In [None]:
from sklearn.impute import KNNImputer

In [None]:
col_miss = ['gender', 'education_level','experience','company_size', 'last_new_job']
train_miss_knn = train[['enrollee_id'] + col_miss]
train_no_miss_knn = train.drop(col_miss, axis=1)

test_miss_knn = test[['enrollee_id'] + col_miss]
test_no_miss_knn = test.drop(col_miss, axis=1)

In [None]:
knn = KNNImputer(n_neighbors=5)
knn.fit(train_miss_knn)
train_miss_knn = pd.DataFrame(np.round(knn.transform(train_miss_knn)),columns = train_miss_knn.columns )
test_miss_knn = pd.DataFrame(np.round(knn.transform(test_miss_knn)),columns = train_miss_knn.columns )

In [None]:
df_train = pd.merge(train_miss_knn, train_no_miss_knn, on='enrollee_id')
df_test = pd.merge(test_miss_knn, test_no_miss_knn, on='enrollee_id')

### 4.2. Mode


In [None]:
df_train['enrolled_university'].fillna(df_train['enrolled_university'].mode()[0], inplace=True)
df_test['enrolled_university'].fillna(df_test['enrolled_university'].mode()[0], inplace=True)

df_train['major_discipline'].fillna(df_train['major_discipline'].mode()[0], inplace=True)
df_test['major_discipline'].fillna(df_test['major_discipline'].mode()[0], inplace=True)

df_train['company_type'].fillna(df_train['major_discipline'].mode()[0], inplace=True)
df_test['company_type'].fillna(df_test['major_discipline'].mode()[0], inplace=True)

## 5. Modelling

Our problem is based on binary classifcation, and we are going to use a neural network with keras as model and use it to find the predictions.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer

from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
from keras.models import Sequential


from imblearn.over_sampling import SMOTE

from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import roc_curve, auc

In [None]:
features_num = ['gender', 'education_level', 'experience', 'company_size', 'last_new_job', 'city',
                'city_development_index','relevent_experience', 'training_hours']
features_cat = ['enrolled_university', 'major_discipline','company_type']

When we draw the target, we realise that there is not balance between the values looking or not looking for job change. Then we will use SMOTE in order to create a more homogeneous sample. 

In [None]:
from imblearn.over_sampling import SMOTE

df_train_X = df_train[features_num + features_cat]

preprocessor = make_column_transformer(
                (StandardScaler(), features_num),
                (OneHotEncoder(), features_cat))

X = preprocessor.fit_transform(df_train_X)
Y = df_train[['target']]
smote = SMOTE(random_state = 550)
X_smote, Y_smote = smote.fit_resample(X,Y)

smote = SMOTE(random_state = 450)
X_smote1, Y_smote1 = smote.fit_resample(X,Y)


df_train_X = pd.concat([pd.DataFrame(X_smote), pd.DataFrame(X_smote1)], axis = 0).reset_index(drop = True)
df_train_y = pd.concat([Y_smote, Y_smote1], axis = 0).reset_index(drop = True)


In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(df_train_X, df_train_y, test_size=0.3, random_state = 540)

In [None]:
Input_nodes = [X_valid.shape[1]]

In [None]:
model=keras.Sequential([
        layers.Dense(512, activation = 'relu', input_shape = Input_nodes), 
        layers.Dropout(0.3),
        layers.BatchNormalization(),
        layers.Dense(512, activation = 'relu'),
        layers.Dropout(0.3),
        layers.BatchNormalization(),
        layers.Dense(1, activation = 'sigmoid'),
])

In [None]:
model.compile(
            loss='binary_crossentropy',
            optimizer='adam',
            metrics=[tf.keras.metrics.AUC()],
)


In [None]:
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.metrics import roc_curve, auc

In [None]:
early_stopping = keras.callbacks.EarlyStopping(
                                    patience = 10,
                                    min_delta = 0.001,
                                    restore_best_weights= True)

In [None]:
history = model.fit(
            X_train, y_train,
            validation_data = (X_valid, y_valid),
            batch_size = 128,
            epochs = 70,
            callbacks = [early_stopping],
            verbose = 1,
)

In [None]:
history_df = pd.DataFrame(history.history)

In [None]:
fig, axes = plt.subplots(nrows=2,ncols=1, figsize=(10,8))
history_df.loc[:, ['loss', 'val_loss']].plot(ax = axes[0])
history_df.loc[:, ['auc', 'val_auc']].plot(ax = axes[1])
axes[0].set_xlabel('epochs')
axes[1].set_xlabel('epochs')
plt.show()

In [None]:
#test
df_test = df_test[features_num + features_cat]
X_test = preprocessor.transform(df_test)


In [None]:
test_preds = model.predict(X_test)
sample_submission['target'] = [ 1 if i>=0.5 else 0 for i in test_preds]

In [None]:
sample_submission.to_csv('submission.csv', index=False)
sample_submission.head()

## Reference 

https://www.kaggle.com/nkitgupta/who-will-leave-a-job-test-auc-0-93