# Problem statement

A training institute which conducts training for analytics/ data science wants to expand their business to manpower recruitment (data science only) as well. 
 
Company gets large number of signups for their trainings. Now, company wants to connect these enrollees with their clients who are looking to hire employees working in the same domain. Before that, it is important to know which of these candidates are really looking for a new employment. They have student information related to demographics, education, experience and features related to training as well.
 
To understand the factors that lead a person to look for a job change, the agency wants you to design a model that uses the current credentials/demographics/experience to predict the probability of an enrollee to look for a new job.

![](https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/JanataHack-HR-Analytics-thumbnail-1200x1200.png)

In [1]:
## import necessary libraries.

import numpy as np ## Numpy Library ( will use to convert data frame to array or creating array etc...).
import pandas as pd ## Pandas Library (will use to load data,create data frame...etc).
import os ## For connecting to machine to get path for reading/writing files.
from sklearn.model_selection import train_test_split ## For splitting data into train and validation.
from sklearn.preprocessing import LabelEncoder ## For label encoding(converting categorical values to label).
from xgboost import XGBClassifier ## XG boost model.
from sklearn.model_selection import GridSearchCV ## For Grid search(cross validation).
from sklearn.metrics import accuracy_score ## For getting accuracy value.
from sklearn.metrics import confusion_matrix ## For getting confusion matrix.
from sklearn.metrics import classification_report ## For classifier metrics(accuracy,TPR,TNR).
from sklearn.naive_bayes import GaussianNB ## Naive Nayes Model.
from sklearn.neighbors import KNeighborsClassifier ## KNN Model.
from sklearn.ensemble import RandomForestClassifier ## Random Forest  Model.
from sklearn.ensemble import BaggingClassifier ## Bagging Model.
from sklearn.ensemble import AdaBoostClassifier ## AdaBoost Model.
from sklearn.ensemble import GradientBoostingClassifier ## GradientBoost Model.
from sklearn.svm import SVC ## SVC Model.
from sklearn.impute import SimpleImputer ## For imputing NA values.

In [2]:
## Get current working directory.
os.getcwd()

'D:\\Python\\Pratice'

In [3]:
## Set working directory.
os.chdir("D:\DataScience\Pratice\HR Analytics")
os.getcwd()

'D:\\DataScience\\Pratice\\HR Analytics'

In [4]:
## Read data files.
train = pd.read_csv("train.csv",header='infer',sep=',')
test = pd.read_csv("test.csv",header='infer',sep=',')

In [5]:
## Check first 5 records of train data.
train.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,23798,city_149,0.689,Male,Has relevent experience,no_enrollment,Graduate,STEM,3,100-500,Pvt Ltd,1,106,0
1,29166,city_83,0.923,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,<10,Funded Startup,1,69,0
2,46,city_16,0.91,,Has relevent experience,no_enrollment,Graduate,STEM,6,50-99,Public Sector,2,4,0
3,18527,city_64,0.666,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,50-99,Pvt Ltd,1,26,0
4,21751,city_100,0.887,,No relevent experience,no_enrollment,Masters,STEM,8,,,2,88,1


In [7]:
## Check last 5 records of train data.
train.tail()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
18354,25366,city_103,0.92,Male,Has relevent experience,Full time course,Graduate,STEM,5,<10,Pvt Ltd,1,71,0
18355,25545,city_160,0.92,Male,No relevent experience,no_enrollment,Graduate,Humanities,15,50-99,Pvt Ltd,1,160,0
18356,11514,city_114,0.926,Male,Has relevent experience,no_enrollment,Masters,STEM,11,50-99,Pvt Ltd,3,18,0
18357,1689,city_75,0.939,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,10/49,Pvt Ltd,3,41,0
18358,5995,city_105,0.794,Female,Has relevent experience,no_enrollment,Graduate,STEM,>20,100-500,Pvt Ltd,2,84,0


In [6]:
## Check first 5 records of test data.
test.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours
0,16548,city_33,0.448,,No relevent experience,Full time course,Graduate,STEM,<1,1000-4999,Public Sector,,15
1,12036,city_28,0.939,Male,No relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,1.0,94
2,11061,city_103,0.92,Male,No relevent experience,Full time course,Graduate,STEM,3,,,1.0,17
3,5032,city_104,0.924,Male,No relevent experience,no_enrollment,Phd,STEM,>20,50-99,Pvt Ltd,2.0,76
4,17599,city_77,0.83,Male,Has relevent experience,no_enrollment,Graduate,STEM,6,<10,Pvt Ltd,2.0,65


In [8]:
## Check last 5 records of test data.
test.tail()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours
15016,11308,city_46,0.762,,Has relevent experience,no_enrollment,Masters,STEM,>20,500-999,Pvt Ltd,>4,68
15017,14612,city_21,0.624,Male,Has relevent experience,Full time course,Masters,STEM,4,1000-4999,Pvt Ltd,1,320
15018,33346,city_16,0.91,Male,Has relevent experience,no_enrollment,High School,,9,1000-4999,Pvt Ltd,4,13
15019,14506,city_64,0.666,,No relevent experience,Full time course,Graduate,STEM,5,,,1,38
15020,32641,city_21,0.624,Male,No relevent experience,no_enrollment,Graduate,,3,,,,100


In [9]:
## Check dimensions of train data.
train.shape

(18359, 14)

In [10]:
## Check dimensions  of test data.
test.shape

(15021, 13)

In [11]:
## Check summary statistics of train data.
train.describe(include='all')

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
count,18359.0,18359,18359.0,14261,18359,18017,17902,15521,18300,13580,13320,17992.0,18359.0,18359.0
unique,,123,,3,2,3,5,6,22,8,6,6.0,,
top,,city_103,,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,1.0,,
freq,,4358,,12884,13596,13659,10769,13738,3437,3120,10051,7567.0,,
mean,16729.360096,,0.84714,,,,,,,,,,65.899014,0.132088
std,9643.749725,,0.110189,,,,,,,,,,60.8853,0.338595
min,1.0,,0.448,,,,,,,,,,1.0,0.0
25%,8378.5,,0.796,,,,,,,,,,23.0,0.0
50%,16706.0,,0.91,,,,,,,,,,47.0,0.0
75%,25148.5,,0.92,,,,,,,,,,89.0,0.0


In [12]:
## Check summary statistics of test data.
test.describe(include='all')

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours
count,15021.0,15021,15021.0,11633,15021,14742,14626,12628,14977,10970,10691,14717.0,15021.0
unique,,123,,3,2,3,5,6,22,8,6,6.0,
top,,city_103,,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,1.0,
freq,,3494,,10578,11102,11228,8743,11117,2713,2577,8063,6246.0,
mean,16643.004327,,0.846683,,,,,,,,,,65.158179
std,9626.895233,,0.109709,,,,,,,,,,59.719211
min,6.0,,0.448,,,,,,,,,,1.0
25%,8316.0,,0.794,,,,,,,,,,23.0
50%,16664.0,,0.91,,,,,,,,,,47.0
75%,24908.0,,0.92,,,,,,,,,,89.0


In [13]:
## Check column data types of train data.
train.dtypes

enrollee_id                 int64
city                       object
city_development_index    float64
gender                     object
relevent_experience        object
enrolled_university        object
education_level            object
major_discipline           object
experience                 object
company_size               object
company_type               object
last_new_job               object
training_hours              int64
target                      int64
dtype: object

In [14]:
## Check column data types of test data.
test.dtypes

enrollee_id                 int64
city                       object
city_development_index    float64
gender                     object
relevent_experience        object
enrolled_university        object
education_level            object
major_discipline           object
experience                 object
company_size               object
company_type               object
last_new_job               object
training_hours              int64
dtype: object

In [15]:
## Get train data column names.
train.columns

Index(['enrollee_id', 'city', 'city_development_index', 'gender',
       'relevent_experience', 'enrolled_university', 'education_level',
       'major_discipline', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours', 'target'],
      dtype='object')

In [16]:
## Get test  data column names.
test.columns

Index(['enrollee_id', 'city', 'city_development_index', 'gender',
       'relevent_experience', 'enrolled_university', 'education_level',
       'major_discipline', 'experience', 'company_size', 'company_type',
       'last_new_job', 'training_hours'],
      dtype='object')

In [17]:
## Get index range of train data.
train.index

RangeIndex(start=0, stop=18359, step=1)

In [18]:
## Get index range of test data.
test.index

RangeIndex(start=0, stop=15021, step=1)

In [19]:
## Set index.
train.set_index('enrollee_id',inplace=True)

In [20]:
## Check first record of train data after setting index.
train.head(1)

Unnamed: 0_level_0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
23798,city_149,0.689,Male,Has relevent experience,no_enrollment,Graduate,STEM,3,100-500,Pvt Ltd,1,106,0


In [21]:
## Set index.
test.set_index('enrollee_id',inplace=True)

In [22]:
## Check first record of test data after setting index.
test.head(1)

Unnamed: 0_level_0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
16548,city_33,0.448,,No relevent experience,Full time course,Graduate,STEM,<1,1000-4999,Public Sector,,15


In [23]:
## Get object data type column names from train data.
obj_col = train.select_dtypes('object').columns
obj_col

Index(['city', 'gender', 'relevent_experience', 'enrolled_university',
       'education_level', 'major_discipline', 'experience', 'company_size',
       'company_type', 'last_new_job'],
      dtype='object')

In [24]:
## Data type conversion(Convert object data type to category).
def dtypesConversion(df):
    for col in obj_col:
        df[col] = df[col].astype('category')

In [25]:
## Convert object data type to category for train data.
dtypesConversion(train)

In [26]:
## Check train data columns data types after conversion.
train.dtypes

city                      category
city_development_index     float64
gender                    category
relevent_experience       category
enrolled_university       category
education_level           category
major_discipline          category
experience                category
company_size              category
company_type              category
last_new_job              category
training_hours               int64
target                       int64
dtype: object

In [27]:
## Convert object data type to category for test data.
dtypesConversion(test)

In [28]:
## Check test data columns data types after conversion.
test.dtypes

city                      category
city_development_index     float64
gender                    category
relevent_experience       category
enrolled_university       category
education_level           category
major_discipline          category
experience                category
company_size              category
company_type              category
last_new_job              category
training_hours               int64
dtype: object

In [29]:
## Check null values for train data.
train.isna().sum()

city                         0
city_development_index       0
gender                    4098
relevent_experience          0
enrolled_university        342
education_level            457
major_discipline          2838
experience                  59
company_size              4779
company_type              5039
last_new_job               367
training_hours               0
target                       0
dtype: int64

In [30]:
## Check null values for test data.
test.isna().sum()

city                         0
city_development_index       0
gender                    3388
relevent_experience          0
enrolled_university        279
education_level            395
major_discipline          2393
experience                  44
company_size              4051
company_type              4330
last_new_job               304
training_hours               0
dtype: int64

In [31]:
## Split data into train and validation(80:20 ratio).
X_train,X_test,y_train,y_test = train_test_split(train.drop('target',axis=1),train['target'],test_size=0.2,random_state=1234)

In [32]:
## Check first 5 records of train data.
X_train.head()

Unnamed: 0_level_0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
18323,city_103,0.92,Male,No relevent experience,no_enrollment,High School,,5,,,never,105
21072,city_103,0.92,,Has relevent experience,no_enrollment,Graduate,STEM,2,50-99,Pvt Ltd,1,124
17061,city_115,0.789,Male,Has relevent experience,no_enrollment,Graduate,STEM,4,10000+,Pvt Ltd,2,36
28418,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,33
325,city_36,0.893,Male,No relevent experience,Full time course,Graduate,STEM,10,,,never,29


In [33]:
## Check first 5 records of train target data.
y_train.head()

enrollee_id
18323    0
21072    0
17061    0
28418    0
325      0
Name: target, dtype: int64

In [34]:
## Check first 5 records of validation data.
X_test.head()

Unnamed: 0_level_0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
6719,city_75,0.939,Male,No relevent experience,no_enrollment,Primary School,,4,,,never,27
16933,city_16,0.91,Male,Has relevent experience,no_enrollment,Graduate,STEM,6,50-99,Pvt Ltd,2,184
18198,city_16,0.91,Male,Has relevent experience,no_enrollment,Graduate,STEM,19,50-99,Pvt Ltd,1,116
22562,city_67,0.855,Male,Has relevent experience,no_enrollment,Phd,STEM,>20,<10,Pvt Ltd,>4,92
19193,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,100-500,Pvt Ltd,2,70


In [35]:
## Check first 5 records of validation target data.
y_test.head()

enrollee_id
6719     0
16933    0
18198    0
22562    0
19193    0
Name: target, dtype: int64

In [36]:
## Check train data dimensions.
X_train.shape

(14687, 12)

In [37]:
## Check dimesnions of validation data.
X_test.shape

(3672, 12)

In [38]:
## Check shapes of train daa.
train.shape

(18359, 13)

In [39]:
## Get category data type columns from train and validation data.

catcols_train = X_train.select_dtypes('category')
catcols_test = X_test.select_dtypes('category')

In [40]:
## Get category data type columns from test data.
test_catcols = test.select_dtypes('category')

In [41]:
## Get train data category column names.
catcols_train.columns

Index(['city', 'gender', 'relevent_experience', 'enrolled_university',
       'education_level', 'major_discipline', 'experience', 'company_size',
       'company_type', 'last_new_job'],
      dtype='object')

In [43]:
## Instantiate Imputation to impute category columns NAs.
cat_imputer = SimpleImputer(strategy = 'most_frequent')

In [44]:
## Fit imputer.
cat_imputer.fit(catcols_train)

## Impute train data category columns with mode and prepare a data frame.
X_train_imp_cat = cat_imputer.transform(catcols_train)
X_train_imp_cat = pd.DataFrame(X_train_imp_cat,columns=catcols_train.columns,index=catcols_train.index)

In [45]:
## Impute validation data category columns with mode and prepare a data frame.
X_test_imp_cat = cat_imputer.transform(catcols_test)
X_test_imp_cat = pd.DataFrame(X_test_imp_cat,columns=catcols_test.columns,index=catcols_test.index)

In [46]:
## Impute test data category columns with mode and prepare a data frame.
test_imp_cat = cat_imputer.transform(test_catcols)
test_imp_cat = pd.DataFrame(test_imp_cat,columns=test_catcols.columns,index=test_catcols.index)

In [47]:
## Check NA for train data after imputation.
X_train_imp_cat.isna().sum()

city                   0
gender                 0
relevent_experience    0
enrolled_university    0
education_level        0
major_discipline       0
experience             0
company_size           0
company_type           0
last_new_job           0
dtype: int64

In [48]:
## Check NA for validation after imputation.
X_test_imp_cat.isna().sum()

city                   0
gender                 0
relevent_experience    0
enrolled_university    0
education_level        0
major_discipline       0
experience             0
company_size           0
company_type           0
last_new_job           0
dtype: int64

In [49]:
## Check NA for test after imputation.
test_imp_cat.isna().sum()

city                   0
gender                 0
relevent_experience    0
enrolled_university    0
education_level        0
major_discipline       0
experience             0
company_size           0
company_type           0
last_new_job           0
dtype: int64

In [50]:
## Get numeric columns from train data.
X_train_numcols = X_train.select_dtypes(include='number')

In [51]:
## Check first 5 rows of numeric columns of train data.
X_train_numcols.head()

Unnamed: 0_level_0,city_development_index,training_hours
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1
18323,0.92,105
21072,0.92,124
17061,0.789,36
28418,0.92,33
325,0.893,29


In [52]:
## Check dimensions of numeric columns of train data.
X_train_numcols.shape

(14687, 2)

In [53]:
## Check dimensions of category columns of train data.
X_train_imp_cat.shape

(14687, 10)

In [54]:
## Check dimensions of train data.
X_train.shape

(14687, 12)

In [55]:
## Get numeric columns from validation data.
X_test_numcols = X_test.select_dtypes(include='number')

In [56]:
## Check first 5 rows of numeric columns of validation data.
X_test_numcols.head()

Unnamed: 0_level_0,city_development_index,training_hours
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6719,0.939,27
16933,0.91,184
18198,0.91,116
22562,0.855,92
19193,0.92,70


In [57]:
## Get numeric columns from test data.
test_numcols = test.select_dtypes(include='number')

In [58]:
## Check first 5 rows of numeric columns of test data.
test_numcols.head()

Unnamed: 0_level_0,city_development_index,training_hours
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1
16548,0.448,15
12036,0.939,94
11061,0.92,17
5032,0.924,76
17599,0.83,65


In [59]:
## Check dimension of numeric columns of test data.
test_numcols.shape

(15021, 2)

In [60]:
## Check dimension of category columns of test data.
test_imp_cat.shape

(15021, 10)

In [61]:
## Get dimenions of test data.
test.shape

(15021, 12)

In [62]:
## Get test data columns data types.
test.dtypes

city                      category
city_development_index     float64
gender                    category
relevent_experience       category
enrolled_university       category
education_level           category
major_discipline          category
experience                category
company_size              category
company_type              category
last_new_job              category
training_hours               int64
dtype: object

In [63]:
## Check dimension of numeric columns of train data.
X_train_numcols.shape

(14687, 2)

In [64]:
## Check dimension of category columns of train data.
X_train_imp_cat.shape

(14687, 10)

In [65]:
## Check first record of numeric column of train data.
X_train_numcols.head(1)

Unnamed: 0_level_0,city_development_index,training_hours
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1
18323,0.92,105


In [66]:
## Check first record of category column of train data.
X_train_imp_cat.head(1)

Unnamed: 0_level_0,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
18323,city_103,Male,No relevent experience,no_enrollment,High School,STEM,5,50-99,Pvt Ltd,never


In [67]:
## Concat numeric and category columns of train data.
train_data = pd.concat([X_train_numcols, X_train_imp_cat], axis=1,sort=False)

In [68]:
## Get first record train data.
train_data.head(1)

Unnamed: 0_level_0,city_development_index,training_hours,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
18323,0.92,105,city_103,Male,No relevent experience,no_enrollment,High School,STEM,5,50-99,Pvt Ltd,never


In [69]:
## Get last record of train data.
train_data.tail(1)

Unnamed: 0_level_0,city_development_index,training_hours,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
9946,0.92,112,city_103,Male,Has relevent experience,no_enrollment,Graduate,STEM,16,10000+,Pvt Ltd,>4


In [70]:
## Check dimensions of train data.
train_data.shape

(14687, 12)

In [71]:
## Check null values of train data.
train_data.isna().sum()

city_development_index    0
training_hours            0
city                      0
gender                    0
relevent_experience       0
enrolled_university       0
education_level           0
major_discipline          0
experience                0
company_size              0
company_type              0
last_new_job              0
dtype: int64

In [72]:
## Concat numeric and category columns of validation data.
valid_data = pd.concat([X_test_numcols, X_test_imp_cat], axis=1)

In [73]:
## Check dimensions of validation data.
valid_data.shape

(3672, 12)

In [74]:
## Concat numeric and category columns of test data.
test_data = pd.concat([test_numcols,test_imp_cat],axis=1)

In [75]:
## Check test data dimesnions.
test_data.shape

(15021, 12)

In [76]:
## Check first record of train data.
train_data.head(1)

Unnamed: 0_level_0,city_development_index,training_hours,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
18323,0.92,105,city_103,Male,No relevent experience,no_enrollment,High School,STEM,5,50-99,Pvt Ltd,never


In [77]:
## Check first record of validation data.
valid_data.head(1)

Unnamed: 0_level_0,city_development_index,training_hours,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
6719,0.939,27,city_75,Male,No relevent experience,no_enrollment,Primary School,STEM,4,50-99,Pvt Ltd,never


In [78]:
## Check first record of test data.
test_data.head(1)

Unnamed: 0_level_0,city_development_index,training_hours,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
16548,0.448,15,city_33,Male,No relevent experience,Full time course,Graduate,STEM,<1,1000-4999,Public Sector,1


In [79]:
## To perform label encoding, we need to append train and test data and fit label encoder on it.
## (beacuse train and  test will not have same lebel so that is reason we combined both data and build label encoder and
## will transform train and  test individuallly)
combined_data = train_data.append(valid_data)
combined_data = combined_data.append(test_data)

In [80]:
## Check first 5 records of combined data.
combined_data.head()

Unnamed: 0_level_0,city_development_index,training_hours,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
18323,0.92,105,city_103,Male,No relevent experience,no_enrollment,High School,STEM,5,50-99,Pvt Ltd,never
21072,0.92,124,city_103,Male,Has relevent experience,no_enrollment,Graduate,STEM,2,50-99,Pvt Ltd,1
17061,0.789,36,city_115,Male,Has relevent experience,no_enrollment,Graduate,STEM,4,10000+,Pvt Ltd,2
28418,0.92,33,city_103,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,1
325,0.893,29,city_36,Male,No relevent experience,Full time course,Graduate,STEM,10,50-99,Pvt Ltd,never


In [81]:
## Check last 5 records of combined data.
combined_data.tail()

Unnamed: 0_level_0,city_development_index,training_hours,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
11308,0.762,68,city_46,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,500-999,Pvt Ltd,>4
14612,0.624,320,city_21,Male,Has relevent experience,Full time course,Masters,STEM,4,1000-4999,Pvt Ltd,1
33346,0.91,13,city_16,Male,Has relevent experience,no_enrollment,High School,STEM,9,1000-4999,Pvt Ltd,4
14506,0.666,38,city_64,Male,No relevent experience,Full time course,Graduate,STEM,5,50-99,Pvt Ltd,1
32641,0.624,100,city_21,Male,No relevent experience,no_enrollment,Graduate,STEM,3,50-99,Pvt Ltd,1


In [82]:
## Check dimensions of train data.
train_data.shape

(14687, 12)

In [83]:
## Check dimensions of validation data.
valid_data.shape

(3672, 12)

In [84]:
## Check dimensions of test data.
test.shape

(15021, 12)

In [85]:
## Check dimensions of combined data.
combined_data.shape

(33380, 12)

In [86]:
## Check NA values for combined data.
combined_data.isna().sum()

city_development_index    0
training_hours            0
city                      0
gender                    0
relevent_experience       0
enrolled_university       0
education_level           0
major_discipline          0
experience                0
company_size              0
company_type              0
last_new_job              0
dtype: int64

In [87]:
## Check null values for combined data.
combined_data.isnull().sum()

city_development_index    0
training_hours            0
city                      0
gender                    0
relevent_experience       0
enrolled_university       0
education_level           0
major_discipline          0
experience                0
company_size              0
company_type              0
last_new_job              0
dtype: int64

In [88]:
## Check combined data columns data types.
combined_data.dtypes

city_development_index    float64
training_hours              int64
city                       object
gender                     object
relevent_experience        object
enrolled_university        object
education_level            object
major_discipline           object
experience                 object
company_size               object
company_type               object
last_new_job               object
dtype: object

In [89]:
## Instantiate label encoder.
le_city = LabelEncoder()
le_gender = LabelEncoder()
le_re = LabelEncoder()
le_eu = LabelEncoder()
le_el = LabelEncoder()
le_md = LabelEncoder()
le_exp = LabelEncoder()
le_cs = LabelEncoder()
le_ct = LabelEncoder()
le_nj = LabelEncoder()
le_target = LabelEncoder()

In [90]:
## Fit and transform the label encoder on combined data.
combined_data['city'] = le_city.fit_transform(combined_data['city'])
combined_data['gender'] = le_gender.fit_transform(combined_data['gender'])
combined_data['relevent_experience'] = le_re.fit_transform(combined_data['relevent_experience'])
combined_data['enrolled_university'] = le_eu.fit_transform(combined_data['enrolled_university'])
combined_data['education_level'] = le_el.fit_transform(combined_data['education_level'])
combined_data['major_discipline'] = le_md.fit_transform(combined_data['major_discipline'])
combined_data['experience'] = le_exp.fit_transform(combined_data['experience'])
combined_data['company_size'] = le_cs.fit_transform(combined_data['company_size'])
combined_data['company_type'] = le_ct.fit_transform(combined_data['company_type'])
combined_data['last_new_job'] = le_nj.fit_transform(combined_data['last_new_job'])
y = le_target.fit_transform(train['target'])

In [91]:
## Transform the label encoder on train data.
train_data['city'] = le_city.transform(train_data['city'])
train_data['gender'] = le_gender.transform(train_data['gender'])
train_data['relevent_experience'] = le_re.transform(train_data['relevent_experience'])
train_data['enrolled_university'] = le_eu.transform(train_data['enrolled_university'])
train_data['education_level'] = le_el.transform(train_data['education_level'])
train_data['major_discipline'] = le_md.transform(train_data['major_discipline'])
train_data['experience'] = le_exp.transform(train_data['experience'])
train_data['company_size'] = le_cs.transform(train_data['company_size'])
train_data['company_type'] = le_ct.transform(train_data['company_type'])
train_data['last_new_job'] = le_nj.transform(train_data['last_new_job'])
y_train = le_target.transform(y_train)

In [92]:
## Transform the label encoder on validation data.
valid_data['city'] = le_city.transform(valid_data['city'])
valid_data['gender'] = le_gender.transform(valid_data['gender'])
valid_data['relevent_experience'] = le_re.transform(valid_data['relevent_experience'])
valid_data['enrolled_university'] = le_eu.transform(valid_data['enrolled_university'])
valid_data['education_level'] = le_el.transform(valid_data['education_level'])
valid_data['major_discipline'] = le_md.transform(valid_data['major_discipline'])
valid_data['experience'] = le_exp.transform(valid_data['experience'])
valid_data['company_size'] = le_cs.transform(valid_data['company_size'])
valid_data['company_type'] = le_ct.transform(valid_data['company_type'])
valid_data['last_new_job'] = le_nj.transform(valid_data['last_new_job'])
y_test = le_target.transform(y_test)

In [93]:
## Transform the label encoder on test data.
test_data['city'] = le_city.transform(test_data['city'])
test_data['gender'] = le_gender.transform(test_data['gender'])
test_data['relevent_experience'] = le_re.transform(test_data['relevent_experience'])
test_data['enrolled_university'] = le_eu.transform(test_data['enrolled_university'])
test_data['education_level'] = le_el.transform(test_data['education_level'])
test_data['major_discipline'] = le_md.transform(test_data['major_discipline'])
test_data['experience'] = le_exp.transform(test_data['experience'])
test_data['company_size'] = le_cs.transform(test_data['company_size'])
test_data['company_type'] = le_ct.transform(test_data['company_type'])
test_data['last_new_job'] = le_nj.transform(test_data['last_new_job'])

In [94]:
## Get the first record of train data.
train_data.head(1)

Unnamed: 0_level_0,city_development_index,training_hours,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
18323,0.92,105,5,1,1,2,1,5,15,4,5,5


In [95]:
## Get the first record of validation data.
valid_data.head(1)

Unnamed: 0_level_0,city_development_index,training_hours,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
6719,0.939,27,97,1,1,2,4,5,14,4,5,5


In [96]:
## Get the first record of test data.
test_data.head(1)

Unnamed: 0_level_0,city_development_index,training_hours,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
16548,0.448,15,73,1,1,0,0,5,20,2,4,0


In [97]:
## Copy data from test to temp.
temp = test_data.copy()

In [98]:
xgb = XGBClassifier() ## Instantiate XGBClassifier model.

optimization_dict = {'max_depth': [2,3,4,5,6,7], ## trying with different max_depth,n_estimators to find best model.
                      'n_estimators': [50,60,70,80,90,100,150,200]} 

## Build best model with Grid Search params.
model = GridSearchCV(xgb, ## XGB model.
                     optimization_dict, ## Dictory with different max_depth,n_estimators.
                     scoring='accuracy', ## On which parameter we are interested.
                     verbose=1, ## For messaging purpose.
                     n_jobs=-1) ## Number of jobs to run in parallel. ''-1' means use all processors.

%time model.fit(train_data, y_train) ## Fit a model.
print(model.best_score_) ## Display best score calues.
print(model.best_params_) ## Display best parameters.

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   10.0s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   36.8s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:   59.2s finished


Wall time: 1min 1s
0.8679104164537197
{'max_depth': 3, 'n_estimators': 90}


In [606]:
## Build a model with best params which we found as part of grid search CV (above code).
model = XGBClassifier(max_depth=7,           ## Depth of the tree.
                      n_estimators=200,      ## number of trees.
                      learning_rate = 0.001, ## learning rate.
                      booster ='gbtree',     ## tree type.
                      random_state=1234)     ## seed value.
## Fit a model.
%time model.fit(train_data, y_train)

Wall time: 1min 9s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.001, max_delta_step=0, max_depth=7,
              min_child_weight=1, missing=None, n_estimators=2000, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=1234,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [607]:
## Get the predictions on train data.
train_pred = model.predict(train_data)

In [608]:
## Display accuracy value for train data.
print("Train Accuracy :",accuracy_score(y_train,train_pred))

Train Accuracy : 0.869340232858991


In [609]:
## Get the predictions on validation data.
validation_pred = model.predict(valid_data)

In [610]:
## Display  accuracy value for validation data.
print("Validation Accuracy :",accuracy_score(y_test,validation_pred))

Validation Accuracy : 0.8681917211328976


In [611]:
## Get the confusion matrix for train data.
confusion_matrix_train = confusion_matrix(y_train, train_pred)
print(confusion_matrix_train)

[[12746     0]
 [ 1919    22]]


In [612]:
## Get the confusion matrix for validation data.
confusion_matrix_test = confusion_matrix(y_test, validation_pred)
print(confusion_matrix_test)

[[3187    1]
 [ 483    1]]


In [613]:
Accuracy_Train=(confusion_matrix_train[0,0]+confusion_matrix_train[1,1])/(confusion_matrix_train[0,0]+confusion_matrix_train[0,1]+confusion_matrix_train[1,0]+confusion_matrix_train[1,1])
TNR_Train= confusion_matrix_train[0,0]/(confusion_matrix_train[0,0]+confusion_matrix_train[0,1])
TPR_Train= confusion_matrix_train[1,1]/(confusion_matrix_train[1,0]+confusion_matrix_train[1,1])

print("Train TNR: ",TNR_Train)
print("\n")
print("Train TPR: ",TPR_Train)
print("\n")
print("Train Accuracy: ",Accuracy_Train)

Train TNR:  1.0


Train TPR:  0.011334363730036065


Train Accuracy:  0.869340232858991


In [614]:
Accuracy_Test=(confusion_matrix_test[0,0]+confusion_matrix_test[1,1])/(confusion_matrix_test[0,0]+confusion_matrix_test[0,1]+confusion_matrix_test[1,0]+confusion_matrix_test[1,1])
TNR_Test= confusion_matrix_test[0,0]/(confusion_matrix_test[0,0] +confusion_matrix_test[0,1])
TPR_Test= confusion_matrix_test[1,1]/(confusion_matrix_test[1,0] +confusion_matrix_test[1,1])

print("Test TNR: ",TNR_Test)
print("\n")
print("Test TPR: ",TPR_Test)
print("\n")
print("Test Accuracy: ",Accuracy_Test)

Test TNR:  0.9996863237139272


Test TPR:  0.002066115702479339


Test Accuracy:  0.8681917211328976


In [158]:
## Get the predictions on test data.
y_pred = model.predict(temp)

In [159]:
## Display test predictions.
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [160]:
## Copy temp data into temp1.
temp1 = temp.copy()

In [161]:
## Check first 5 records of temp1.
temp1.head()

Unnamed: 0_level_0,city_development_index,training_hours,city,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
16548,0.448,15,73,1,1,0,0,5,20,2,4,0
12036,0.939,94,70,1,1,2,0,5,21,4,5,0
11061,0.92,17,5,1,1,0,0,5,13,4,5,0
5032,0.924,76,6,1,1,2,3,5,21,4,5,1
17599,0.83,65,105,1,0,2,0,5,16,7,5,1


In [162]:
## Do inverse transform on test predictions to get it's original values(male,female).
temp1['target'] = le_target.inverse_transform(y_pred)

In [163]:
## Reset index.
temp1.reset_index(inplace=True)

In [164]:
## Copy enrollee_id, target column from temp1 to to_submit_1.
to_submit_1 = temp1[['enrollee_id', 'target']]

In [165]:
## Display value counts for target column of to_submit_1.
to_submit_1.target.value_counts()

1    14324
0      697
Name: target, dtype: int64

In [113]:
## Check dimesnions of test data.
test_data.shape

(15021, 13)

In [119]:
## Store to_submit_1 into csv file with name xgb_model 
to_submit_1.to_csv('xgb_model.csv',index = False)

In [None]:
## Build different models.

In [234]:
## Instantiate KNN model.
#model = KNeighborsClassifier(algorithm = 'brute', n_neighbors = 3,metric = "euclidean")

In [235]:
## Instantiate Navie Bayes Model.
## model = GaussianNB()

In [416]:
## Instantiate Random Forest Model.
## model = RandomForestClassifier(n_estimators=1000,max_depth=7,n_jobs=-1,class_weight = 'balanced') #class_weight = 'balanced'

In [417]:
## Instantiate Bagging classifier Model.
## model  = BaggingClassifier(n_estimators=200)

In [418]:
## Instantiate Adaboost classifier Model.
## model = AdaBoostClassifier(n_estimators=100,learning_rate=.001)

In [419]:
## Instantiate Gradient boosting classifier Model.
## model = GradientBoostingClassifier(n_estimators=200,learning_rate=0.3)

In [420]:
## Instantiate SVC Model.
## model = SVC(C=10,kernel='rbf')

In [591]:
## Random forest gave best result comapre to different classifier models.
model = RandomForestClassifier(n_estimators=2000,         ## The number of trees in the forest.
                               max_depth=7,               ## The maximum depth of the tree.
                               n_jobs=-1,                 ## The number of jobs to run in parallel. -1 means using all processors.
                               class_weight = 'balanced', ## Weights associated with classes in the form.
                               criterion='entropy')       ##The function to measure the quality of a split.

In [592]:
## Fit a model.
%time model.fit(train_data, y_train)

Wall time: 15.3 s


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='entropy', max_depth=7, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=2000,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [593]:
## Get prediction on train and validation data.
predict_train = model.predict(train_data)
predict_validation = model.predict(valid_data)

In [594]:
## Display accuracy value for train data.
print("Train Accuracy :",accuracy_score(y_train,train_pred))

Train Accuracy : 0.8735616531626609


In [595]:
## Display  accuracy value for validation data.
print("Validation Accuracy :",accuracy_score(y_test,validation_pred))

Validation Accuracy : 0.8638344226579521


In [596]:
## Get confusion matrix for train data.
confusion_matrix_train = confusion_matrix(y_train, predict_train)
print(confusion_matrix_train)

[[9809 2937]
 [ 845 1096]]


In [597]:
## Get confusion matrix for validation data.
confusion_matrix_validation = confusion_matrix(y_test, predict_validation)
print(confusion_matrix_validation)

[[2421  767]
 [ 274  210]]


In [598]:
Accuracy_Train=(confusion_matrix_train[0,0]+confusion_matrix_train[1,1])/(confusion_matrix_train[0,0]+confusion_matrix_train[0,1]+confusion_matrix_train[1,0]+confusion_matrix_train[1,1])
TNR_Train= confusion_matrix_train[0,0]/(confusion_matrix_train[0,0]+confusion_matrix_train[0,1])
TPR_Train= confusion_matrix_train[1,1]/(confusion_matrix_train[1,0]+confusion_matrix_train[1,1])

print("Train TNR: ",TNR_Train)
print("\n")
print("Train TPR: ",TPR_Train)
print("\n")
print("Train Accuracy: ",Accuracy_Train)

Train TNR:  0.7695747685548407


Train TPR:  0.5646573930963421


Train Accuracy:  0.7424933614761353


In [599]:
Accuracy_Test=(confusion_matrix_validation[0,0]+confusion_matrix_validation[1,1])/(confusion_matrix_validation[0,0]+confusion_matrix_validation[0,1]+confusion_matrix_validation[1,0]+confusion_matrix_validation[1,1])
TNR_Test= confusion_matrix_validation[0,0]/(confusion_matrix_validation[0,0] +confusion_matrix_validation[0,1])
TPR_Test= confusion_matrix_validation[1,1]/(confusion_matrix_validation[1,0] +confusion_matrix_validation[1,1])

print("Validation TNR: ",TNR_Test)
print("\n")
print("Validation TPR: ",TPR_Test)
print("\n")
print("Validation Accuracy: ",Accuracy_Test)

Validation TNR:  0.7594102885821832


Validation TPR:  0.43388429752066116


Validation Accuracy:  0.7165032679738562


In [600]:
## Get the predictions on test data.
y_pred = model.predict(temp)

In [602]:
## Do inverse transform on test predictions to get it's original values.
temp['target'] = le_target.inverse_transform(y_pred)

In [603]:
## Reset index.
temp.reset_index(inplace=True)

In [604]:
## Copy enrollee_id, target columns from temp to to_submit.
to_submit = temp[['enrollee_id', 'target']]

In [605]:
## Display value counts fo target column of to_submit.
to_submit.target.value_counts()

0    10861
1     4160
Name: target, dtype: int64

In [575]:
## Store to_submit into csv file with name randomforest. 
to_submit.to_csv('randomforest.csv',index = False)