# 🐍 Next DS Job: XGBoost, LightGBM, LogReg, R.Forest

# SUMMARY

1.  This notebook is created to implement different ML models on **HR Analytics: Job Change of Data Scientists** dataset.
> **(XGBoost, LightGBM, LogisticRegression, RandomForest)**
1.  The dataset is preprocessed in several ways. 
> **(KNNImputer, LabelEncoding, OneHotEncoding)**
1.  Finally, the results are compared in order to find the best model and preprocessing combination.
1.  The best model is used to predict test data.

# Table Of Contents

* [1. EDA & Preparing the Mappings](#chapter1)
* [2. Data Preparation](#chapter2)
    * [2.1. Handle Missing Values](#chapter2.1)
    * [2.2. Dataset for LightGBM](#chapter2.2)
    * [2.3. Correlation](#chapter2.3)
    * [2.4. Dataset for LogisticRegression, RandomForest, XGBoost](#chapter2.4)
        * [2.4.1 Encoding with manual mapping](#chapter2.4.1)
        * [2.4.2 KNN Imputer](#chapter2.4.1)
    * [2.3. Dataset Summary](#chapter2.5) 
* [3. Models](#chapter3)
    * [3.1. Model 1: Logistic Regression](#chapter3.1)
    * [3.2. Model 2: Random Forest](#chapter3.2)
    * [3.3. Model 3: XGBoost](#chapter3.3)
    * [3.4. Model 4: LightGBM](#chapter3.4)
* [4. Model Comparisons](#chapter4)
* [5. Feature Importances](#chapter5)
* [6. Predict aug_test.csv](#chapter6)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_recall_fscore_support
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from datetime import datetime

import xgboost as xgb


In [None]:
train_data = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
test_data = pd.read_csv('../input/hr-analytics-job-change-of-data-scientists/aug_test.csv')

# 1. EDA & Preparing the mappings <a class="anchor" id="chapter1"></a>

In [None]:
train_data.head()

In [None]:
print('Train shape:', train_data.shape)
print('Test shape:', test_data.shape)

In [None]:
train_data.info()

In [None]:
def perc_missing(df):
    print('***  Count Missing Values ***')
    print (train_data.isnull().sum().sort_values(ascending=False))
    print('\n---------------\n')
    print('*** Percentage Missing Values ***')
    print ((df.isnull().sum() / len(df)*100).sort_values(ascending=False))

In [None]:
perc_missing(train_data)

In [None]:
# columns that has at least one null value
train_data.columns[train_data.isnull().any()].tolist()

- Save the columns as series.

In [None]:
enrollee_id = test_data['enrollee_id']
target = train_data['target']

### **--> City**
- **Nominal Categorical Attribute**: There is no ordering between the cities.

In [None]:
train_data.city.value_counts().sort_values()

- Label Encoder can be used for encoding cities to an integer.

### **--> City Development Index**
- **Continues Numerical Attribute**:

In [None]:
train_data.city_development_index.hist()

In [None]:
train_data.city_development_index.isnull().sum()

- There is no null value in city_development_index column.

### **--> Gender**
- **Nominal Categorical Attribute**: There is no ordering between gender categories.

In [None]:
train_data.gender.value_counts().sort_values()

In [None]:
plt.figure(figsize=(8,5))
patches, texts, autotexts = plt.pie(x=train_data.gender.value_counts().tolist(), labels=train_data.gender.value_counts().index, autopct='%1.2f%%')

#make percent texts bigger
plt.setp(autotexts, fontsize=14)

#make label texts bigger
plt.setp(texts, fontsize=14)

In [None]:
train_data.gender.isnull().sum()

- There are null values in Gender column. They will be handled later.
- Categorical values are converted to numerical values by using mapping with dictionary.
- For LGBM model, missing values left as NaN.

In [None]:
map_gender_lgbm = {'Other': 0, 'Female':1, 'Male':2}
map_gender = {'null': 0, 'Other': 1, 'Female':2, 'Male':3}

### **--> Relevant Experience**
- **Ordinal Categorical Attribute**

In [None]:
train_data.relevent_experience.value_counts()

In [None]:
data = train_data.relevent_experience.value_counts().tolist()
labels = train_data.relevent_experience.value_counts().index

patches, texts, autotexts = plt.pie(x=data, labels=labels, autopct='%1.2f%%')

#make percent texts bigger
plt.setp(autotexts, fontsize=14)

#make label texts bigger
plt.setp(texts, fontsize=14)

In [None]:
train_data.relevent_experience.isnull().sum()

- There is no missing value for Relevant Experience column.
- This is ordinal categorical variable, so there is ordering between categories. 
- Having relavant experience > Having no experience.
- So, while convering them to numerical values, the ordering can be considered.

In [None]:
map_relevent_experience_lgbm = {'No relevent experience': 0, 'Has relevent experience': 1}
map_relevent_experience = {'null':0, 'No relevent experience': 1, 'Has relevent experience': 2}

### **--> Enrolled University**
- **Ordinal Categorical Attribute**

In [None]:
train_data.enrolled_university.value_counts()

In [None]:
data = train_data.enrolled_university.value_counts().tolist()
labels = train_data.enrolled_university.value_counts().index

patches, texts, autotexts = plt.pie(x=data, labels=labels, autopct='%1.2f%%')

#make percent texts bigger
plt.setp(autotexts, fontsize=14)

#make label texts bigger
plt.setp(texts, fontsize=14)

In [None]:
train_data.enrolled_university.isnull().sum()

- There are missing values.
- While converting ordinal categorical values to numerical values, the ordering can be considered.

In [None]:
map_enrolled_university_lgbm = {'no_enrollment': 0, 'Part time course': 1, 'Full time course' : 2}
map_enrolled_university = {'null': 0, 'no_enrollment': 1, 'Part time course': 2, 'Full time course' : 3}

### **--> Education Level**
- **Ordinal Categorical Attribute**

In [None]:
train_data.education_level.value_counts()

In [None]:
data = train_data.education_level.value_counts().tolist()
labels = train_data.education_level.value_counts().index

patches, texts, autotexts = plt.pie(x=data, labels=labels, autopct='%1.2f%%')

#make percent texts bigger
plt.setp(autotexts, fontsize=14)

#make label texts bigger
plt.setp(texts, fontsize=14)

In [None]:
train_data.education_level.isnull().sum()

- There are missing values.
- After conversion missing values will remain as NaN.

In [None]:
map_education_level_lgbm = {'Primary School':0, 'High School':1, 'Graduate': 2, 'Masters':3, 'Phd':4}
map_education_level = {'null': 0, 'Primary School':1, 'High School':2, 'Graduate': 3, 'Masters':4, 'Phd':5}

### **--> Major Discipline**
- **Nominal Categorical Attribute**

In [None]:
train_data.major_discipline.value_counts()

In [None]:
train_data.major_discipline.isnull().sum()

- There are a lot of missing values.
- After conversion missing values will remain as NaN.

In [None]:
map_major_discipline_lgbm = {'No Major':0, 'Arts':1, 'Business Degree':2, 'Other': 3, 'Humanities':4, 'STEM':5}
map_major_discipline = {'null': 0, 'No Major':1, 'Arts':2, 'Business Degree':3, 'Other': 4, 'Humanities':5, 'STEM':6}

### **--> Experience**
- **Numerical Categorical Variable**

- This is numerical variable that includes non-numeric values like <1 and >20.
- These values can be handled with mapping.

In [None]:
train_data.experience.value_counts().sort_index()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(train_data.experience, order=['<1','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','>20'])

- These are ordinal categorical variables which are in string format.
- They can be converted to numerical variables, by handling >20 and <1 values.
- If the model is LGBM then missing values left as NaN, otherwise will be encoded.

In [None]:
def experience_replace(exp, model):
    if exp == '>20':
        return 21
    elif exp == '<1':
        return 0
    elif exp is not np.NaN:
        return int(exp)
    elif exp is np.NaN and model != 'LGBM':
        return 22
    else:
        return exp

### **--> Company Size**
> - **Ordinal Categorical Variable**

In [None]:
train_data.company_size.value_counts().sort_values()

In [None]:
# number of instances based on company_size ordered by company_size
plt.figure(figsize=(10,5))
sns.countplot(train_data.company_size, order=['<10','10/49','50-99','100-500','500-999','1000-4999','5000-9999','10000+'])

In [None]:
map_company_size_lgbm = {'<10': 0,'10/49': 1,'50-99': 2,'100-500': 3,'500-999': 4,'1000-4999':5, '5000-9999':6, '10000+':7}
map_company_size = {'null': 0, '<10': 1,'10/49': 2,'50-99': 3,'100-500': 4,'500-999': 5,'1000-4999':6, '5000-9999':7, '10000+':8}

### **--> Company Type**
- **Nominal Categorical Variable**

In [None]:
train_data.company_type.value_counts()

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(train_data.company_type)

In [None]:
map_company_type_lgbm = {
    'Pvt Ltd': 5,
    'Funded Startup':4,
    'Early Stage Startup':3,
    'Other':2,
    'Public Sector':1,
    'NGO':0
}

map_company_type = {
    'Pvt Ltd': 6,
    'Funded Startup':5,
    'Early Stage Startup':4,
    'Other':3,
    'Public Sector':2,
    'NGO':1,
    'null':0
}

### **--> Last New Job**
- **Numerical Categorical Variable**

- This is numerical variable that includes non-numeric values like never and >4.
- These values can be handled with mapping.

In [None]:
train_data.last_new_job.value_counts()

- Convert >4 and 'never' to numerical values.
- If model is LGBM, missing values left as NaN, otherwise they are encoded.

In [None]:
# function for replacing the values, and converting them to integer
def lastnewjob_replace(lnj, model):
    if lnj == '>4':
        return 5
    elif lnj == 'never':
        return 0
    elif lnj is not np.NaN:
        return int(lnj)
    elif lnj is np.NaN and model != 'LGBM':
        return 6
    else:
        return lnj

### **--> Training Hours**

In [None]:
train_data.training_hours.hist()

In [None]:
train_data.training_hours.isnull().sum()

- There is no missing value.

### **--> Target**
- Binary classification problem

In [None]:
train_data.target.hist()

# 2. Data Preparation <a class="anchor" id="chapter2"></a>

- Merge train and test dataset before implementing encoding

In [None]:
all_data = pd.concat([train_data.drop(['target','enrollee_id'], axis=1), test_data.drop(['enrollee_id'], axis=1)], axis=0)
all_data

## 2.1 Handle Missing Values <a class="anchor" id="chapter2.1"></a>

- Missing values can be handled with various ways.
    - **Deleting** rows or columns that includes at least 1 missing value.
    - **Encoding** all missing values to same number for a given column.
    - **Imputation**: Filling the missing values with a relevant value. The relevant value can be
        - median, mod, mean value of the column.
        - a value that is found by implementing another machine learning algorithm to predict missing value.
            - Ex: K-NearestNeigbors algorithm. The algorithm finds a value by evaluating the similar instances

The models will be executed with below dataset configurations.

- LightGBM:
    - missing values left
- Logistic Regression, Random Forest, XGBoost (missing values should be handled for these models)
    - missing values encoded with mapping
    - missing values imputed with KNN

In [None]:
def convert_dataset(df_data, model):
    
    # do not change df_data
    # converting will be done on returned dataset
    temp_data = df_data.copy()
    
    le = LabelEncoder()
    temp_data.city = le.fit_transform(temp_data.city)
    temp_data.last_new_job = temp_data.last_new_job.apply(lastnewjob_replace, args=(model,))
    temp_data.experience = temp_data.experience.apply(experience_replace, args=(model,))
    
    # convert categorical values, left missing values as null
    if model == 'LGBM':
        temp_data.gender = temp_data.gender.map(map_gender_lgbm)
        temp_data.relevent_experience = temp_data.relevent_experience.map(map_relevent_experience_lgbm)
        temp_data.enrolled_university = temp_data.enrolled_university.map(map_enrolled_university_lgbm)
        temp_data.education_level = temp_data.education_level.map(map_education_level_lgbm)
        temp_data.major_discipline = temp_data.major_discipline.map(map_major_discipline_lgbm)
        temp_data.company_size = temp_data.company_size.map(map_company_size_lgbm)
        temp_data.company_type = temp_data.company_type.map(map_company_type_lgbm)
        
        
    # convert categorical values, encode missing values    
    else:
        # first fill NaN values with a string 'null', mapper are handling with 'null' string
        temp_data.fillna('null', inplace=True)
        temp_data.gender = temp_data.gender.map(map_gender)
        temp_data.relevent_experience = temp_data.relevent_experience.map(map_relevent_experience)
        temp_data.enrolled_university = temp_data.enrolled_university.map(map_enrolled_university)
        temp_data.education_level = temp_data.education_level.map(map_education_level)
        temp_data.major_discipline = temp_data.major_discipline.map(map_major_discipline)
        temp_data.company_size = temp_data.company_size.map(map_company_size)
        temp_data.company_type = temp_data.company_type.map(map_company_type)


        
    return temp_data

## 2.2 Dataset for LightGBM <a class="anchor" id="chapter2.2"></a>

- Categorical values are encoded.
- Missing values are left as NaN. LightGBM can automatically handle missing values.
- Check: https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#
- **convert_dataset(df, 'LGBM')** function will encode the values while considering the order, and leaving missing values as NaN

In [None]:
convert_dataset(all_data, 'LGBM')

In [None]:
all_data = convert_dataset(all_data, 'LGBM')
lgbm_train_data = all_data.iloc[0:train_data.shape[0], :]
lgbm_test_data = all_data.iloc[train_data.shape[0]:, :]

# add the id column to test data
lgbm_test_data.loc[:,'enrollee_id'] = enrollee_id.values
# add the target column to train data
lgbm_train_data.loc[:,'target'] = target.values

## 2.3 Correlation among Columns <a class="anchor" id="chapter2.3"></a>

- Outstanding correlations
    - Experience vs Relevant Experience: This is self-explanatory relationship.
    - Experience vs Enrolled University: People with higher experience usually dont enroll university.
    - Experience vs Last New Job: While experience is increasing, the difference in years between the previus job and current job is increasing
    - City development index vs Target (Looking for a job change): Negative correlation.

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(lgbm_train_data.corr(), annot=True)

## 2.4 Dataset for LogisticRegression, RandomForest, XGBoost <a class="anchor" id="chapter2.4"></a>

### **2.4.1. Encoding with manuel mapping**

- Call convert_dataset function by giving model name as Others
- All missing values are encoded

In [None]:
all_data = pd.concat([train_data.drop(['target','enrollee_id'], axis=1), test_data.drop(['enrollee_id'], axis=1)], axis=0)
all_data = convert_dataset(all_data, 'Others')

In [None]:
others_train_data = all_data.iloc[0:train_data.shape[0], :]
others_test_data = all_data.iloc[train_data.shape[0]:, :]

# add the id column to test data
others_test_data.loc[:,'enrollee_id'] = enrollee_id.values
# add the target column to train data
others_train_data.loc[:,'target'] = target.values

In [None]:
others_train_data

In [None]:
others_test_data

### **2.4.2. KNN Imputer**

- First convert the original dataset without filling missing values.
- Converting parameter is LGBM, because it will left missing values as NaN
- Then, missing values will be imputed by using KNN algorithm.

In [None]:
# Get the original data, merge it.
all_data = pd.concat([train_data.drop(['target','enrollee_id'], axis=1), test_data.drop(['enrollee_id'], axis=1)], axis=0)
# convert it
all_data = convert_dataset(all_data, 'LGBM')

In [None]:
missing_columns = all_data.columns[all_data.isnull().any()].tolist()
missing_columns

- Since the categorical columns are labeled with numerical values, the values found by KNN should be rounded to integer.

In [None]:
knn_imputer = KNNImputer(n_neighbors=3)

arr = knn_imputer.fit_transform(all_data.loc[:,all_data.columns != 'target'])
all_data_knn_imputed = pd.DataFrame(arr, columns = all_data.loc[:,all_data.columns != 'target'].columns)

In [None]:
all_data_knn_imputed[missing_columns] = np.round(all_data_knn_imputed[missing_columns])

In [None]:
all_data_knn_imputed

- Split the dataset as train and test

In [None]:
train_data_knn_imputed = all_data_knn_imputed.iloc[0:train_data.shape[0], :].copy()
test_data_knn_imputed = all_data_knn_imputed.iloc[train_data.shape[0]:, :].copy()

# add the id column to test data
test_data_knn_imputed.loc[:,'enrollee_id'] = enrollee_id.values
# add the target column to train data
train_data_knn_imputed.loc[:,'target'] = target.values

In [None]:
train_data_knn_imputed

In [None]:
test_data_knn_imputed

## 2.5 Datasets Summary <a class="anchor" id="chapter2.5"></a>

- **lgbm_train_data**           : Dataset prepared for the LGBM model. All missing values are left as **NaN**.
- **others_train_data**         : Dataset prepared for the models except LGBM. All missing values are handled with **manuel mapping**.
- **train_data_knn_imputed**    : Dataset prepared for the models except LGBM. All missing values are handled with **KNN** imputer.

And, corresponding test datasets are prepared.

# 3. Models <a class="anchor" id="chapter3"></a>

- Function for saving and printing model results for comparison

In [None]:
auc_scores = []
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
model_list = []
imputer_list = []
timestamp = []

In [None]:
def print_report(estimator,X,y, model, imputer):
    
    print('\n\n','*'*15,'REPORT','*'*15,'\n')
    print('Model: ', model)
    print('Imputer:', imputer)
    
    if model == 'LGBM':
        auc = roc_auc_score(y, estimator.predict(X))
        y_predict = [1 if x > 0.5 else 0 for x in estimator.predict(X) ]
        cmatrix = confusion_matrix(y, y_predict)
    else:
        auc = roc_auc_score(y, estimator.predict_proba(X)[:,1])
        y_predict= estimator.predict(X)
        
        
    precision, recall, fscore, support = precision_recall_fscore_support(y, y_predict)
    accuracy = accuracy_score(y, y_predict)
    
    
    #print
    print('AUC: ', auc)
    print('*'*40)
    print(classification_report(y, y_predict))
    print('*'*40)
    
    if model == 'LGBM':
        sns.heatmap(cmatrix, annot=True, fmt='d', cmap='Blues')
    else:
        plot_confusion_matrix(estimator,X, y, values_format='d')
    
    
    #save
    auc_scores.append(auc)
    precision_scores.append(precision[1])
    f1_scores.append(fscore[1])
    recall_scores.append(recall[1])
    accuracy_scores.append(accuracy)
    model_list.append(model)
    imputer_list.append(imputer)
    timestamp.append(datetime.now())

## 3.1 Model 1: Logistic Regression <a class="anchor" id="chapter3.1"></a>

- For the nominal categorical variables, one hot encoding will be used in order to prevent introducing ordinal relation between values.
- Tree-based models, such as Decision Trees, Random Forests, and Boosted Trees, typically don't perform well with one-hot encodings with lots of levels.
- For Logistic regression, one hot encoding will be used for nominal categorical variable, label encoding will be used for ordinal categorical variable.
- For RandomForest, XGBoost, and LightGBM, LabelEncoding will be used.

- Since the data is skewed, it is better to implement stratified splitting based on target value. 

In [None]:
def Logistic_Regression(df, imputer):
    
    # one hot encoding for Nominal categorical variables
    train_data_ohe = pd.get_dummies(df, columns=['gender'], prefix='G', prefix_sep='_')
    train_data_ohe = pd.get_dummies(df, columns=['enrolled_university'], prefix='EU', prefix_sep='_')
    train_data_ohe = pd.get_dummies(df, columns=['major_discipline'], prefix='MD', prefix_sep='_')
    
    # Grid Search
    X = train_data_ohe.drop(['target'],axis=1)
    y = train_data_ohe['target']

    X_train, X_valid, y_train, y_valid = train_test_split(X,y, test_size=0.2, stratify= y)

    lr = LogisticRegression(max_iter=2000)

    params={'C':np.logspace( -10, 1, 15)}

    gs_lr = GridSearchCV(lr, param_grid = params, scoring=('roc_auc'), cv=5, n_jobs=-1)
    gs_lr.fit(X_train,y_train)
    
    print('Best parameters: ', gs_lr.best_params_)
    print('Best score: ', gs_lr.best_score_)
    print_report(gs_lr,X_valid,y_valid, 'LogisticRegression', imputer)

In [None]:
Logistic_Regression(train_data_knn_imputed, 'KNN')

In [None]:
Logistic_Regression(others_train_data, 'Manuel_Mapping')

## 3.2. Model 2: Random Forest <a class="anchor" id="chapter3.2"></a>

In [None]:
def Random_Forest(df, imputer):
 
    X = df.drop(['target'], axis=1)
    y = df['target']


    X_train, X_valid, y_train, y_valid = train_test_split(X,y, test_size=0.2, stratify= y)

    rf_clf = RandomForestClassifier(n_estimators=100)

    param_grid = {
        'max_depth': range(2,20,2),
        'criterion': ['gini','entropy'],
        'min_samples_split' : [2,5,10,20,50,100,150]
    }

    ss = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    gs_rf = GridSearchCV(rf_clf,param_grid, cv=ss.split(X_train,y_train), scoring='roc_auc', n_jobs=-1)
    gs_rf.fit(X_train,y_train)
    
    
    print('Best parameters: ', gs_rf.best_params_)
    print('Best score: ', gs_rf.best_score_)
    print_report(gs_rf,X_valid,y_valid, 'RandomForest', imputer)

In [None]:
Random_Forest(train_data_knn_imputed, 'KNN')

In [None]:
Random_Forest(others_train_data, 'Manuel_Mapping')

## 3.3 Model 3: XGBoost <a class="anchor" id="chapter3.3"></a>

In [None]:
# gamma means min_split_loss in XGBoost: minimum loss reduction required to make a further partition on a leaf node of the tree.
# the larger the gamma is, the more conservative the algorithm will be.

# collsample_bytree: is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.


def XGBoost_(df, imputer):

    X = df.drop(['target'], axis=1)
    y = df['target']


    X_train, X_valid, y_train, y_valid = train_test_split(X,y, test_size=0.2, stratify= y)
    
    xgb_clf = xgb.XGBClassifier(use_label_encoder=False)


    parameters = {
         "eta"    : [0.01, 0.05, 0.10] ,
         "max_depth"        : [ 5, 6, 8],
         "gamma"            : [ 0.3, 0.4, 0.5 ],
         "colsample_bytree" : [ 0.4, 0.5 , 0.7 ]
         }


    gs_xgboost = GridSearchCV(xgb_clf, parameters, n_jobs=-1, scoring='roc_auc', cv=3)
    gs_xgboost.fit(X_train,y_train)
    
    print('Best parameters: ', gs_xgboost.best_params_)
    print('Best score: ', gs_xgboost.best_score_)
    print_report(gs_xgboost,X_valid,y_valid, 'XGBoost', imputer)
    
    return gs_xgboost

In [None]:
xgboost_knn_model = XGBoost_(train_data_knn_imputed, 'KNN')

In [None]:
xgboost = XGBoost_(others_train_data, 'Manuel_Mapping')

## 3.4. Model 4: Light GBM <a class="anchor" id="chapter3.4"></a>

- For LightGBM, the dataset: manually encoded but missing values left.

In [None]:
def LightGBM_(df, imputer):
    X = df.drop(['target'], axis=1)
    y= df['target']

    cat_features = ['city', 'gender', 'enrolled_university', 'education_level', 'major_discipline', 'company_size', 'company_type']

    X_train, X_valid, y_train, y_valid = train_test_split(X,y, test_size=0.2, shuffle=True, stratify=y, random_state=1301)
    
    train_lgbm_dataset_format = lgb.Dataset(X_train, y_train, categorical_feature=cat_features)
    valid_lgbm_dataset_format = lgb.Dataset(X_valid, y_valid, categorical_feature=cat_features)
    
    params = {'objective':'binary',
          'metric' : 'auc',
          'boosting_type' : 'gbdt',
          'colsample_bytree' : 0.93,
          'num_leaves' : 50,
          'max_depth' : -1,
          'n_estimators' : 1000,
          'min_child_samples': 200, 
          'min_child_weight': 0.08,
          'reg_alpha': 2,
          'reg_lambda': 5,
          'subsample': 0.9,
          'verbose' : -1,
          'num_threads' : 4,
          'learning_rate': 0.015,
          'random_seed' : 100
        }
    
    lgbm = lgb.train(params,
                 train_lgbm_dataset_format,
                 3000,
                 valid_sets=valid_lgbm_dataset_format,
                 early_stopping_rounds= 40,
                 verbose_eval= 10
                 )

    print_report(lgbm, X_valid, y_valid, model='LGBM', imputer=imputer)
    
    return lgbm

In [None]:
lgbm = LightGBM_(lgbm_train_data, 'Manuel_Mapping')

# 4. Model Comparisons <a class="anchor" id="chapter4"></a>

In [None]:
results = {'Timestamp': timestamp, 'Model':model_list, 'Imputer': imputer_list, 'AUC':auc_scores, 'Accuracy':accuracy_scores, 'Precision': precision_scores, 'Recall': recall_scores, 'F1_Score': f1_scores}
results = pd.DataFrame(results)
results

- Based on the AUC score the best model is **XGBoost** which is trained with manually mapped dataset.

# 5. Feature Importances <a class="anchor" id="chapter5"></a>

- Feature importances based on the **XGBoost** model.

In [None]:
feature_importances = pd.concat(
    [pd.DataFrame(xgboost.best_estimator_.feature_importances_, columns=['Importances']),
     pd.Series(others_train_data.drop(['target'], axis=1).columns, name='Features')], axis=1).sort_values(by='Importances', ascending=False)

In [None]:
plt.figure(figsize=(10,5))
ax = sns.barplot(x='Features', y='Importances', data=feature_importances)

plt.xticks(rotation=30)
ax.set_title('Feature Importances', fontsize='18')

# 6. Predict aug_test.csv <a class="anchor" id="chapter6"></a>

- **others_test_data** is a dataframe that holds preprocessed aug_test.csv records 

In [None]:
others_test_data

In [None]:
y_pred = xgboost.predict_proba(others_test_data.drop(['enrollee_id'], axis=1))[:,1]

In [None]:
submission = pd.concat([others_test_data['enrollee_id'], pd.Series(y_pred, name='target')], axis=1)
submission

In [None]:
submission.to_csv('submission.csv',index=False)