![AIcrowd-Logo](https://raw.githubusercontent.com/AIcrowd/AIcrowd/master/app/assets/images/misc/aicrowd-horizontal.png)

# Code for [LABOR Challenge](www.aicrowd.com/challenges/labor) on AIcrowd
#### Author : Team BlitzCA

## Download Necessary Packages 📚

In [3]:
!pip install numpy
!pip install pandas
!pip install scikit-learn
!pip install catboost==0.22

Collecting catboost==0.22
[?25l  Downloading https://files.pythonhosted.org/packages/94/ec/12b9a42b2ea7dfe5b602f235692ab2b61ee1334ff34334a15902272869e8/catboost-0.22-cp36-none-manylinux1_x86_64.whl (64.4MB)
[K     |████████████████████████████████| 64.4MB 61kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.22


## Download Data
The first step is to download out train test data. We will be training a model on the train data and make predictions on test data. We submit our predictions


In [4]:
#Donwload the datasets
!rm -rf data
!mkdir data 
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-practice-challenges/public/labor/v0.1/test.csv
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-practice-challenges/public/labor/v0.1/train.csv
!mv test.csv data/test.csv
!mv train.csv data/train.csv

--2020-07-25 09:40:55--  https://s3.eu-central-1.wasabisys.com/aicrowd-practice-challenges/public/labor/v0.1/test.csv
Resolving s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)... 130.117.252.11, 130.117.252.10, 130.117.252.12, ...
Connecting to s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)|130.117.252.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 554341 (541K) [text/csv]
Saving to: ‘test.csv’


2020-07-25 09:40:56 (1.15 MB/s) - ‘test.csv’ saved [554341/554341]

--2020-07-25 09:40:58--  https://s3.eu-central-1.wasabisys.com/aicrowd-practice-challenges/public/labor/v0.1/train.csv
Resolving s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)... 130.117.252.17, 130.117.252.10, 130.117.252.11, ...
Connecting to s3.eu-central-1.wasabisys.com (s3.eu-central-1.wasabisys.com)|130.117.252.17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1803530 (1.7M) [text/csv]
Saving to: ‘train.csv’


2020-07-25 


## Import packages

In [22]:
import pandas as pd
import numpy as np
import warnings
from sklearn.model_selection import KFold, StratifiedKFold
from catboost import CatBoostRegressor, CatBoostClassifier
from sklearn.metrics import f1_score, confusion_matrix

warnings.filterwarnings('ignore')
%matplotlib inline
pd.set_option('max_column', 100)

## Load Data




In [11]:
train_path = "data/train.csv" 
test_path = "data/test.csv"

In [12]:
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)

## Visualize the data 👀

In [13]:
train.head()

Unnamed: 0,duration,wage-increase-first-year,wage-increase-second-year,wage-increase-third-year,cost-of-living-adjustment,working-hours,pension,standby-pay,shift-differential,education-allowance,statutory-holidays,vacation,longterm-disability-assistance,contribution-to-dental-plan,bereavement-assistance,contribution-to-health-plan,class
0,3,3.597483,4.0,5.0,0,40.0,2,8.32238,3.0,0,11.0,0,1,1,1,2,1
1,3,3.968619,4.0,5.1,1,40.0,2,2.0,3.0,0,12.0,1,1,2,1,2,1
2,2,6.328544,5.08968,5.0,0,35.915468,2,2.0,4.0,0,12.0,1,1,1,1,2,1
3,2,4.348288,5.336979,5.0,0,37.651356,2,2.0,3.0,0,15.0,1,1,2,1,2,1
4,2,3.530789,2.892247,2.029438,0,40.0,2,2.0,4.0,0,11.0,1,1,1,1,2,1


## Create Features




In [14]:
len_train = len(train)
data = pd.concat([train, test])

In [15]:
data['duration_first_year_mean'] = data.groupby('duration')['wage-increase-first-year'].transform('mean')
data['duration_first_year_std'] = data.groupby('duration')['wage-increase-first-year'].transform('std')

data['duration_second_year_mean'] = data.groupby('duration')['wage-increase-second-year'].transform('mean')
data['duration_second_year_std'] = data.groupby('duration')['wage-increase-second-year'].transform('std')

data['duration_third_year_mean'] = data.groupby('duration')['wage-increase-third-year'].transform('mean')
data['duration_third_year_std'] = data.groupby('duration')['wage-increase-third-year'].transform('std')

data['pension_standby-pay_mean'] = data.groupby('pension')['standby-pay'].transform('mean')
data['pension_standby-pay_std'] = data.groupby('pension')['standby-pay'].transform('std')

data['pension_working-hours_mean'] = data.groupby('pension')['working-hours'].transform('mean')

data['assistance'] = data['longterm-disability-assistance'] + data['bereavement-assistance']
data['contribution_plans'] = data['contribution-to-dental-plan'] + data['contribution-to-health-plan']

In [16]:
def workers(x):
    if x >= 35.0:
        return 'Hard_workers'
    else:
        return 'Lazy_workers'
data['workers_cat'] = data['working-hours'].map(workers)

In [17]:
data = data[['bereavement-assistance', 'class', 'contribution-to-dental-plan',
       'contribution-to-health-plan', 'cost-of-living-adjustment', 'duration',
       'education-allowance', 'longterm-disability-assistance', 'pension',
       'shift-differential', 'standby-pay', 'statutory-holidays', 'vacation',
       'wage-increase-first-year', 'wage-increase-second-year',
       'wage-increase-third-year', 'working-hours', 'duration_first_year_mean',
       'duration_first_year_std', 'duration_second_year_mean',
       'duration_second_year_std', 'duration_third_year_mean',
       'duration_third_year_std', 'pension_standby-pay_mean',
       'pension_standby-pay_std', 'pension_working-hours_mean', 'assistance',
       'contribution_plans', 'workers_cat']]

In [18]:
train = data[:len_train]
test = data[len_train:]

In [19]:
X = train.drop(columns='class')
y = train['class']
tes = test.drop(columns='class')

In [20]:
cate_features_index = np.where(X.dtypes != float) [0]; cate_features_index

array([ 0,  1,  2,  3,  4,  5,  6,  7, 11, 25, 26, 27])

## Train Model and Predict




In [23]:
err=[]
y_pred_totcb=[]
from lightgbm import LGBMClassifier

fold=StratifiedKFold(n_splits=10, random_state=1234)
for train_index, test_index in fold.split(X,y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    m1 = CatBoostClassifier(iterations=5000,learning_rate=0.1, random_seed=1234, eval_metric='F1')
    m1.fit(X_train,y_train,eval_set=[(X_train,y_train),(X_test, y_test)], early_stopping_rounds=100,verbose=100, cat_features=cate_features_index)
    preds = m1.predict(X_test)
    print("err: ",np.sqrt(f1_score(y_test,preds)))
    err.append(np.sqrt(f1_score(y_test,preds)))
    p2 = m1.predict(tes)
    y_pred_totcb.append(p2)
np.mean(err)

0:	learn: 0.9272901	test: 0.9272901	test1: 0.9200574	best: 0.9200574 (0)	total: 146ms	remaining: 12m 11s
100:	learn: 0.9776520	test: 0.9778107	test1: 0.9704198	best: 0.9709385 (94)	total: 8.1s	remaining: 6m 32s
200:	learn: 0.9818770	test: 0.9817411	test1: 0.9718377	best: 0.9720697 (199)	total: 16.2s	remaining: 6m 26s
300:	learn: 0.9836092	test: 0.9832352	test1: 0.9730375	best: 0.9732569 (250)	total: 24s	remaining: 6m 15s
400:	learn: 0.9849836	test: 0.9845305	test1: 0.9735147	best: 0.9739919 (371)	total: 32.2s	remaining: 6m 9s
Stopped by overfitting detector  (100 iterations wait)

bestTest = 0.9739918874
bestIteration = 371

Shrink model to first 372 iterations.
err:  0.9869102732152069
0:	learn: 0.9266895	test: 0.9266895	test1: 0.9270858	best: 0.9270858 (0)	total: 91.8ms	remaining: 7m 39s
100:	learn: 0.9773792	test: 0.9773696	test1: 0.9731144	best: 0.9735903 (80)	total: 8.16s	remaining: 6m 35s
200:	learn: 0.9816852	test: 0.9810210	test1: 0.9762131	best: 0.9764454 (190)	total: 16.2s	re

0.9884380889321569

In [24]:
predic = np.mean(y_pred_totcb, 0)

In [25]:
submission = pd.DataFrame(predic)

In [26]:
submission = submission.astype(int)

In [27]:
submission.to_csv('best.csv', header=['class'],index=False)