<h2 style=color:green align='left'> Table of Contents </h2>

##### 1) Load Required Libraries
##### 2) Read Data
##### 3) EDA (Exploratory Data Analysis)

>    3.1) Drop Unwanted Columns

>    3.2) Missing Values

>    3.3) Variable Analysis

>    3.4) Outliers

>    3.5) Relation between Features 

>    3.6) Skewness and Kurtosis 

##### 4) Model Building and Evaluation

>    4.1) XGBoost

>    4.2) LightAutoML

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 1) Load Required Libraries </h1>

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

plt.style.use("fivethirtyeight")
sns.set_style("darkgrid")

In [2]:
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, auc, roc_curve, roc_auc_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 2) Read Data </h1>

In [3]:
train = pd.read_csv("/kaggle/input/tabular-playground-series-may-2021/train.csv")
test = pd.read_csv("/kaggle/input/tabular-playground-series-may-2021/test.csv")
sub = pd.read_csv("/kaggle/input/tabular-playground-series-may-2021/sample_submission.csv")

all_df = pd.concat([train, test], axis=0)
all_df = all_df.drop(['id', 'target'], axis=1)

In [4]:
lbe = LabelEncoder()
train['target'] = lbe.fit_transform(train['target'])

le = LabelEncoder()
for col in all_df.columns:
    all_df[col] = le.fit_transform(all_df[col])

for col in all_df.columns:
    all_df[col] = np.log1p(all_df[col])

In [5]:
train_df = all_df[:len(train)]
train_df['target'] = train['target']
test_df = all_df[len(train):]

<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> 4) Model Building and Evaluation </h1>

In [6]:
# Independant variable
X = train_df.drop('target', axis=1)

# Dependant variable
y = train_df['target']

In [7]:
# split  data into training and testing sets of 80:20 ratio
# 20% of test size selected
# random_state is random seed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [8]:
print("Length of X_train is: {X_train}".format(X_train = len(X_train)))
print("Length of X_test is: {X_test}".format(X_test = len(X_test)))
print("Length of y_train is: {y_train}".format(y_train = len(y_train)))
print("Length of y_test is: {y_test}".format(y_test = len(y_test)))

Length of X_train is: 80000
Length of X_test is: 20000
Length of y_train is: 80000
Length of y_test is: 20000


<h1 style="background-color:orange; font-family:newtimeroman; font-size:160%; text-align:left;"> 4.1) XGBoost </h1>

In [9]:
xgb = XGBClassifier(random_state=42, use_label_encoder=True)
xgb = xgb.fit(X, y)



In [10]:
y_pred_xgb = xgb.predict_proba(X_test)

In [11]:
y_pred_xgb_test = xgb.predict_proba(test)

XGBoostError: [16:44:42] ../src/predictor/cpu_predictor.cc:258: Check failed: m->NumColumns() == model.learner_model_param->num_feature (51 vs. 50) : Number of columns in data must equal to trained model.
Stack trace:
  [bt] (0) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x912df) [0x7fc3707582df]
  [bt] (1) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x243048) [0x7fc37090a048]
  [bt] (2) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x244d3d) [0x7fc37090bd3d]
  [bt] (3) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x198221) [0x7fc37085f221]
  [bt] (4) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x1ce6a7) [0x7fc3708956a7]
  [bt] (5) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(+0x9cb2e) [0x7fc370763b2e]
  [bt] (6) /opt/conda/lib/python3.7/site-packages/xgboost/lib/libxgboost.so(XGBoosterPredictFromDense+0x225) [0x7fc37074afa5]
  [bt] (7) /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fc3eb0b09dd]
  [bt] (8) /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7fc3eb0b0067]



<h1 style="background-color:orange; font-family:newtimeroman; font-size:160%; text-align:left;"> 4.2) LightAutoML </h1>

In [12]:
pip install -U lightautoml

Collecting lightautoml
  Downloading LightAutoML-0.2.13-py3-none-any.whl (250 kB)
[K     |████████████████████████████████| 250 kB 1.2 MB/s 
Collecting importlib-metadata<2.0,>=1.0
  Downloading importlib_metadata-1.7.0-py2.py3-none-any.whl (31 kB)
Collecting json2html
  Downloading json2html-1.3.0.tar.gz (7.0 kB)
Collecting autowoe>=1.2
  Downloading AutoWoE-1.2.5-py3-none-any.whl (204 kB)
[K     |████████████████████████████████| 204 kB 5.5 MB/s 
[?25hCollecting lightgbm<3.0,>=2.3
  Downloading lightgbm-2.3.1-py2.py3-none-manylinux1_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.0 MB/s 
Collecting log-calls
  Downloading log_calls-0.3.2.tar.gz (232 kB)
[K     |████████████████████████████████| 232 kB 6.4 MB/s 
Collecting poetry-core<2.0.0,>=1.0.0
  Downloading poetry_core-1.0.3-py2.py3-none-any.whl (424 kB)
[K     |████████████████████████████████| 424 kB 6.2 MB/s 
[?25hCollecting efficientnet-pytorch
  Downloading efficientnet_pytor

In [13]:
# Imports from our package
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task
from sklearn.metrics import log_loss

In [14]:
N_THREADS = 4 # threads cnt for lgbm and linear models
N_FOLDS = 5 # folds cnt for AutoML
RANDOM_STATE = 2021 # fixed random state for various reasons
TEST_SIZE = 0.2 # Test size for metric check
TIMEOUT = 14400 # Time in seconds for automl run

In [15]:
%%time

automl = TabularUtilizedAutoML(task = Task('multiclass',), 
                               timeout = TIMEOUT,
                               cpu_limit = N_THREADS,
                               reader_params = {'n_jobs': N_THREADS},
                               tuning_params = {'max_tuning_iter': 20, 'max_tuning_time': 50},
                               
)

CPU times: user 28.2 ms, sys: 1.02 ms, total: 29.2 ms
Wall time: 28 ms


In [16]:
target_column = 'target'

roles = {
    'target': target_column
}

lightml_pred = automl.fit_predict(train_df, roles = roles)
print('lightml_pred:\n{}\nShape = {}'.format(lightml_pred[:10], lightml_pred.shape))

Current random state: {'reader_params': {'random_state': 42}, 'general_params': {'return_all_predictions': False}}
Found reader_params in kwargs, need to combine
Merged variant for reader_params = {'n_jobs': 4, 'random_state': 42}
Start automl preset with listed constraints:
- time: 14399.996123075485 seconds
- cpus: 4 cores
- memory: 16 gb

Train data shape: (100000, 51)
Feats was rejected during automatic roles guess: []


Layer 1 ...
Train process start. Time left 14357.884085178375 secs
Start fitting Lvl_0_Pipe_0_Mod_0_LinearL2 ...

===== Start working with fold 0 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====

Linear model: C = 1e-05 score = -1.1102790553703905
Linear model: C = 5e-05 score = -1.104653170647472
Linear model: C = 0.0001 score = -1.1038413102462887
Linear model: C = 0.0005 score = -1.1037279969975353
Linear model: C = 0.001 score = -1.1038292608886957
Linear model: C = 0.005 score = -1.1038546926751733

===== Start working with fold 1 for Lvl_0_Pipe_0_Mod_0_LinearL2 =====

L

Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer


Start fitting Lvl_0_Pipe_1_Mod_1_LightGBM ...

===== Start working with fold 0 for Lvl_0_Pipe_1_Mod_1_LightGBM =====

Training until validation scores don't improve for 200 rounds
[100]	valid's multi_logloss: 1.09824
[200]	valid's multi_logloss: 1.10207
[300]	valid's multi_logloss: 1.10801
Early stopping, best iteration is:
[100]	valid's multi_logloss: 1.09824
Lvl_0_Pipe_1_Mod_1_LightGBM fitting and predicting completed
Start fitting Lvl_0_Pipe_1_Mod_1_LightGBM ...

===== Start working with fold 0 for Lvl_0_Pipe_1_Mod_1_LightGBM =====

Training until validation scores don't improve for 200 rounds
[100]	valid's multi_logloss: 1.09795
[200]	valid's multi_logloss: 1.09972
[300]	valid's multi_logloss: 1.10363
Early stopping, best iteration is:
[118]	valid's multi_logloss: 1.09788
Lvl_0_Pipe_1_Mod_1_LightGBM fitting and predicting completed
Start fitting Lvl_0_Pipe_1_Mod_1_LightGBM ...

===== Start working with fold 0 for Lvl_0_Pipe_1_Mod_1_LightGBM =====

Training until validation scores d

In [17]:
%%time

test_pred = automl.predict(test_df)
print('Prediction for test set:\n{}\nShape = {}'.format(test_pred[:5], test_pred.shape))

Prediction for test set:
array([[0.0972752 , 0.6254245 , 0.1639225 , 0.11337782],
       [0.08227223, 0.70333004, 0.12477012, 0.08962765],
       [0.08403666, 0.6446773 , 0.1732978 , 0.09798831],
       [0.08272669, 0.55320585, 0.27393508, 0.09013242],
       [0.0732189 , 0.6282202 , 0.18970132, 0.10885958]], dtype=float32)
Shape = (50000, 4)
CPU times: user 7min 24s, sys: 555 ms, total: 7min 25s
Wall time: 2min 8s


<h1 style="background-color:LimeGreen; font-family:newtimeroman; font-size:200%; text-align:center; border-radius: 15px 50px;"> Submission </h1>

submission = pd.DataFrame(y_pred_xgb_test, columns=xgb.classes_)
submission

submission.insert(0, 'id', testoriginal['id'])
submission

submission.to_csv("submission.csv", index = False)

In [18]:
sub.iloc[:, 1:] = test_pred.data
sub.to_csv('light_automl_1.csv', index = False)


In [19]:
sub[sub.columns[1:]] = test_pred
sub.to_csv('alternative.csv', index=False)

AssertionError: Numpy dataset support only np.ndarray features