This notebook aims to allocate the development related to exploratory analysis of insights related to [Kaggle Tabular Playground of May 2021](https://www.kaggle.com/c/tabular-playground-series-may-2021/data). Also, this notebook uses the tools presented on [xplotter](https://github.com/ThiagoPanini/xplotter) and [mlcomposer](https://github.com/ThiagoPanini/mlcomposer) python packages made by myself and published on PyPI repository. This is a real good effort for coding useful functions for making the Exploratory Data Analysis and applying Machine Learning process a lot more easier for Data Scientists and Data Analysis through deliverying charts customization and matplotlib/seaborn plots with a little few lines of code. I really hope you all enjoy it!

<div align="center">
    <img src="https://i.imgur.com/5XFP1Ha.png" height=300 width=200 alt="xplotter Logo">
    <img src="https://i.imgur.com/MIcPH8g.png" width=450 height=450 alt="mlcomposer logo">
</div>

<a id="top"></a>
<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Table of Content</h3>

* [1. Libraries and Project Variables](#1)
* [2. Reading the Data](#2)
* [3. EDA: Exploring Insights with xplotter](#3)
    - [3.1 Target Class Balance](#3.1)
    - [3.2 Correlation Matrix](#3.2)
    - [3.3 Distribution Analysis](#3.3)
    - [3.4 Categorical Countplots](#3.4)
* [4. ML: Training Models with mlcomposer](#4)
    - [4.1 Transformers Module](#4.1)
        - [4.1.1 Selecting Features](#4.1.1)
        - [4.1.2 Target Transformation](#4.1.2)
        - [4.1.3 Split the Data](#4.1.3)
        - [4.1.4 Prep Pipelines](#4.1.4)
    - [4.2 Trainer Module](#4.2)
        - [4.2.1 Initial Setup](#4.2.1)
        - [4.2.2 Training Models](#4.2.2)
        - [4.2.3 Evaluating Performance](#4.2.3)
* [5. Hyperparameter Tunning](#5)
    - [5.1 Feature Selection](#5.1)
* [6. Submitting Results](#6)

<a id="1"></a>
<font color="darkslateblue" size=+2.5><b>1. Libraries and Project Variables</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

So let's do the work by importing libraries and defining project variables. This will start the implementation and help us to organize the code.

In [None]:
!pip install xplotter --upgrade
!pip install mlcomposer --upgrade

In [None]:
# Standard python libraries
import pandas as pd
import os
from warnings import filterwarnings
filterwarnings('ignore')

# Showing up xplotter
from xplotter.insights import *

In [None]:
# Path variables
PROJECT_PATH = '../input/tabular-playground-series-may-2021'
TRAIN_FILEPATH = os.path.join(PROJECT_PATH, 'train.csv')
TEST_FILEPATH = os.path.join(PROJECT_PATH, 'test.csv')

<a id="2"></a>
<font color="darkslateblue" size=+2.5><b>2. Reading the Data</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After importing libraries and defining user variables, let's read the data and make the first contact with the content available.

In [None]:
# Reading the data
df_train = pd.read_csv(TRAIN_FILEPATH)
print(f'Data shape: {df_train.shape}')
df_train.head()

Ok, we can see that the `training` data has 100,000 rows and 52 columns divided into:
* 1 id column
* 1 target column
* 49 data features

>**Note from competitions page:** The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.

Now that we have already read the data and put the eyes on it for the first time, we can start the job by looking a little deeper into the features for extracting useful information before training Machine Learning models. And thats the point we put `xplotter` package on the game: with `xplotter` we can execute already built functions for visualizing the content of a dataset in a faster and more beautiful way.

<a id="3"></a>
<font color="darkslateblue" size=+2.5><b>3. EDA: Exploring Insights with xplotter</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

The `xplotter` package construction was motivated exactly to facilitate the work of data scientists in the pillars of insights and exploratory data analysis. The next steps will be based on the tools provided from xplotter library to make beautiful charts in order to get a deep understand of our data. 

<a id="3.1"></a>
<font color="dimgrey" size=+2.0><b>3.1 Target Class Balance</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In [None]:
plot_donut_chart(df=df_train, col='target')

With one line of code, the `plot_donut_chart()` function extracted from `xplotter` was able to deliver a complete donut chart with information about our target class balance. The chart above shows us how the data is distributed along the 4 different target classes and with that we can point out the following statements:

* The *Class_1* category is the one with less data elements (8,480 rows)
* The *Class_2* category is the one with more data elements (57,497 rows)

There is much more to explore but, by now, we can think of how this difference of balance can probably impact a further Machine Learning classification model. Let's keep going.

<a id="3.2"></a>
<font color="dimgrey" size=+2.0><b>3.2 Correlation Matrix</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In [None]:
# Creating a numerical target from original one
df_train_corr = pd.get_dummies(df_train)

# Plotting positive correlation matrix
target_list = ['target_Class_' + str(i) for i in range(1, 5)]
for target_class in target_list:
    # Creating a to_drop list for not considering other classes on correlation
    to_drop = ['target_Class_' + str(i) for i in range(1, 5)]
    to_drop.remove(target_class)
    
    # Applying xplotter function
    plot_corr_matrix(df=df_train_corr.drop(to_drop, axis=1), corr_col=target_class, n_vars=10)

The `xplotter` package have one of its most powerful functions called `plot_corr_matrix()`! With this function, we can easily plot a beautiful correlation matrix with custom parameters allowed. We just need to pass the DataFrame and the correlation column (`corr_col` parameter). Additionaly, for making the process computationally less expensive, we can pass the `n_vars` parameter for filtering just the top N features to be used on the matrix.

The function call above shows the top 10 features with most correlation between each of target class categories. The for loop coded iterates of each different target class after the `pd.get_dummies()` process and uses the `corr_col` paramter from `plot_corr_matrix()` to pass different correlation column as the target for analysis. With this we can see how each feature impacts on each class output.

But maybe the features have little "real world meaning" and the cells on the matrix shows us that the correlation values are always too low for every class. By the other hand, the `plot_corr_matrix()` allows us to see the correlation through the **negative** perspective by handling the `corr` parameter. Let's see the top 10 features with most negative correlation with each target class:

In [None]:
# Plotting negative correlation matrix
target_list = ['target_Class_' + str(i) for i in range(1, 5)]
for target_class in target_list:
    # Creating a to_drop list for not considering other classes on correlation
    to_drop = ['target_Class_' + str(i) for i in range(1, 5)]
    to_drop.remove(target_class)
    
    # Applying xplotter function
    plot_corr_matrix(df=df_train_corr.drop(to_drop, axis=1), corr='negative',
                     corr_col=target_class, n_vars=10)

By the same way, the sequence of plots above shows a correlation analysis for each target class by a negative perspective. This is really important for guiding decisions to be made along the project development.

<a id="3.3"></a>
<font color="dimgrey" size=+2.0><b>3.3 Distribution Analysis</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

A good way to see the distributions of numerical features is trough plotting histogram, boxplot or another distribution chart. In the `xplotter` package, it's possible to plot custom individual distribution charts or multiple distribution charts with just one function.

Let's see the distribution of 9 features (*feature_1* to *feature_9*).

In [None]:
# Multiple distribution plots
col_list = ['feature_' + str(i) for i in range(1, 10)]
plot_multiple_distplots(df=df_train, col_list=col_list)

By setting up some parameters it's possible to see distribution by another look: the `kind` argument (defined on kwargs) can be used to plot not only histograms, but kdeplots, boxplots, boxenplots or striplots. Let's see another set of features (*feature_10* to *feature_18*).

In [None]:
# Plotting multiple boxenplots for features
col_list = ['feature_' + str(i) for i in range(10, 19)]
plot_multiple_distplots(df=df_train, col_list=col_list, kind='boxen')

Looking at the features distribution, we can see that some of them has a kind of categorical behavior. If we take a look at the histograms, it can be seens some of discrete elevations on specific x-axis points. Because of that, it's could be useful to generate a countplot instead of distplot for the features. Fortunately, the `xplotter` package can also plot multiple countplots at once and that's what we will see on the next session

<a id="3.4"></a>
<font color="dimgrey" size=+2.0><b>3.4 Categorical Countplots</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In [None]:
# Plotting countplots for some features
col_list = ['feature_2', 'feature_5', 'feature_9', 'feature_18', 'feature_20', 'feature_22',
            'feature_23']
plot_multiple_countplots(df=df_train, col_list=col_list, n_cols=2, palette='viridis')

After a short journey on `xplotter` for understanding the data and visualizing some useful patterns like multiple distribution charts, correlation matrix and countplots, there's enough information from data for starting another journey on data modelling by training and evaluating Machine Learning models.

Let's do it using `mlcomposer`: another homemade built for encapsulating the hard work from data scientists for training and evaluating models. This happens by useful classes and methods that has all you need for applying Machine Learning on diverse contexts with few lines of code.

<a id="4"></a>
<font color="darkslateblue" size=+2.5><b>4. ML: Training Models with mlcomposer</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

As I promissed you, this section will dive deep into an excellent package that you help you apply ML like you never did before. Meet `mlcomposer` as a new way for telling computers to learn from data. On this Tabular Playground Series, we will apply the methods and functions of this package and you will be able to see its power.

<div align="center">
    <img src="https://i.imgur.com/MIcPH8g.png" width=500 height=500 alt="mlcomposer logo">
</div>

<a id="4.1"></a>
<font color="dimgrey" size=+2.0><b>4.1 Transformers Module</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Well, after studying a little bit the dataset provided, it can be said that there is no much data transformation to be applied in order to train machine learning models. The data has already a pre processing made by another flow and so we have the features already built and a target classed defined.

Even though, let's take the opportunity to show some of the tools presented in `mlcomposer` package, like its useful module called `transformers`. With this module, we can use python classes for applying data transformations in various aspects in order to construct efficient data pipelines for processing and preparing data for training models. 

<a id="4.1.1"></a>
<font color="dimgrey" size=+1.0><b>4.1.1 Selecting Features</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

As long there's no much transformations to be done in this application, let's start by importing a class for selecting only the initial features to be used on training steps. 

In [None]:
# Importing class
from mlcomposer.transformers import ColumnSelection

# Instancing object and transforming
INITIAL_FEATURES = df_train.drop('id', axis=1).columns
selector = ColumnSelection(features=INITIAL_FEATURES)
df_selected = selector.fit_transform(df_train)

# Results
print(f'Columns of original dataset: {df_train.shape[1]}')
print(f'Columns of selected dataset: {df_selected.shape[1]}')

<a id="4.1.2"></a>
<font color="dimgrey" size=+1.0><b>4.1.2 Target Transformation</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After initializing the preparation section by selecting the features from a initial training dataset to be received in a previous workflow execution, we need to handle the transformation of your target column (original composed by string).

Most machine learning models don't recognize string entries and so they require a encoding process for transforming strings onto numbers. The mlcomposer package doesn't have something for this because sklearn's already have `OneHotEncoder()` and `LabelEncoder()` classes. But for this point of view, let's keep it simple and create a custom transformation class for extracting the class number from the class string. This will be a good coding construction so we can see how a transformation class can be built for using it in further preparation pipelines.

In [None]:
# Importing classes
from sklearn.base import BaseEstimator, TransformerMixin

# Creating custom class
class TargetExtractor(BaseEstimator, TransformerMixin):
    
    def __init__(self, old_target_name='target', new_target_name='target'):
        self.old_target_name = old_target_name
        self.new_target_name = new_target_name
    
    def fit(self, df, y=None):
        return self
    
    def transform(self, df, y=None):
        # Creating new target column
        df_copy = df.copy()
        df_copy[self.new_target_name] = df_copy[self.old_target_name].apply(lambda x: int(x[-1]))
        
        # Verifying if names are equal for dropping the old column
        if self.old_target_name == self.new_target_name:
            return df_copy
        else:
            return df_copy.drop(self.old_target_name, axis=1)

In [None]:
# Executing custom class for transforming the target
target_prep = TargetExtractor()
df_target_prep = target_prep.fit_transform(df_selected)

# Results
print(f'Samples of old target column: \n{df_selected["target"].values[:5]}')
print(f'\nSamples of new target column: \n{df_target_prep["target"].values[:5]}')

Well done!

<a id="4.1.3"></a>
<font color="dimgrey" size=+1.0><b>4.1.3 Split the Data</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

For training machine learning models, it's important to validate the results in a set not seen by the models. Thinkng of this need, `mlcomposer` package brings the `SplitData()` class that enables splitting the data into a training pipeline. Let's see how it works.

In [None]:
# Importing class
from mlcomposer.transformers import DataSplitter

# Initializing object and applying transformation
splitter = DataSplitter(target='target')
X_train, X_val, y_train, y_val = splitter.fit_transform(df_target_prep)

# Results
print(f'Shape of X_train: {X_train.shape}')
print(f'Shape of X_val: {X_val.shape}')

<a id="4.1.4"></a>
<font color="dimgrey" size=+1.0><b>4.1.4 Prep Pipelines</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

As said before, there few transformations to be made on this dataset. Just for keep the structure and to be prepared for further ideas that can be taken in consideration along the project, let's create a pipeline for data preparation. In the future we can add to this structure:

* PCA pipelines
* New custom features implementation
* Feature selection pipelines

In [None]:
# Creating prep pipelines for training and testing
from sklearn.pipeline import Pipeline

# Defining variables
TARGET = 'target'
INITIAL_FEATURES = ['feature_' + str(i) for i in range(50)] + [TARGET]
TEST_FEATURES = INITIAL_FEATURES[:-1]

# Train and test pipelines
train_prep_pipe = Pipeline([
    ('selector', ColumnSelection(features=INITIAL_FEATURES)),
    ('target_encoder', TargetExtractor()),
    ('splitter', DataSplitter(target=TARGET))
])

test_prep_pipe = Pipeline([
    ('selector', ColumnSelection(features=TEST_FEATURES))
])

# Reading the data and applying pipelines
df_train = pd.read_csv(TRAIN_FILEPATH)
df_test = pd.read_csv(TEST_FILEPATH)

X_train, X_val, y_train, y_val = train_prep_pipe.fit_transform(df_train)
X_test = test_prep_pipe.fit_transform(df_test)

Very good! Now we're ready for using the `mlcomposer.trainer` module for trying it out multiclass classification models for reaching out our goal! Keep watching!

<a id="4.2"></a>
<font color="dimgrey" size=+2.0><b>4.2 Trainer Module</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After preparing our data with `transformers` module from mlcomposer, we will use this section for applying the tools available for the `trainer` module. The one thing we must do before easily start training anything is to prepare a dictionary of models and its hyperparameters search space (if applicable) following the structure:

    set_classifiers = {
        'model_name': {
            'estimator': ModelClass(),
            'params': dictionary_params
        }
    }
    
After doing that, we can initialize an object and execute its methods for training and evaluating different models at once. Let's do it!

<a id="4.2.1"></a>
<font color="dimgrey" size=+1.0><b>4.2.1 Initial Setup</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

>Note: unhide the cell below to see hyperparameters definition

In [None]:
# Setting up hyperparameters for DecisionTrees
dtree_tunning_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 6, 7, 8, 9, 10],
    'class_weight': [None, 'balanced'],
    'random_state': [42]
}

# Setting up hyperparameters for RandomForest
forest_tunning_grid = {
    #'bootstrap': [True, False],
    'class_weight': [None, 'balanced'],
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 6, 7, 8, 9, 10],
    #'max_features': [None, 'auto', 'sqrt', 'log2'],
    #'max_leaf_nodes': np.arange(3, 50, 2),
    #'min_impuriti_decrease': np.linspace(0, 1, 50),
    #'min_samples_leaf': np.arange(1, 100, 1),
    #'min_samples_split': np.arange(2, 100, 1),
    #'min_weight_fraction_leaf': np.linspace(0, 1, 50),
    'n_estimators': [500],
    #'oob_score': [True, False],
    'random_state': [42]
}

# # Setting up hyperparameters for LightGBM
lgbm_tunning_grid = {
    'boosting_type': ['gbdt'],
    'class_weight': [None, 'balanced'],
    #'colsample_bytree': np.linspace(.5, 1, 50),
    #'importance_type': ['split', 'gain'],
    'learning_rate': [0.003, 0.01, 0.03, 0.1, 0.3, 1, 3],
    'max_depth': [5, 6, 7, 8, 9, 10],
    #'min_child_samples': np.arange(10, 50, 1),
    #'min_child_weight': np.linspace(1e-4, 1, 100),
    'n_estimators': [500],
    'num_leaves': [5, 10, 15, 20],
    'objective': ['binary'],
    'random_state': [42],
    #'reg_alpha': np.linspace(.0, 1.0, 50)
}

In [None]:
# Importing models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier

# Defining variables
N_CLASSES = len(np.unique(y_train))
TARGET_NAMES = ['Class_' + str(i) for i in range(1, N_CLASSES + 1)]

# Initializing objects
dtree = DecisionTreeClassifier()
forest = RandomForestClassifier()
lgbm = LGBMClassifier(objective='multiclass', num_class=N_CLASSES)

# Creating set classifiers
model_obj = [dtree, forest, lgbm]
model_names = [type(model).__name__ for model in model_obj]
model_params = [dtree_tunning_grid, forest_tunning_grid, lgbm_tunning_grid]
set_classifiers = {name: {'model': obj, 'params': param} for (name, obj, param) in zip(model_names, model_obj, model_params)}

print(f'Classifiers to be trained: \n\n{model_names}')

<a id="4.2.2"></a>
<font color="dimgrey" size=+1.0><b>4.2.2 Training Models</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Once defined the `set_classifiers` dictionary, you will now see how easy is to train and evaluate all selected models through a `mlcomposer.trainer` class already built for handling the hard work and the code for doing those steps.

For training the model of TPS May 21 we will use the class `ClassificadorMulticlasse` and apply its `fit()` method setting up some parameters.

In [None]:
# Importing class
from mlcomposer.trainer import MulticlassClassifier

# Initializing object and training models
trainer = MulticlassClassifier()
trainer.fit(set_classifiers, X_train, y_train, random_search=False, cv=5, n_jobs=3)

Easy like that! Now we have both models trained inside the `trainer` object. Let's dive into another section for evaluating models using the same object in an easy and interpretable way.

>Note: Applying random search here it's expensive. We can try it in future versions.

<a id="4.2.3"></a>
<font color="dimgrey" size=+1.0><b>4.2.3 Evaluating Performance</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

By now we saw that mlcomposer package has a classe caled ClassificadorMulticlasse with a `fit()` method that allows training multiple models at once as long as they are specified on `set_classifiers()` dictionary corretly. After training them, what can we do? It's time to show two excellent functions for extracting a handful and complete report for metrics in multiclass classification problem and plotting metrics in a beautiful chart.

We're talking of `evaluate_performance()` and `plot_metrics()` methods from the trainer object. Let's see what we can get from them.

In [None]:
# Extracting a complete performance report in a DataFrame format
metrics = trainer.evaluate_performance(X_train, y_train, X_val, y_val, target_names=TARGET_NAMES)
metrics

It's seems our models didn't perform well for this multiclassification task. Let's see it in a matplotlib chart.

In [None]:
# Plotting metrics
trainer.plot_metrics(metrics)

Well, it's clearly that for classic classification metrics like accuracy, precision, recall and f1-score our candidate models really perform poorly. Even though, let's explore other functionalities from mlcomposer by plotting a customized confusion matrix.

___
* **_Confusion Matrix_**
___

In [None]:
# Plotting a confusion matrix for training and validation data
trainer.plot_confusion_matrix(classes=TARGET_NAMES)

By the way, the matrix from the left shows that the models kindly perform well only for the *Class_2* category and this is something we can investigate in the future.

For the sake of the art, let's take a look at the feature importances for each model by using another useful mlcomposer method called `plot_feature_importance()`.

___
* **_Feature Importances_**
___

In [None]:
# Plotando importância das features
trainer.plot_feature_importance(features=TEST_FEATURES)

It's seems like *feature_38* is at the top for all models.

___
* **_Extracting Log Loss_**
___

By now, for completing the submission for this task, we can create a code for extracting the log loss metrics for each trained model. After that, we can compare them and select the best one for submitting on the test sample.

In [None]:
# Importing modules
from sklearn.metrics import log_loss

# Defining variables
y_val_encoded = pd.get_dummies(y_val)

# Iterating for each trained model on trainer class
for name in model_names:
    model = trainer.get_estimator(model_name=name)
    
    # Predicting score for validation set
    val_probas = model.predict_proba(X_val)
    val_loss = log_loss(y_val_encoded, val_probas)
    print(f'Model: {name} - Log Loss: {round(val_loss, 5)}')

<a id="5"></a>
<font color="darkslateblue" size=+2.5><b>5. Hyperparameter Tunning</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Well, after training some baseline models and looking at its performances, it would be a good idea to explore some tools for tunning hyperparameters for a good candidate model in order to extract the best combination for highest performance on data for this task. So, we're going to use the `LightGBM` trained model for exploring some of its hyperparameters through a `RandomizedSearchCV` application. We will also set the optimization metrics to be equal to `neg_log_loss` for finding the best model for this task.

In [None]:
# Importing libraries
from sklearn.model_selection import RandomizedSearchCV

# Setting up hyperparameters for LightGBM
lgbm_tunning_grid = {
    'boosting_type': ['gbdt'],
    'class_weight': [None, 'balanced'],
    #'colsample_bytree': np.linspace(.5, 1, 50),
    'importance_type': ['split', 'gain'],
    'learning_rate': [0.003, 0.01, 0.03, 0.1, 0.3, 1, 3],
    'max_depth': [5, 6, 7, 8, 9, 10],
    #'min_child_samples': np.arange(10, 50, 1),
    #'min_child_weight': np.linspace(1e-4, 1, 100),
    'n_estimators': [500, 650, 700, 800],
    'num_leaves': [5, 10, 15, 20],
    'objective': ['binary'],
    'random_state': [42],
    'reg_alpha': np.linspace(.0, 1.0, 50)
}

# Initializing a new model and applying random search
lgbm = LGBMClassifier()
rnd_search = RandomizedSearchCV(lgbm, lgbm_tunning_grid, scoring='neg_log_loss', cv=5, verbose=10,
                                random_state=42, n_jobs=5)
rnd_search.fit(X_train, y_train)

In [None]:
# Extracting the best estimator
best_model = rnd_search.best_estimator_
print(f'Best hyperparameters: \n{rnd_search.best_params_}')

# Computing log loss for the best model
y_probas = best_model.predict_proba(X_val)
loss = log_loss(y_val_encoded, y_probas)
print(f'\nLoss after hyperparameter tunning: {round(loss, 5)}')

<a id="5.1"></a>
<font color="dimgrey" size=+1.0><b>5.1 Feature Selection</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

On another trial for improving performance, let's use the useful class from mlcomposer called `FeatureSelection`. The `transform()` method of this class used the `feature_importances` list result for selecting the dataset columns based on a `k` parameter passed as an argument. When we put this on a pipeline and applies RandomizedSearchCV, it's possible to tune this `k` parameter in order to find the best combination for a optimization rule.

For this last approach, let's use the full data for training and building a final model. For result comparison, we will only use the `neg_log_loss` metric obtained by cross validation on this full set.

In [None]:
# Importing class
from mlcomposer.transformers import FeatureSelection

# Concatenating data
X = X_train.append(X_val)
y = np.concatenate([y_train, y_val])

# Extracting feature importances for model trained just before
feature_importance = best_model.feature_importances_

# Creating a tunning pipeline if FeatureSelection class
tunning_pipeline = Pipeline([
    ('selector', FeatureSelection(feature_importance, k=len(TEST_FEATURES))),
    ('model', best_model)
])

# Deifning a hyparparemeter search for the tunning pipeline (k hyperparmeter only)
tunning_param_grid = {
    'selector__k': np.arange(5, len(TEST_FEATURES) + 1)
}

# Defining random search and training it
tunning_search = RandomizedSearchCV(tunning_pipeline, tunning_param_grid, scoring='neg_log_loss', cv=5,
                                    n_jobs=5, verbose=10, random_state=42)
tunning_search.fit(X, y)

How about the best hyperparameters for the feature selection pipeline?

In [None]:
# Best params
tunning_search.best_params_

It's seems like one feature was discarted from the final model.

In [None]:
# Results
final_model = tunning_search.best_estimator_
final_log_loss = -tunning_search.best_score_
print(f'Log loss using cross validation: {round(final_log_loss, 5)}')

<a id="6"></a>
<font color="darkslateblue" size=+2.5><b>6. Submitting Results</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

After training, evaluating and selecting the best model for this task, we can use the test data for computing the scores for each element.

In [None]:
# Extracting scores for test data
X_test = test_prep_pipe.fit_transform(df_test)
test_proba = final_model.predict_proba(X_test)

# Generating submition dataset
for i in range(N_CLASSES):
    df_test['Class_' + str(i + 1)] = test_proba[:, i]
    
df_sub = df_test.loc[:, ['id'] + TARGET_NAMES]
df_sub.to_csv('initial_sub.csv', index=False)
df_sub.head()

Please tell me what do you think about `xplotter` and `mlcomposer` packages and leave here a comment or a upvote. Your opinion is really important and I'm really excited to show you new implementations on those packages.

* **xplotter on Github:** https://github.com/ThiagoPanini/xplotter
* **xplotter on PyPI:** https://pypi.org/project/xplotter/


* **mlcomposer on Github:** https://github.com/ThiagoPanini/mlcomposer
* **mlcomposer on PyPI:** https://pypi.org/project/mlcomposer/
___

<font size="+1" color="black"><b>You can also visit my other kernels by clicking on the buttons</b></font><br>

<a href="https://www.kaggle.com/thiagopanini/pycomp-predicting-survival-on-titanic-disaster" class="btn btn-primary" style="color:white;">Titanic EDA</a>
<a href="https://www.kaggle.com/thiagopanini/pycomp-exploring-and-modeling-housing-prices" class="btn btn-primary" style="color:white;">Housing Prices</a>
<a href="https://www.kaggle.com/thiagopanini/predicting-restaurant-s-rate-in-bengaluru" class="btn btn-primary" style="color:white;">Bengaluru's Restaurants</a>
<a href="https://www.kaggle.com/thiagopanini/sentimental-analysis-on-e-commerce-reviews" class="btn btn-primary" style="color:white;">Sentimental Analysis E-Commerce</a>