## Introduction
In this notebook I will be creating a using lightgbm and then I will do hyperparameter tuning on that model to improve it 😄. Initially with out the hyper parameter tuning, the model gives a score of around .77(approx) on submitting it to the competition. While, after doing the hyper parameter tuning, it gets a score of .88(appprox)🤞. I have used optuna for hyper parameter tunining. I will keep adding all the necessary links as well from which you can learn. 

Thank you !! 
Enjoy :)

## Importing the libraries and loading the data 

In [None]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn import metrics 
import lightgbm as lgb
import optuna
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
train = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')

In [None]:
test = pd.read_csv('../input/tabular-playground-series-mar-2021/test.csv')

In [None]:
submission = pd.read_csv('../input/tabular-playground-series-mar-2021/sample_submission.csv',index_col='id')

## Feature Engineering and Feature Selection

This section includes - 
1. Understanding and visualizing the data.
2. Applying label encoder to the categorical features.
3. Checking for missing values.
4. Dropping some of the unimportant features. 

In [None]:
# Pandas Profiling on the training set. 
prof = ProfileReport(train)
prof.to_notebook_iframe()

In [None]:
# Pandas Profiling on the test set
prof = ProfileReport(test)
prof.to_notebook_iframe()

In [None]:
#Label encoder
for c in train.columns:
    if train[c].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(train[c].values) + list(test[c].values))
        train[c] = lbl.transform(train[c].values)
        test[c] = lbl.transform(test[c].values)
        
display(train.head())

In [None]:
#checking for missing values in training set
train.isnull().sum()

In [None]:
#Checking for missing values in test set
test.isnull().sum()

In [None]:
#Removing unimportant columns
target = train.pop('target')
train.pop('id')
train.info()
test.pop('id')

## Model Creation and Hyper Parameter Optimisation

First I have used lightgbm to train the model and then I have used optuna to get the best hyper parameters for the model. After that I have created a simple model using optuna with defaults hyper parameters. The reason for doing so is to compare the roc auc score for both the models and how much improvement is seen from the initial model. 

Link to learn about optuna for lgbm - https://github.com/optuna/optuna/blob/master/examples/lightgbm/lightgbm_simple.py

In [None]:

def objective(trial,data=train, target = target):
    X_train,X_test,y_train,y_test = train_test_split(train,target,train_size=0.9)
    dtrain = lgb.Dataset(X_train, label=y_train)
    param = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }
    model = lgb.train(param,dtrain)
    y_pred = model.predict(X_test)
    pred_labels = np.rint(y_pred)
    auc_roc_score = roc_auc_score(y_test,pred_labels)
    return auc_roc_score

In [None]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

In [None]:
params=study.best_params 
#params['n_estimators'] = 2000 
params['metric'] = 'roc_auc_score'

The dictionary params contains the best values for our hyper parameters which we found out using optuna

In [None]:
X_train,X_test,y_train,y_test = train_test_split(train,target,train_size=0.9)

First I have created a lgbm model in which I have not included the hyper parameters and then I have created another model where I have included the hyper parameter. And the hyper parameters have made a very slight improvement in the model but the slight improvement shows a significantly big jump in the leaderboard. 

In [None]:
from lightgbm import LGBMClassifier
lgb = LGBMClassifier()
lgb.fit(X_train,y_train)
y_preds = lgb.predict(X_test)
print(metrics.roc_auc_score(y_preds,y_test))

In [None]:


check = LGBMClassifier(**params)
check.fit(X_train,y_train)
y_preds = check.predict(X_test)
target_names = ["class 0 ","class 1"]
print(metrics.roc_auc_score(y_preds,y_test))

The roc_auc_score without the hyper parameter tuning is around 0.81645 and the roc_auc_score with hyper parameters is around 0.81780. And the difference in the roc_auc_score is 0.00135

In [None]:
output = check.predict_proba(test)
submission['target'] = output

In [None]:
submission.head()

In [None]:
submission.to_csv('lgbm.csv')

**Thank you so much for your time, if you find it useful kindly upvote** :)