# Optuna: A hyperparameter optimization framework

* *In This Kernel I will use the amazing **Optuna** to find the best hyparameters of LGBM*

**So, Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API. The code written with Optuna enjoys high modularity, and the user of Optuna can dynamically construct the search spaces for the hyperparameters.** 
* To learn more about Optuna check this [link](https://optuna.org/)

# Basic Concepts
So, We use the terms study and trial as follows:
* Study: optimization based on an objective function
* Trial: a single execution of the objective function

In [None]:
#!pip install optuna 
import optuna

In [None]:
from lightgbm import LGBMClassifier
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split,StratifiedKFold

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
train = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')
test  = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv')
sub = pd.read_csv('../input/tabular-playground-series-apr-2021/sample_submission.csv')

In [None]:
train.head()

## Impute missing values

In [None]:
#complete embarked with mode
train['Embarked'].fillna(train['Embarked'].mode()[0], inplace = True)
test['Embarked'].fillna(test['Embarked'].mode()[0], inplace = True)

#complete sex with mode
train['Sex'].fillna(train['Sex'].mode()[0], inplace = True)
test['Sex'].fillna(test['Sex'].mode()[0], inplace = True)

#complete missing age with mean
train['Age'].fillna(train['Age'].mean(), inplace = True)
test['Age'].fillna(test['Age'].mean(), inplace = True)

#complete missing fare with mean
train['Fare'].fillna(test['Fare'].median(), inplace = True)
test['Fare'].fillna(test['Fare'].median(), inplace = True)

In [None]:
columns = [c for c in train.columns if c not in ['PassengerId','Cabin','Ticket','Survived','Name']]

## One Hot Encoding for Encoding Categorical Features

In [None]:
train_objs_num = len(train)
dataset = pd.concat(objs=[train[columns], test[columns]], axis=0)
dataset_preprocessed = pd.get_dummies(dataset,columns=['Sex','Embarked','Parch','SibSp'])
train_preprocessed = dataset_preprocessed[:train_objs_num]
test_preprocessed = dataset_preprocessed[train_objs_num:]

## Let's build our optimization function using optuna

### This function uses LGBMClassifier model, takes 
* the data
* the target
* trial(How many executions we will do)  
#### and returns
* accuracy

## Notes:
* Note that I used some LGBMClassifier hyperparameters from LGBM official site. 
* So if you like to add more parameters or change them, check this [link](https://lightgbm.readthedocs.io/en/latest/Parameters.html) 
* Also I used early_stopping_rounds to avoid overfiting

In [None]:
def objective(trial,data=train_preprocessed,target=train['Survived']):
    
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.2,random_state=42,stratify = train['Survived'])
    param = {
        'metric': 'binary_logloss', 
        'random_state': 48,
        'n_estimators': 20000,
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.3,0.4,0.5,0.6,0.7,0.8,0.9, 1.0]),
        'subsample': trial.suggest_categorical('subsample', [0.4,0.5,0.6,0.7,0.8,1.0]),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 5),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.008,0.01,0.014,0.017,0.02]),
        'max_depth': trial.suggest_categorical('max_depth', [30,100]),
        'num_leaves' : trial.suggest_int('num_leaves', 10, 300),
        'min_child_samples': trial.suggest_int('min_child_samples', 10, 300),
        'cat_smooth' : trial.suggest_int('cat_smooth', 1, 100)
    }
    model = LGBMClassifier(**param)   
    
    model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)
    
    preds = model.predict(test_x)
    
    accuracy = accuracy_score(test_y, preds)
    
    return accuracy

## All thing is ready So let's start 🏄‍
* Note that the objective of our fuction is to maxmize the accuracy that's why I set direction='maximize'
* you can vary n_trials(number of executions) 

In [None]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

In [None]:
study.trials_dataframe()

# Let's do some Quick Visualization for Hyperparameter Optimization Analysis
### Optuna provides various visualization features in optuna.visualization to analyze optimization results visually

In [None]:
#plot_optimization_histor: shows the scores from all trials as well as the best score so far at each point.
optuna.visualization.plot_optimization_history(study)

In [None]:
#plot_parallel_coordinate: interactively visualizes the hyperparameters and scores
optuna.visualization.plot_parallel_coordinate(study)

In [None]:
'''plot_slice: shows the evolution of the search. You can see where in the hyperparameter space your search
went and which parts of the space were explored more.'''
optuna.visualization.plot_slice(study)

In [None]:
#plot_contour: plots parameter interactions on an interactive chart. You can choose which hyperparameters you would like to explore.
optuna.visualization.plot_contour(study, params=['num_leaves',
                            'max_depth',
                            'subsample',
                            'learning_rate'])

In [None]:
#Visualize parameter importances.
optuna.visualization.plot_param_importances(study)

In [None]:
#Visualize empirical distribution function
optuna.visualization.plot_edf(study)

# Let's create an LGBMClassifier model with the best hyperparameters

In [None]:
params=study.best_params   
params['random_state'] = 48
params['n_estimators'] = 20000 
params['metric'] = 'binary_logloss'

In [None]:
params

In [None]:
kf = StratifiedKFold(n_splits=5,random_state=48,shuffle=True)                  
accuracy=[]   # list contains accuracy for each fold  
n=0   
for trn_idx, test_idx in kf.split(train_preprocessed,train['Survived']):
    X_tr,X_val=train_preprocessed.iloc[trn_idx],train_preprocessed.iloc[test_idx]
    y_tr,y_val=train['Survived'].iloc[trn_idx],train['Survived'].iloc[test_idx]
    model = LGBMClassifier(**params) 
    model.fit(X_tr,y_tr,eval_set=[(X_val,y_val)],early_stopping_rounds=200,verbose=False) 
    sub[str(n)] = model.predict(test_preprocessed)
    accuracy.append(accuracy_score(y_val, model.predict(X_val))) 
    print(n+1,accuracy[n])                                                                                       
    n+=1 

In [None]:
np.mean(accuracy)  

In [None]:
from optuna.integration import lightgbm as lgb
lgb.plot_importance(model, max_num_features=10, figsize=(10,10))
plt.show()


# Submission

In [None]:
df=sub[['0','1','2','3','4']].mode(axis=1) # select the most frequent predicted class by our model
sub['Survived']=df[0]    
sub=sub[['PassengerId','Survived']]
sub['Survived']=sub['Survived'].apply(lambda x : int(x))
sub.to_csv('submission.csv',index=False)

In [None]:
sub

In [None]:
# # Please If you find this kernel helpful, upvote it to help others see it 😊 😋