# 🌿 Extra Trees Baseline Model and Optuna Hyperparam Optimization...
Extra Trees is an ensemble machine learning algorithm that combines the predictions from many decision trees. </br>
It can often achieve as-good or better performance than the random forest algorithm, although it uses a simpler algorithm to construct the decision trees used as members of the ensemble

**Data Description**

For this challenge, you will be predicting bacteria species based on repeated lossy measurements of DNA snippets. </br>
Snippets of length 10 are analyzed using Raman spectroscopy that calculates the histogram of bases in the snippet. </br> 
In other words, the DNA segment $ATATGGCCTT$ becomes $A_{2}T_{4}G_{2}C_{2}$ </br>.

Each row of data contains a spectrum of histograms generated by repeated measurements of a sample, each row containing the output of all 286 histogram  </br>
possibilities (e.g., $A_{0}T_{0}G_{0}C_{0}$  to $A_{10}T_{0}G_{0}C_{0}$), which then has a bias spectrum (of totally random ATGC) subtracted from the results. </br>
The data (both train and test) also contains simulated measurement errors (of varying rates) for many of the samples, which makes the problem more challenging. </br>


**Notebook Goals**
* Identify optimal model architecture using Optuna.
* Develop a baseline model to understand the competition data.
* Apply the insights adquired on Hyperparam Optimization.

**The Strategy is the Following**

1. Installing Libraries for the Model.</br>
2. Importing Required Libraries & Notebook Setup.</br>
3. Load the CSV into a Dataframe.</br>
4. Visualize the Information Loaded.</br>
5. Data Pre-Processing.</br>
6. Feature Engineering.</br>
7. Develop a Simple Model.</br>
8. Learn from the Model.</br>
9. Think in new Ideas for Future Improvements.</br>
10. Submit the Model for Ranking.</br>

**Update 02/11/2022**
* Baseline Model and Notebook Created.
* Added Optuna Hyperparam Optimization.

**Update 02/12/2022**
* Implemented the Memory Efficient functions to Help Pycaret.


**Credits and Notebooks Used**
* https://www.kaggle.com/sfktrkl/tps-feb-2022
* https://www.kaggle.com/hamzaghanmi/train-test-286


# 1. Install Libraries & Setup the Notebook

In [None]:
# Nothing needs to be installed

# 2. Importing Required Libraries & Notebook Setup.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time
import optuna
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '/kaggle/input/ubiquant-market-prediction/'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.5f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

# 3. Loading the Train and Test Datasets into a Dataframe.

In [None]:
%%time
trn_data = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/train.csv')
tst_data = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/test.csv')

In [None]:
%%time
sub = pd.read_csv('../input/tabular-playground-series-feb-2022/sample_submission.csv')

# 4. Visualize the Information Loaded.

In [None]:
%%time
trn_data.head()

In [None]:
%%time
tst_data.head()

In [None]:
%%time
trn_data.describe()

# 5. Feature Engineering.

In [None]:
%%time
# Drop duplicate rows
cols = [col for col in tst_data.columns if col not in ('row_id')]
trn_data.drop_duplicates(subset = cols, keep = 'first',inplace = True)

In [None]:
trn_data.info()

In [None]:
# Checking the intersection between Train and Test...
merge = pd.merge(trn_data, tst_data, how = 'inner', on = cols)

train_test_map = {}
for i in range(len(merge)):
    train_test_map[merge.loc[i]['row_id_y']] = merge.loc[i]['row_id_x']

In [None]:
cols = [col for col in trn_data.columns if 'target' not in col]

trn_data['COUNT'] = trn_data.groupby(cols)['A0T0G0C10'].transform('size')
tst_data['COUNT'] = tst_data.groupby(cols)['A0T0G0C10'].transform('size')

In [None]:
ignore = ['target']
features = [feat for feat in trn_data.columns if feat not in ignore]

In [None]:
%%time
def create_features(df):
    """
    Created multiple features...
    """    
    df['A_sum'] = df[features].sum(axis = 1)
    df['A_min'] = df[features].min(axis = 1)
    df['A_max'] = df[features].max(axis = 1)    
    df['A_std'] = df[features].std(axis = 1)
    df['A_mad'] = df[features].mad(axis = 1)
    df['A_var'] = df[features].var(axis = 1)
    df['A_mean'] = df[features].mean(axis = 1)
    df['A_positive'] = df.select_dtypes(include='float64').gt(0).sum(axis=1)
    
    return df

In [None]:
%%time
#trn_data = create_features(trn_data)
#tst_data = create_features(tst_data)

In [None]:
%%time
trn_data.head()

In [None]:
%%time
ignore = ['target']
features = [feat for feat in trn_data.columns if feat not in ignore]

# 6. Data Processing.

In [None]:
%%time
from sklearn.preprocessing import LabelEncoder
target_encoder = LabelEncoder()
trn_data['target_enc'] = target_encoder.fit_transform(trn_data['target'])

In [None]:
X = trn_data[features]
y = trn_data['target_enc']

# 7. Baseline Model Configuration & Training.

In [None]:
%%time
N_SPLITS = 10
folds = StratifiedKFold(n_splits = N_SPLITS, shuffle = True)

In [None]:
## %%time
from sklearn.model_selection import StratifiedKFold

N_SPLITS = 10
folds = StratifiedKFold(n_splits = N_SPLITS, shuffle = True)


n_estimators = 128
max_depth = 64
min_samples_split = 3
min_samples_leaf = 1
criterion = 'gini'

scores  = []
y_probs = []

for fold, (trn_id, val_id) in enumerate(folds.split(X, y)):  
    X_train, y_train = X.iloc[trn_id], y.iloc[trn_id]
    X_valid, y_valid = X.iloc[val_id], y.iloc[val_id]
    
    model = ExtraTreesClassifier(n_estimators = n_estimators,
                                 max_depth = max_depth,
                                 min_samples_split = min_samples_split,
                                 min_samples_leaf = min_samples_leaf,
                                 criterion = criterion,
                                 random_state = 69,
                                 n_jobs = -1)
    model.fit(X_train, y_train)
    
    valid_pred = model.predict(X_valid)
    valid_score = accuracy_score(y_valid, valid_pred)
    
    print("Fold:", fold, "Accuracy:", valid_score)
    scores.append(valid_score)
    y_probs.append(model.predict_proba(tst_data[features]))

In [None]:
%%time
print("Mean accuracy score:", np.array(scores).mean())

# 8. Optuna Model Configuration Hyperparameter Search.

In [None]:
%%time
N_SPLITS = 10
folds = StratifiedKFold(n_splits = N_SPLITS, shuffle = True)

In [None]:
%%time
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

def objective(trial):
    n_estimators = trial.suggest_int("n_estimators", 8, 2048)
    max_depth = trial.suggest_int("max_depth", 4, 2048)
    min_samples_split = trial.suggest_int("min_samples_split", 2, 16)
    min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 8)
    criterion = trial.suggest_categorical("criterion", ['gini', 'entropy'])
    
    clf = ExtraTreesClassifier(n_estimators = n_estimators,
                               max_depth = max_depth,
                               min_samples_split = min_samples_split, 
                               min_samples_leaf = min_samples_leaf,
                               criterion = criterion,
                               random_state = 69,
                              )
    
    clf.fit(X_train, y_train)
    return clf.score(X_valid, y_valid)

#study = optuna.create_study(direction = "maximize")
#study.optimize(objective, n_trials = 30)

In [None]:
%%time
#parameters = study.best_params
#parameters

In [None]:
%%time
#print("Mean accuracy score:", np.array(scores).mean())

# 9. Optuna Hyperparam Optimized Model.

In [None]:
%%time
N_SPLITS = 10
folds = StratifiedKFold(n_splits = N_SPLITS, shuffle = True)

In [None]:
%%time

n_estimators = 2373
max_depth = 3691
min_samples_split = 3
min_samples_leaf = 1
criterion = 'gini'

scores  = []
y_probs = []

for fold, (trn_id, val_id) in enumerate(folds.split(X, y)):  
    X_train, y_train = X.iloc[trn_id], y.iloc[trn_id]
    X_valid, y_valid = X.iloc[val_id], y.iloc[val_id]
    
    model = ExtraTreesClassifier(n_estimators = n_estimators,
                                 max_depth = max_depth,
                                 min_samples_split = min_samples_split,
                                 min_samples_leaf = min_samples_leaf,
                                 criterion = criterion,
                                 random_state = 69,
                                 n_jobs = -1)
    model.fit(X_train, y_train)
    
    valid_pred = model.predict(X_valid)
    valid_score = accuracy_score(y_valid, valid_pred)
    
    print("Fold:", fold, "Accuracy:", valid_score)
    scores.append(valid_score)
    y_probs.append(model.predict_proba(tst_data[features]))

In [None]:
%%time
print("Mean accuracy score:", np.array(scores).mean())

# 10. Prediction Post Processing and Model Submission.

In [None]:
%%time
y_prob = sum(y_probs) / len(y_probs)
y_prob += np.array([0, 0, 0.03, 0.036, 0, 0, 0, 0.027, 0, 0])
y_pred_tuned = target_encoder.inverse_transform(np.argmax(y_prob, axis=1))
pd.Series(y_pred_tuned, index=tst_data.index).value_counts().sort_index() / len(tst_data) * 100

In [None]:
sub[sub['row_id'] == 262823]

In [None]:
for key in train_test_map:
    sub.loc[sub[sub['row_id'] == key].index.to_list(),'target'] = trn_data.loc[trn_data[trn_data['row_id'] == train_test_map[key]].index.tolist()[0],'target']

In [None]:
sub[sub['row_id'] == 262823]

In [None]:
%%time
sub["target"] = y_pred_tuned
sub.to_csv("submission.csv", index=False)
sub