# PyCaret Classifier, Quick Implementation 🧪 ...
In this notebook I use PyCaret to create a basic implementation of a MultiClass Classsifier </br>


**Notebook Goal:** Develop a quick Classifier to understand the competition and the areas of improvements, get a general idea of how the preprocessing and feature engineering </br>
could be used for future iterations of the Model.

**The Strategy is the Following**

1. Installing Libraries for the Model.
2. Importing Required Libraries & Notebook Setup.
3. Load the CSV into a Dataframe.
4. Visualize the Information Loaded.
5. Data Pre-Processing.
6. Feature Engineering.
7. Develop a Simple Model.
8. Learn from the Model.
9. Think in new Ideas for Future Improvements.
10. Submit the Model for Ranking.

**Update 02/06/2022**
* Baseline Model and Notebook Created.
* Train a model using a PyCaret Classifier.

**Update 02/07/2022**
* Implemented the Memory Efficient functions to Help Pycaret.

**Credits and Resoures**


# 1. Install Libraries & Setup the Notebook

In [None]:
%%capture
# Install PyCaret...
!pip install pycaret[full]

# 2. Importing Required Libraries & Notebook Setup.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time
# Import PyCaret
from pycaret.classification import *

# Visualization library
import matplotlib.pyplot as plt

In [None]:
%%time
# I like to disable my Notebook Warnings because it looks better.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '/kaggle/input/ubiquant-market-prediction/'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.5f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

# 3. Loading the Train and Test Datasets into a Dataframe.

In [None]:
%%time
trn_data = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/train.csv', index_col = 0)
tst_data = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2022/test.csv', index_col = 0)

In [None]:
%%time
sub = pd.read_csv('../input/tabular-playground-series-feb-2022/sample_submission.csv')

In [None]:
%%time
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [None]:
%%time
trn_data = reduce_mem_usage(trn_data)

In [None]:
%%time
tst_data = reduce_mem_usage(tst_data)

# 4. Visualize the Information Loaded.

In [None]:
%%time
trn_data.head()

In [None]:
%%time
tst_data.head()

In [None]:
%%time
trn_data.describe()

In [None]:
%%time
pd.Series(trn_data['target'], index=trn_data.index).value_counts().sort_index() / len(trn_data) * 100

# 5. Data Pre-Processing.

In [None]:
%%time
# Label encoding the Target variable....

from sklearn.preprocessing import LabelEncoder
target_encoder = LabelEncoder()
trn_data['encoded_target'] = target_encoder.fit_transform(trn_data['target'])

In [None]:
%%time
# Create a list of the features for the model...

ignore = ['target', 'encoded_target']
features = [feat for feat in trn_data.columns if feat not in ignore]
features_pycaret = [feat for feat in trn_data.columns if feat not in ignore] + ['target']

# 6. Feature Engineering.

In [None]:
%%time
cols = [col for col in trn_data.columns if 'target' not in col]

trn_data['COUNT'] = trn_data.groupby(cols)['A0T0G0C10'].transform('size')
tst_data['COUNT'] = tst_data.groupby(cols)['A0T0G0C10'].transform('size')

In [None]:
%%time
# Define a simple function to create features across columns...

def create_features(df, features):
    """
    Created multiple features...
    """
    df['A_sum'] = df[features].sum(axis = 1)
    df['A_min'] = df[features].min(axis = 1)
    df['A_max'] = df[features].max(axis = 1)    
    df['A_std'] = df[features].std(axis = 1)
    df['A_mad'] = df[features].mad(axis = 1)
    
    df['A_mean'] = df[features].mean(axis = 1)
    df['A_positive'] = df.select_dtypes(include='float64').gt(0).sum(axis=1)
    
    return df

In [None]:
%%time
# Utilizes the new defined funtion to create new features, for now is disabled...

#trn_data = create_features(trn_data,features)
#tst_data = create_features(tst_data,features)

In [None]:
%%time
# Removing the dupplicated rows from the dataset...

trn_data = trn_data.drop_duplicates(keep = 'first')

In [None]:
%%time
# Review one more time the first five rows to check everything is in order.
trn_data.head()

# 7. Model Configuration & Training.

In [None]:
%%time
# Creates a PyCaret Sutup files, to be used in the modeling stage...

exp_mclf101 = setup(data = trn_data[features_pycaret], 
                    target = 'target',
                    use_gpu = True,
                    fold_strategy = 'stratifiedkfold',
                    fold = 3,
                    silent = True,
                    session_id = 69) 

In [None]:
%%time
# Creates a classifier, PyCaret model / Extra Tress
# The created model uses the default configuration...

model_run = 'development'

if model_run == 'train':
    n_estimators = 1186 
    max_depth = 1845
    min_samples_split = 3
    min_samples_leaf = 1
    criterion = 'gini'
else:
    n_estimators = 1536
    max_depth = None
    min_samples_split = 2
    min_samples_leaf = 1
    criterion = 'gini'

extra_tree = create_model('et',
                          n_estimators = n_estimators,
                          max_depth = max_depth,
                          min_samples_split = min_samples_split,
                          min_samples_leaf = min_samples_leaf,
                          criterion = criterion,
                          random_state = 69,
                          n_jobs = -1)

In [None]:
%%time
# Print some of the model parameters, It give a good idea of the defaults
print(extra_tree)

# 8. Learning from the Model.
In this section I use two powerful visualizations, Feature Importance and the Confusion Matrix...

In [None]:
%%time
#plot_model(extra_tree, 'feature')

In [None]:
%%time
#plt.figure(figsize=(10,10))
#plot_model(extra_tree, plot = 'confusion_matrix', scale = 0.8)

# 9. Submit the Model for Ranking.
I this section I export the raw prediction or probabilities from PyCaret and compile them for the final submission.

In [None]:
%%time
tst_data['pred'] = predict_model(extra_tree, data=tst_data, raw_score=True)['Label']
predictions = predict_model(extra_tree, data=tst_data, raw_score=True)

In [None]:
%%time
score_features = [feat for feat in predictions.columns if 'Score' in feat]
y_prob = predictions[score_features].to_numpy()
y_prob += np.array([0, 0, 0.03, 0.036, 0, 0, 0, 0.027, 0, 0])

In [None]:
%%time
y_pred_tuned = target_encoder.inverse_transform(np.argmax(y_prob, axis=1))
pd.Series(y_pred_tuned, index=tst_data.index).value_counts().sort_index() / len(tst_data) * 100

In [None]:
%%time
sub["target"] = y_pred_tuned
sub.to_csv("submission_02042022.csv", index = False)

# 10. New Ideas for Future Improvement to The Model.