# CatBoooostClassifier

By: Jhonnatan Torres: <br>

The goal of this notebook is to test a **CatBoostClassifier** in a text classification task, according to the [documentation](https://catboost.ai/en/docs/), there is a parameter called *text_features* which requires a list with the column names that contain text in a dataset
___

## Data
The data contains two columns, a *text* and an *author*, this dataset was used in a Kaggle competition some years ago, it took place on October and the three authors are  **Edgar Allan Poe (EAP)**, **Mary Shelley (MWS)** and **HP Lovecraft (HPL)**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
from catboost import CatBoostClassifier

In [None]:
train = pd.read_csv('/kaggle/input/spooky-author-identification/train.zip', usecols=['text', 'author'])
test = pd.read_csv('/kaggle/input/spooky-author-identification/test.zip')

In [None]:
train.sample(10)

## ML Model
* Using a test size of 25%
* Using a CatBoostClassifier model, the task_type is set to "GPU" but it can be set to "CPU"
* The loss function is *MultiClass* or *LogLoss* this is the metric defined for this competition
* Setting early_stopping_rounds=25 in order to avoid overfitting

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train.drop(columns='author'), train['author'], test_size=0.25, stratify=train['author'])

In [None]:
X_train

In [None]:
model = CatBoostClassifier(text_features=['text'], random_state=1234, auto_class_weights='Balanced', loss_function='MultiClass',
                           task_type='GPU', devices="0:1")

In [None]:
model.fit(X_train, y_train, eval_set=(X_test, y_test), early_stopping_rounds=25, verbose=100)

## Predictions
* You can get the probabilities or the classes

In [None]:
print(model.predict_proba(X_test)[0:5])
print(y_test[0:5])

In [None]:
predictions = model.predict(X_test)

In [None]:
#Flatten the list with the results
predictions = [item for sublist in predictions for item in sublist]

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test, predictions))

## Retraining the model
* Model is retrained in order to submit the results to the competition

In [None]:
model.fit(train.drop(columns='author'), train['author'], early_stopping_rounds=25, verbose=100)

In [None]:
preds = model.predict_proba(test.drop(columns='id'))
preds_df = pd.DataFrame(preds, columns=['EAP','HPL','MWS'])
preds_df = pd.concat(objs=[test['id'], preds_df], axis="columns")

In [None]:
preds_df.sample(5)

Exporting the predictions to csv format

In [None]:
preds_df.to_csv("submission.csv", index=False)

In [None]:
pd.read_csv('/kaggle/input/spooky-author-identification/sample_submission.zip').head()

## Closing Comments
* I got a 0.38 LogLoss score (late submission) in the competition leaderboard (LB)
* If you are un a hurry, this model **CatBoostClassifier** can be very handy, you can use *categorical features* w/o additional transformations or preprocessing, you can use *numeric features* w/o scaling or normalizing because this is a tree based model and you can use *text features* w/o additional preprocessing or vectorizer 