# Buscalibre Numeric Model

The preprocessed has two types of data, numeric and text data. First, we build an estimator using only the numeric part, based on different Machine Learning Models.

## Libraries and Data

Import the neccesary packages.

In [1]:
import numpy as np
import pandas as pd
import os

Upload the Training dataset, and shuffle as it becomes sectioned.

In [2]:
path_folder = os.getcwd().replace("\\", "/") + "/"
path_parent = os.path.dirname(os.getcwd()).replace("\\", "/") + "/"
train = pd.read_csv(path_parent + "data_analysis/train_2.csv")
test = pd.read_csv(path_parent + "data_analysis/test_2.csv")

In [3]:
train = train.sample(frac=1, random_state=123).reset_index(drop=True)
X = train.drop(columns=["isbn", "review", "topic", "review_cleaned"])
y = train["topic"]

Define the cross validated scorer function.

Note: As we saw in the Exploratory Analysis part, the target labels are imbalanced, so we are the following metrics:

- [wieghted f1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html): instead of accuracy, suitable for imbalanced multiclass target data.
- [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix): to visualize the results.

In [4]:
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold, cross_val_score

In [5]:
def cross_score(model, k=10):
    kf = StratifiedKFold(n_splits=k, shuffle=True)
    scores = cross_val_score(model, X, y, cv=kf, scoring="f1_weighted")
    return np.mean(scores)

### Train-Test Split

Ideally, one would validate the scores of a model using the cross validation method, but we are splitting the data to analyze some of the model behaviours.

In [6]:
from sklearn.model_selection import train_test_split
X_hold, X_val, y_hold, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=999)
print(f"Holding set shapes: {X_hold.shape}, {y_hold.shape}")
print(f"Validation set shapes: {X_val.shape}, {y_val.shape}")

Holding set shapes: (2130, 26), (2130,)
Validation set shapes: (533, 26), (533,)


### Common Label

The topic (target label) "grandes-descuentos" has the greatest number of samples. So an starting prediction is to assume that every label belongs to it.

If we do so, we get an score of

In [7]:
%%time
y_pred = ["grandes-descuentos" for i in range(len(y_val))]
score = f1_score(y_val, y_pred, average="weighted")
print(f"Predicting with the common label has an f1 score of: {score}.\n")
print(f"And the following Validation confusion matrix: \n {confusion_matrix(y_val, y_pred)} \n")

Predicting with the common label has an f1 score of: 0.24682589316735654.

And the following Validation confusion matrix: 
 [[  0   0   0   0   0   0   0   0   0  23   0   0   0]
 [  0   0   0   0   0   0   0   0   0  38   0   0   0]
 [  0   0   0   0   0   0   0   0   0  17   0   0   0]
 [  0   0   0   0   0   0   0   0   0  88   0   0   0]
 [  0   0   0   0   0   0   0   0   0  46   0   0   0]
 [  0   0   0   0   0   0   0   0   0  15   0   0   0]
 [  0   0   0   0   0   0   0   0   0   8   0   0   0]
 [  0   0   0   0   0   0   0   0   0  14   0   0   0]
 [  0   0   0   0   0   0   0   0   0   3   0   0   0]
 [  0   0   0   0   0   0   0   0   0 223   0   0   0]
 [  0   0   0   0   0   0   0   0   0  31   0   0   0]
 [  0   0   0   0   0   0   0   0   0  18   0   0   0]
 [  0   0   0   0   0   0   0   0   0   9   0   0   0]] 

Wall time: 143 ms


## Base Models

We are going to use multiple simple predictors provided by [Scikit-Learn](https://scikit-learn.org/) and [Tensorflow](https://www.tensorflow.org). 

The choice of hyperparameters was done with the help of [Optuna package](https://optuna.org/) using [Google Colab](https://colab.research.google.com/) servers.

### Random Forest Classifier

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier

In [9]:
%%time
ovr_rfc = OneVsRestClassifier(RandomForestClassifier(**{
    'n_estimators': 334,
    'criterion': 'entropy',
    'max_depth': 42,
    'min_samples_split': 9,
    'min_samples_leaf': 4,
    'max_features': 0.45448755763486154,
    'random_state': 555
}))
score = cross_score(ovr_rfc)
print(f"Random Forest Classifier has a cross validated f1 score of: {score}. \n")
ovr_rfc.fit(X_hold, y_hold)
ovr_rfc_pred = ovr_rfc.predict(X_val)
print(f"And the following Validation confusion matrix: \n {confusion_matrix(y_val, ovr_rfc_pred)} \n")

Random Forest Classifier has a cross validated f1 score of: 0.6582078030807285. 

And the following Validation confusion matrix: 
 [[ 15   5   0   0   0   0   0   0   0   3   0   0   0]
 [  7  21   0   4   0   0   0   0   0   4   1   1   0]
 [  0   0   5   4   5   0   0   1   0   2   0   0   0]
 [  0   2   1  67   3   0   0   0   0  10   4   1   0]
 [  1   1   2   9  21   2   0   2   0   7   0   1   0]
 [  0   0   0   8   3   3   0   0   0   0   1   0   0]
 [  0   0   0   0   0   0   6   1   0   0   0   1   0]
 [  0   1   0   1   3   0   0   4   0   2   2   1   0]
 [  0   0   0   1   2   0   0   0   0   0   0   0   0]
 [  0   5   0   3   2   0   0   1   0 209   0   3   0]
 [  0   3   0  10   0   0   0   0   0   4  14   0   0]
 [  0   4   0   3   1   0   0   0   0   4   0   6   0]
 [  0   0   0   7   0   0   0   0   0   0   2   0   0]] 

Wall time: 4min 35s


### Logistic Regression

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [11]:
%%time
logreg = make_pipeline(StandardScaler(), MinMaxScaler(), LogisticRegression(
    C=8.261486231908338,
    tol=0.8728213920467933,
    intercept_scaling=9.117615728181427,
    multi_class="multinomial",
    max_iter=10_000,
    random_state=555
))
score = cross_score(logreg)
print(f"Logistic Regression has a cross validated f1 score of: {score}. \n")
logreg.fit(X_hold, y_hold)
logreg_pred = logreg.predict(X_val)
print(f"And the following Validation confusion matrix: \n {confusion_matrix(y_val, logreg_pred)} \n")

Logistic Regression has a cross validated f1 score of: 0.642663188564323. 

And the following Validation confusion matrix: 
 [[ 15   3   0   0   1   0   0   0   0   4   0   0   0]
 [  4  19   0   5   0   0   0   0   0   7   3   0   0]
 [  0   0   5   5   3   0   1   2   0   1   0   0   0]
 [  0   3   0  68   2   0   0   0   0  10   4   1   0]
 [  1   2   4  12  16   1   0   2   0   8   0   0   0]
 [  0   1   0   7   3   3   0   0   0   1   0   0   0]
 [  0   0   0   0   0   0   6   2   0   0   0   0   0]
 [  0   1   0   6   1   0   0   3   0   1   0   2   0]
 [  0   0   0   1   2   0   0   0   0   0   0   0   0]
 [  1   3   0   6   3   0   0   0   0 209   0   1   0]
 [  1   2   0   9   0   0   0   0   0   6  13   0   0]
 [  0   1   0   4   1   0   0   0   0   5   0   7   0]
 [  0   0   0   7   0   0   0   0   0   0   2   0   0]] 

Wall time: 2.74 s


### XGBoost Classifier

In [12]:
from xgboost import XGBClassifier

In [13]:
%%time
ovr_xgb = OneVsRestClassifier(XGBClassifier(**{
    'n_estimators': 378,
    'learning_rate': 0.02950073992817461,
    'base_score': 0.9187179242725662,
    'verbosity': 0,
    'use_label_encoder': False,
    'random_state': 555
}))
score = cross_score(ovr_xgb)
print(f"XGBoost Classifier has a cross validated f1 score of: {score}. \n")
ovr_xgb.fit(X_hold, y_hold)
ovr_xgb_pred = ovr_xgb.predict(X_val)
print(f"And the following Validation confusion matrix: \n {confusion_matrix(y_val, ovr_xgb_pred)} \n")

XGBoost Classifier has a cross validated f1 score of: 0.6586022709564061. 

And the following Validation confusion matrix: 
 [[ 14   4   0   0   0   0   0   0   0   5   0   0   0]
 [  7  22   0   2   0   0   0   0   0   4   3   0   0]
 [  0   0   5   2   6   0   1   1   0   2   0   0   0]
 [  0   3   0  64   3   1   0   1   0   7   6   1   2]
 [  1   1   1  11  22   1   0   1   1   6   0   1   0]
 [  0   0   0   8   2   2   0   0   1   0   1   1   0]
 [  0   0   0   0   0   0   5   2   0   0   0   1   0]
 [  1   1   0   3   2   0   0   5   0   1   0   1   0]
 [  0   0   1   1   1   0   0   0   0   0   0   0   0]
 [  1   6   0   3   2   1   0   0   0 208   0   2   0]
 [  0   3   0   9   0   0   0   1   0   5  13   0   0]
 [  0   3   0   4   1   0   0   0   0   3   0   7   0]
 [  0   0   0   6   0   0   0   0   0   1   1   0   1]] 

Wall time: 1min 37s


### Light GBM Classifier

In [14]:
from lightgbm import LGBMClassifier

In [15]:
%%time
lgb = LGBMClassifier(**{
    'num_leaves': 27,
    'n_estimators': 268,
    'learning_rate': 0.018813923117324143,
    'random_state': 555
})
score = cross_score(lgb)
print(f"LGBM Classifier has a cross validated f1 score of: {score}. \n")
lgb.fit(X_hold, y_hold)
lgb_pred = lgb.predict(X_val)
print(f"And the following Validation confusion matrix: \n {confusion_matrix(y_val, lgb_pred)} \n")

LGBM Classifier has a cross validated f1 score of: 0.6452160708723848. 

And the following Validation confusion matrix: 
 [[ 15   5   0   0   1   0   0   0   0   2   0   0   0]
 [  7  20   0   3   0   0   0   0   0   6   2   0   0]
 [  0   0   2   2   6   0   1   2   1   3   0   0   0]
 [  0   3   0  64   6   0   0   0   0   8   5   1   1]
 [  1   0   4  10  19   3   0   1   0   7   0   1   0]
 [  0   0   0   6   4   2   0   0   0   1   1   0   1]
 [  0   0   0   0   0   0   5   2   0   0   0   1   0]
 [  1   1   0   3   1   0   0   4   0   2   1   1   0]
 [  0   0   0   1   2   0   0   0   0   0   0   0   0]
 [  2   5   0   2   3   0   0   1   0 208   0   2   0]
 [  1   2   0   6   1   0   0   1   0   5  14   1   0]
 [  0   3   0   4   1   0   0   1   0   3   0   6   0]
 [  0   0   0   6   0   0   0   0   0   1   1   0   1]] 

Wall time: 32 s


### CatBoost Classifier

In [16]:
from catboost import CatBoostClassifier

In [17]:
%%time
ovr_cat = OneVsRestClassifier(CatBoostClassifier(**{
    'iterations': 330,
    'learning_rate': 0.043379595491767745,
    'depth': 7,
    'l2_leaf_reg': 0.5416613355579589,
    'border_count': 212,
    'loss_function': 'MultiClass',
    'verbose': False,
    'random_state': 555
}))
score = cross_score(ovr_cat)
print(f"CatBoost Classifier has a cross validated f1 score of: {score}. \n")
ovr_cat.fit(X_hold, y_hold)
ovr_cat_pred = ovr_cat.predict(X_val)
print(f"And the following Validation confusion matrix: \n {confusion_matrix(y_val, ovr_cat_pred)} \n")

CatBoost Classifier has a cross validated f1 score of: 0.6505277205855613. 

And the following Validation confusion matrix: 
 [[ 15   3   0   0   0   0   0   0   0   5   0   0   0]
 [  6  20   0   4   0   0   0   0   0   4   4   0   0]
 [  0   0   6   3   3   0   0   2   1   2   0   0   0]
 [  0   2   1  64   3   0   0   0   0  10   6   1   1]
 [  1   2   4   7  21   2   0   2   0   5   0   2   0]
 [  0   0   0   7   4   3   0   0   0   0   1   0   0]
 [  0   0   0   0   0   0   6   1   0   0   0   1   0]
 [  0   1   0   1   3   0   0   4   0   2   2   1   0]
 [  0   0   0   1   2   0   0   0   0   0   0   0   0]
 [  1   5   0   4   1   0   0   0   0 209   0   3   0]
 [  0   5   0   7   0   0   0   1   0   3  14   0   1]
 [  0   4   0   3   1   0   0   0   0   3   0   7   0]
 [  0   0   0   6   0   0   0   0   0   0   2   0   1]] 

Wall time: 7min 6s


### Forward Network Classifier

In [18]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.base import BaseEstimator, ClassifierMixin

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


In [19]:
%%time
tf.random.set_seed(555)
class NetworkClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, ini_neurons=60, optimizer="adam", epochs=200):
        self.ini_neurons = ini_neurons
        self.optimizer = optimizer
        self.epochs = epochs
        self.model = Sequential()
    # Fit Function, fit the Neural Network with one layer and ini_neurons number of neurons.
    def fit(self, X, y):
        y_enc = pd.get_dummies(y)
        self.cols = y_enc.columns
        y_np = y_enc.to_numpy()
        X_np = X.to_numpy()
        self.model.add(Dense(self.ini_neurons, input_shape=(X.shape[1], ), activation="relu"))
        self.model.add(Dense(13, activation="softmax"))
        self.model.compile(
            optimizer=self.optimizer, loss="categorical_crossentropy", metrics=["accuracy"]
        )
        self.model.fit(
            X_np, y_np, epochs=self.epochs, verbose=0
        )
        return self
    # Predict Function
    def predict(self, X):
        X_np = X.to_numpy()
        y_hat = self.model.predict(X_np)
        y_df = pd.DataFrame(data=y_hat, columns=self.cols)
        y_pred = y_df.idxmax(axis=1)
        return y_pred

nc = NetworkClassifier()
score = cross_score(nc)
print(f"Network Classifier has a cross validated f1 score of: {score}. \n")
nc.fit(X_hold, y_hold)
nc_pred = nc.predict(X_val)
print(f"And the following Validation confusion matrix: \n {confusion_matrix(y_val, nc_pred)} \n")

Network Classifier has a cross validated f1 score of: 0.6458845412913149. 

And the following Validation confusion matrix: 
 [[ 16   3   0   0   0   0   0   0   0   4   0   0   0]
 [  7  22   0   4   0   0   0   2   0   2   1   0   0]
 [  0   0   3   3   6   0   0   3   0   1   0   1   0]
 [  0   6   0  66   2   0   0   6   0   5   2   1   0]
 [  1   1   1  10  21   0   1   7   0   4   0   0   0]
 [  0   1   0   8   3   3   0   0   0   0   0   0   0]
 [  0   0   0   0   0   1   5   2   0   0   0   0   0]
 [  0   1   0   2   3   0   0   5   0   1   0   2   0]
 [  0   0   0   1   2   0   0   0   0   0   0   0   0]
 [  5   4   0   4   5   1   0   4   0 194   4   2   0]
 [  1   4   0   5   0   0   0   3   0   4  13   1   0]
 [  0   3   0   3   2   0   0   0   0   3   0   7   0]
 [  0   0   0   7   0   0   0   0   0   0   2   0   0]] 

Wall time: 1min 33s


## Correlation

Almost every model fits very well to the training data with an acceptable f1 score. To know which model we would use, first verify the correlation in the validation predictions.

In [20]:
from sklearn.preprocessing import LabelEncoder

In [21]:
le = LabelEncoder()
le.fit(y)
stack = np.column_stack([
    le.transform(ovr_rfc_pred), 
    le.transform(logreg_pred), 
    le.transform(ovr_xgb_pred), 
    le.transform(lgb_pred), 
    le.transform(ovr_cat_pred), 
    le.transform(nc_pred)
])
stack = pd.DataFrame(data=stack, columns=["rfc", "logreg", "xgb", "lgb", "catb", "network"])

In [22]:
stack.corr()

Unnamed: 0,rfc,logreg,xgb,lgb,catb,network
rfc,1.0,0.825543,0.856202,0.846,0.893799,0.817743
logreg,0.825543,1.0,0.74302,0.724786,0.810144,0.833218
xgb,0.856202,0.74302,1.0,0.868415,0.825789,0.75354
lgb,0.846,0.724786,0.868415,1.0,0.827099,0.730183
catb,0.893799,0.810144,0.825789,0.827099,1.0,0.772371
network,0.817743,0.833218,0.75354,0.730183,0.772371,1.0


In summary, we have two suggested models: The Random Forest Classifier which got the best cross validated f1 score, and a combination of the least linear-correlated models, Logistic Regression + LightGBM + Network Classifier. We are going to fit the combination in the following.

## Weighted Voting Model

One simple way to combine and (posible) improve your predictions, is building a [Voting Classifier](https://en.wikipedia.org/wiki/Ensemble_learning). Each Estimator makes its own prediction, and then we save the most voted label. Also, one can put weights on each estimator, to favor it over the others, and get an overall improved prediction.

At this time, weights were chosen using Optuna.

Define and call the Weighted Averaging Estimator class.

In [23]:
from sklearn.base import clone

In [24]:
class WeightedAveragingEstimator(BaseEstimator, ClassifierMixin):
    def __init__(self, models, weights=None):
        self.models = models
        self.weights = weights
    # Fit Function, fit cloned models to prevent overwriting in cross validation.
    def fit(self, X, y):
        self.cols = pd.get_dummies(y).columns
        self.models_ = [clone(x) for x in self.models]
        for model in self.models_:
            model.fit(X, y)
        return self
    # Predict Function, make the prediction for each model and maintain the most voted.
    def predict(self, X):
        sum_ = pd.DataFrame(dtype=float, columns=self.cols)
        for i, model in enumerate(self.models_):
            y_pred_ = model.predict(X)
            y_hat = self.weights[i] * pd.get_dummies(y_pred_)
            sum_ = sum_.add(y_hat, fill_value=0)
        sum_.fillna(value=0)
        y_pred = sum_.idxmax(axis=1)
        return y_pred

In [25]:
models = (
    make_pipeline(StandardScaler(), MinMaxScaler(), LogisticRegression(
        C=8.261486231908338,
        tol=0.8728213920467933,
        intercept_scaling=9.117615728181427,
        multi_class="multinomial",
        max_iter=10_000,
        random_state=555
    )),
    LGBMClassifier(**{
        'num_leaves': 27,
        'n_estimators': 268,
        'learning_rate': 0.018813923117324143,
        'random_state': 555
    }),
    NetworkClassifier()
)
weights = [0.3, 0.4, 0.3]
wae = WeightedAveragingEstimator(models=models, weights=weights)

In [26]:
%%time
score = cross_score(wae)
print(f"Weighted Averaging Estimator has a cross validated f1 score of: {score}. \n")
wae.fit(X_hold, y_hold)
y_pred = wae.predict(X_val)
print(f"And the following Validation confusion matrix: \n {confusion_matrix(y_val, y_pred)} \n")

Weighted Averaging Estimator has a cross validated f1 score of: 0.6496798494514795. 

And the following Validation confusion matrix: 
 [[ 15   4   0   0   0   0   0   0   0   4   0   0   0]
 [  7  19   0   6   0   0   0   0   0   4   2   0   0]
 [  0   0   3   4   5   0   1   2   1   1   0   0   0]
 [  0   4   0  69   3   0   0   0   0   7   4   1   0]
 [  1   2   3  11  19   2   0   2   0   6   0   0   0]
 [  0   1   0   7   4   3   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   6   2   0   0   0   0   0]
 [  0   1   0   5   1   0   0   4   0   1   0   2   0]
 [  0   0   0   1   2   0   0   0   0   0   0   0   0]
 [  1   5   0   4   3   0   0   1   0 206   0   3   0]
 [  1   3   0   8   0   0   0   1   0   5  12   1   0]
 [  0   3   0   4   2   0   0   0   0   3   0   6   0]
 [  0   0   0   7   0   0   0   0   0   0   2   0   0]] 

Wall time: 2min 3s


## Conclusion:

All models worked fine. But we save the CatBoost Classifier model as it got the best rating.

# Review Model

The next step is to build a classifier that only takes the review data as inputs, and then compare it with the numeric model (the CatBoostClassifier) and see if we can get improvements.

In [27]:
path_folder = os.getcwd().replace("\\", "/") + "/"
path_parent = os.path.dirname(os.getcwd()).replace("\\", "/") + "/"
train = pd.read_csv(path_parent + "data_analysis/train_2.csv")
test = pd.read_csv(path_parent + "data_analysis/test_2.csv")

In [28]:
train = train.sample(frac=1, random_state=123).reset_index(drop=True)
X = train["review_cleaned"]
y = train["topic"]

## Word Embedding

Machines only understand numbers, so to process text with a classifier, we have to represent each word as a vector in some vector space.

There are many ways to do that. One of the easiest is to use a pre-trained word embedding matrix. Here we are using the [GloVe Embedding](https://nlp.stanford.edu/projects/glove/) to map each word into a 300-dimension real vector space.

### Lemmatizer

Two conjugate words can be quite different but not alter the context of a sentence. It is always a good idea to standardize each word and stay wit its baseline meaning to make it easier for the classifier. That process is known as lemmatization.

Import the spanish lemmatizer

In [29]:
import spacy
import spacy_spanish_lemmatizer
nlp = spacy.load("es_core_news_sm")
nlp.replace_pipe("lemmatizer", "spanish_lemmatizer")

<spacy_spanish_lemmatizer.main.SpacyCustomLemmatizer at 0x2dcaf96cc40>

Define the lemmatizer function and apply it to our data.

In [30]:
import time
def lemmatizer(X):
    start = time.time()
    lemma = X.apply(lambda x: " ".join([token.lemma_ for token in nlp(x)]))
    end = time.time()
    print(f"Lemmatization Done in {(end - start)//60:.2f} minutes")
    return lemma

In [31]:
X_lemma = lemmatizer(X)
X_lemma = pd.DataFrame(data=X_lemma)

Lemmatization Done in 14.00 minutes


### GloVe Embedding

Create a dictionary with each word and its corresponding vector in the GloVe Embedding.

In [32]:
from tqdm import tqdm
embedding_vector = {}
with open(path_folder + "SBW-vectors-300-min5.txt", encoding="utf8") as f:
    for line in tqdm(f):
        value = line.split(" ")
        word = value[0]
        coef = np.array(value[1:], dtype="float32")
        embedding_vector[word] = coef

1000654it [01:43, 9701.24it/s] 


Create a datafrane of shape (*, 300) containing each vector, and each word as indexes.

In [44]:
def get_vocab_df(X):
    X_vocab = [item for i in range(X_lemma.shape[0]) for item in X_lemma.iloc[i][0].split(" ")]
    X_vocab = list(set(X_vocab))
    vocab_df = pd.DataFrame(data=[], columns=[str(i) for i in range(300)])
    for word in tqdm(X_vocab):
        try:
            temp = embedding_vector[word]
            temp = pd.Series(temp, name=word)
            vocab_df.at[word, :] = temp.values
        except:
            pass
    return vocab_df

In [45]:
vocab_df = get_vocab_df(X_lemma)

100%|████████████████████████████████████████████████████████████████████████████| 28355/28355 [26:49<00:00, 17.62it/s]


## Clustering the Words

Now that we have each word embedded in a vector space, we must cluster them in different sets of similar words.

We are using the [Gaussian Mixture](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html) algorithm.

In [46]:
from sklearn.mixture import GaussianMixture

In [49]:
n_clusters = 100
clustering = GaussianMixture(n_components=n_clusters)
clustering.fit(vocab_df)
labels = clustering.predict(vocab_df)
clusters = np.unique(labels)

Store the cluster labels in a dataset.

In [50]:
words_labels = pd.Series(data=labels, name="label", index=vocab_df.index)

Create a new dataset in which the columns are the cluster labels and the values is the number of words belonging to each cluster.

In [52]:
zeros_mat = np.zeros(shape=(X.shape[0], n_clusters))
X_new = pd.DataFrame(data=zeros_mat, columns=[str(i) for i in range(n_clusters)])
for i in tqdm(range(X.shape[0])):
    for word in X_lemma.iloc[i][0].split(" "):
        if word in vocab_df.index:
            col = str(words_labels[word])
            X_new.at[i, col] = X_new.at[i, col] + 1

100%|█████████████████████████████████████████████████████████████████████████████| 2663/2663 [00:09<00:00, 285.12it/s]


## Classifier

The classifier that translates the number of clusters that each row has, should be as simple as possible. We will use the Logistic Regression.

In [68]:
clf = OneVsRestClassifier(LogisticRegression(
    max_iter=10_000,
    random_state=555
))
clf.fit(X_new, y)
y_pred = clf.predict(X_new)
print(f"Training Confusion Matrix: {confusion_matrix(y, y_pred)} \n")
print(f"Training f1 Score:")
print(f1_score(y, y_pred, average="weighted"))

Training Confusion Matrix: [[ 43   8   0   0   0   0   0   0   0  61   1   0   0]
 [  5  65   0   2   0   0   0   2   0 109   4   2   0]
 [  0   0  25   3   0   1   0   0   1  53   1   0   0]
 [  3   6   0 153   3   3   0   2   0 253  13   1   1]
 [  0   3   1  19  40   0   0   6   0 157   0   1   2]
 [  0   0   2  14   2  16   0   1   0  41   0   0   0]
 [  0   0   0   0   0   0  39   0   0   0   0   0   0]
 [  0   1   0   3   6   0   0  27   0  30   1   1   0]
 [  0   0   0   0   0   0   0   0  14   4   0   0   0]
 [ 12  11   9  56  19   2   2   7   1 980   8   5   1]
 [  3  17   0  23   1   0   0   1   0  62  46   2   2]
 [  3   4   0   2   0   0   0   0   0  60   0  24   0]
 [  0   0   0   5   0   1   0   0   0  17   1   0  21]] 

Training f1 Score:
0.522565289069367


# Things to be done

One can perform a more elaborated scraping to gather (somehow) an extract of each book instead of a review inserted by someone in the website. There are some totally corrupted reviews, and other books that just don't have a review. We will stick to the Numeric CatBoost Classifier with a final scores of: