### Imports

In [17]:
from sentence_transformers import SentenceTransformer, util  # SentenceBERT
import csv

### Dataset

The **Quora Duplicate Questions dataset** is used for this project.
- It contains approximately 500,000 question pairs.
- The dataset can be obtained from: [First Quora Dataset Release: Question Pairs](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)
- The question pairs may or may not be duplicates, meaning they might ask the same thing but are phrased differently. An attribute in the dataset indicates whether each pair is a duplicate or not.
- The questions span 100 different languages, with the majority being in English.

For simplicity, this application will use only a subset of the full dataset.

In [None]:
corpus_sentences = set()
dataset_path = "./datasets/quoraDuplicateQuestionsReduced.tsv"
with open(dataset_path, encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        corpus_sentences.add(row['question1'])
        corpus_sentences.add(row['question2'])
corpus_sentences = list(corpus_sentences)

### Embeddings

In [None]:
model = SentenceTransformer('quora-distilbert-multilingual')
print("Codificando os dados...")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_tensor=True)
print("Dados codificados. Total de {} sentenças/embeddings".format(len(corpus_sentences)))

Codificando os dados. Pode demorar...


Batches: 100%|██████████| 2264/2264 [05:22<00:00,  7.02it/s]


Dados codificados. Total de 72423 sentenças/embeddings


In [None]:
print("Dimensão dos embeddings: ", len(corpus_embeddings[0]), "\n")
print("Embedding da primeira sentença: ", corpus_embeddings[0], "\n")

Dimensão dos embeddings:  768 

Embedding da primeira sentença:  tensor([ 1.8629e-01,  1.1262e-01,  5.4526e-01,  3.2218e-01, -6.4382e-02,
         1.7457e-01, -4.5411e-01, -2.1163e-01, -2.0066e-01, -2.3809e-01,
        -6.9773e-02,  3.5057e-01,  1.8053e-01, -6.4607e-02,  2.4889e-05,
         2.9427e-01, -1.2328e-01, -1.1485e-01,  2.4043e-02, -4.7279e-01,
        -3.4201e-01,  2.7796e-01,  3.2721e-01, -7.6693e-02,  1.1853e-01,
         4.1939e-01,  4.5358e-01, -3.4031e-01, -1.3234e-01, -8.4394e-02,
        -1.0700e-01,  3.3024e-02,  2.6077e-01, -2.7364e-01,  3.5028e-01,
        -3.0256e-02, -9.4805e-02, -1.1167e-01,  2.7996e-01, -1.6700e-01,
         4.3810e-01, -9.9588e-03,  1.7853e-01, -5.2103e-02, -2.4134e-01,
         2.8020e-01, -3.2572e-01,  2.3961e-01, -2.4937e-01,  2.2240e-02,
        -2.3590e-01, -1.9295e-01,  8.6322e-02, -5.3046e-03, -3.3325e-01,
        -3.9985e-01,  5.0177e-01,  5.3615e-01, -8.0466e-02,  7.0358e-02,
         3.0628e-01, -3.7138e-01,  5.4547e-02,  6.6831e-01,

### Search Function

Given a query, the search function described below queries the corpus and prints a ranked list of the top k results (where k=5 in the example).

The `util.semantic_search` function from SBERT implements an optimized search.
-   Documentation: [sentence_transformers.util.semantic_search](https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.semantic_search)
-   This function performs the search by calculating the Cosine Similarity between the embeddings of the input query (or queries) and the embeddings of the documents in the corpus.
-   It is suitable for Information Retrieval / Semantic Search tasks on corpora containing up to one million entries.
-   For larger corpora (beyond one million entries), an Approximate Nearest Neighbor (ANN) search approach is recommended.
-   A popular library for ANN search is FAISS: [FAISS GitHub](https://github.com/facebookresearch/faiss)

In [None]:
def search(question):
    question_embedding = model.encode(question, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, corpus_embeddings)
    hits = hits[0]

    print("Consulta:", question)
    print("Resultados:")
    for hit in hits[0:5]:
        print("\t{:.3f}\t{}".format(hit['score'], corpus_sentences[hit['corpus_id']]))

### Testes de Busca

In [22]:
search("How can I learn Python online?")

Consulta: How can I learn Python online?
Resultados:
	0.980	What's the best way to learn Python?
	0.980	How do I learn Python in easy way?
	0.980	What can I do if I learn Python?
	0.979	How do I learn Python systematically?
	0.979	Where should I start at to learn about how to do Python?


In [23]:
search("Como eu posso aprender Python online?")

Consulta: Como eu posso aprender Python online?
Resultados:
	0.980	How do I learn Python in easy way?
	0.980	What's the best way to learn Python?
	0.980	What can I do if I learn Python?
	0.980	How do I learn Python systematically?
	0.979	How do I learn Python?


In [None]:
search("How can I live a happier life?")
search("Como escolher um bom vinho?")
search("Best practices for data science")

Consulta: How can I live a happier life?
Resultados:
	0.972	How do we live a happy life?
	0.965	What is the best way to live a happy and successful life?
	0.961	Life Advice: How can I make my life simpler?
	0.959	What is the best advice for a happy life?
	0.956	How can I make my life better?
Consulta: Como escolher um bom vinho?
Resultados:
	0.946	What is wine made from?
	0.945	What type of grapes are used to make wine?
	0.928	How wine is good for health?
	0.927	Do wine grapes make for good eating?
	0.925	What is a good white wine sweetness scale?
Consulta: Best practices for data science
Resultados:
	0.951	What is the best way to get started with data science?
	0.939	How do I get started in data science?
	0.936	What is data science
	0.933	How do I learn Data Science by “doing it”?
	0.928	What is actually a data science?


----------------

## Text Classification using SentenceBERT Model Embeddings

This section covers the implementation of text classification using embeddings generated by the SentenceBERT model as input to a classifier.

### Pacotes

In [28]:
# Processamento de dados
import pandas as pd
import numpy as np
# Machine Learning
from sentence_transformers import SentenceTransformer  # SentenceBERT
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Data e Hora
from datetime import datetime

### Dataset

E-commerce dataset containing the category and description of each product. There are 4 product categories: "Electronics", "Household", "Books" and "Clothing & Accessories".<br>
The complete dataset is available at:<br>
https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification<br>
In this application, we use a reduced version of the dataset.

In [None]:
df = pd.read_csv("./datasets/ecommerceDatasetReduced.csv")
df = df[["category", "product"]]

df.head()

Unnamed: 0,category,product
0,Household,K London Multicolor Men's Wallet A Bi-Fold Wal...
1,Household,VAPOK Neon Plastic and Microfiber Duster(Multi...
2,Household,Sehaz Artworks Tree Bird Round Wood Wall Clock...
3,Clothing & Accessories,W for Woman Women's Cape Gilet
4,Household,Usha 3732 300-Watt Hand Mixer with 2 Hooks (Bl...


### EDA

In [30]:
print('Número total de produtos: {}'.format(len(df)))
print(40*'-')
print('Partição por categoria:')
print(df["category"].value_counts())
print(40*'-')
nr_categories = len(df["category"].unique())
print("Número de categorias: {n}".format(n=nr_categories))

Número total de produtos: 5000
----------------------------------------
Partição por categoria:
category
Household                 1901
Books                     1204
Electronics               1015
Clothing & Accessories     880
Name: count, dtype: int64
----------------------------------------
Número de categorias: 4


In [None]:
n=90
print('Categoria: ',df['category'][n])
print(100*'-')
print('Produto:')
print(df['product'][n])

Categoria:  Household
----------------------------------------------------------------------------------------------------
Produto:
Forzza Zoey Laptop Table (Walnut with Black Frame) Ergonomically designed foldable laptop table is made with 12mm MDF with melamine. The legs are made in black powder coated metal. This table is easily foldable. It is ideal to place your laptop and work while in bed or on your couch. It could also be used for placing your cup of coffee. It is easily stackable and can be easily wiped with a damp cloth. This well finished product will be a good investment. Available at unbelievable prices.


### Embeddings

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')
df["embedding"] = df["product"].apply(lambda x: np.array(model.encode([x])[0]))

  torch.utils._pytree._register_pytree_node(


### Train Test partition

In [None]:
X = df['embedding'].tolist()
y = df['category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=df['category'])

In [None]:
y_train.value_counts()/y.value_counts()

category
Household                 0.700158
Books                     0.700166
Electronics               0.699507
Clothing & Accessories    0.700000
Name: count, dtype: float64

### Classifier Training

In [None]:
model_sbert = LogisticRegression()
start_time = datetime.now()
model_sbert.fit(X_train, y_train)
end_time = datetime.now()
training_time_sbert = (end_time - start_time).total_seconds()

### Model Evaluation

In [None]:
predicted_train_sbert = model_sbert.predict(X_train)
accuracy_train_sbert = accuracy_score(y_train, predicted_train_sbert)
print('Acurácia nos dados de treino: {:.1%}'.format(accuracy_train_sbert))

predicted_test_sbert = model_sbert.predict(X_test)
accuracy_test_sbert = accuracy_score(y_test, predicted_test_sbert)
print('Acurácia nos dados de teste:  {:.1%}'.format(accuracy_test_sbert))

print('Tempo de treinamento: {:.1f}s'.format(training_time_sbert))

Acurácia nos dados de treino: 93.9%
Acurácia nos dados de teste:  94.1%
Tempo de treinamento: 0.1s


Test with artificial neural network classifier

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
model_mlp_sbert = MLPClassifier()
start_time = datetime.now()
model_mlp_sbert.fit(X_train, y_train)
end_time = datetime.now()
training_time_sbert = (end_time - start_time).total_seconds()

In [None]:
predicted_train_sbert = model_mlp_sbert.predict(X_train)
accuracy_train_sbert = accuracy_score(y_train, predicted_train_sbert)
print('Acurácia nos dados de treino: {:.1%}'.format(accuracy_train_sbert))

predicted_test_sbert = model_mlp_sbert.predict(X_test)
accuracy_test_sbert = accuracy_score(y_test, predicted_test_sbert)
print('Acurácia nos dados de teste:  {:.1%}'.format(accuracy_test_sbert))

print('Tempo de treinamento: {:.1f}s'.format(training_time_sbert))

Acurácia nos dados de treino: 100.0%
Acurácia nos dados de teste:  93.3%
Tempo de treinamento: 3.6s
