Importando bibliotecas

In [1]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split

import pandas as pd
import collections
import logging
import random

Configurando nível de logs

In [2]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Lendo arquivos de texto

In [3]:
frzd_olist_order_reviews = pd.read_parquet('../dataset/delivery/dlzd_olist_order_reviews.parquet.snappy')

In [40]:
frzd_olist_order_reviews.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_comment_title_and_message,review_creation_date,review_answer_timestamp
0,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,[],"[parabens, lojas, lannister, adorei, comprar, ...","[parabens, lojas, lannister, adorei, comprar, ...",2018-03-01,2018-03-02 10:26:53
1,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,[recomendo],"[aparelho, eficiente, no, site, marca, do, apa...","[recomendo, aparelho, eficiente, no, site, mar...",2018-05-22,2018-05-23 16:45:47
2,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,[],"[mas, um, pouco, travando, pelo, valor, ta, boa]","[mas, um, pouco, travando, pelo, valor, ta, boa]",2018-02-16,2018-02-20 10:52:22
3,d21bbc789670eab777d27372ab9094cc,4fc44d78867142c627497b60a7e0228a,5,[otimo],"[loja, nota]","[otimo, loja, nota]",2018-07-10,2018-07-11 14:10:25
4,0e0190b9db53b689b285d3f3916f8441,79832b7cb59ac6f887088ffd686e1d5e,5,[],"[obrigado, pela, atencao, amim, dispensada]","[obrigado, pela, atencao, amim, dispensada]",2017-12-01,2017-12-09 22:58:58


Dividindo entre conjunto de treino e teste de forma stratificada

In [74]:
X = frzd_olist_order_reviews.loc[:, ['review_comment_title_and_message', 'review_score']]
y = frzd_olist_order_reviews.loc[:, ['review_score']]

In [75]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=51
)

Criando tag para armazenar marcar cada documento.

In [78]:
train_documents = []

for index, data in X_train.iterrows():
    train_documents.append(TaggedDocument(data['review_comment_title_and_message'].tolist(), [str(data['review_score']) + '_star_score']))

In [105]:
test_documents = []

for index, data in X_test.iterrows():
    test_documents.append(TaggedDocument(data['review_comment_title_and_message'].tolist(), [str(data['review_score']) + '_star_score']))

Definindo modelo Doc2vec

In [100]:
model = Doc2Vec(
    vector_size=400, # tamanho do vetor de embedding
    min_count=4, # quantidade mínima de repetições palavras para entrarem no treinamento 
    sample=10**-5, # frequência teórica para ponderar palavras muito frequêntes
    window=6,
    shrink_windows=True,
    hs=0, # utilizando amostragem negativa para acelarar o treinamento
    negative=20, # determina que 20 palavras aleatórias serão utilizadas para treinar a palavra predita
    dm=1, # usando modelo PV-DM
    dm_concat=1, # concatenando vetor de parágrafo com de palavras
    dbow_words=1, # treinando também vetor de palavras word2vec
    workers=8, # numero de cores utilizados no paralelismo
    seed=51, # fixando a aleatoriada para deixar o modelo reprodutível
)

2024-09-21 10:47:17,117 : INFO : using concatenative 5200-dimensional layer1
2024-09-21 10:47:17,121 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dm/c,d400,n20,w6,mc4,s1e-05,t8>', 'datetime': '2024-09-21T10:47:17.121579', 'gensim': '4.3.3', 'python': '3.9.19 (main, May  6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'created'}


Criando vocabulário

In [101]:
model.build_vocab(train_documents)

2024-09-21 10:47:18,725 : INFO : collecting all words and their counts
2024-09-21 10:47:18,728 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2024-09-21 10:47:18,780 : INFO : PROGRESS: at example #10000, processed 83266 words (1659515 words/s), 6819 word types, 5 tags
2024-09-21 10:47:18,806 : INFO : collected 9668 word types and 5 unique tags from a corpus of 18721 examples and 157562 words
2024-09-21 10:47:18,807 : INFO : Creating a fresh vocabulary
2024-09-21 10:47:18,821 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=4 retains 2657 unique words (27.48% of original 9668, drops 7011)', 'datetime': '2024-09-21T10:47:18.821824', 'gensim': '4.3.3', 'python': '3.9.19 (main, May  6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'prepare_vocab'}
2024-09-21 10:47:18,822 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=4 leaves 147935 word corpus (93.89% of original 157562, dro

In [102]:
model.train(
    train_documents,
    total_examples=model.corpus_count,
    epochs=20
)

2024-09-21 10:47:19,665 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 8 workers on 2658 vocabulary and 5200 features, using sg=0 hs=0 sample=1e-05 negative=20 window=6 shrink_windows=True', 'datetime': '2024-09-21T10:47:19.665967', 'gensim': '4.3.3', 'python': '3.9.19 (main, May  6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'train'}
2024-09-21 10:47:20,819 : INFO : EPOCH 0 - PROGRESS: at 57.68% examples, 19799 words/s, in_qsize 7, out_qsize 1
2024-09-21 10:47:20,870 : INFO : EPOCH 0: training on 157562 raw words (38640 effective words) took 1.2s, 32960 effective words/s
2024-09-21 10:47:21,812 : INFO : EPOCH 1: training on 157562 raw words (38565 effective words) took 0.9s, 42231 effective words/s
2024-09-21 10:47:22,709 : INFO : EPOCH 2: training on 157562 raw words (38469 effective words) took 0.9s, 44032 effective words/s
2024-09-21 10:47:23,591 : INFO : EPOCH 3: training on 157562 raw words (38390 effective words)

Teste de 'sanidade', calculando similaridade de documentos em relação ao conjunto todo

In [106]:
train_ranks = []

for document in train_documents:
    inferred_vector = model.infer_vector(document.words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(document.tags[0])
    train_ranks.append(rank)

In [107]:
collections.Counter(train_ranks)

Counter({0: 11462, 1: 3440, 2: 1638, 3: 1147, 4: 1034})

x de 18721 avaliações (x %) são similares a elas mesmas, o que é um bom sinal

In [111]:
test_ranks = []

for document in test_documents:
    inferred_vector = model.infer_vector(document.words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(document.tags[0])
    test_ranks.append(rank)

In [112]:
collections.Counter(test_ranks)

Counter({0: 2505, 1: 793, 2: 570, 3: 432, 4: 381})

x de 4681 avaliações (x %) são similares a elas mesmas, o que é um bom sinal

In [72]:
doc_id = 55

inferred_vector = model.infer_vector(train_documents[doc_id].words)
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

print('Document ({}) | tag ({}): «{}»\n'.format(doc_id, train_documents[doc_id].tags[0], ' '.join(train_documents[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
# for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
#     # print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_documents[sims[index][0]].words)))
#     print()
sims

Document (55) | tag (4_star_score): «muito boa»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec<dm/c,d300,n20,w5,mc2,s1e-06,t8>:



[('3_star_score', 0.08512957394123077),
 ('1_star_score', 0.0543084517121315),
 ('2_star_score', 0.020415732637047768),
 ('5_star_score', -0.06042415648698807),
 ('4_star_score', -0.09239460527896881)]

In [135]:
semantic_query = 'estou com problemas com meu numero da claro'

inferred_vector = model.infer_vector(semantic_query.split(' '))
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

print(u'%s %s: «%s»\n' % ('MOST', sims[index], ' '.join(train_documents[sims[index][0]].words)))

MOST (5807, -0.22642099857330322): «suporta ate»



Testando o modelo com o conjunto de teste

In [137]:
# Escolhendo aleatoriamente um documento do conjunto de teste
doc_id = random.randint(0, len(test_documents) - 1)
inferred_vector = model.infer_vector(test_documents[doc_id])
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Comparando este documento com o conjunto de treinamento
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_documents[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_documents[sims[index][0]].words)))

Test Document (3221): «otimo»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec<dm/c,d300,n20,w5,mc5,s1e-05,t8>:

MOST (8283, 0.22291600704193115): «muito rui eles nao ciporta com criente»

MEDIAN (12816, 0.0006949880626052618): «nota»

LEAST (13287, -0.24041034281253815): «toner ja esta instalado funcionando perfeitamente»



Salvando modelo Doc2Vec

In [27]:
model.save('../model/doc2vec_telecom_pandemic_claims')

2024-09-19 19:02:23,668 : INFO : Doc2Vec lifecycle event {'fname_or_handle': '../model/doc2vec_telecom_pandemic_claims', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2024-09-19T19:02:23.667075', 'gensim': '4.3.3', 'python': '3.9.19 (main, May  6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'saving'}
2024-09-19 19:02:23,672 : INFO : storing np array 'syn1neg' to ../model/doc2vec_telecom_pandemic_claims.syn1neg.npy
2024-09-19 19:02:23,999 : INFO : not storing attribute cum_table
2024-09-19 19:02:24,126 : INFO : saved ../model/doc2vec_telecom_pandemic_claims
