Importando bibliotecas

In [2]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split

import pandas as pd
import collections
import logging
import random

Configurando nível de logs

In [3]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Lendo arquivos de texto

In [4]:
frzd_olist_order_reviews = pd.read_parquet('../dataset/delivery/dlzd_olist_order_reviews.parquet.snappy')

In [5]:
frzd_olist_order_reviews.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_comment_title_and_message,review_creation_date,review_answer_timestamp
0,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,[],"[parabens, lojas, lannister, adorei, comprar, ...","[parabens, lojas, lannister, adorei, comprar, ...",2018-03-01,2018-03-02 10:26:53
1,8670d52e15e00043ae7de4c01cc2fe06,b9bf720beb4ab3728760088589c62129,4,[recomendo],"[aparelho, eficiente, no, site, marca, do, apa...","[recomendo, aparelho, eficiente, no, site, mar...",2018-05-22,2018-05-23 16:45:47
2,4b49719c8a200003f700d3d986ea1a19,9d6f15f95d01e79bd1349cc208361f09,4,[],"[mas, um, pouco, travando, pelo, valor, ta, boa]","[mas, um, pouco, travando, pelo, valor, ta, boa]",2018-02-16,2018-02-20 10:52:22
3,d21bbc789670eab777d27372ab9094cc,4fc44d78867142c627497b60a7e0228a,5,[otimo],"[loja, nota]","[otimo, loja, nota]",2018-07-10,2018-07-11 14:10:25
4,0e0190b9db53b689b285d3f3916f8441,79832b7cb59ac6f887088ffd686e1d5e,5,[],"[obrigado, pela, atencao, amim, dispensada]","[obrigado, pela, atencao, amim, dispensada]",2017-12-01,2017-12-09 22:58:58


Dividindo entre conjunto de treino e teste de forma stratificada

In [6]:
X = frzd_olist_order_reviews.loc[:, ['review_comment_title_and_message', 'review_score']]
y = frzd_olist_order_reviews.loc[:, ['review_score']]

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=51
)

Criando tag para armazenar marcar cada documento.

In [9]:
train_documents = []

for index, data in X_train.iterrows():
    train_documents.append(TaggedDocument(data['review_comment_title_and_message'].tolist(), [str(data['review_score']) + '_star_score']))

In [10]:
test_documents = []

for index, data in X_test.iterrows():
    test_documents.append(TaggedDocument(data['review_comment_title_and_message'].tolist(), [str(data['review_score']) + '_star_score']))

Definindo modelo Doc2vec

In [334]:
model = Doc2Vec(
    vector_size=400, # tamanho do vetor de embedding
    min_count=3, # quantidade mínima de repetições palavras para entrarem no treinamento 
    sample=10**-5, # frequência teórica para ponderar palavras muito frequêntes
    window=7,
    shrink_windows=True,
    hs=0, # utilizando amostragem negativa para acelarar o treinamento
    negative=5, # determina que 20 palavras aleatórias serão utilizadas para treinar a palavra predita
    dm=1, # usando modelo PV-DM
    dm_concat=1, # concatenando vetor de parágrafo com de palavras
    dbow_words=1, # treinando também vetor de palavras word2vec
    workers=8, # numero sde cores utilizados no paralelismo
    seed=51, # fixando a aleatoriada para deixar o modelo reprodutível
)

2024-09-21 20:44:40,599 : INFO : using concatenative 6000-dimensional layer1
2024-09-21 20:44:40,601 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dm/c,d400,n5,w7,mc3,s1e-05,t8>', 'datetime': '2024-09-21T20:44:40.601207', 'gensim': '4.3.3', 'python': '3.9.19 (main, May  6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'created'}


Criando vocabulário

In [335]:
model.build_vocab(train_documents)

2024-09-21 20:44:42,257 : INFO : collecting all words and their counts
2024-09-21 20:44:42,261 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2024-09-21 20:44:42,348 : INFO : PROGRESS: at example #10000, processed 83266 words (1014113 words/s), 6819 word types, 5 tags
2024-09-21 20:44:42,383 : INFO : collected 9668 word types and 5 unique tags from a corpus of 18721 examples and 157562 words
2024-09-21 20:44:42,383 : INFO : Creating a fresh vocabulary
2024-09-21 20:44:42,404 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=3 retains 3297 unique words (34.10% of original 9668, drops 6371)', 'datetime': '2024-09-21T20:44:42.404599', 'gensim': '4.3.3', 'python': '3.9.19 (main, May  6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'prepare_vocab'}
2024-09-21 20:44:42,406 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=3 leaves 149855 word corpus (95.11% of original 157562, dro

In [336]:
model.train(
    train_documents,
    total_examples=model.corpus_count,
    epochs=70
)

2024-09-21 20:44:42,786 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 8 workers on 3298 vocabulary and 6000 features, using sg=0 hs=0 sample=1e-05 negative=5 window=7 shrink_windows=True', 'datetime': '2024-09-21T20:44:42.786958', 'gensim': '4.3.3', 'python': '3.9.19 (main, May  6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'train'}
2024-09-21 20:44:43,747 : INFO : EPOCH 0: training on 157562 raw words (40637 effective words) took 0.9s, 44565 effective words/s
2024-09-21 20:44:44,625 : INFO : EPOCH 1: training on 157562 raw words (40673 effective words) took 0.8s, 48665 effective words/s
2024-09-21 20:44:45,530 : INFO : EPOCH 2: training on 157562 raw words (40794 effective words) took 0.9s, 46281 effective words/s
2024-09-21 20:44:46,349 : INFO : EPOCH 3: training on 157562 raw words (40620 effective words) took 0.8s, 51195 effective words/s
2024-09-21 20:44:47,135 : INFO : EPOCH 4: training on 157562 raw words (40645

Teste de 'sanidade', calculando similaridade de documentos em relação ao conjunto todo

In [337]:
train_ranks = []

for document in train_documents:
    inferred_vector = model.infer_vector(document.words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(document.tags[0])
    train_ranks.append(rank)

In [338]:
collections.Counter(train_ranks)

Counter({0: 12932, 1: 3471, 2: 1259, 3: 653, 4: 406})

x de 18721 avaliações (x %) são similares a elas mesmas, o que é um bom sinal

In [339]:
test_ranks = []

for document in test_documents:
    inferred_vector = model.infer_vector(document.words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(document.tags[0])
    test_ranks.append(rank)

In [340]:
collections.Counter(test_ranks)

Counter({0: 2736, 1: 812, 2: 470, 3: 365, 4: 298})

x de 4681 avaliações (x %) são similares a elas mesmas, o que é um bom sinal

In [326]:
doc_id = 55

inferred_vector = model.infer_vector(train_documents[doc_id].words)
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

print('Document ({}) | tag ({}): «{}»\n'.format(doc_id, train_documents[doc_id].tags[0], ' '.join(train_documents[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
# for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
#     # print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_documents[sims[index][0]].words)))
#     print()
sims

Document (55) | tag (4_star_score): «muito boa»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec<dm/c,d400,n5,hs,w7,mc3,s1e-05,t8>:



[('4_star_score', 0.6903237104415894),
 ('5_star_score', 0.5428512096405029),
 ('3_star_score', 0.28243574500083923),
 ('2_star_score', -0.06443776190280914),
 ('1_star_score', -0.2582360506057739)]

In [135]:
semantic_query = 'estou com problemas com meu numero da claro'

inferred_vector = model.infer_vector(semantic_query.split(' '))
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

print(u'%s %s: «%s»\n' % ('MOST', sims[index], ' '.join(train_documents[sims[index][0]].words)))

MOST (5807, -0.22642099857330322): «suporta ate»



Testando o modelo com o conjunto de teste

In [137]:
# Escolhendo aleatoriamente um documento do conjunto de teste
doc_id = random.randint(0, len(test_documents) - 1)
inferred_vector = model.infer_vector(test_documents[doc_id])
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Comparando este documento com o conjunto de treinamento
print('Test Document ({}): «{}»\n'.format(doc_id, ' '.join(test_documents[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_documents[sims[index][0]].words)))

Test Document (3221): «otimo»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec<dm/c,d300,n20,w5,mc5,s1e-05,t8>:

MOST (8283, 0.22291600704193115): «muito rui eles nao ciporta com criente»

MEDIAN (12816, 0.0006949880626052618): «nota»

LEAST (13287, -0.24041034281253815): «toner ja esta instalado funcionando perfeitamente»



Salvando modelo Doc2Vec

In [27]:
model.save('../model/doc2vec_telecom_pandemic_claims')

2024-09-19 19:02:23,668 : INFO : Doc2Vec lifecycle event {'fname_or_handle': '../model/doc2vec_telecom_pandemic_claims', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2024-09-19T19:02:23.667075', 'gensim': '4.3.3', 'python': '3.9.19 (main, May  6 2024, 20:12:36) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'saving'}
2024-09-19 19:02:23,672 : INFO : storing np array 'syn1neg' to ../model/doc2vec_telecom_pandemic_claims.syn1neg.npy
2024-09-19 19:02:23,999 : INFO : not storing attribute cum_table
2024-09-19 19:02:24,126 : INFO : saved ../model/doc2vec_telecom_pandemic_claims
