O objetivo é construir um classificador de notícias utilizando a base de dados disponibilizada, em que será criada a representação word2vec.

# Criação de representação própria **Word2Vec**

**Spacy** é uma Biblioteca projetada especificamente para uso em produção e ajuda a criar aplicações que processam e abrange grande volumes de textos. [Documentação](https://https://spacy.io/)


Neste estudo vamos:

 - configurar o modelo
 - construir o vocabulário a partir do corpus 
 - treinar a representação Word2Vec

In [None]:
import pandas as pd
import spacy

In [None]:
dados_treino = pd.read_csv("/content/drive/MyDrive/word2ver/treino.csv")
dados_treino.sample(5)

Unnamed: 0,title,text,date,category,subcategory,link
10004,Traficantes controlam quadrilhas de presídios ...,"Do interior de presídios federais, os trafican...",2015-06-16,cotidiano,,http://www1.folha.uol.com.br/cotidiano/2015/06...
67731,Gabriel não deve reforçar o Santos contra o Sã...,O atacante Gabriel voltou a treinar no Santos ...,2015-02-13,esporte,,http://www1.folha.uol.com.br/esporte/2015/02/1...
86552,Força-tarefa interdita frigorífico no Paraná p...,Uma força-tarefa de órgãos do trabalho interdi...,2015-05-14,mercado,,http://www1.folha.uol.com.br/mercado/2015/05/1...
84043,"DAO, o projeto que quer mudar o mundo",Aposto que a maioria dos leitores nunca ouviu ...,2016-05-23,colunas,ronaldolemos,http://www1.folha.uol.com.br/colunas/ronaldole...
24320,Câmara aprova projeto que prevê PPP em termina...,Vereadores de São Paulo aprovaram na noite des...,2015-05-13,cotidiano,,http://www1.folha.uol.com.br/cotidiano/2015/05...


In [None]:
#!python -m spacy download pt_core_news_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('pt_core_news_sm')


In [None]:
#criando objeto nlp
nlp = spacy.load("pt_core_news_sm")

## Pré-Processamento

- Remover stop words, acentuação, números...

### Exemplo

In [None]:
texto = "Rio de Janeiro é uma cidade maravilhosa"
doc = nlp(texto) ## transforma string em token

In [None]:
doc

Rio de Janeiro é uma cidade maravilhosa

In [None]:
type(doc)

spacy.tokens.doc.Doc

In [None]:
doc[1].is_stop

True

In [None]:
doc[2]

Janeiro

### Tratamento do texto

In [None]:
textos_para_tratamento = (titulos.lower() for titulos in dados_treino["title"]) # usamos Generator Expressions

Remover stop words, caracteres não alfabéticos e retornando apenas títulos com mais de 2 palavras.

In [None]:
def trata_textos(doc):  # entrada como doc para não tirar as caractisticcaractristics, em vez de textos_para_tratamentos(uma string)
  tokens_validos = []
  for token in doc:
    e_valido = not token.is_stop and token.is_alpha # verifica se o token é valido 
    if e_valido:
      tokens_validos.append(token.text)

  if len(tokens_validos) > 2: # somente frases com mais de duas palavras
    return " ".join(tokens_validos)


texto = "Rio de Janeiro 1212122 c****é uma cidade maravilhosa!" # teste
doc = nlp(texto)
trata_textos(doc)


'Rio Janeiro cidade maravilhosa'

In [None]:
from time import time

t0 = time()
textos_tratados = [trata_textos(doc) for doc in nlp.pipe(textos_para_tratamento,
                                                        batch_size = 1000,
                                                        n_process = -1)]

tf = time() - t0

print(tf/60)                             

3.100439675649007


In [None]:
titulos_tratados = pd.DataFrame({"titulo": textos_tratados})
titulos_tratados.head()

Unnamed: 0,titulo
0,polêmica marine le pen abomina negacionistas h...
1,macron e le pen a o turno frança revés siglas ...
2,apesar larga vitória legislativas macron terá ...
3,governo antecipa balanço e alckmin anuncia que...
4,queda maio a atividade econômica sobe junho bc


# Configuração do modelo

Os hiperparâmetros são parâmetros que configuram a forma que seu modelo será treinado, por isso são passados antes da fase de treinamento.

In [None]:
from gensim.models import Word2Vec

w2v_modelo = Word2Vec(sg = 0,
                      window = 2,
                      size = 300,
                      min_count = 5,
                      alpha = 0.03,
                      min_alpha = 0.007)


**Hiperparâmetros**:
- sg = arquitetura de treinamento skipgram: 
  - 1 = arquiterura skipgram (1 de TRue)
  - 0 = cbow (0 para False)
- window = quantas palavras serão consideradas antes e depois do contexto
- size = possui tamanho fixo do vetor
- min_count = considerar palavras com frequencia maior que o min_count
- alpha = taxa de custo de interação 
- min_alpha = taxa de aprendizado minima, deve ser o menor que alpha

In [None]:
w2v_modelo

<gensim.models.word2vec.Word2Vec at 0x7f06a4aeb8d0>

# Construção do vocabulário a partir do corpus 

In [None]:
## retirando vazios e duplicados
print(len(titulos_tratados))

titulos_tratados = titulos_tratados.dropna().drop_duplicates()

print(len(titulos_tratados))

90000
86113


In [None]:
lista_lista_tokens = [titulo.split(" ")for titulo in titulos_tratados.titulo]

In [None]:
import logging ##visualizar o processamento 

logging.basicConfig(format="%(asctime)s : - %(message)s", level = logging.INFO)

w2v_modelo = Word2Vec(sg = 0,
                      window = 2,
                      size = 300,
                      min_count = 5,
                      alpha = 0.03,
                      min_alpha = 0.007)

w2v_modelo.build_vocab(lista_lista_tokens, progress_per=5000) ##a cada 5000 titulos ele atualiza o processo 

2021-01-06 13:03:39,591 : - collecting all words and their counts
2021-01-06 13:03:39,592 : - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-01-06 13:03:39,612 : - PROGRESS: at sentence #5000, processed 34716 words, keeping 10129 word types
2021-01-06 13:03:39,627 : - PROGRESS: at sentence #10000, processed 69298 words, keeping 14909 word types
2021-01-06 13:03:39,646 : - PROGRESS: at sentence #15000, processed 103841 words, keeping 18223 word types
2021-01-06 13:03:39,661 : - PROGRESS: at sentence #20000, processed 138620 words, keeping 20969 word types
2021-01-06 13:03:39,684 : - PROGRESS: at sentence #25000, processed 173257 words, keeping 23410 word types
2021-01-06 13:03:39,703 : - PROGRESS: at sentence #30000, processed 207976 words, keeping 25453 word types
2021-01-06 13:03:39,718 : - PROGRESS: at sentence #35000, processed 242567 words, keeping 27263 word types
2021-01-06 13:03:39,741 : - PROGRESS: at sentence #40000, processed 277254 words, keeping 2899

# Treinamento da representação Word2Vec.

### Treinamento com arquitetura CBow
 Colocamos lá em cima a opção *0* no hiperparametro 

In [None]:
dir(w2v_modelo)

['__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_adapt_by_suffix',
 '_check_input_data_sanity',
 '_check_training_sanity',
 '_clear_post_train',
 '_do_train_epoch',
 '_do_train_job',
 '_get_job_params',
 '_get_thread_working_mem',
 '_job_producer',
 '_load_specials',
 '_log_epoch_end',
 '_log_epoch_progress',
 '_log_progress',
 '_log_train_end',
 '_minimize_model',
 '_raw_word_count',
 '_save_specials',
 '_set_train_params',
 '_smart_save',
 '_train_epoch',
 '_train_epoch_corpusfile',
 '_update_job_params',
 '_worker_loop',
 '_worker_loop_corpusfile',
 'accuracy',
 'alpha',
 'batch_words',
 'build_vocab',
 'build_vocab_from_freq',
 'ca

In [None]:
w2v_modelo.corpus_count

86113

Quando treinamento uma rede neural é comum acompanhar seu **loss**, o loss geralmente é atualizado em cada época, mas com gensim Word2Vec não há uma maneira direta de se fazer isso.

O método para calcular e armazenar o loss é `model.get_latest_training_loss()`. Porém não se calcula por época, e sim por treinamento completo. Entretanto, podemos driblar este problema configurando uma mensagem de callback.

In [None]:
from gensim.models.callbacks import CallbackAny2Vec

# iniciando a chamada callback
class callback(CallbackAny2Vec):
     def __init__(self):
       self.epoch = 0

     def on_epoch_end(self, model):
       loss = model.get_latest_training_loss()
       if self.epoch == 0:
           print('Loss após a época {}: {}'.format(self.epoch, loss))
       else:
           print('Loss após a época {}: {}'.format(self.epoch, loss- self.loss_previous_step))
       self.epoch += 1
       self.loss_previous_step = loss

In [None]:
w2v_modelo.train(lista_lista_tokens,
                total_examples=w2v_modelo.corpus_count,
                epochs = 30,
                compute_loss = True,
                callbacks=[callback()])

2021-01-06 13:43:51,375 : - Effective 'alpha' higher than previous training cycles
2021-01-06 13:43:51,376 : - training model with 3 workers on 13006 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=2
2021-01-06 13:43:52,400 : - EPOCH 1 - PROGRESS: at 70.25% examples, 351331 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:43:52,754 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:43:52,767 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:43:52,779 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:43:52,780 : - EPOCH - 1 : training on 597929 raw words (503117 effective words) took 1.4s, 363205 effective words/s


Loss após a época 0: 167536.46875


2021-01-06 13:43:53,810 : - EPOCH 2 - PROGRESS: at 70.26% examples, 346397 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:43:54,167 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:43:54,175 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:43:54,189 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:43:54,190 : - EPOCH - 2 : training on 597929 raw words (502920 effective words) took 1.4s, 359480 effective words/s


Loss após a época 1: 169891.5625


2021-01-06 13:43:55,207 : - EPOCH 3 - PROGRESS: at 68.61% examples, 344002 words/s, in_qsize 4, out_qsize 1
2021-01-06 13:43:55,565 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:43:55,587 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:43:55,600 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:43:55,601 : - EPOCH - 3 : training on 597929 raw words (502893 effective words) took 1.4s, 360185 effective words/s


Loss após a época 2: 162861.15625


2021-01-06 13:43:56,628 : - EPOCH 4 - PROGRESS: at 70.25% examples, 347712 words/s, in_qsize 4, out_qsize 1
2021-01-06 13:43:56,952 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:43:56,987 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:43:56,991 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:43:56,992 : - EPOCH - 4 : training on 597929 raw words (502935 effective words) took 1.4s, 364615 effective words/s


Loss após a época 3: 148732.5


2021-01-06 13:43:58,029 : - EPOCH 5 - PROGRESS: at 70.26% examples, 344442 words/s, in_qsize 3, out_qsize 2
2021-01-06 13:43:58,365 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:43:58,367 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:43:58,393 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:43:58,394 : - EPOCH - 5 : training on 597929 raw words (502931 effective words) took 1.4s, 361652 effective words/s


Loss após a época 4: 148933.3125


2021-01-06 13:43:59,409 : - EPOCH 6 - PROGRESS: at 70.26% examples, 352307 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:43:59,782 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:43:59,795 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:43:59,801 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:43:59,802 : - EPOCH - 6 : training on 597929 raw words (502843 effective words) took 1.4s, 360478 effective words/s


Loss após a época 5: 137653.0625


2021-01-06 13:44:00,847 : - EPOCH 7 - PROGRESS: at 70.29% examples, 342037 words/s, in_qsize 6, out_qsize 0
2021-01-06 13:44:01,182 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:01,218 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:01,219 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:01,222 : - EPOCH - 7 : training on 597929 raw words (502890 effective words) took 1.4s, 357310 effective words/s


Loss após a época 6: 139424.9375


2021-01-06 13:44:02,253 : - EPOCH 8 - PROGRESS: at 70.26% examples, 346893 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:02,595 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:02,599 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:02,623 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:02,624 : - EPOCH - 8 : training on 597929 raw words (503058 effective words) took 1.4s, 362141 effective words/s


Loss após a época 7: 119856.625


2021-01-06 13:44:03,640 : - EPOCH 9 - PROGRESS: at 68.60% examples, 342566 words/s, in_qsize 4, out_qsize 1
2021-01-06 13:44:03,990 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:04,019 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:04,029 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:04,030 : - EPOCH - 9 : training on 597929 raw words (502956 effective words) took 1.4s, 360181 effective words/s


Loss após a época 8: 122648.0


2021-01-06 13:44:05,058 : - EPOCH 10 - PROGRESS: at 71.91% examples, 359768 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:05,380 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:05,389 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:05,404 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:05,407 : - EPOCH - 10 : training on 597929 raw words (502759 effective words) took 1.4s, 371454 effective words/s


Loss após a época 9: 125676.875


2021-01-06 13:44:06,439 : - EPOCH 11 - PROGRESS: at 68.60% examples, 337111 words/s, in_qsize 4, out_qsize 1
2021-01-06 13:44:06,792 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:06,808 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:06,813 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:06,814 : - EPOCH - 11 : training on 597929 raw words (502789 effective words) took 1.4s, 359999 effective words/s


Loss após a época 10: 117605.125


2021-01-06 13:44:07,842 : - EPOCH 12 - PROGRESS: at 70.26% examples, 347257 words/s, in_qsize 6, out_qsize 1
2021-01-06 13:44:08,159 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:08,193 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:08,202 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:08,203 : - EPOCH - 12 : training on 597929 raw words (503018 effective words) took 1.4s, 364861 effective words/s


Loss após a época 11: 108164.5


2021-01-06 13:44:09,219 : - EPOCH 13 - PROGRESS: at 70.26% examples, 352531 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:09,594 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:09,598 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:09,620 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:09,622 : - EPOCH - 13 : training on 597929 raw words (502905 effective words) took 1.4s, 358128 effective words/s


Loss após a época 12: 110423.125


2021-01-06 13:44:10,641 : - EPOCH 14 - PROGRESS: at 70.26% examples, 350909 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:11,026 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:11,042 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:11,050 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:11,051 : - EPOCH - 14 : training on 597929 raw words (502715 effective words) took 1.4s, 355041 effective words/s


Loss após a época 13: 113484.125


2021-01-06 13:44:12,091 : - EPOCH 15 - PROGRESS: at 70.26% examples, 343959 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:12,442 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:12,460 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:12,472 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:12,473 : - EPOCH - 15 : training on 597929 raw words (502978 effective words) took 1.4s, 357107 effective words/s


Loss após a época 14: 106016.625


2021-01-06 13:44:13,508 : - EPOCH 16 - PROGRESS: at 70.26% examples, 345615 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:13,852 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:13,869 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:13,878 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:13,879 : - EPOCH - 16 : training on 597929 raw words (502888 effective words) took 1.4s, 361129 effective words/s


Loss após a época 15: 98886.5


2021-01-06 13:44:14,931 : - EPOCH 17 - PROGRESS: at 71.95% examples, 350009 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:15,245 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:15,261 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:15,275 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:15,276 : - EPOCH - 17 : training on 597929 raw words (502670 effective words) took 1.4s, 364805 effective words/s


Loss após a época 16: 101754.0


2021-01-06 13:44:16,316 : - EPOCH 18 - PROGRESS: at 73.60% examples, 360169 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:16,602 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:16,629 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:16,633 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:16,634 : - EPOCH - 18 : training on 597929 raw words (503089 effective words) took 1.3s, 373819 effective words/s


Loss após a época 17: 104839.0


2021-01-06 13:44:17,655 : - EPOCH 19 - PROGRESS: at 70.26% examples, 349966 words/s, in_qsize 6, out_qsize 1
2021-01-06 13:44:17,993 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:18,004 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:18,013 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:18,015 : - EPOCH - 19 : training on 597929 raw words (502923 effective words) took 1.4s, 367385 effective words/s


Loss após a época 18: 98434.25


2021-01-06 13:44:19,033 : - EPOCH 20 - PROGRESS: at 71.91% examples, 358805 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:19,341 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:19,362 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:19,371 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:19,372 : - EPOCH - 20 : training on 597929 raw words (502876 effective words) took 1.3s, 373481 effective words/s


Loss após a época 19: 88142.75


2021-01-06 13:44:20,396 : - EPOCH 21 - PROGRESS: at 71.93% examples, 356381 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:20,718 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:20,724 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:20,737 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:20,738 : - EPOCH - 21 : training on 597929 raw words (502736 effective words) took 1.4s, 370612 effective words/s


Loss após a época 20: 99578.0


2021-01-06 13:44:21,753 : - EPOCH 22 - PROGRESS: at 70.26% examples, 352120 words/s, in_qsize 4, out_qsize 1
2021-01-06 13:44:22,083 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:22,093 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:22,102 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:22,104 : - EPOCH - 22 : training on 597929 raw words (502834 effective words) took 1.4s, 371677 effective words/s


Loss após a época 21: 85414.25


2021-01-06 13:44:23,140 : - EPOCH 23 - PROGRESS: at 71.91% examples, 353470 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:23,453 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:23,463 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:23,477 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:23,480 : - EPOCH - 23 : training on 597929 raw words (503099 effective words) took 1.4s, 369105 effective words/s


Loss após a época 22: 92733.75


2021-01-06 13:44:24,516 : - EPOCH 24 - PROGRESS: at 71.91% examples, 354984 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:24,831 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:24,852 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:24,855 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:24,856 : - EPOCH - 24 : training on 597929 raw words (502835 effective words) took 1.4s, 370214 effective words/s


Loss após a época 23: 86502.0


2021-01-06 13:44:25,885 : - EPOCH 25 - PROGRESS: at 73.58% examples, 363667 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:26,182 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:26,212 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:26,225 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:26,226 : - EPOCH - 25 : training on 597929 raw words (502761 effective words) took 1.4s, 370483 effective words/s


Loss após a época 24: 77674.25


2021-01-06 13:44:27,255 : - EPOCH 26 - PROGRESS: at 71.95% examples, 360619 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:27,569 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:27,589 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:27,597 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:27,598 : - EPOCH - 26 : training on 597929 raw words (502809 effective words) took 1.3s, 373761 effective words/s


Loss após a época 25: 80075.5


2021-01-06 13:44:28,655 : - EPOCH 27 - PROGRESS: at 71.95% examples, 347086 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:28,963 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:28,985 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:28,988 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:28,990 : - EPOCH - 27 : training on 597929 raw words (502695 effective words) took 1.4s, 365083 effective words/s


Loss após a época 26: 83533.75


2021-01-06 13:44:30,018 : - EPOCH 28 - PROGRESS: at 73.60% examples, 364523 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:30,303 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:30,324 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:30,329 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:30,330 : - EPOCH - 28 : training on 597929 raw words (502770 effective words) took 1.3s, 378991 effective words/s


Loss após a época 27: 85629.25


2021-01-06 13:44:31,359 : - EPOCH 29 - PROGRESS: at 71.91% examples, 355095 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:31,683 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:31,720 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:31,723 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:31,726 : - EPOCH - 29 : training on 597929 raw words (502726 effective words) took 1.4s, 363271 effective words/s


Loss após a época 28: 81238.25


2021-01-06 13:44:32,762 : - EPOCH 30 - PROGRESS: at 73.59% examples, 361007 words/s, in_qsize 5, out_qsize 0
2021-01-06 13:44:33,055 : - worker thread finished; awaiting finish of 2 more threads
2021-01-06 13:44:33,063 : - worker thread finished; awaiting finish of 1 more threads
2021-01-06 13:44:33,084 : - worker thread finished; awaiting finish of 0 more threads
2021-01-06 13:44:33,088 : - EPOCH - 30 : training on 597929 raw words (502863 effective words) took 1.3s, 372574 effective words/s
2021-01-06 13:44:33,089 : - training on a 17937870 raw words (15086281 effective words) took 41.7s, 361676 effective words/s


Loss após a época 29: 80066.5


(15086281, 17937870)

In [None]:
w2v_modelo.wv.most_similar("google") #teste

2021-01-06 13:44:33,897 : - precomputing L2-norms of word weight vectors


[('apple', 0.4273000955581665),
 ('facebook', 0.38305389881134033),
 ('uber', 0.3599740266799927),
 ('fbi', 0.35088932514190674),
 ('amazon', 0.3478612005710602),
 ('netanyahu', 0.3400176763534546),
 ('disney', 0.33000797033309937),
 ('software', 0.3272336423397064),
 ('news', 0.3262626528739929),
 ('snapchat', 0.3204635977745056)]

In [None]:
w2v_modelo.wv.most_similar("microsoft") #teste

[('telefónica', 0.41046327352523804),
 ('amazon', 0.40845006704330444),
 ('braskem', 0.40362513065338135),
 ('unilever', 0.3990749716758728),
 ('canais', 0.395320326089859),
 ('sky', 0.3939288258552551),
 ('tesla', 0.3899308145046234),
 ('lego', 0.3851560354232788),
 ('viajante', 0.37406665086746216),
 ('netflix', 0.3730001151561737)]

In [None]:
w2v_modelo.wv.most_similar("barcelona") #teste

[('bayern', 0.47407281398773193),
 ('madrid', 0.44254249334335327),
 ('botafogo', 0.4367230534553528),
 ('leicester', 0.43565401434898376),
 ('barça', 0.4255511164665222),
 ('chelsea', 0.4184545874595642),
 ('juventus', 0.41743004322052),
 ('liverpool', 0.41165515780448914),
 ('lazio', 0.4115499258041382),
 ('munique', 0.40442782640457153)]

In [None]:
w2v_modelo.wv.most_similar("messi") #teste

[('suárez', 0.5018256306648254),
 ('neymar', 0.4094392657279968),
 ('barça', 0.40654322504997253),
 ('tevez', 0.4012410342693329),
 ('cristiano', 0.3918797969818115),
 ('ronaldo', 0.3839137554168701),
 ('calleri', 0.3779301047325134),
 ('bauza', 0.3762350082397461),
 ('chuteiras', 0.37445375323295593),
 ('maradona', 0.36999812722206116)]

### Treinamento com arquitetura Skip-gram

In [None]:
w2v_modelo_sg = Word2Vec(sg = 1,
                      window = 5,
                      size = 300,
                      min_count = 5,
                      alpha = 0.03,
                      min_alpha = 0.007)

w2v_modelo_sg.build_vocab(lista_lista_tokens, progress_per=5000)

w2v_modelo_sg.train(lista_lista_tokens,
                total_examples=w2v_modelo_sg.corpus_count,
                epochs = 30)




2021-01-06 13:48:38,673 : - collecting all words and their counts
2021-01-06 13:48:38,674 : - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-01-06 13:48:38,695 : - PROGRESS: at sentence #5000, processed 34716 words, keeping 10129 word types
2021-01-06 13:48:38,709 : - PROGRESS: at sentence #10000, processed 69298 words, keeping 14909 word types
2021-01-06 13:48:38,725 : - PROGRESS: at sentence #15000, processed 103841 words, keeping 18223 word types
2021-01-06 13:48:38,738 : - PROGRESS: at sentence #20000, processed 138620 words, keeping 20969 word types
2021-01-06 13:48:38,752 : - PROGRESS: at sentence #25000, processed 173257 words, keeping 23410 word types
2021-01-06 13:48:38,771 : - PROGRESS: at sentence #30000, processed 207976 words, keeping 25453 word types
2021-01-06 13:48:38,787 : - PROGRESS: at sentence #35000, processed 242567 words, keeping 27263 word types
2021-01-06 13:48:38,801 : - PROGRESS: at sentence #40000, processed 277254 words, keeping 2899

(15088210, 17937870)

In [None]:
w2v_modelo_sg.wv.most_similar("google") #teste

[('reguladores', 0.41893380880355835),
 ('apple', 0.3909544348716736),
 ('android', 0.38632649183273315),
 ('buffett', 0.3862176239490509),
 ('patentes', 0.372905433177948),
 ('concorda', 0.36265307664871216),
 ('yahoo', 0.3612287938594818),
 ('anunciantes', 0.3608386516571045),
 ('verizon', 0.35400474071502686),
 ('warren', 0.35376566648483276)]

In [None]:
w2v_modelo_sg.wv.most_similar("microsoft") #teste

[('linkedin', 0.5076127648353577),
 ('chips', 0.49765723943710327),
 ('kraft', 0.469170480966568),
 ('heinz', 0.4532616436481476),
 ('unilever', 0.4500690698623657),
 ('software', 0.4486843943595886),
 ('silício', 0.4459594488143921),
 ('verizon', 0.4356909394264221),
 ('ciberataques', 0.432153582572937),
 ('telefónica', 0.42697417736053467)]

In [None]:
w2v_modelo_sg.wv.most_similar("barcelona") #teste


[('celta', 0.5669006705284119),
 ('espanyol', 0.5228727459907532),
 ('supercopa', 0.49703750014305115),
 ('villarreal', 0.49059662222862244),
 ('athletic', 0.48977184295654297),
 ('sevilla', 0.48323675990104675),
 ('madrid', 0.46111834049224854),
 ('wolfsburg', 0.4586392045021057),
 ('valencia', 0.4560597836971283),
 ('monaco', 0.45213747024536133)]

In [None]:
w2v_modelo_sg.wv.most_similar("messi") #teste

[('suárez', 0.5066846609115601),
 ('barça', 0.49429234862327576),
 ('benzema', 0.4662396013736725),
 ('finalizações', 0.4626116156578064),
 ('cavani', 0.4623824954032898),
 ('celta', 0.45204073190689087),
 ('cristiano', 0.4513094425201416),
 ('neymar', 0.44874292612075806),
 ('espanyol', 0.43521976470947266),
 ('neuer', 0.4306640625)]

A avaliação de analogias de palavras nesse contexto será diretamento no classificador.

In [None]:
w2v_modelo.wv.save_word2vec_format("/content/drive/MyDrive/word2ver/modelo_cbow.txt", binary=False)
w2v_modelo_sg.wv.save_word2vec_format("/content/drive/MyDrive/word2ver/modelo_skipgram.txt", binary=False)

2021-01-06 14:08:45,201 : - storing 13006x300 projection weights into /content/drive/MyDrive/word2ver/modelo_cbow.txt
2021-01-06 14:08:48,110 : - storing 13006x300 projection weights into /content/drive/MyDrive/word2ver/modelo_skipgram.txt


# Classificador