### Criação do corpus:
---
Corpus é colção de textos reunidos e organziados com criterios especificos para análise, servindo com base para estudos em linguistica computacional, PLN, dicionarios e IA

Nesse exemplo de corpus, estou usando um texto retirado do artigo "COMPUTING MACHINERY AND INTELLIGENCE
By A. M. Turing".

In [2]:
tiny_corpus = ["I propose to consider the question, Can machines think?", 
               "This should begin with definitions of the meaning of the terms Machine and think."
]
tiny_corpus_lowered = [sentence.lower() for sentence in tiny_corpus]
print(tiny_corpus_lowered)

['i propose to consider the question, can machines think?', 'this should begin with definitions of the meaning of the terms machine and think.']


### Tokenização
---
Processo de transformar um texto em unidades menores, chamadas de tokens.

In [3]:
tiny_corpus_tokenized = [sentence.split() for sentence in tiny_corpus_lowered]
print(tiny_corpus)

words = [word for sentence in tiny_corpus_tokenized for word in sentence]
print(words)

unique_words = list(set(words))
print(unique_words)

id2word = {nr: word for nr, word in enumerate(unique_words)}
print(id2word)

word2id = {word: id for id, word in id2word.items()}
print(word2id)

['I propose to consider the question, Can machines think?', 'This should begin with definitions of the meaning of the terms Machine and think.']
['i', 'propose', 'to', 'consider', 'the', 'question,', 'can', 'machines', 'think?', 'this', 'should', 'begin', 'with', 'definitions', 'of', 'the', 'meaning', 'of', 'the', 'terms', 'machine', 'and', 'think.']
['propose', 'terms', 'machine', 'this', 'meaning', 'consider', 'think?', 'begin', 'to', 'the', 'think.', 'definitions', 'should', 'of', 'question,', 'i', 'can', 'and', 'machines', 'with']
{0: 'propose', 1: 'terms', 2: 'machine', 3: 'this', 4: 'meaning', 5: 'consider', 6: 'think?', 7: 'begin', 8: 'to', 9: 'the', 10: 'think.', 11: 'definitions', 12: 'should', 13: 'of', 14: 'question,', 15: 'i', 16: 'can', 17: 'and', 18: 'machines', 19: 'with'}
{'propose': 0, 'terms': 1, 'machine': 2, 'this': 3, 'meaning': 4, 'consider': 5, 'think?': 6, 'begin': 7, 'to': 8, 'the': 9, 'think.': 10, 'definitions': 11, 'should': 12, 'of': 13, 'question,': 14, 'i

### One-Hot Enconding:
---


In [4]:
from pprint import pprint
import numpy as np

num_word = len(unique_words)
word2one_hot = dict()

for i in range(num_word):
    zero_vec = np.zeros(num_word)
    zero_vec[i] = 1
    word2one_hot[id2word[i]] = list(zero_vec)

print(word2one_hot)

{'propose': [np.float64(1.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0)], 'terms': [np.float64(0.0), np.float64(1.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0)], 'machine': [np.float64(0.0), np.float64(0.0), np.float64(1.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(

### SkipGram:
---
Modelo que indica o contexto de um documento a partir de uma palavra especifica que foi selecionada, usando as relações semanticas entre as palavras para parametro.

De acordo com a ideia do Word2Vec, quanto mais vezes duas palavras aparecem no mesmo contexto, mais proximo serão seus singnicados.