<a id = "1.0"></a>
<p style="font-size:30px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong><i>Conjunto de clasificadores robustos</i></strong></p> 

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i> Bag of Words Ingles</i></strong></p> 

*Ejemplo extraido de https://scikit-learn.org/ adaptado al curso*

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?',]

***CountVectorizer:*** implementa la tokenización como el recuento de ocurrencias

In [4]:
vectorizer = CountVectorizer()

***Entrenamiento***

In [5]:
X = vectorizer.fit_transform(corpus)

***Palabras detectadas*** 

In [6]:
 print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


**Oraciones procesadas**
* This is the first document.
* This document is the second document.
* And this is the third one.
* Is this the first document?

A cada término encontrado se le asigna un **índice** entero único correspondiente a una columna en la matriz resultante.

In [7]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [8]:
vectorizer.vocabulary_.get('this')

8

In [9]:
vectorizer.transform(['Something completely new.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [10]:
vectorizer.transform(['Document one new.']).toarray()

array([[0, 1, 0, 0, 1, 0, 0, 0, 0]], dtype=int64)

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i> Bag of Words Español</i></strong></p> 

In [11]:
corpus = ['Este es el primer documento.',
          'Este documento es el segundo documento.',
          'Y este es el tercero.',
          '¿Es este el primer documento?']

In [12]:
vectorizer = CountVectorizer()

In [13]:
X = vectorizer.fit_transform(corpus)

In [14]:
 print(vectorizer.get_feature_names())

['documento', 'el', 'es', 'este', 'primer', 'segundo', 'tercero']


In [15]:
print(X.toarray())

[[1 1 1 1 1 0 0]
 [2 1 1 1 0 1 0]
 [0 1 1 1 0 0 1]
 [1 1 1 1 1 0 0]]


In [16]:
vectorizer.vocabulary_.get('este')

3

In [17]:
vectorizer.transform(['Algo completamente nuevo.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [18]:
vectorizer.transform(['Primer documento nuevo.']).toarray()

array([[1, 0, 0, 0, 1, 0, 0]], dtype=int64)

***Desventaja:*** modelo pierde informacion de contexto, oraciones en afirmativo y en interrogativo podrian ser consideradas como similares.

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

***Bigrams:*** hasta cierto punto esto puede ser solucionado mediante el uso de este recurso

In [19]:
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?',]

In [20]:
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))

In [21]:
X2 = vectorizer2.fit_transform(corpus)

In [22]:
print(vectorizer2.get_feature_names())

['and this', 'document is', 'first document', 'is the', 'is this', 'second document', 'the first', 'the second', 'the third', 'third one', 'this document', 'this is', 'this the']


**Oraciones procesadas**
* This is the first document.
* This document is the second document.
* And this is the third one.
* Is this the first document?

In [23]:
print(X2.toarray())

[[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]


<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

***Texto extenso a modelar***

In [30]:
from dividing_into_sentences import read_text_file, preprocess_text, divide_into_sentences_nltk

In [31]:
def get_sentences(filename):
    sherlock_holmes_text = read_text_file(filename)
    sherlock_holmes_text = preprocess_text(sherlock_holmes_text)
    sentences = divide_into_sentences_nltk(sherlock_holmes_text)
    return sentences

In [32]:
def get_new_sentence_vector(sentence, vectorizer):
    new_sentence_vector = vectorizer.transform([sentence])
    return new_sentence_vector

In [33]:
def create_vectorizer(sentences):
    vectorizer = CountVectorizer(max_df=0.6) # Se descarta lo que este por debajo de este valor
    X = vectorizer.fit_transform(sentences)
    return (vectorizer, X)    

In [34]:
def create_bigram_vectorizer(sentences):
    bigram_vectorizer = CountVectorizer(ngram_range=(1, 2))
    X = bigram_vectorizer.fit_transform(sentences)
    return (bigram_vectorizer, X)

In [36]:
sentences = get_sentences("./data/sherlock_holmes_1.txt")

In [37]:
(vectorizer, X) = create_vectorizer(sentences)

In [38]:
print(X)

  (0, 113)	1
  (0, 98)	1
  (0, 46)	1
  (0, 97)	1
  (0, 53)	1
  (0, 10)	1
  (0, 0)	1
  (0, 123)	1
  (1, 38)	1
  (1, 94)	1
  (1, 40)	1
  (1, 43)	1
  (1, 63)	1
  (1, 41)	1
  (1, 115)	1
  (1, 11)	1
  (1, 78)	1
  (1, 69)	1
  (2, 97)	1
  (2, 41)	1
  (2, 47)	1
  (2, 45)	1
  (2, 28)	1
  (2, 24)	1
  (2, 87)	1
  :	:
  (9, 85)	1
  (9, 56)	1
  (9, 14)	1
  (9, 66)	1
  (9, 20)	1
  (9, 106)	1
  (9, 102)	1
  (9, 70)	1
  (10, 113)	1
  (10, 123)	2
  (10, 43)	1
  (10, 108)	1
  (10, 75)	1
  (10, 118)	2
  (10, 107)	1
  (10, 52)	1
  (10, 4)	1
  (10, 76)	1
  (10, 15)	1
  (10, 126)	1
  (10, 109)	1
  (10, 55)	1
  (10, 23)	1
  (10, 88)	1
  (10, 60)	1


In [39]:
print(X.todense())

[[1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 1]]


In [40]:
print(len(X.todense()))

11


In [41]:
print(vectorizer.get_feature_names())

['_the_', 'abhorrent', 'actions', 'adjusted', 'adler', 'admirable', 'admirably', 'admit', 'akin', 'all', 'always', 'any', 'as', 'balanced', 'be', 'but', 'cold', 'crack', 'delicate', 'distracting', 'disturbing', 'doubt', 'drawing', 'dubious', 'eclipses', 'emotion', 'emotions', 'excellent', 'eyes', 'factor', 'false', 'felt', 'finely', 'for', 'from', 'gibe', 'grit', 'has', 'have', 'he', 'heard', 'her', 'high', 'him', 'himself', 'his', 'holmes', 'in', 'instrument', 'into', 'introduce', 'intrusions', 'irene', 'is', 'it', 'late', 'lenses', 'love', 'lover', 'machine', 'memory', 'men', 'mental', 'mention', 'might', 'mind', 'more', 'most', 'motives', 'name', 'nature', 'never', 'not', 'observer', 'observing', 'of', 'one', 'or', 'other', 'own', 'particularly', 'passions', 'perfect', 'placed', 'position', 'power', 'precise', 'predominates', 'questionable', 'reasoner', 'reasoning', 'results', 'save', 'seen', 'seldom', 'sensitive', 'sex', 'she', 'sherlock', 'sneer', 'softer', 'spoke', 'strong', 'suc

In [42]:
new_sentence = "And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory."
new_sentence_vector = get_new_sentence_vector(new_sentence, vectorizer)

In [43]:
analyze = vectorizer.build_analyzer()
print(analyze(new_sentence))

['and', 'yet', 'there', 'was', 'but', 'one', 'woman', 'to', 'him', 'and', 'that', 'woman', 'was', 'the', 'late', 'irene', 'adler', 'of', 'dubious', 'and', 'questionable', 'memory']


In [44]:
print(new_sentence_vector)

  (0, 4)	1
  (0, 15)	1
  (0, 23)	1
  (0, 43)	1
  (0, 52)	1
  (0, 55)	1
  (0, 60)	1
  (0, 75)	1
  (0, 76)	1
  (0, 88)	1
  (0, 107)	1
  (0, 108)	1
  (0, 109)	1
  (0, 113)	1
  (0, 118)	2
  (0, 123)	2
  (0, 126)	1


In [45]:
print(new_sentence_vector.todense())

[[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  1 1 0 0 0 1 0 0 0 0 2 0 0 0 0 2 0 0 1]]


In [46]:
print(vectorizer.get_feature_names())

['_the_', 'abhorrent', 'actions', 'adjusted', 'adler', 'admirable', 'admirably', 'admit', 'akin', 'all', 'always', 'any', 'as', 'balanced', 'be', 'but', 'cold', 'crack', 'delicate', 'distracting', 'disturbing', 'doubt', 'drawing', 'dubious', 'eclipses', 'emotion', 'emotions', 'excellent', 'eyes', 'factor', 'false', 'felt', 'finely', 'for', 'from', 'gibe', 'grit', 'has', 'have', 'he', 'heard', 'her', 'high', 'him', 'himself', 'his', 'holmes', 'in', 'instrument', 'into', 'introduce', 'intrusions', 'irene', 'is', 'it', 'late', 'lenses', 'love', 'lover', 'machine', 'memory', 'men', 'mental', 'mention', 'might', 'mind', 'more', 'most', 'motives', 'name', 'nature', 'never', 'not', 'observer', 'observing', 'of', 'one', 'or', 'other', 'own', 'particularly', 'passions', 'perfect', 'placed', 'position', 'power', 'precise', 'predominates', 'questionable', 'reasoner', 'reasoning', 'results', 'save', 'seen', 'seldom', 'sensitive', 'sex', 'she', 'sherlock', 'sneer', 'softer', 'spoke', 'strong', 'suc

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

In [47]:
(bigram_vectorizer, X) = create_bigram_vectorizer(sentences)

In [48]:
print(X)

  (0, 269)	1
  (0, 229)	1
  (0, 118)	1
  (0, 226)	1
  (0, 136)	1
  (0, 20)	1
  (0, 0)	1
  (0, 299)	1
  (0, 275)	1
  (0, 230)	1
  (0, 119)	1
  (0, 228)	1
  (0, 137)	1
  (0, 21)	1
  (0, 1)	1
  (1, 93)	1
  (1, 221)	1
  (1, 101)	1
  (1, 108)	1
  (1, 156)	1
  (1, 103)	1
  (1, 278)	1
  (1, 31)	1
  (1, 190)	1
  (1, 167)	1
  :	:
  (10, 307)	1
  (10, 261)	1
  (10, 141)	1
  (10, 60)	1
  (10, 210)	1
  (10, 151)	1
  (10, 30)	1
  (10, 308)	1
  (10, 262)	1
  (10, 285)	1
  (10, 45)	1
  (10, 187)	1
  (10, 300)	1
  (10, 271)	1
  (10, 109)	1
  (10, 251)	1
  (10, 301)	1
  (10, 288)	1
  (10, 253)	1
  (10, 142)	1
  (10, 8)	1
  (10, 180)	1
  (10, 61)	1
  (10, 27)	1
  (10, 211)	1


In [49]:
print(X.todense())

[[1 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 0 1 1]]


In [50]:
print(len(X.todense()))

11


In [51]:
print(bigram_vectorizer.get_feature_names())

['_the_', '_the_ woman', 'abhorrent', 'abhorrent to', 'actions', 'adjusted', 'adjusted temperament', 'adler', 'adler of', 'admirable', 'admirable things', 'admirably', 'admirably balanced', 'admit', 'admit such', 'akin', 'akin to', 'all', 'all emotions', 'all his', 'always', 'always _the_', 'and', 'and actions', 'and finely', 'and observing', 'and predominates', 'and questionable', 'and sneer', 'and that', 'and yet', 'any', 'any emotion', 'any other', 'as', 'as his', 'as lover', 'balanced', 'balanced mind', 'be', 'be more', 'but', 'but admirably', 'but as', 'but for', 'but one', 'cold', 'cold precise', 'crack', 'crack in', 'delicate', 'delicate and', 'distracting', 'distracting factor', 'disturbing', 'disturbing than', 'doubt', 'doubt upon', 'drawing', 'drawing the', 'dubious', 'dubious and', 'eclipses', 'eclipses and', 'emotion', 'emotion akin', 'emotion in', 'emotions', 'emotions and', 'excellent', 'excellent for', 'eyes', 'eyes she', 'factor', 'factor which', 'false', 'false positio

In [52]:
new_sentence = "I had seen little of Holmes lately."
new_sentence_vector = bigram_vectorizer.transform([new_sentence])

In [53]:
print(new_sentence_vector)
print(new_sentence_vector.todense())

  (0, 118)	1
  (0, 179)	1
  (0, 219)	1
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


In [54]:
new_sentence1 = " And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory."
new_sentence_vector1 = vectorizer.transform([new_sentence1])

In [55]:
print(new_sentence_vector1)
print(new_sentence_vector1.todense())

  (0, 4)	1
  (0, 15)	1
  (0, 23)	1
  (0, 43)	1
  (0, 52)	1
  (0, 55)	1
  (0, 60)	1
  (0, 75)	1
  (0, 76)	1
  (0, 88)	1
  (0, 107)	1
  (0, 108)	1
  (0, 109)	1
  (0, 113)	1
  (0, 118)	2
  (0, 123)	2
  (0, 126)	1
[[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  1 1 0 0 0 1 0 0 0 0 2 0 0 0 0 2 0 0 1]]


<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>CBOW</i></strong></p> 

***Continuous Bag-of-Words Word2Vec:*** es una arquitectura para crear vectores de palabras que utiliza tanto palabras futuras como pasadas. La función objetivo de CBOW es:

![Alt Text](./img/clow.png)

![Alt Text](./img/word2vec_diagrams.png)

***Paper:*** [Efficient Estimation of Word Representations in Vector Space](https://paperswithcode.com/method/cbow-word2vec)

*Codigo adaptado para la clase: ***fasttext*** quick start guide*

In [56]:
import numpy as np
np.random.seed(13)

In [57]:
import gensim

In [58]:
import random
from IPython.display import Image

In [71]:
#import keras.backend as K
#import keras.backend.tensorflow_backend as K
import keras.backend as K
from keras.models import Sequential, Model
from keras.layers import Embedding, Reshape, Activation, Input, Lambda, Dense
from keras.preprocessing import sequence
from keras.layers import Dot
from keras.utils import np_utils
from keras.utils.data_utils import get_file
from keras.utils.vis_utils import model_to_dot
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

In [63]:
window_size = 4

In [73]:
def skipgrams(sequence, vocabulary_size,
              window_size=window_size, negative_samples=1., shuffle=True,
              categorical=False, sampling_table=None, seed=None):
    couples = []
    labels = []
    for i, wi in enumerate(sequence):
        if not wi:
            continue
        if sampling_table is not None:
            if sampling_table[wi] < random.random():
                continue

        window_start = max(0, i - window_size)
        window_end = min(len(sequence), i + window_size + 1)
        for j in range(window_start, window_end):
            if j != i:
                wj = sequence[j]
                if not wj:
                    continue
                couples.append([wi, wj])
                if categorical:
                    labels.append([0, 1])
                else:
                    labels.append(1)

    if negative_samples > 0:
        num_negative_samples = int(len(labels) * negative_samples)
        words = [c[0] for c in couples]
        random.shuffle(words)

        couples += [[words[i % len(words)],
                    random.randint(1, vocabulary_size - 1)]
                    for i in range(num_negative_samples)]
        if categorical:
            labels += [[1, 0]] * num_negative_samples
        else:
            labels += [0] * num_negative_samples

    if shuffle:
        if seed is None:
            seed = random.randint(0, 10e6)
        random.seed(seed)
        random.shuffle(couples)
        random.seed(seed)
        random.shuffle(labels)
        
    return couples, labels

def generate_data_for_cbow(corpus, window_size, V):
    maxlen = window_size*2
    corpus = tokenizer.texts_to_sequences(corpus)
    for words in corpus:
        L = len(words)
        for index, word in enumerate(words):
            contexts = []
            labels   = []            
            s = index - window_size
            e = index + window_size + 1
            
            contexts.append([words[i] for i in range(s, e) if 0 <= i < L and i != index])
            labels.append(word)
            x = pad_sequences(contexts, maxlen=maxlen)
            y = np_utils.to_categorical(labels, V)
            yield (x, y)

In [65]:
path = './data/alice.txt'
corpus = open(path, encoding="utf-8").readlines()

In [66]:
corpus = [sentence for sentence in corpus if sentence.count(' ') >= 2]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
V=len(tokenizer.word_index) + 1

In [67]:
embedding_dim = 100
# inputs
w_inputs = Input(shape=(1, ), dtype='int32')
w = Embedding(V, embedding_dim)(w_inputs)

# context
c_inputs = Input(shape=(1, ), dtype='int32')
c = Embedding(V, embedding_dim)(c_inputs)
o = Dot(axes=2)([w, c])
o = Reshape((1,), input_shape=(1, 1))(o)
o = Activation('sigmoid')(o)

sg_model = Model(inputs=[w_inputs, c_inputs], outputs=o)
sg_model.compile(loss='binary_crossentropy', optimizer='adam')

In [68]:
cbow = Sequential()
cbow.add(Embedding(input_dim=V, output_dim=embedding_dim, input_length=window_size*2))
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(embedding_dim,)))
cbow.add(Dense(V, activation='softmax'))
cbow.compile(loss='categorical_crossentropy', optimizer='adadelta')

<a id = "1.0"></a>
<p style="font-size:20px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Entrenamiento</i></strong></p> 

In [69]:
for ite in range(5):
    loss = 0.
    for i, doc in enumerate(tokenizer.texts_to_sequences(corpus)):
        data, labels = skipgrams(sequence=doc, vocabulary_size=V, window_size=5, negative_samples=5.)
        x = [np.array(x) for x in zip(*data)]
        y = np.array(labels, dtype=np.int32)
        if x:
            loss += sg_model.train_on_batch(x, y)

    print(ite, loss)

0 1109.1710671558976
1 759.6094448566437
2 702.9872647300363
3 676.4245354682207
4 652.0895389690995


In [74]:
for ite in range(5):
    loss  =  0.
    for  x, y in generate_data_for_cbow(corpus, window_size, V):
        loss += cbow.train_on_batch(x, y)

    print(ite, loss)

0 250932.1743850708
1 250805.3332901001
2 250678.60900306702
3 250551.89191818237
4 250425.04319000244


<a id = "1.0"></a>
<p style="font-size:20px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Guardar los vectores generados</i></strong></p> 

In [75]:
with open('./data/sg_vectors.txt' ,'w') as f:
    f.write('{} {}\n'.format(V-1, embedding_dim))
    vectors = sg_model.get_weights()[0]
    for word, i in tokenizer.word_index.items():
        f.write('{} {}\n'.format(word, ' '.join(map(str, list(vectors[i, :])))))

In [76]:
with open('./data/cbow_vectors.txt' ,'w') as f:
    f.write('{} {}\n'.format(V-1, embedding_dim))
    vectors = cbow.get_weights()[0]
    for word, i in tokenizer.word_index.items():
        f.write('{} {}\n'.format(word, ' '.join(map(str, list(vectors[i, :])))))

<a id = "1.0"></a>
<p style="font-size:20px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Carga de vectores</i></strong></p> 

In [81]:
#sg_model = gensim.models.KeyedVectors.load_word2vec_format( open('./data/sg_vectors.txt', 'r'), binary=False)
sg_model = gensim.models.KeyedVectors.load_word2vec_format('./data/sg_vectors.txt', binary=False, encoding='latin-1')

In [82]:
#cbow_model = gensim.models.KeyedVectors.load_word2vec_format(open('./data/cbow_vectors.txt', 'r'), binary=False)
cbow_model = gensim.models.KeyedVectors.load_word2vec_format('./data/cbow_vectors.txt', binary=False, encoding='latin-1')

In [83]:
sg_model.most_similar(positive=['queen'])

[('hearts', 0.7326700091362),
 ('wildly', 0.6947518587112427),
 ('became', 0.6890791058540344),
 ('fire', 0.6887832880020142),
 ('cook', 0.6798040270805359),
 ('turning', 0.6791201233863831),
 ('queen\x92s', 0.676758348941803),
 ('croquet', 0.6745284795761108),
 ('top', 0.6716916561126709),
 ('hush', 0.6687166094779968)]

In [84]:
cbow_model.most_similar(positive=['queen'])

[('losing', 0.40262794494628906),
 ('needs', 0.3390108048915863),
 ('thoughts', 0.32639408111572266),
 ('whiles', 0.3247351348400116),
 ('search', 0.28745707869529724),
 ('positively', 0.2724922299385071),
 ('ones', 0.2723672091960907),
 ('sentence', 0.26836562156677246),
 ('stored', 0.26779720187187195),
 ('pack', 0.2639751434326172)]

In [85]:
sg_model.most_similar(positive=['alice'])

[('thought', 0.7039021253585815),
 ('nonsense', 0.5913527607917786),
 ('\x93it', 0.5814018845558167),
 ('doubtfully', 0.5789979100227356),
 ('cautiously', 0.574916422367096),
 ('rather', 0.5677836537361145),
 ('\x93i\x92m', 0.5670735239982605),
 ('perhaps', 0.5468480587005615),
 ('said', 0.5464442372322083),
 ('glad', 0.5305448174476624)]

In [86]:
cbow_model.most_similar(positive=['alice'])

[('display', 0.388045072555542),
 ('a\x97i\x92m', 0.30560874938964844),
 ('guests', 0.2873057425022125),
 ('draw', 0.282014936208725),
 ('burning', 0.28113794326782227),
 ('profit', 0.27617281675338745),
 ('time', 0.2714170217514038),
 ('clearly', 0.26782727241516113),
 ('exclamation', 0.26630762219429016),
 ('\x93tut', 0.25979748368263245)]

In [87]:
sg_model.most_similar(positive=['the'])

[('queen', 0.6171810030937195),
 ('queen\x92s', 0.6143390536308289),
 ('lobster', 0.5978270173072815),
 ('hearts', 0.593090832233429),
 ('country', 0.5838220715522766),
 ('croquet', 0.5799708366394043),
 ('archive', 0.5592383146286011),
 ('march', 0.558046281337738),
 ('literary', 0.5445029735565186),
 ('king', 0.5424962043762207)]

In [88]:
cbow_model.most_similar(positive=['the'])

[('pronounced', 0.3117009699344635),
 ('tidy', 0.29297709465026855),
 ('compilation', 0.29158133268356323),
 ('prosecute', 0.2840246856212616),
 ('expense', 0.27993249893188477),
 ('trouble', 0.278850257396698),
 ('sea\x97\x94', 0.27178046107292175),
 ('watch', 0.2663557231426239),
 ('theirs', 0.2655295133590698),
 ('change', 0.2634385824203491)]

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Term-Frequency Inverse Document-Frequency - TF IDF</i></strong></p> 

En un corpus de texto extenso, algunas palabras estarán muy presentes (por ejemplo, "the", "a", "is" en inglés), por lo que ***contienen muy poca información significativa sobre el contenido real del documento***. Si se pasa directamente a un clasificador, esos términos muy frecuentes ensombrecerían las frecuencias de términos más raros pero más interesantes.

In [89]:
import nltk
import string

In [90]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [91]:
from nltk.stem.snowball import SnowballStemmer

In [93]:
from removing_stopwords import read_in_csv

In [94]:
stemmer = SnowballStemmer('english')
stopwords_file_path = "./data/stopwords.csv"
sentences = get_sentences("./data/sherlock_holmes_1.txt")

In [95]:
def tokenize_and_stem(sentence):
    tokens = nltk.word_tokenize(sentence)
    filtered_tokens = [t for t in tokens if t not in string.punctuation]
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [96]:
def create_char_vectorizer(sentences):
    #Create TF-IDF object
    tfidf_char_vectorizer = TfidfVectorizer(analyzer='char_wb', max_df=0.90, max_features=200000,
                                        min_df=0.05, use_idf=True, ngram_range=(1,3))
    tfidf_char_vectorizer = tfidf_char_vectorizer.fit(sentences)
    tfidf_matrix = tfidf_char_vectorizer.transform(sentences)
    print(tfidf_matrix)
    dense_matrix = tfidf_matrix.todense()
    print(dense_matrix)
    print(tfidf_char_vectorizer.get_feature_names())
    analyze = tfidf_char_vectorizer.build_analyzer()
    print(analyze("To Sherlock Holmes she is always _the_ woman."))
    return (tfidf_char_vectorizer, tfidf_matrix)

In [97]:
def create_vectorizer(sentences):
    #Create TF-IDF object
    stopword_list = read_in_csv(stopwords_file_path)
    stemmed_stopwords = [tokenize_and_stem(stopword)[0] for stopword in stopword_list]
    stopword_list = stopword_list + stemmed_stopwords
    tfidf_vectorizer = TfidfVectorizer(max_df=0.90, max_features=200000,
                                        min_df=0.05, stop_words=stopword_list,
                                        use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
    tfidf_vectorizer = tfidf_vectorizer.fit(sentences)
    tfidf_matrix = tfidf_vectorizer.transform(sentences)
    print(tfidf_matrix)
    dense_matrix = tfidf_matrix.todense()
    print(dense_matrix)
    print(tfidf_vectorizer.get_feature_names())
    return (tfidf_vectorizer, tfidf_matrix)

In [98]:
(vectorizer, matrix) = create_vectorizer(sentences)

  (0, 195)	0.2892833606818738
  (0, 167)	0.33843668854613723
  (0, 166)	0.33843668854613723
  (0, 165)	0.33843668854613723
  (0, 84)	0.33843668854613723
  (0, 83)	0.33843668854613723
  (0, 82)	0.33843668854613723
  (0, 1)	0.33843668854613723
  (0, 0)	0.33843668854613723
  (1, 160)	0.408248290463863
  (1, 159)	0.408248290463863
  (1, 158)	0.408248290463863
  (1, 115)	0.408248290463863
  (1, 78)	0.408248290463863
  (1, 77)	0.408248290463863
  (2, 194)	0.28867513459481287
  (2, 193)	0.28867513459481287
  (2, 164)	0.28867513459481287
  (2, 146)	0.28867513459481287
  (2, 145)	0.28867513459481287
  (2, 144)	0.28867513459481287
  (2, 60)	0.28867513459481287
  (2, 59)	0.28867513459481287
  (2, 58)	0.28867513459481287
  (2, 51)	0.28867513459481287
  :	:
  (9, 57)	0.18475579480398302
  (9, 52)	0.13888402122461266
  (9, 39)	0.18475579480398302
  (9, 38)	0.18475579480398302
  (9, 37)	0.18475579480398302
  (9, 30)	0.18475579480398302
  (9, 29)	0.18475579480398302
  (9, 28)	0.18475579480398302
  (10

In [99]:
analyze = vectorizer.build_analyzer()
print(analyze("To Sherlock Holmes she is always _the_ woman."))

['sherlock', 'holm', '_the_', 'woman', 'sherlock holm', 'holm _the_', '_the_ woman', 'sherlock holm _the_', 'holm _the_ woman']


<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

In [100]:
from sklearn.datasets import fetch_20newsgroups

In [101]:
newsgroups_train = fetch_20newsgroups(subset='train')

In [102]:
newsgroups_test = fetch_20newsgroups(subset='test')

In [103]:
x_train = newsgroups_train.data

In [104]:
x_test = newsgroups_test.data

In [105]:
y_train = newsgroups_train.target

In [106]:
y_test = newsgroups_test.target

In [107]:
print ("Categorias de las 20 fuentes de datos:")
print (newsgroups_train.target_names)
print ("___________________________")
print ("Ejemplo de un email:")
print (x_train[0])
print ("___________________________")
print ("Ejemplos de Target:")
print (y_train[0])
print (newsgroups_train.target_names[y_train[0]])

Categorias de las 20 fuentes de datos:
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
___________________________
Ejemplo de un email:
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine 

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

In [108]:
import nltk

In [109]:
import string

In [110]:
import pandas as pd

In [111]:
from nltk.corpus import stopwords

In [112]:
from nltk.stem import WordNetLemmatizer

In [113]:
from nltk import pos_tag

In [114]:
from nltk.stem import PorterStemmer

In [115]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [116]:
def preprocessing(text):
    text2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in text]).split())
    tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)]
    tokens = [word.lower() for word in tokens]
    stopwds = stopwords.words('english')
    tokens = [token for token in tokens if token not in stopwds]
    tokens = [word for word in tokens if len(word)>=3]
    stemmer = PorterStemmer()

    try:
        tokens = [stemmer.stem(word) for word in tokens]
    except:
        tokens = tokens
        
    tagged_corpus = pos_tag(tokens)    
    Noun_tags = ['NN','NNP','NNPS','NNS']
    Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']
    lemmatizer = WordNetLemmatizer()

    def prat_lemmatize(token,tag):
        if tag in Noun_tags:
            return lemmatizer.lemmatize(token,'n')
        elif tag in Verb_tags:
            return lemmatizer.lemmatize(token,'v')
        else:
            return lemmatizer.lemmatize(token,'n')
    
    pre_proc_text =  " ".join([prat_lemmatize(token,tag) for token,tag in tagged_corpus])             

    return pre_proc_text

In [117]:
x_train_preprocessed  = []

In [118]:
for i in x_train:
    x_train_preprocessed.append(preprocessing(i))

In [119]:
x_test_preprocessed = []

In [120]:
for i in x_test:
    x_test_preprocessed.append(preprocessing(i))

In [121]:
vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2),  stop_words='english', 
                             max_features= 10000,strip_accents='unicode',  norm='l2')

In [122]:
x_train_2 = vectorizer.fit_transform(x_train_preprocessed).todense()

In [123]:
x_test_2 = vectorizer.transform(x_test_preprocessed).todense()

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Deep Learning modules</i></strong></p> 

In [124]:
import numpy as np

In [125]:
from keras.models import Sequential

In [126]:
from keras.layers.core import Dense, Dropout, Activation

In [127]:
from keras.optimizers import Adadelta,Adam,RMSprop

In [128]:
from keras.utils import np_utils

In [129]:
from sklearn.metrics import accuracy_score,classification_report

<a id = "1.0"></a>
<p style="font-size:20px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Hyper parameters</i></strong></p> 

In [130]:
np.random.seed(1337) 
nb_classes = 20
batch_size = 64
nb_epochs = 20

In [131]:
Y_train = np_utils.to_categorical(y_train, nb_classes)

In [132]:
model = Sequential()

In [133]:
model.add(Dense(1000,input_shape= (10000,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

In [134]:
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.5))

In [135]:
model.add(Dense(50))
model.add(Activation('relu'))
model.add(Dropout(0.5))

In [136]:
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

In [137]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [138]:
print (model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_1 (Dense)             (None, 1000)              10001000  
                                                                 
 activation_1 (Activation)   (None, 1000)              0         
                                                                 
 dropout (Dropout)           (None, 1000)              0         
                                                                 
 dense_2 (Dense)             (None, 500)               500500    
                                                                 
 activation_2 (Activation)   (None, 500)               0         
                                                                 
 dropout_1 (Dropout)         (None, 500)               0         
                                                                 
 dense_3 (Dense)             (None, 50)               

<a id = "1.0"></a>
<p style="font-size:20px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Model Training</i></strong></p> 

In [139]:
model.fit(x_train_2, Y_train, batch_size=batch_size, epochs=nb_epochs,verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x25c383b7520>

<a id = "1.0"></a>
<p style="font-size:20px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Model Prediction</i></strong></p> 

In [141]:
y_train_predclass = model.predict(x_train_2,batch_size=batch_size)



In [142]:
y_test_predclass = model.predict(x_test_2,batch_size=batch_size)



In [148]:
print ("Train accuracy: {}". format(round(accuracy_score(y_train,np.argmax(y_train_predclass,axis=1)),3)))
print ("Test accuracy: {}". format(round(accuracy_score(y_test,np.argmax(y_test_predclass,axis=1)),3)))

Train accuracy: 0.999
Test accuracy: 0.802


In [147]:
print ("Test Classification Report\n")
print (classification_report(y_test,np.argmax(y_test_predclass,axis=1)))

Test Classification Report

              precision    recall  f1-score   support

           0       0.81      0.71      0.76       319
           1       0.67      0.71      0.69       389
           2       0.73      0.65      0.69       394
           3       0.66      0.72      0.68       392
           4       0.79      0.72      0.76       385
           5       0.80      0.77      0.78       395
           6       0.77      0.82      0.79       390
           7       0.81      0.87      0.84       396
           8       0.96      0.87      0.91       398
           9       0.93      0.90      0.91       397
          10       0.96      0.96      0.96       399
          11       0.91      0.89      0.90       396
          12       0.63      0.74      0.68       393
          13       0.87      0.85      0.86       396
          14       0.90      0.90      0.90       394
          15       0.82      0.88      0.85       398
          16       0.78      0.83      0.81       364

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Word Embeddings</i></strong></p> 

Mecanismo que como resultado del entrenamiento de una ***red neuronal***, predice una palabra a partir de todas las demás palabras de la oración. Los vectores resultantes son similares para palabras que ocurren en contextos similares. 

In [149]:
from gensim.models import KeyedVectors

In [150]:
import numpy as np

In [151]:
w2vec_model_path = "./models/40/model.bin"

In [152]:
def load_model(path):
    model = KeyedVectors.load_word2vec_format(w2vec_model_path, binary=True)
    return model

In [153]:
def get_sentence_vector(word_vectors):
    matrix = np.array(word_vectors)
    centroid = np.mean(matrix[:,:], axis=0)
    return centroid

In [154]:
def get_word_vectors(sentence, model):
    word_vectors = []
    for word in sentence:
        try:
            word_vector = model.get_vector(word.lower())
            word_vectors.append(word_vector)
        except KeyError:
            continue
    return word_vectors

In [157]:
model = load_model(w2vec_model_path)

In [161]:
print(model['holmes'])

[-0.309647 -0.127936 -0.136244 -0.252969  0.410695  0.206325  0.119236
 -0.244745 -0.436801  0.058889  0.237439  0.247656  0.072103  0.044183
 -0.424878  0.367344  0.153287  0.343856  0.232269 -0.181432 -0.050021
  0.225756  0.71465  -0.564166 -0.168468 -0.153668  0.300445 -0.220122
 -0.021261  0.25779  -0.581744  0.320341 -0.236189  0.224906  0.029358
 -0.295143  0.483847 -0.05832   0.010784  0.050842 -0.034141  0.420114
  0.126926 -0.405974 -0.421415  0.006092 -0.137557  0.038477  0.100005
  0.151401  0.287163 -0.433263 -0.249083 -0.057834  0.367427 -0.181977
  0.31608   0.063203 -0.486009 -0.127354 -0.283149  0.028113 -0.150146
 -0.38704   0.033237  0.146932  0.470853 -0.151154  0.064424  0.146739
 -0.164267 -0.094909  0.443384 -0.055244  0.117268 -0.221496 -0.185951
  0.056249 -0.176986 -0.449508  0.345431 -0.096014 -0.19798   0.117698
 -0.162563 -0.181655 -0.18644  -0.158727  0.595464  0.161437 -0.382661
  0.148537  0.173535  0.370556 -0.346765  0.055452  0.024405 -0.002895
  0.08

In [162]:
print(model.most_similar(['holmes'], topn=15))

[('sherlock', 0.8416915535926819), ('parker', 0.8099909424781799), ('moriarty', 0.8039607405662537), ('sawyer', 0.8002702593803406), ('moore', 0.7932804822921753), ('wolfe', 0.7923581600189209), ('hale', 0.7910093069076538), ('doyle', 0.7906038165092468), ('holmes.the', 0.7895271182060242), ('watson', 0.788769006729126), ('yates', 0.7882786393165588), ('stevenson', 0.7879440188407898), ('spencer', 0.7877693176269531), ('goodwin', 0.7866846323013306), ('baxter', 0.7864187359809875)]


In [163]:
sentence = "It was not that he felt any emotion akin to love for Irene Adler."
word_vectors = get_word_vectors(sentence, model)
sentence_vector = get_sentence_vector(word_vectors)
words = ['banana', 'apple', 'computer', 'strawberry']
print(model.doesnt_match(words))

computer


In [164]:
word = "cup"
words = ['glass', 'computer', 'pencil', 'watch']
print(model.most_similar_to_given(word, words))

glass


<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Word2vec Model</i></strong></p> 

El algoritmo ***word2vec*** utiliza un modelo de red neuronal para aprender asociaciones de palabras de un gran corpus de texto. Una vez entrenado, dicho modelo puede detectar palabras sinónimas o sugerir palabras adicionales para una oración parcial.

![Alt Text](./img/word2vec_translation.png)

#### Fuente: Python Deep Learning Projects, curaduria para el presente curso

In [165]:
import requests
import os
import re
import multiprocessing

In [166]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nltk

In [167]:
import gensim.models.word2vec as w2v

In [168]:
import sklearn.manifold

In [169]:
import tensorflow as tf

In [170]:
#nltk.download("punkt")
#nltk.download("stopwords")

In [171]:
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]", " ", raw)
    words = clean.split()
    return list(map(lambda x: x.lower(), words))

#### Preproceso a realizar con el texto

![Alt Text](./img/learning-word-vectors-1.png)

***Texto extraido de Principles of Geology by Sir Charles Lyell, Project Gutenberg***

In [172]:
filepath = 'http://www.gutenberg.org/files/33224/33224-0.txt'
corpus_raw = requests.get(filepath).text

In [173]:
# Clean text
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
raw_sentences = tokenizer.tokenize(corpus_raw)

In [174]:
# Sentence where each word is tokenized
sentences = (sentence_to_wordlist(raw) for raw in raw_sentences if raw)
sentences = list(sentences)
token_count = sum([len(sentence) for sentence in sentences])
print(f'The book corpus contains {token_count} tokens.')

The book corpus contains 425633 tokens.


#### Definicion del Modelo

In [175]:
# Dimensiones
num_features = 300

# umbral minimo para considerar una palabra.
min_word_count = 3

# Definicon de tareas en paralelo
num_workers = multiprocessing.cpu_count()

# Ventana 
context_size = 7

# Bajada a disco
downsampling = 1e-3

# Semilla.
seed = 1

model2vec = w2v.Word2Vec(sg=1, seed=seed, workers=num_workers, min_count=min_word_count,
                         window=context_size, sample=downsampling)

In [176]:
model2vec.build_vocab(list(sentences))

***Nota:*** Aumentar el número de dimensiones conduce a una mejor generalización, pero también agrega más complejidad computacional

***Nota:*** Parametro ***context_size***. Establece el límite superior para la distancia entre la predicción de palabras actual y objetivo dentro de una oración.

#### Entrenamiento

In [177]:
model2vec.train(sentences, total_examples=model2vec.corpus_count, epochs=10)

In [180]:
model2vec.save('./models/sample.w2v')

#### Evaluacion del modelo

In [178]:
print('Similar a: "earth":')
for sWord in model2vec.wv.most_similar("earth"):
    print(sWord)
    
print('\nSimilar a: "human":')
for sWord in model2vec.wv.most_similar("human"):
    print(sWord)
    
print('\nContribucion Positiva y Negativa:')
for sWord in model2vec.wv.most_similar_cosmul(positive=['earth', 'moon'], negative=['orbit']):
    print(sWord)

Similar a: "earth":
('crust', 0.7069793939590454)
('globe', 0.6538490056991577)
('orbit', 0.6276617050170898)
('inequalities', 0.6221923232078552)
('planet', 0.6116704940795898)
('surface', 0.5956591963768005)
('moon', 0.5919835567474365)
('sun', 0.5879665017127991)
('remodelled', 0.5859501361846924)
('unevenness', 0.5838420987129211)

Similar a: "human":
('race', 0.6516473889350891)
('art', 0.6346074938774109)
('man', 0.6225540041923523)
('population', 0.6143702864646912)
('affairs', 0.6132031679153442)
('industry', 0.607807993888855)
('men', 0.5850975513458252)
('beings', 0.5833526253700256)
('organized', 0.5775673389434814)
('wickedness', 0.5763292908668518)

Contribucion Positiva y Negativa:
('employed', 0.8177217841148376)
('sun', 0.8086581826210022)
('crust', 0.7970605492591858)
('remodelled', 0.7948926091194153)
('attendant', 0.7932011485099792)
('sound', 0.792156994342804)
('globe', 0.7861462235450745)
('god', 0.7815254926681519)
('spirits', 0.7794240117073059)
('rays', 0.77813

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Train Word2Vec</i></strong></p> 

In [196]:
import gensim
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
import pickle
from nltk.tokenize import  word_tokenize

In [184]:
from bag_of_words import get_sentences

In [186]:
word2vec_model_path = "models/word2vec.model"
books_dir = "books/"
evaluation_file = "test/questions-words.txt"
pretrained_model_path = "models/40/model.bin"

In [187]:
def train_word2vec(words, word2vec_model_path):
    #model = gensim.models.Word2Vec(
    #    words,
    #    size=50,
    #    window=7,
    #    min_count=1,
    #    workers=10)
    model = gensim.models.Word2Vec(words, window=5, min_count=5)
    model.train(words, total_examples=len(words), epochs=200)
    pickle.dump(model, open(word2vec_model_path, 'wb'))
    return model

In [188]:
def get_all_book_sentences(directory):
    text_files = [join(directory, f) for f in listdir(directory) if isfile(join(directory, f)) and ".rtf" in f]
    all_sentences = []
    for text_file in text_files:
        sentences = get_sentences(text_file)
        all_sentences = all_sentences + sentences
    return all_sentences

In [189]:
def test_model(w1):
    model = pickle.load(open(word2vec_model_path, 'rb'))
    #words = list(model.wv.vocab)
    words = list(model.wv.index_to_key)
    #print(words)
    words = model.wv.most_similar(w1, topn=10)
    print(words)

In [190]:
def evaluate_model(model, filename):
    return model.wv.accuracy(filename)

In [192]:
sentences = get_all_book_sentences(books_dir)

In [195]:
sentences = [word_tokenize(s.lower()) for s in sentences]

In [197]:
model = train_word2vec(sentences,word2vec_model_path)

In [198]:
oneWord = "river"
test_model(oneWord)

[('banks', 0.6390682458877563), ('mountains', 0.6331812739372253), ('woods', 0.6327189803123474), ('illinois', 0.6184098720550537), ('avenue', 0.6137031316757202), ('stream', 0.6122629642486572), ('raft', 0.5993295907974243), ('shore', 0.5992862582206726), ('hill', 0.5925582647323608), ('island', 0.5775993466377258)]


In [199]:
model = pickle.load(open(word2vec_model_path, 'rb'))

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Word2Vec Español</i></strong></p> 

[Entrenamiento de wor2vect en españo, ejemplo adaptado para la clase](https://github.com/dccuchile/spanish-word-embeddings)

In [200]:
from gensim.models.keyedvectors import KeyedVectors

Alternativamente s epuede acceder a [Word vectors for 157 languages](https://fasttext.cc/docs/en/crawl-vectors.html)

In [201]:
wordvectors_file_vec = './data/fasttext-sbwc.3.6.e20.vec'

In [202]:
cantidad = 100000

In [203]:
wordvectors = KeyedVectors.load_word2vec_format(wordvectors_file_vec, limit=cantidad)

#### Buscar analogias o palabras que tienen un contexto similar

[Reference: KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html)

In [204]:
wordvectors.most_similar_cosmul(positive=['rey','mujer'],negative=['hombre'])

[('reina', 0.9141532778739929),
 ('infanta', 0.8582409620285034),
 ('berenguela', 0.8470728993415833),
 ('princesa', 0.8445042371749878),
 ('consorte', 0.835599422454834),
 ('emperatriz', 0.8247664570808411),
 ('regente', 0.8239888548851013),
 ('infantas', 0.8104740381240845),
 ('hermanastra', 0.8072930574417114),
 ('regencia', 0.8037239909172058)]

In [205]:
wordvectors.most_similar_cosmul(positive=['actor','mujer'],negative=['hombre'], topn=10)

[('actriz', 0.9687139391899109),
 ('compositora', 0.855713427066803),
 ('cantante', 0.8482002019882202),
 ('actrices', 0.845941424369812),
 ('dramaturga', 0.8354867696762085),
 ('presentadora', 0.8346402645111084),
 ('bailarina', 0.8301039934158325),
 ('coprotagonista', 0.8284398317337036),
 ('guionista', 0.828334629535675),
 ('cantautora', 0.8273791670799255)]

In [206]:
wordvectors.most_similar_cosmul(positive=['hijo','mujer'],negative=['hombre'], topn=5)

[('hija', 0.9641352295875549),
 ('esposa', 0.911634087562561),
 ('madre', 0.9057635068893433),
 ('nieta', 0.8976945877075195),
 ('hermanastra', 0.8958925604820251)]

In [207]:
wordvectors.most_similar_cosmul(positive=['yerno','mujer'],negative=['hombre'])

[('nuera', 0.8991931080818176),
 ('cuñada', 0.8967029452323914),
 ('esposa', 0.8791162967681885),
 ('hija', 0.8787108659744263),
 ('suegra', 0.8752366304397583),
 ('sobrina', 0.8678680658340454),
 ('hermanastra', 0.8615662455558777),
 ('viuda', 0.8587483167648315),
 ('yernos', 0.8577941656112671),
 ('nieta', 0.8574916124343872)]

In [208]:
wordvectors.most_similar_cosmul(positive=['jugar','canta'],negative=['cantar'])

[('juega', 0.927038848400116),
 ('jugará', 0.9030497670173645),
 ('juegue', 0.8957996368408203),
 ('jugando', 0.8832089304924011),
 ('juegan', 0.868077278137207),
 ('jugado', 0.8658615946769714),
 ('jugó', 0.8645128607749939),
 ('juegas', 0.8533657789230347),
 ('jugaría', 0.8508267402648926),
 ('jugara', 0.8470849394798279)]

In [209]:
wordvectors.most_similar_cosmul(positive=['jugar','cantaría'],negative=['cantar'])

[('jugaría', 1.002570629119873),
 ('jugarían', 0.9512909650802612),
 ('jugara', 0.9422452449798584),
 ('disputaría', 0.918655276298523),
 ('jugará', 0.908361554145813),
 ('jugaran', 0.8989545106887817),
 ('jugase', 0.8874877095222473),
 ('disputarían', 0.8822468519210815),
 ('jugó', 0.8740343451499939),
 ('ficharía', 0.8733251094818115)]

In [210]:
wordvectors.most_similar_cosmul(positive=['ir','jugando'],negative=['jugar'])

[('yendo', 0.907002329826355),
 ('ido', 0.8450857996940613),
 ('saliendo', 0.832144021987915),
 ('caminando', 0.8135581612586975),
 ('yéndose', 0.8133329153060913),
 ('acercando', 0.8035196661949158),
 ('iremos', 0.8023999333381653),
 ('marchando', 0.8001841902732849),
 ('parando', 0.7995682954788208),
 ('irá', 0.7987060546875)]

In [211]:
wordvectors.most_similar_cosmul(positive=['santiago','venezuela'],negative=['chile'])

[('caracas', 0.9048638343811035),
 ('barinas', 0.871845543384552),
 ('brión', 0.8565776944160461),
 ('cojedes', 0.851475715637207),
 ('cumaná', 0.8507834076881409),
 ('guanare', 0.8507249355316162),
 ('maturín', 0.8474243879318237),
 ('mariño', 0.8468520641326904),
 ('barquisimeto', 0.8451403379440308),
 ('falcón', 0.8430415987968445)]

In [212]:
wordvectors.most_similar_cosmul(positive=['habana','chile'],negative=['santiago'])

[('cuba', 0.9638005495071411),
 ('venezuela', 0.8891815543174744),
 ('colombia', 0.876230001449585),
 ('cubana', 0.8471046686172485),
 ('nicaragua', 0.8443881273269653),
 ('cubanos', 0.8370179533958435),
 ('ecuador', 0.8361554145812988),
 ('brasil', 0.8355840444564819),
 ('cubano', 0.8315702080726624),
 ('panamá', 0.8302189111709595)]

In [213]:
wordvectors.most_similar_to_given('santiago', ['cuba','chile', 'brasil'] )

'chile'

In [214]:
wordvectors.n_similarity('santiago', 'chile')

0.7982443

In [215]:
wordvectors.n_similarity('chile', 'brasil')

0.9034706

#### palabra dentro de que está más lejana del resto de las palabras de la lista

In [216]:
wordvectors.doesnt_match(['blanco','azul','rojo','chile'])

'chile'

In [217]:
wordvectors.doesnt_match(['sol','luna','almuerzo','jupiter'])

'almuerzo'

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:left;border-radius:5px;">
<strong><i>Bidirectional Encoder Representations from Transformers - Bert</i></strong></p> 

***BERT*** utiliza Transformer, un mecanismo que aprende las relaciones contextuales entre palabras en un texto. En su forma básica, Transformer incluye dos mecanismos separados: un codificador que lee la entrada de texto y un decodificador que produce una predicción para la tarea.

In [219]:
from sentence_transformers import SentenceTransformer

In [220]:
from dividing_into_sentences import read_text_file, divide_into_sentences_nltk

In [221]:
text = read_text_file("./data/sherlock_holmes.txt")

In [222]:
sentences = divide_into_sentences_nltk(text)

In [223]:
model = SentenceTransformer('bert-base-nli-mean-tokens')

Downloading (…)821d1/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading (…)d1/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)01e821d1/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)821d1/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1e821d1/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [224]:
sentence_embeddings = model.encode(["the beautiful lake"])

In [225]:
print("Sentence embeddings:")
print(sentence_embeddings)

Sentence embeddings:
[[-7.61982948e-02 -5.74669957e-01  1.08264244e+00  7.36554921e-01
   5.51345766e-01 -9.39117789e-01 -2.80430317e-01 -5.41625559e-01
   7.50949204e-01 -4.40971464e-01  5.31526744e-01 -5.41882992e-01
   1.92792937e-01  3.44117552e-01  1.50266457e+00 -6.26990139e-01
  -2.42829040e-01 -3.66734654e-01  5.57459831e-01 -2.21802637e-01
  -9.69591737e-01 -4.38950121e-01 -7.93552458e-01 -5.84922850e-01
  -1.55690759e-01  2.12004229e-01  4.02014196e-01 -2.63063848e-01
   6.21910214e-01  5.97238004e-01  9.78123173e-02  7.20052481e-01
  -4.66322720e-01  3.86450231e-01 -8.24903488e-01  1.09985709e+00
  -3.59135151e-01 -4.31918919e-01  2.56567057e-02  5.73160291e-01
   2.40237564e-01 -7.67570674e-01  9.38899815e-01 -3.60024393e-01
  -8.77115607e-01 -2.47681111e-01 -8.65838528e-01  1.04203582e+00
   3.65989417e-01 -6.47720471e-02 -7.04247296e-01  5.91089716e-03
  -8.04807484e-01  2.21370250e-01 -1.79775149e-01  8.04759324e-01
  -4.44356680e-01 -4.46378887e-01  7.55991489e-02 -2.17

<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 
<a id = "1.0"></a>
<p style="font-size:25px;color:#16268a;font-family:'Avantgarde';text-align:center;border-radius:5px;">
<strong>______________________________</strong></p> 