*Problem Statement:*

    Implement the Continuous Bag of Words (CBOW) Model for the given (textual document 1) using the below steps:
    a. Data preparation
    b. Generate training data
    c. Train model
    d. Output



In [115]:
"""
The speed of transmission is an important point of difference between the two viruses. Influenza has a shorter median incubation period (the time from infection to appearance of symptoms) and a shorter serial interval (the time between successive cases) than COVID-19 virus. The serial interval for COVID-19 virus is estimated to be 5-6 days, while for influenza virus, the serial interval is 3 days. This means that influenza can spread faster than COVID-19.

Further, transmission in the first 3-5 days of illness, or potentially pre-symptomatic transmission –transmission of the virus before the appearance of symptoms – is a major driver of transmission for influenza. In contrast, while we are learning that there are people who can shed COVID-19 virus 24-48 hours prior to symptom onset, at present, this does not appear to be a major driver of transmission.

The reproductive number – the number of secondary infections generated from one infected individual – is understood to be between 2 and 2.5 for COVID-19 virus, higher than for influenza. However, estimates for both COVID-19 and influenza viruses are very context and time-specific, making direct comparisons more difficult.
"""

'\nThe speed of transmission is an important point of difference between the two viruses. Influenza has a shorter median incubation period (the time from infection to appearance of symptoms) and a shorter serial interval (the time between successive cases) than COVID-19 virus. The serial interval for COVID-19 virus is estimated to be 5-6 days, while for influenza virus, the serial interval is 3 days. This means that influenza can spread faster than COVID-19.\n\nFurther, transmission in the first 3-5 days of illness, or potentially pre-symptomatic transmission –transmission of the virus before the appearance of symptoms – is a major driver of transmission for influenza. In contrast, while we are learning that there are people who can shed COVID-19 virus 24-48 hours prior to symptom onset, at present, this does not appear to be a major driver of transmission.\n\nThe reproductive number – the number of secondary infections generated from one infected individual – is understood to be betwe

In [116]:
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda
from keras.utils import to_categorical
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
import pandas as pd

In [117]:
data=open('./LP-IV-datasets/CBOW(Ass5)/CBOW.txt','r')
corona_data = [text for text in data if text.count(' ') >= 2]
vectorize = Tokenizer()

In [118]:
vectorize.fit_on_texts(corona_data)
corona_data = vectorize.texts_to_sequences(corona_data)
word2id=vectorize.word_index
word2id['PAD'] = 0

id2word={v:k for k,v in word2id.items()}

In [119]:
# Find total no of words
total_vocab = len(word2id)
window_size = 2

In [120]:
# Generate the pairs of Context words and target words
def cbow_model(data, window_size, total_vocab):
    total_length = window_size*2
    for text in data:
        text_len = len(text)
        for idx, word in enumerate(text):
            context_word = []
            target   = []
            begin = idx - window_size
            end = idx + window_size + 1
            context_word.append([text[i] for i in range(begin, end) if 0 <= i < text_len and i != idx])
            target.append(word)
            contextual = sequence.pad_sequences(context_word, maxlen=total_length)
            final_target = to_categorical(target, total_vocab)
            yield(contextual, final_target)

Create Neural Network model with following parameters :

    Model type : sequential
    
    Layers : Dense , Lambda , embedding. Compile

    Options : (loss='categorical_crossentropy', optimizer='adam')

In [121]:
model = Sequential()
model.add(Embedding(input_dim=total_vocab, output_dim=100, input_length=window_size*2))
model.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(100,)))
model.add(Dense(total_vocab, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
for i in range(10):
    cost = 0
    for x, y in cbow_model(data, window_size, total_vocab):
        cost += model.train_on_batch(x, y) # type: ignore
    print(i, cost)

0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0


In [122]:
# Create vector file of some word for testing
dimensions=100
vect_file = open('./LP-IV-datasets/CBOW(Ass5)/vectors.txt' ,'w')
vect_file.write('{} {}\n'.format(total_vocab,dimensions))

8

In [123]:
# Assign weights to your trained model
weights = model.get_weights()[0]
for text, i in vectorize.word_index.items():
    final_vec = ' '.join(map(str, list(weights[i, :])))
    vect_file.write('{} {}\n'.format(text, final_vec))
vect_file.close()

In [124]:
from sklearn.metrics.pairwise import euclidean_distances
weights=model.get_weights()[0][:]
distance=euclidean_distances(weights)
data=pd.DataFrame(distance,index=word2id.keys()) # type: ignore
data.columns=word2id.keys() # type: ignore

data

Unnamed: 0,the,of,influenza,covid,19,virus,for,transmission,is,to,...,both,very,context,specific,making,direct,comparisons,more,difficult,PAD
the,0.000000,0.416938,0.373206,0.422391,0.446623,0.423370,0.430138,0.425963,0.414923,0.436728,...,0.402639,0.450444,0.421117,0.395164,0.429711,0.423255,0.443211,0.398158,0.445656,0.469780
of,0.416938,0.000000,0.396720,0.424540,0.370634,0.443016,0.436251,0.411362,0.440748,0.375826,...,0.417697,0.438962,0.429598,0.413331,0.430085,0.402068,0.420750,0.413995,0.408109,0.379795
influenza,0.373206,0.396720,0.000000,0.416751,0.386385,0.381383,0.426269,0.388344,0.439700,0.398538,...,0.406843,0.418233,0.386656,0.395612,0.419992,0.411951,0.376090,0.378439,0.414763,0.420073
covid,0.422391,0.424540,0.416751,0.000000,0.379004,0.395000,0.383626,0.400065,0.448840,0.408111,...,0.418244,0.419685,0.390090,0.383997,0.408722,0.404076,0.405887,0.403589,0.431757,0.403858
19,0.446623,0.370634,0.386385,0.379004,0.000000,0.426840,0.411385,0.384151,0.433585,0.384891,...,0.426656,0.427476,0.436468,0.435027,0.416077,0.390885,0.441585,0.422849,0.407687,0.404386
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
direct,0.423255,0.402068,0.411951,0.404076,0.390885,0.410741,0.366366,0.414419,0.414323,0.400893,...,0.436311,0.410979,0.426324,0.405778,0.403044,0.000000,0.416081,0.413523,0.384870,0.386098
comparisons,0.443211,0.420750,0.376090,0.405887,0.441585,0.433938,0.425441,0.419907,0.402662,0.388535,...,0.416147,0.447323,0.381423,0.423221,0.379927,0.416081,0.000000,0.417633,0.394862,0.379225
more,0.398158,0.413995,0.378439,0.403589,0.422849,0.408963,0.413588,0.400032,0.395680,0.405780,...,0.387126,0.397739,0.385745,0.388575,0.418895,0.413523,0.417633,0.000000,0.396620,0.394657
difficult,0.445656,0.408109,0.414763,0.431757,0.407687,0.423480,0.387720,0.393285,0.399423,0.404191,...,0.433592,0.373644,0.403370,0.378475,0.415173,0.384870,0.394862,0.396620,0.000000,0.385975


In [125]:
def SearchWord(WordList):
  ans={}
  for word in WordList:
    if(word in word2id):
      ans[word]=[id2word[idx] for idx in distance[word2id[word]-1].argsort()[0:5]+1]
  return ans

In [127]:
SearchWord(['covid'])

{'covid': ['covid', 'period', 'we', '5', 'while']}