######AdiPokharna
# **Chit 6 or 7 or 8**

*Problem Statement:*

    Implement the Continuous Bag of Words (CBOW) Model for the given (textual document 1) using the below steps:
    a. Data preparation
    b. Generate training data
    c. Train model
    d. Output



# Importing libraries
a. Data preparation

In [1]:
import numpy as np
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda
from tensorflow.python.keras import utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
import gensim

b. Generate training data

This will be same for 6, 7, 8 Chits only the data in the file will be different for each chit

In [2]:
# Open a text file in write mode
with open('corona.txt', 'w') as file:
    # Write the message to the file
    file.write("""
The speed of transmission is an important point of difference between the two viruses.
Influenza has a shorter median incubation period (the time from infection to appearance of symptoms)
and a shorter serial interval (the time between successive cases) than COVID-19 virus.
The serial interval for COVID-19 virus is estimated to be 5-6 days, while for influenza virus, the serial interval is 3 days.
This means that influenza can spread faster than COVID-19.

Further, transmission in the first 3-5 days of illness, or potentially pre-symptomatic transmission –
transmission of the virus before the appearance of symptoms – is a major driver of transmission for influenza.
In contrast, while we are learning that there are people who can shed COVID-19 virus 24-48 hours prior to symptom onset,
at present, this does not appear to be a major driver of transmission.

The reproductive number – the number of secondary infections generated from one infected individual –
is understood to be between 2 and 2.5 for COVID-19 virus, higher than for influenza.
However, estimates for both COVID-19 and influenza viruses are very context and time-specific, making direct comparisons more difficult.
""")

In [3]:
data=open('corona.txt','r')
corona_data = [text for text in data if text.count(' ') >= 2]
vectorize = Tokenizer()

## Fit data to tokenizer


In [4]:
vectorize.fit_on_texts(corona_data)
corona_data = vectorize.texts_to_sequences(corona_data)

In [5]:
# Find total no of words and total no of sentences.
total_vocab = sum(len(s) for s in corona_data)
word_count = len(vectorize.word_index) + 1
window_size = 2

c. Train model

In [6]:
# Generate the pairs of Context words and target words
def cbow_model(data, window_size, total_vocab):
    total_length = window_size*2
    for text in data:
        text_len = len(text)
        for idx, word in enumerate(text):
            context_word = []
            target   = []
            begin = idx - window_size
            end = idx + window_size + 1
            context_word.append([text[i] for i in range(begin, end) if 0 <= i < text_len and i != idx])
            target.append(word)
            contextual = sequence.pad_sequences(context_word, total_length=total_length)
            final_target = utils.to_categorical(target, total_vocab)
            yield(contextual, final_target)

Create Neural Network model with following parameters :

    Model type : sequential
    
    Layers : Dense , Lambda , embedding. Compile

    Options : (loss='categorical_crossentropy', optimizer='adam')

In [7]:
model = Sequential()
model.add(Embedding(input_dim=total_vocab, output_dim=100, input_length=window_size*2))
model.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(100,)))
model.add(Dense(total_vocab, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
for i in range(10):
    cost = 0
    for x, y in cbow_model(data, window_size, total_vocab):
        cost += model.train_on_batch(contextual, final_target)
    print(i, cost)

0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0


In [8]:
# Create vector file of some word for testing
dimensions=100
vect_file = open('/content/vectors.txt' ,'w')
vect_file.write('{} {}\n'.format(total_vocab,dimensions))

8

In [9]:
# Assign weights to your trained model
weights = model.get_weights()[0]
for text, i in vectorize.word_index.items():
    final_vec = ' '.join(map(str, list(weights[i, :])))
    vect_file.write('{} {}\n'.format(text, final_vec))
vect_file.close()

d. Output

In [10]:
# Use the vectors created in Gemsim
cbow_output = gensim.models.KeyedVectors.load_word2vec_format('/content/vectors.txt', binary = False, limit=100)

In [11]:
# choose the word to get similar type of words
cbow_output.most_similar(positive=['virus'])

[('24', 0.2239864021539688),
 ('major', 0.21599477529525757),
 ('covid', 0.20103251934051514),
 ('period', 0.18392768502235413),
 ('faster', 0.15826314687728882),
 ('symptomatic', 0.15524835884571075),
 ('viruses', 0.15504062175750732),
 ('comparisons', 0.146196648478508),
 ('number', 0.1459341198205948),
 ('3', 0.14284861087799072)]