<a href="https://colab.research.google.com/github/Alihassan7726/Language-Models/blob/main/Word_level_language_model_using_GloVe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('gdrive')

Mounted at gdrive


In [None]:
import string
import numpy as np

In [None]:
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

In [None]:
# load document
in_filename = '/content/gdrive/MyDrive/Colab DataSets/socrates.txt'
doc = load_doc(in_filename)
print(doc[:200])

BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what


### Text Cleaning

In [None]:
# turn a doc into clean tokens
def clean_doc(doc):
	# replace '--' with a space ' '
	doc = doc.replace('--', ' ')
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# make lower case
	tokens = [word.lower() for word in tokens]
	return tokens

In [None]:
# clean document
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['book', 'i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting', 'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid',

In [None]:
## We can organize the long list of tokens into sequences of 50 input words and 1 output word.
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
	# select sequence of tokens
	seq = tokens[i-length:i]
	# convert into a line
	line = ' '.join(seq)
	# store
	sequences.append(line)
print('Total Sequences: %d' % len(sequences))
sequences[:2]

Total Sequences: 118633


['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was',
 'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted']

In [None]:
'\n'.join(sequences[:3])

'book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was\ni i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted\ni went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with'

In [None]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()
 
 # save sequences to file
out_filename = '/content/gdrive/MyDrive/Colab DataSets/socrates_sequences.txt'
save_doc(sequences, out_filename)

### Building GloVe model to learn embeddings

In [None]:

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# load
in_filename = '/content/gdrive/MyDrive/Colab DataSets/socrates_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
lines[:3]

['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was',
 'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with']

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from nltk.tokenize import word_tokenize
from tensorflow.keras.preprocessing.sequence import pad_sequences


In [None]:
## Creating corpus to build gloVe 
from tqdm import tqdm
def create_corpus(lines):
    corpus=[]
    for line in tqdm(lines):
        words=[word for word in word_tokenize(line)]
        corpus.append(words)
    return corpus

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
corpus = create_corpus(lines)
corpus[:3]

100%|██████████| 118633/118633 [00:24<00:00, 4840.66it/s]


[['book',
  'i',
  'i',
  'went',
  'down',
  'yesterday',
  'to',
  'the',
  'piraeus',
  'with',
  'glaucon',
  'the',
  'son',
  'of',
  'ariston',
  'that',
  'i',
  'might',
  'offer',
  'up',
  'my',
  'prayers',
  'to',
  'the',
  'goddess',
  'bendis',
  'the',
  'thracian',
  'artemis',
  'and',
  'also',
  'because',
  'i',
  'wanted',
  'to',
  'see',
  'in',
  'what',
  'manner',
  'they',
  'would',
  'celebrate',
  'the',
  'festival',
  'which',
  'was',
  'a',
  'new',
  'thing',
  'i',
  'was'],
 ['i',
  'i',
  'went',
  'down',
  'yesterday',
  'to',
  'the',
  'piraeus',
  'with',
  'glaucon',
  'the',
  'son',
  'of',
  'ariston',
  'that',
  'i',
  'might',
  'offer',
  'up',
  'my',
  'prayers',
  'to',
  'the',
  'goddess',
  'bendis',
  'the',
  'thracian',
  'artemis',
  'and',
  'also',
  'because',
  'i',
  'wanted',
  'to',
  'see',
  'in',
  'what',
  'manner',
  'they',
  'would',
  'celebrate',
  'the',
  'festival',
  'which',
  'was',
  'a',
  'new',
  

In [None]:
!pip install glove-python-binary

Collecting glove-python-binary
[?25l  Downloading https://files.pythonhosted.org/packages/cc/11/d8510a80110f736822856db566341dd2e1e7c3af536f77e409a6c09e0c22/glove_python_binary-0.2.0-cp37-cp37m-manylinux1_x86_64.whl (948kB)
[K     |▍                               | 10kB 21.2MB/s eta 0:00:01[K     |▊                               | 20kB 28.6MB/s eta 0:00:01[K     |█                               | 30kB 22.6MB/s eta 0:00:01[K     |█▍                              | 40kB 26.3MB/s eta 0:00:01[K     |█▊                              | 51kB 24.4MB/s eta 0:00:01[K     |██                              | 61kB 27.0MB/s eta 0:00:01[K     |██▍                             | 71kB 18.0MB/s eta 0:00:01[K     |██▊                             | 81kB 19.0MB/s eta 0:00:01[K     |███                             | 92kB 17.8MB/s eta 0:00:01[K     |███▌                            | 102kB 18.0MB/s eta 0:00:01[K     |███▉                            | 112kB 18.0MB/s eta 0:00:01[K     |████

In [None]:
from glove import Corpus, Glove

In [None]:
glove_corpus = Corpus()
glove_corpus.fit(corpus, window=5)

In [None]:
glove = Glove(no_components=300, learning_rate=0.05)
glove_corpus.matrix

<7408x7408 sparse matrix of type '<class 'numpy.float64'>'
	with 298273 stored elements in COOrdinate format>

In [None]:
glove.fit(glove_corpus.matrix, epochs=100, no_threads=4, verbose=True)

Epoch 90
Epoch 91
Epoch 92
Epoch 93
Epoch 94
Epoch 95
Epoch 96
Epoch 97
Epoch 98
Epoch 99


In [None]:
glove.add_dictionary(glove_corpus.dictionary)

In [None]:
glove.most_similar('glaucon')

[('requested', 0.6448431484418098),
 ('perverse', 0.6101077233205933),
 ('admirer', 0.5511954142977035),
 ('dear', 0.5458896901192136)]

In [None]:
glove.most_similar('celebrate')

[('imperishable', 0.8802873262458321),
 ('overpowered', 0.87995233435774),
 ('scourged', 0.8741833316427847),
 ('maddest', 0.8725698252917115)]

In [None]:
print(glove.word_vectors.shape)

(7408, 300)


In [None]:
dir(glove)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_similarity_query',
 'add_dictionary',
 'alpha',
 'biases_sum_gradients',
 'dictionary',
 'fit',
 'inverse_dictionary',
 'learning_rate',
 'load',
 'load_stanford',
 'max_count',
 'max_loss',
 'most_similar',
 'most_similar_paragraph',
 'no_components',
 'random_state',
 'save',
 'transform_paragraph',
 'vectors_sum_gradients',
 'word_biases',
 'word_vectors']

In [None]:
glove.word_vectors

array([[ 0.08130634, -0.02137271, -0.11304898, ...,  0.15917209,
        -0.00745882, -0.41285779],
       [ 0.30462347, -0.35249524, -0.23140721, ...,  0.62549208,
         0.38317353,  0.05759389],
       [ 0.1384654 , -0.03672953, -0.14108673, ...,  0.00177653,
        -0.0516151 ,  0.09366849],
       ...,
       [ 0.15985889, -0.11373138, -0.16251841, ..., -0.08066703,
         0.01054506, -0.19335764],
       [ 0.12837377,  0.02778205, -0.12772698, ..., -0.00648152,
         0.0386688 , -0.13218314],
       [ 0.12355593, -0.11804162, -0.13651158, ...,  0.1011807 ,
         0.00142053, -0.12206934]])

In [None]:
## Inversing dictionary to further making a new dictionary for words and their vectors
word_idx_dic = glove.inverse_dictionary
print(len(word_idx_dic))

7408


In [None]:
embedding_index = {}

for idx , word in word_idx_dic.items():
  vector = glove.word_vectors[idx]
  embedding_index[word] = vector
print(len(embedding_index))

7408


In [None]:
embedding_index.get('celebrate')

array([ 1.01768066e-01, -1.06537480e-01, -1.04187260e-01,  1.39433275e-01,
       -1.12835522e-01,  9.21147328e-02, -9.69899151e-02, -4.43506222e-03,
       -1.08592464e-01, -9.60213695e-02,  1.05557103e-01,  2.25099322e-01,
       -7.53851966e-02,  7.00105532e-03,  1.09569938e-01, -6.89260435e-02,
        1.48587259e-01, -7.10263363e-02,  1.37428969e-01, -1.21929492e-01,
        2.43641872e-02, -1.88357000e-02, -1.95394740e-01, -9.52634369e-02,
        8.37063504e-02,  1.07416652e-01, -1.43695577e-01,  1.19678248e-01,
        1.12664278e-01, -5.47817834e-02, -5.80592204e-02,  1.66667546e-01,
        1.03633242e-03,  3.08846982e-02,  4.86717378e-02,  5.43536095e-02,
       -1.09048738e-01, -1.11992439e-01,  7.66844376e-04, -3.55252456e-02,
       -6.04879009e-03, -1.15408970e-01, -9.40256948e-02, -9.96294587e-02,
       -1.79972905e-01, -8.90505854e-02, -3.69110507e-02, -1.04634338e-01,
        2.90704536e-02,  7.06109918e-02,  1.82023879e-01,  4.27836149e-03,
       -7.92387286e-02, -

## Creating weight matrix for words

In [None]:
# Transforms each text in texts to a sequence of integers.
tokenizer_obj = Tokenizer() 
tokenizer_obj.fit_on_texts(lines) 
sequences=tokenizer_obj.texts_to_sequences(lines) 
print(len(sequences))
sequences[:3]

118633


[[1046,
  11,
  11,
  1045,
  329,
  7409,
  4,
  1,
  2873,
  35,
  213,
  1,
  261,
  3,
  2251,
  9,
  11,
  179,
  817,
  123,
  92,
  2872,
  4,
  1,
  2249,
  7408,
  1,
  7407,
  7406,
  2,
  75,
  120,
  11,
  1266,
  4,
  110,
  6,
  30,
  168,
  16,
  49,
  7405,
  1,
  1609,
  13,
  57,
  8,
  549,
  151,
  11,
  57],
 [11,
  11,
  1045,
  329,
  7409,
  4,
  1,
  2873,
  35,
  213,
  1,
  261,
  3,
  2251,
  9,
  11,
  179,
  817,
  123,
  92,
  2872,
  4,
  1,
  2249,
  7408,
  1,
  7407,
  7406,
  2,
  75,
  120,
  11,
  1266,
  4,
  110,
  6,
  30,
  168,
  16,
  49,
  7405,
  1,
  1609,
  13,
  57,
  8,
  549,
  151,
  11,
  57,
  1147],
 [11,
  1045,
  329,
  7409,
  4,
  1,
  2873,
  35,
  213,
  1,
  261,
  3,
  2251,
  9,
  11,
  179,
  817,
  123,
  92,
  2872,
  4,
  1,
  2249,
  7408,
  1,
  7407,
  7406,
  2,
  75,
  120,
  11,
  1266,
  4,
  110,
  6,
  30,
  168,
  16,
  49,
  7405,
  1,
  1609,
  13,
  57,
  8,
  549,
  151,
  11,
  57,
  1147,
  35]]

In [None]:
word_index=tokenizer_obj.word_index
vocab_size = len(word_index)+1
print('Number of unique words:',vocab_size)

Number of unique words: 7410


In [None]:
embedding_matrix = np.zeros((vocab_size, 300))
for word, i in tokenizer_obj.word_index.items():
    try:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector:
            embedding_matrix[i] = embedding_vector
    except:
        pass

In [None]:
#padding=pad_sequences(sequences,maxlen=51,truncating='post',padding='pre')

#print(len(sequences))
#print(padding.shape)
#padding

In [None]:
idx = glove.dictionary['i']
glove.word_vectors[idx]

array([-1.76497423e-01,  7.76970039e-03,  2.64180867e-01,  3.62026901e-01,
        8.35547312e-02,  3.37949357e-02,  9.88658683e-02,  1.16313651e-01,
        1.80103364e-01, -2.87718805e-02,  7.97602320e-01, -1.40438220e-01,
        1.63725932e-02,  2.50410568e-01,  1.52533625e-01, -1.04266798e+00,
       -1.07756724e-01,  8.57791113e-02, -1.00160423e-01,  2.38509429e-04,
        5.45795347e-01,  9.45021226e-02,  6.54612394e-02, -1.22585088e-02,
       -6.01224104e-02,  3.13713700e-02, -2.48669530e-01, -2.32619519e-01,
        3.74613181e-02, -2.58312116e-01, -2.02133301e-02, -9.82397446e-02,
       -1.21153163e-01,  3.89657812e-01,  5.18970503e-02,  9.00497488e-02,
       -2.80083208e-02, -4.60343219e-02, -8.85630357e-02,  4.65506323e-02,
       -7.29143433e-02,  4.62133745e-02,  2.37168424e-01,  1.69533508e-01,
        2.82708793e-01, -4.61569560e-02,  5.66565920e-01,  1.69704842e-01,
       -7.47799647e-01, -4.06274476e-01, -7.59670907e-01, -1.46794198e-01,
        1.24081321e-01, -

In [None]:
embedding_index.get('i')

array([-1.76497423e-01,  7.76970039e-03,  2.64180867e-01,  3.62026901e-01,
        8.35547312e-02,  3.37949357e-02,  9.88658683e-02,  1.16313651e-01,
        1.80103364e-01, -2.87718805e-02,  7.97602320e-01, -1.40438220e-01,
        1.63725932e-02,  2.50410568e-01,  1.52533625e-01, -1.04266798e+00,
       -1.07756724e-01,  8.57791113e-02, -1.00160423e-01,  2.38509429e-04,
        5.45795347e-01,  9.45021226e-02,  6.54612394e-02, -1.22585088e-02,
       -6.01224104e-02,  3.13713700e-02, -2.48669530e-01, -2.32619519e-01,
        3.74613181e-02, -2.58312116e-01, -2.02133301e-02, -9.82397446e-02,
       -1.21153163e-01,  3.89657812e-01,  5.18970503e-02,  9.00497488e-02,
       -2.80083208e-02, -4.60343219e-02, -8.85630357e-02,  4.65506323e-02,
       -7.29143433e-02,  4.62133745e-02,  2.37168424e-01,  1.69533508e-01,
        2.82708793e-01, -4.61569560e-02,  5.66565920e-01,  1.69704842e-01,
       -7.47799647e-01, -4.06274476e-01, -7.59670907e-01, -1.46794198e-01,
        1.24081321e-01, -

In [None]:
# separate into input and output
import numpy as np
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

print(vocab_size)
del(sequences)

7410


In [None]:
X[:3]

array([[1046,   11,   11, 1045,  329, 7409,    4,    1, 2873,   35,  213,
           1,  261,    3, 2251,    9,   11,  179,  817,  123,   92, 2872,
           4,    1, 2249, 7408,    1, 7407, 7406,    2,   75,  120,   11,
        1266,    4,  110,    6,   30,  168,   16,   49, 7405,    1, 1609,
          13,   57,    8,  549,  151,   11],
       [  11,   11, 1045,  329, 7409,    4,    1, 2873,   35,  213,    1,
         261,    3, 2251,    9,   11,  179,  817,  123,   92, 2872,    4,
           1, 2249, 7408,    1, 7407, 7406,    2,   75,  120,   11, 1266,
           4,  110,    6,   30,  168,   16,   49, 7405,    1, 1609,   13,
          57,    8,  549,  151,   11,   57],
       [  11, 1045,  329, 7409,    4,    1, 2873,   35,  213,    1,  261,
           3, 2251,    9,   11,  179,  817,  123,   92, 2872,    4,    1,
        2249, 7408,    1, 7407, 7406,    2,   75,  120,   11, 1266,    4,
         110,    6,   30,  168,   16,   49, 7405,    1, 1609,   13,   57,
           8,  549,  1

In [None]:
# create train and validation sets
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.1, random_state=42)

print('Train shape:', X_tr.shape, 'Val shape:', X_val.shape)
del(X)
del(y)

Train shape: (106769, 50) Val shape: (11864, 50)


In [None]:
from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.callbacks import *
from tensorflow.keras.initializers import Constant

In [None]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 300, input_length=50 , weights=[embedding_matrix] , trainable = True))
#model.add(LSTM(100, return_sequences=True))
model.add(Bidirectional(LSTM(300,return_sequences=True)))
#model.add(LSTM(100))

model.add(GlobalMaxPooling1D())
model.add(Dense(300, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

#Add loss function, metrics, optimizer
model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy']) 

#Adding callbacks
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1,patience=5)  
mc=ModelCheckpoint('/content/gdrive/MyDrive/Colab Notebooks/best_model.h5', monitor='val_accuracy', mode='max', save_best_only=True,verbose=1)


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 300)           2223000   
_________________________________________________________________
bidirectional (Bidirectional (None, 50, 600)           1442400   
_________________________________________________________________
global_max_pooling1d (Global (None, 600)               0         
_________________________________________________________________
dense (Dense)                (None, 300)               180300    
_________________________________________________________________
dense_1 (Dense)              (None, 7410)              2230410   
Total params: 6,076,110
Trainable params: 6,076,110
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
from time import time
t = time()
history = model.fit(np.array(X_tr),np.array(y_tr),batch_size=100,epochs=100,
                    validation_data=(np.array(X_val),np.array(y_val)),verbose=1,callbacks=[es,mc],
                    )
print("Total time taken to run : {} mins".format(np.round((time()-t)/60,decimals = 2)))


Epoch 1/100

Epoch 00001: val_accuracy improved from -inf to 0.05934, saving model to /content/gdrive/MyDrive/Colab Notebooks/best_model.h5
Epoch 2/100

Epoch 00002: val_accuracy did not improve from 0.05934
Epoch 3/100

Epoch 00003: val_accuracy did not improve from 0.05934
Epoch 4/100

Epoch 00004: val_accuracy did not improve from 0.05934
Epoch 5/100

Epoch 00005: val_accuracy did not improve from 0.05934
Epoch 6/100

Epoch 00006: val_accuracy did not improve from 0.05934
Epoch 7/100

Epoch 00007: val_accuracy did not improve from 0.05934
Epoch 8/100

Epoch 00008: val_accuracy did not improve from 0.05934
Epoch 9/100

Epoch 00009: val_accuracy did not improve from 0.05934
Epoch 10/100

Epoch 00010: val_accuracy did not improve from 0.05934
Epoch 11/100

Epoch 00011: val_accuracy did not improve from 0.05934
Epoch 12/100

Epoch 00012: val_accuracy did not improve from 0.05934
Epoch 13/100

Epoch 00013: val_accuracy did not improve from 0.05934
Epoch 14/100

Epoch 00014: val_accuracy 

## So GloVe does'nt work well . We will perform it again in seperate file with another embedding technique