## Embedding Synthesis Demo for GloVe

To make sure that you have everything needed, we will start from scratch for this notebook.  \
There is only one part that is not included (gathering wikitext -- or any other corpus) -- we assume it has been collected and is sitting on harddisk\
It is advised to run this notebook in a virtual environment \
Once you have done that, let's get the dependencies!

In [None]:
!pip3 install torch tqdm wandb nltk selenium beautifulsoup4

_**Note**_: section 5 uses selenium web crawler and firefox, so it would be good if the browser is installed. Otherwise you would have to get the respective browser driver yourself (for e.g for chrome user it's [ChromeDriver](https://chromedriver.chromium.org/home)) and then modify the [function](#driver) here.

There are five sections to this notebook:
* [Section 1: Preparing Corpus](#s1)
* [Section 2: Preparing Training Data](#s2)
* [Section 3: Training with Pytorch and Wandb](#s3)
* [Section 4: Testing](#s4)
* [Section 5: Inference](#s5)
* [Section 6: Incorporate Synthetic Vector Into GloVe](#s6)

## <a class="anchor" id="s1">Section 1: Preparing Corpus</a>

In [2]:
from tqdm import tqdm
import requests



url = 'https://nlp.stanford.edu/data/glove.6B.zip'

dest_file = './glove_6B.zip'
# Streaming, so we can iterate over the response.
response = requests.get(url, stream=True)
total_size_in_bytes= int(response.headers.get('content-length', 0))
block_size = 1024 #1 Kibibyte
progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
with open(dest_file, 'wb') as file:
    for data in response.iter_content(block_size):
        progress_bar.update(len(data))
        file.write(data)
progress_bar.close()
if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
    print("ERROR, something went wrong")

100%|██████████████████████████████████████████████████████████████████████████████| 862M/862M [03:39<00:00, 3.93MiB/s]


In [6]:
#extract to folder
import zipfile
with zipfile.ZipFile(dest_file, 'r') as zip_ref:
    zip_ref.extractall(dest_file.replace('.zip','') )  #use the filename as destination dir


In [9]:
#generate the glove files
import argparse
import numpy as np
import sys
import json

def generate(file):
    words = []
    vectors = {}
    with open(file, 'r',encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            _temp = line.rstrip().split(' ')
            words.append(_temp[0])
            vectors[_temp[0]] = [float(x) for x in _temp[1:]]

    vocab_size = len(words)
    vocab = {w: idx for idx, w in enumerate(words)}
    ivocab = {idx: w for idx, w in enumerate(words)}

    vector_dim = len(vectors[ivocab[0]])
    W = np.zeros((vocab_size, vector_dim))
    for word, v in vectors.items():
        if word == '<unk>':
            continue
        W[vocab[word], :] = v

    # normalize each word vector to unit variance
    W_norm = np.zeros(W.shape)
    d = (np.sum(W ** 2, 1) ** (0.5))
    W_norm = (W.T / d).T
    return (W_norm, vocab, ivocab)

glove_file = f"{dest_file.replace('.zip','')}/glove.6B.100d.txt"
(W_norm, vocab, ivocab) = generate(glove_file)

#save the files as npy for easier loading 
np.save('./glove6B100d.npy',W_norm)
with open('./glove_vocab.json','w') as f:
    json.dump(vocab,f)
with open('./glove_ivocab.json','w') as f:
    json.dump(ivocab,f)

## <a class="anchor" id="s2">Section 2: Preparing Training Data</a>

In [12]:
#this section covers the preparation pipeline. It is assumed that the wikitext is already downloaded and extracted
# the wikipedia dump is here https://dumps.wikimedia.org/enwiki/20210920/enwiki-20210920-pages-articles-multistream.xml.bz2 
# the source code to extract it is here https://github.com/attardi/wikiextractor 


#eventually, this part of the code just need the path to all the text files, 
#so it's up to you to implement it how you like
import os, re
wiki_dataset_dir = "D:/DATASETS_UNZIP/Datasets/text"
#Extracting from wiki 
filepath_list = []
wiki=True
if wiki == True:
    for folder in os.listdir(wiki_dataset_dir):
        for file in os.listdir(wiki_dataset_dir+'/'+folder):
            filepath_list.append(wiki_dataset_dir+'/'+folder+'/'+file)

In [8]:
from nltk.corpus import stopwords

In [None]:
#run this if you dont have nltk stopwords
nltk.download('stopwords')

In [9]:
stopword_list = stopwords.words('english')

In [10]:
#extract the target word and neighboring word from its sentences

def extract(sentence,target_word,context_length,pad,debug=True):
    target_word = target_word.lower()
    sentence = sentence.lower()
    if target_word not in sentence: #reduce processing
        return
    
    s = sentence.lower().strip()
    s = re.sub('[\n\r\ ]+',' ',s)
    s = re.sub('[^a-z ]+','',s)
    
    t = target_word
    raw_tokens = s.split(' ')
    tokens = [i for i in raw_tokens if i != '' and i not in stopword_list]
    word_list = []
    if t in tokens:
        __index = tokens.index(t)        # this only get one utterance, what about other utterances? 
        if __index < context_length:      #pad front
            word_list += [pad for _ in range(context_length-__index)]
            word_list += tokens[:__index]      #pad back
        else:
            word_list += tokens[__index-context_length:__index]
        if __index + context_length >= len(tokens):
            word_list += tokens[__index+1:]
            word_list += [pad for _ in range(context_length + __index + 1 - len(tokens))]
        else:
            word_list += tokens[__index+1:__index+context_length+1]
        return word_list


    else:   #target not found:
        return None


In [17]:
!mkdir processed_data

In [None]:
#crawling multiple words at once, writing the file into target directory  

target_words = ['tired','pointless']
                
context_length = 10

for target_word in target_words:
    line_written = 0
    corpus = f"./processed_data/{target_word}_corpus_c{context_length}.txt"
    with open(corpus,'a') as o10:
        for file in tqdm(filepath_list):
            with open(file,'r',encoding='utf-8') as f:
                lines = f.readlines()
                for line in lines:
                    result = extract(line,target_word,context_length,"<pad>",True)
                    if result == None:
                        continue
                    else:
                        o10.write(f'{target_word}:')
                        o10.write(''.join([i+' ' for i in result] ))
                        o10.write('\n')
                        line_written += 1
                        if line_written % 10000 == 0:
                            print(f"Written {line_written} into file {target_word}")
        print("\n\n DONE WITH CORPUS {corpus}\n\n")

## <a class="anchor" id="s3">Section 3: Training with Pytorch and Wandb</a>

this section is monitored by wandb. Please change the config accordingly if you want to rerun your experiment

In [23]:
import json
import numpy as np
import wandb
import random
from datetime import date

wandb.init(project='Synthetic Net')
config = wandb.config

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
wandb: Currently logged in as: sosig_catto (use `wandb login --relogin` to force relogin)
wandb: wandb version 0.12.2 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


In [24]:
import torch
import torch.nn as nn

class DenseNet(nn.Module):
    def __init__(self,context_length,embed_size=100):
        super().__init__()
        self.n = context_length*2
        self.embed_size = embed_size
        self.act = nn.ReLU()
        self.out = nn.Tanh() 
        self.hidden1 = nn.Linear(self.n*self.embed_size,2048)
        self.hidden2 = nn.Linear(2048,512)
        self.hidden3 = nn.Linear(512,self.embed_size)
 
    def forward(self,x):
        x = x.view(x.size(0), -1)
        x = self.act(self.hidden1(x))
        x = self.act(self.hidden2(x))
        x = self.out(self.hidden3(x))
        return x
    
config.context_length = 10
model = DenseNet(context_length = config.context_length)
print(model)

DenseNet(
  (act): ReLU()
  (out): Tanh()
  (hidden1): Linear(in_features=2000, out_features=2048, bias=True)
  (hidden2): Linear(in_features=2048, out_features=512, bias=True)
  (hidden3): Linear(in_features=512, out_features=100, bias=True)
)


In [25]:
#util file contains the loading functions
from util import *

#Loading the data
W_norm,vocab,ivocab = load_glove(        
        weight_file = './glove6B100d.npy',
        vocab_file = './glove_vocab.json',
        ivocab_file='./glove_ivocab.json'
)
    
config.batch_size = 64

#configure the files used for training. Can load multiple files
#usually load with negative samples to avoid overfitting
files_for_training =['tired'] 

training_files = [f'./processed_data/{x}_corpus_c10.txt' for x in files_for_training]

training_data = load_training_batch(training_files,config.batch_size)

#for logging purpose, provide data lineage
config.data = "wiki_only"

train_tensor = get_embedding(training_data,W_norm,vocab)

In [177]:
### checking vocab
print(len(vocab))
print(len(ivocab))

400000
400000


In [26]:
import torch.optim as optim


config.lr = 0.0005
config.momentum = 0.005
optimizer = optim.SGD(model.parameters(),lr=config.lr,momentum=config.momentum,weight_decay=0.01)
criterion = nn.L1Loss()

def cosim(v1,v2):
    return np.dot(v1,v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))

#scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, debug_set.shape[0], eta_min=config.lr)
#learning rate adjustment -- try 0.001

def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        features,labels = batch
        batch_size = features.shape[0]
        predictions = model(torch.Tensor(features)).squeeze(1)
        loss = criterion(predictions,torch.Tensor(labels))      
        loss.backward()
        
        optimizer.step()
        epoch_loss += loss.item()
        
        cosim_score = np.mean([cosim(labels[i],predictions[i].detach().numpy()) for i in range(batch_size) ])
        
    return epoch_loss,cosim_score

In [36]:
config.epochs = 40   #usually 40 is the best

best_valid_loss = float('inf')

for epoch in tqdm(range(config.epochs)):   
    train_loss,cosim_score= train(model,iter(train_tensor), optimizer, criterion)

    #epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    #if valid_loss < best_valid_loss:
     #   best_valid_loss = valid_loss
      #  torch.save(model.state_dict(), 'tut1-model.pt')
    
    #print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    wandb.log({"loss":train_loss,"cosim_score":cosim_score})
    print(f'Epoch:{epoch+1:02}\t|\tTrain Loss: {train_loss:.3f}\t|\tCosim score: {cosim_score:.3f}')

    

  2%|██                                                                                 | 1/40 [00:07<04:48,  7.39s/it]

Epoch:01	|	Train Loss: 18.281	|	Cosim score: 0.139


  5%|████▏                                                                              | 2/40 [00:14<04:43,  7.46s/it]

Epoch:02	|	Train Loss: 17.854	|	Cosim score: 0.194


  8%|██████▏                                                                            | 3/40 [00:22<04:37,  7.49s/it]

Epoch:03	|	Train Loss: 17.436	|	Cosim score: 0.249


 10%|████████▎                                                                          | 4/40 [00:30<04:35,  7.66s/it]

Epoch:04	|	Train Loss: 17.026	|	Cosim score: 0.301


 12%|██████████▍                                                                        | 5/40 [00:38<04:32,  7.80s/it]

Epoch:05	|	Train Loss: 16.624	|	Cosim score: 0.352


 15%|████████████▍                                                                      | 6/40 [00:46<04:28,  7.88s/it]

Epoch:06	|	Train Loss: 16.229	|	Cosim score: 0.400


 18%|██████████████▌                                                                    | 7/40 [00:54<04:22,  7.94s/it]

Epoch:07	|	Train Loss: 15.840	|	Cosim score: 0.444


 20%|████████████████▌                                                                  | 8/40 [01:02<04:17,  8.05s/it]

Epoch:08	|	Train Loss: 15.458	|	Cosim score: 0.486


 22%|██████████████████▋                                                                | 9/40 [01:10<04:10,  8.08s/it]

Epoch:09	|	Train Loss: 15.083	|	Cosim score: 0.524


 25%|████████████████████▌                                                             | 10/40 [01:19<04:04,  8.14s/it]

Epoch:10	|	Train Loss: 14.713	|	Cosim score: 0.560


 28%|██████████████████████▌                                                           | 11/40 [01:27<03:55,  8.13s/it]

Epoch:11	|	Train Loss: 14.349	|	Cosim score: 0.592


 30%|████████████████████████▌                                                         | 12/40 [01:35<03:48,  8.16s/it]

Epoch:12	|	Train Loss: 13.992	|	Cosim score: 0.621


 32%|██████████████████████████▋                                                       | 13/40 [01:43<03:38,  8.10s/it]

Epoch:13	|	Train Loss: 13.640	|	Cosim score: 0.648


 35%|████████████████████████████▋                                                     | 14/40 [01:51<03:30,  8.10s/it]

Epoch:14	|	Train Loss: 13.294	|	Cosim score: 0.672


 38%|██████████████████████████████▊                                                   | 15/40 [01:59<03:22,  8.11s/it]

Epoch:15	|	Train Loss: 12.955	|	Cosim score: 0.694


 40%|████████████████████████████████▊                                                 | 16/40 [02:08<03:15,  8.16s/it]

Epoch:16	|	Train Loss: 12.622	|	Cosim score: 0.714


 42%|██████████████████████████████████▊                                               | 17/40 [02:16<03:07,  8.16s/it]

Epoch:17	|	Train Loss: 12.295	|	Cosim score: 0.732


 45%|████████████████████████████████████▉                                             | 18/40 [02:24<02:59,  8.14s/it]

Epoch:18	|	Train Loss: 11.975	|	Cosim score: 0.749


 48%|██████████████████████████████████████▉                                           | 19/40 [02:32<02:51,  8.17s/it]

Epoch:19	|	Train Loss: 11.661	|	Cosim score: 0.764


 50%|█████████████████████████████████████████                                         | 20/40 [02:40<02:42,  8.14s/it]

Epoch:20	|	Train Loss: 11.354	|	Cosim score: 0.778


 52%|███████████████████████████████████████████                                       | 21/40 [02:48<02:34,  8.15s/it]

Epoch:21	|	Train Loss: 11.053	|	Cosim score: 0.790


 55%|█████████████████████████████████████████████                                     | 22/40 [02:56<02:27,  8.18s/it]

Epoch:22	|	Train Loss: 10.760	|	Cosim score: 0.802


 57%|███████████████████████████████████████████████▏                                  | 23/40 [03:05<02:18,  8.14s/it]

Epoch:23	|	Train Loss: 10.473	|	Cosim score: 0.813


 60%|█████████████████████████████████████████████████▏                                | 24/40 [03:13<02:12,  8.30s/it]

Epoch:24	|	Train Loss: 10.192	|	Cosim score: 0.823


 62%|███████████████████████████████████████████████████▎                              | 25/40 [03:22<02:04,  8.31s/it]

Epoch:25	|	Train Loss: 9.918	|	Cosim score: 0.832


 65%|█████████████████████████████████████████████████████▎                            | 26/40 [03:30<01:56,  8.35s/it]

Epoch:26	|	Train Loss: 9.651	|	Cosim score: 0.841


 68%|███████████████████████████████████████████████████████▎                          | 27/40 [03:40<01:55,  8.87s/it]

Epoch:27	|	Train Loss: 9.391	|	Cosim score: 0.849


 70%|█████████████████████████████████████████████████████████▍                        | 28/40 [03:49<01:46,  8.87s/it]

Epoch:28	|	Train Loss: 9.138	|	Cosim score: 0.856


 72%|███████████████████████████████████████████████████████████▍                      | 29/40 [03:58<01:37,  8.84s/it]

Epoch:29	|	Train Loss: 8.893	|	Cosim score: 0.863


 75%|█████████████████████████████████████████████████████████████▌                    | 30/40 [04:06<01:27,  8.78s/it]

Epoch:30	|	Train Loss: 8.655	|	Cosim score: 0.870


 78%|███████████████████████████████████████████████████████████████▌                  | 31/40 [04:15<01:18,  8.76s/it]

Epoch:31	|	Train Loss: 8.424	|	Cosim score: 0.876


 80%|█████████████████████████████████████████████████████████████████▌                | 32/40 [04:24<01:09,  8.68s/it]

Epoch:32	|	Train Loss: 8.202	|	Cosim score: 0.882


 82%|███████████████████████████████████████████████████████████████████▋              | 33/40 [04:32<01:00,  8.64s/it]

Epoch:33	|	Train Loss: 7.987	|	Cosim score: 0.887


 85%|█████████████████████████████████████████████████████████████████████▋            | 34/40 [04:40<00:50,  8.48s/it]

Epoch:34	|	Train Loss: 7.779	|	Cosim score: 0.892


 88%|███████████████████████████████████████████████████████████████████████▊          | 35/40 [04:49<00:42,  8.44s/it]

Epoch:35	|	Train Loss: 7.579	|	Cosim score: 0.897


 90%|█████████████████████████████████████████████████████████████████████████▊        | 36/40 [04:57<00:33,  8.38s/it]

Epoch:36	|	Train Loss: 7.386	|	Cosim score: 0.902


 92%|███████████████████████████████████████████████████████████████████████████▊      | 37/40 [05:05<00:24,  8.28s/it]

Epoch:37	|	Train Loss: 7.199	|	Cosim score: 0.906


 95%|█████████████████████████████████████████████████████████████████████████████▉    | 38/40 [05:13<00:16,  8.23s/it]

Epoch:38	|	Train Loss: 7.019	|	Cosim score: 0.910


 98%|███████████████████████████████████████████████████████████████████████████████▉  | 39/40 [05:21<00:08,  8.14s/it]

Epoch:39	|	Train Loss: 6.844	|	Cosim score: 0.914


100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [05:29<00:00,  8.24s/it]

Epoch:40	|	Train Loss: 6.676	|	Cosim score: 0.918





In [28]:
! mkdir output
torch.save(model.state_dict(),f'output/{date.today().strftime("%Y-%m")}_{config.data}_{wandb.run.name}.pt')

## <a class="anchor" id="s4">Section 4: Testing</a>

In [33]:
## section 4: Testing using analogy tests
model_to_test =  DenseNet(context_length = 10)
model_to_test.load_state_dict(torch.load('./output/2021-09_wiki_only_cool-hill-39.pt'))

##Testing has 3 unit tests
#Test 1 sentence
#Test 1 batch
#custom sentence


#Test 1 -- random sentence in training
random_sent = random.choice(training_data)[random.randint(0,config.batch_size-1)]
y,x = random_sent.split(':')
x = re.sub('[\n\r\ ]+',' ',x).strip()
sample_tensor = torch.Tensor([[get_glove_vec(word,W_norm,vocab) for word in x.split(' ')]])
sample_output = model_to_test(sample_tensor)
target_label = np.array(get_glove_vec(y,W_norm,vocab))

output1 = sample_output.squeeze(1)
vec_output1 = output.detach().numpy()
print(vec_output1.shape)

def __distance(W, vocab, ivocab, vec_output):


    dist = np.dot(W, vec_output.T).squeeze(1)
    print(dist.shape)
    a = np.argsort(-dist)[:10]

    print("\n                               Word       Unnormalized Cosine distance\n")
    print("---------------------------------------------------------\n")
    for i,x in enumerate(a):
        print("%d%35s\t\t%f" % (i,ivocab[str(x)], dist[x]))
print(f"Test 1 -- sample sentence: \n\n{random_sent}\n\n")

__distance(W_norm,vocab,ivocab,vec_output1)


print(f"\n\n\t\tCosim score: {cosim(vec_output1,target_label)}")


(1, 100)
Test 1 -- sample sentence: 

tired:short run run sweet road runner real scored music ten feathered clippety clobbered used set generic musical cues follow action 



(400000,)

                               Word       Unnormalized Cosine distance

---------------------------------------------------------

0                    multilateralism		0.190550
1                          resurging		0.161833
2                        replicators		0.156357
3                       undiminished		0.147284
4                            sinning		0.145458
5                            finning		0.143951
6                             glatch		0.143233
7                      neoliberalism		0.142868
8                      americanizing		0.142485
9                          labarbera		0.142431


		Cosim score: [0.09054023]


In [85]:
#test 2: test by batch

random_batch = random.choice(training_data)
sample_batch_tensor = []
target_batch_tensor = []
for sentence in random_batch:
    y,x = sentence.split(':')
    x = re.sub('[\n\r\ ]+',' ',x).strip()
    sample_tensor = [get_glove_vec(word,W_norm,vocab) for word in x.split(' ')]
    target_batch_tensor.append(get_glove_vec(y,W_norm,vocab))
    sample_batch_tensor.append(sample_tensor)
    
sample_batch_tensor = torch.Tensor(np.array(sample_batch_tensor))
target_batch_tensor = np.array(target_batch_tensor)

sample_output = model(sample_batch_tensor)

output2 = torch.mean(sample_output,0)   #sum across embeddings
vec_output2 = output2.detach().numpy().reshape((1,100))

print(f"Test 2 -- sample batch: \n\n")

__distance(W_norm,vocab,ivocab,vec_output2)


print(f"\n\n\t\tCosim score: {cosim(vec_output2,target_label)}")

Test 2 -- sample batch: 


(400000,)

                               Word       Unnormalized Cosine distance

---------------------------------------------------------

0                              tired		0.596813
1                             scared		0.462481
2                              weary		0.460485
3                                 'm		0.446562
4                         frustrated		0.442676
5                           fatigued		0.438139
6                          exhausted		0.436740
7                              bored		0.434672
8                             afraid		0.433339
9                               feel		0.431909


		Cosim score: [0.14825795]


In [35]:
###Test 3: Custom


random_sent = 'pacific disaster response fund support armenian government fight spread covid year bank committed million loan electric networks armenia ensure electricity '
target_word = 'pneumonia'
target_label = np.array(get_glove_vec(target_word,W_norm,vocab))
random_sent = re.sub('[\n\r\ ]+',' ',random_sent).strip()

sample_tensor = torch.Tensor([[get_glove_vec(word,W_norm,vocab) for word in random_sent.split(' ')]])
sample_output = model(sample_tensor)
output3 = sample_output.squeeze(1)
vec_output3 = output3.detach().numpy()
print(f"Test 3: Custom Test\n\n{random_sent}\n\n")
__distance(W_norm,vocab,ivocab,vec_output3)
print(f"\n\n\t\tCosim score: {cosim(vec_output3,target_label)}")


Test 3: Custom Test

pacific disaster response fund support armenian government fight spread covid year bank committed million loan electric networks armenia ensure electricity


(400000,)

                               Word       Unnormalized Cosine distance

---------------------------------------------------------

0                    multilateralism		0.190550
1                          resurging		0.161833
2                        replicators		0.156357
3                       undiminished		0.147284
4                            sinning		0.145458
5                            finning		0.143951
6                             glatch		0.143233
7                      neoliberalism		0.142868
8                      americanizing		0.142485
9                          labarbera		0.142431


		Cosim score: [0.12973318]


## <a class="anchor" id="s5">Section 5: Inference</a>

**Note**: the driver function is <a class="anchor" id="driver">get_google_search_page</a>. The driver is currently defined as Firefox

In [3]:
## section 5: generating for an unknown word 

#we will crawl some data for an unknown words using selenium, and export it to a text file

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pprint
import time
import argparse
from nltk.tokenize import sent_tokenize
import re
import requests
from bs4 import BeautifulSoup


def get_google_search_page(input_text):
    fox = webdriver.Firefox()
    fox.get(f"https://www.google.com/search?q={input_text}")
    search_bar = fox.find_element_by_tag_name("input")
    time.sleep(2)
    cur_url = fox.current_url
    fox.close()
    fox.quit()
    return cur_url

def get_top_results(URL):
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')

    all_urls = []
    for elems in soup.findAll('h3'):
        for link in soup.findAll('a'):
            url = str(link.get('href'))
            if url.startswith('/url?'):
                all_urls.append(url[7:])
    return all_urls

def filter_duplicate(url_list,max_url=5):
    domain_list = set()
    result = dict()
    blacklist = ["youtube.com"]
    for url in url_list:
        domain = re.findall(r'^https://www.(.*?)/',url) 
        url = re.findall(r'^https://(.*?)&sa',url) # &sa is there for the google rubbish
        if domain != [] and domain[0] not in domain_list and domain[0] not in blacklist:   #it will not check[0] until domain != [] satisfy
            domain_list.add('https://www.'+domain[0])
            result['www.'+domain[0]] = "https://"+url[0]
            if len(domain_list) == max_url:
                break
    return result



def extract_text_elems_from_urls(urls):
    stored_text = dict()
    for url_ in urls.values():
        #print(f'========{url_}===========')
        try: 
            childpage = requests.get(url_)
            childsoup = BeautifulSoup(childpage.content, 'html.parser')
            stored_text[url_] = {'h1':set(),'p':set()}
            for tagment in ['h1','p']:
                    elements = childsoup.findAll(tagment)
                    for element in elements:
                        stored_text[url_][tagment].add(element.text)
        except:
            print(f"ERROR at url {url_}")
            pass
        #print("========================================================")
    return stored_text


def clean_sentence(sent):
    sent = re.sub(f'[\ \r\n]+',' ',sent)
    return sent.lower()


def write_text_dict_to_file(stored_text,outputfile='test.txt'):
    with open(outputfile,'w',encoding='utf-8') as f:
        for url,texts in stored_text.items():
            f.write(f'=========={url}===========\n')
            clean_h1 = []
            f.write(f'<<h1>>\n')
            for h1 in texts['h1']:
                clean_h1 += sent_tokenize(h1)
            for sentence in clean_h1:
                f.write(clean_sentence(sentence))
                f.write('\n')
            f.write(f'<<p>>\n')
            clean_p = []
            for p in texts['p']:
                clean_p += sent_tokenize(p)
            for sentence in clean_p:
                f.write(clean_sentence(sentence))
                f.write('\n')
            f.write('\n\n')
            print(f"Finishing with url: {url}")
    print("Document ready")


def main(input_text,output,num_pages=10):
    print("Getting the google page...")
    gg_url = get_google_search_page(input_text)
    print("Get the first page results....")
    top_search = get_top_results(gg_url) 
    print(f"Duplicate and filter to {num_pages} pages...")
    unique_top_search = filter_duplicate(top_search,num_pages)
    print("Extracting the h1 and p elements from these pages...")
    stored_text = extract_text_elems_from_urls(unique_top_search)
    print("Writing to file")
    write_text_dict_to_file(stored_text,output)

    
for i in ['moderna','astrazenecca','washington','raffles','tekong','obama','NYC']:
    main(i,f'./inference/{i}.txt',num_pages = 100)

Getting the google page...
Get the first page results....
Duplicate and filter to 100 pages...
Extracting the h1 and p elements from these pages...
ERROR at url https://www.straitstimes.com/world/heart-inflammation-rates-higher-after-moderna-covid-19-shot-than-pfizer-vaccine-canada-data
Writing to file
Finishing with url: https://www.channelnewsasia.com/asia/japan-takeda-moderna-covid-19-vaccine-contaminant-human-error-recall-2214936
Finishing with url: https://www.pharmaceutical-technology.com/comment/moderna-vaccine-recalled-novovax-replacement/
Finishing with url: https://www.todayonline.com/singapore/singapore-sends-100000-doses-moderna-covid-19-vaccine-brunei
Finishing with url: https://www.investing.com/news/stock-market-news/moderna-biontech-pfizer-fall-on-merck-covid19-pill-news-2632204
Finishing with url: https://www.businessinsider.com/moderna-vaccines-might-not-need-boosters-like-pfizer-2021-9
Finishing with url: https://www.latimes.com/science/story/2021-09-17/study-finds-b

In [13]:
#cleaning the crawled text for inference
# cleaning
for inference_target in ['NYC','obama','raffles','washington','tekong']:
    inference_file = f"./inference/{inference_target}.txt"
    inference_output_file = f"./inference/clean_{inference_target}.txt"
    with open(inference_output_file,'w') as o:
        with open(inference_file,'r',encoding='utf-8') as f:
            lines = f.readlines()
            line_written =0
            for line in tqdm(lines):
                result = extract(line,inference_target,10,"<pad>",True)
                if result == None:
                    continue
                else:
                    print(result)
                    o.write(f'{inference_target}:')
                    o.write(''.join([i+' ' for i in result] ))
                    o.write('\n')
                    line_written += 1
                    if line_written % 10000 == 0:
                        print(f"Written {line_written} into file {inference_target}")

100%|████████████████████████████████████████████████████████████████████████████| 362/362 [00:00<00:00, 122357.81it/s]


['time', 'visiting', 'city', 'together', 'meghan', 'baby', 'shower', 'son', 'archie', 'mountbattenwindsor', 'back', 'harry', 'london', 'engagement', 'highly', 'anticipated', 'visit', 'also', 'duchesss', 'first']
['<pad>', '<pad>', '<pad>', 'federal', 'judge', 'deals', 'blow', 'covid', 'vaccine', 'mandate', 'teachers', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'bennett', 'sukkot', 'time', 'stroll', 'memory', 'lane', 'th', 'ave', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'also', 'part', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad

100%|█████████████████████████████████████████████████████████████████████████████| 399/399 [00:00<00:00, 13304.77it/s]


['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'barack', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'married', 'michelle', 'robinson', 'lawyer', 'also', 'excelled', 'harvard', 'law', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'elected', 'illinois', 'senate', 'us', 'senate', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'barack', 'elected', 'president', 'became', 'first', 'african', 'american', 'hold', 'office', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'learn', 'barack', 'obamas', 'spouse', 'michelle', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad

100%|█████████████████████████████████████████████████████████████████████████████| 406/406 [00:00<00:00, 16918.40it/s]


['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'seychelles', 'praslin', 'seychelles', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'part', 'accor', 'copyright', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'singapore', 'singapore', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', 'creative', 'expression', 'longlasting', 'inspiration', 'combine', 'issue', 'magazine', 'inspired', 'five', 'senses', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>'

100%|████████████████████████████████████████████████████████████████████████████████| 89/89 [00:00<00:00, 9890.39it/s]


['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'university', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', 'return', 'inperson', 'campus', 'life', 'may', 'prompt', 'anxiety', 'tips', 'university', 'psychology', 'professor', 'jane', 'simoni', 'help', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'university', 'seattle', 'wa', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'discover', 'best', 'restaurants', 'bars', 'dc', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'stay', 'current', 'things', 'around', 'dc', 'signing', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>',

100%|███████████████████████████████████████████████████████████████████████████████| 52/52 [00:00<00:00, 26057.80it/s]

['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'located', 'northeastern', 'coast', 'singapore', 'pulau', 'land', 'area', 'hectares', 'singapores', 'largest', 'natural', 'offshore', 'island', '<pad>', '<pad>']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'originally', 'consisting', 'two', 'islands', 'pulau', 'besar', 'meaning', 'big', 'tekong', 'island', 'malay', 'pulau', 'tekong', 'kechil', 'meaning']
['<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'pulau', 'home', 'dozen', 'villages', 'largest', 'singapores', 'natural', 'offshore', 'islands', 'holds', 'special']
['<pad>', '<pad>', 'first', 'edition', 'memorial', 'halls', 'public', 'lecture', 'series', 'pulau', 'mapping', 'research', 'consultant', 'mok', 'ly', 'yng', 'share', 'history', 'tekong', 'island']
['<pad>', '<pad>', '<pad>', '<pad>', 'join', 'mr', 'moks', 'talk', 'learn', 'transformation', 'years', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
['<pad>', '<pad




In [95]:
with open(inference_output_file,'r') as f:
    batch = f.readlines()
    
sample_batch_tensor = []
target_batch_tensor = []
for sentence in batch:
    y,x = sentence.split(':')
    x = re.sub('[\n\r\ ]+',' ',x).strip()
    sample_tensor = [get_glove_vec(word,W_norm,vocab) for word in x.split(' ')]
    target_batch_tensor.append(get_glove_vec(y,W_norm,vocab))
    sample_batch_tensor.append(sample_tensor)
    
sample_batch_tensor = torch.Tensor(np.array(sample_batch_tensor))
target_batch_tensor = np.array(target_batch_tensor)

sample_output = model(sample_batch_tensor)

output = torch.mean(sample_output,0)   #sum across embeddings
infer_vec_output = output.detach().numpy().reshape((1,100))

print(f"Inference batch: {inference_target}\n\n")

__distance(W_norm,vocab,ivocab,infer_vec_output)


print(f"\n\n\t\tCosim score: {cosim(infer_vec_output,target_label)}")

Inference batch: sian


(400000,)

                               Word       Unnormalized Cosine distance

---------------------------------------------------------

0                              tired		0.531570
1                             scared		0.414386
2                              weary		0.412458
3                                 'm		0.400601
4                         frustrated		0.399057
5                          exhausted		0.392542
6                             afraid		0.389355
7                               feel		0.388964
8                              bored		0.386259
9                           fatigued		0.386248


		Cosim score: [0.15159323]


## <a class="anchor" id="s6">Section 6: Incorporating synthesis vector back into GloVe</a>

In [194]:
## section 6: writing the vectors back 


#choose an output vec that performs well 
output_vec = infer_vec_output
embeddings = {inference_target:output_vec}  #word: embedding

original_glove = './glove_6B/glove.6B.100d.txt'
modified_glove = './glove_6B/modded_glove.6B.100d.txt'

#the word might have existed in the corpus, so we will play safe
existing = []
with open(modified_glove,'w',encoding='utf-8') as f:
    with open(original_glove,'r',encoding='utf-8') as i:
        data = i.readlines()
        for line in data:
            word = line.rstrip().split(' ')[0]
            if word in embeddings:  
                f.write(f"{word} ")  
                f.write(' '.join([f'{i:.5f}' for i in embeddings[word].flatten().tolist()]))
                f.write('\n')
                existing.append(word)
            else:
                f.write(line)
        non_existing = [i for i in embeddings.keys() if i not in existing]
        for new_word in non_existing:
            f.write(f"{word} ")  
            f.write(' '.join([f'{i:.5f}' for i in embeddings[word].flatten().tolist()]))
            f.write('\n')

#checking if it has been incorporated
with open(modified_glove,'r',encoding='utf-8') as f:
    temp = f.readlines()
print(len(temp))
(m_W_norm, mod_vocab, m_ivocab) = generate(modified_glove)
print(len(mod_vocab))
print(existing,non_existing)

400000
400000
['sian'] []
