## **Custom Embeddings**

This notebook should be ideally run on google colab. 

For Google colab, 
1. Make sure you have added the shared folder "digital-forest". 
2. Mount the google drive onto the colab environment. 
    1. Go to the folder icon on the left
    2. Click on th folder icon with google drive icon.
    3. This should mount the drive.
    4. Now all files in your drive are directly accessible in your colab environment.

For running on local environment, 
1. Make sure to change the root path to the local directory.
2. If any errors make sure to double check the file directory.



## 1. Load the data

**Note:-** Here the directory should match the directory from your google colab drive. 
To get this
1. Explore the folders in the files section
2. Right Click on the folder whose path you woukld like to import.
3. Click on Copy Path from the dropdown

In [1]:
mdpi_dir = '/content/drive/MyDrive/digital-forest/mdpi'

In [2]:
# Get a list of all files in given directory
from os import walk
filenames = next(walk(mdpi_dir), (None, None, []))[2]  # [] if no file

In [4]:
filenames[:3]

['10.3390_rs13153009.html',
 '10.3390_rs13152956.html',
 '10.3390_rs13152892.html']

## 2. Get text data from all files

In [35]:
import imp
from bs4 import BeautifulSoup
import re

def extract_text_from_html(mdpi_dir, mdpi_file_name):
    with open(mdpi_dir + '/' + mdpi_file_name, "r", encoding='utf-8') as f:
        html_file = f.read()
    soup = BeautifulSoup(html_file, 'html.parser')
    
    article = soup.find('article')
    text_list = article.find_all(text=True)
    article_text = " ".join(text_list)
    
    # Remove \n characters
    clean_text = article_text.replace('\n', ' ')
    # Remove special characters and numbers
    clean_text = re.sub('[^.,A-Za-z]+', ' ', clean_text)
    # Convert all text to lower
    clean_text = clean_text.lower()
    
    return clean_text

In [36]:
# Get all the text data from the articles
mdpi_corpus = []
failed_files = []
for file_name in filenames:
    # There might be possible exceptions from extracting text. 
    # This will catch the exceptions and we can analyze why it failed for some files
    try:
        extracted_text = extract_text_from_html(mdpi_dir, file_name)
        mdpi_corpus.append(extracted_text)
    except Exception as e:
        failed_files.append(file_name)
        print("Error while extracting text for {}".format(file_name), e)

Error while extracting text for 10.3390_rs90201023.html 'NoneType' object has no attribute 'find_all'
Error while extracting text for 10.3390_rs90201024.html 'NoneType' object has no attribute 'find_all'


In [37]:
print("Successfully processed {} records".format(len(mdpi_corpus)))

Successfully processed 402 records


## 3. Setup glove embeddings 

**This is required on google colab as data is not stored permenantly.**

In [10]:
!wget https://nlp.stanford.edu/data/glove.6B.zip

--2022-04-15 15:39:05--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-04-15 15:39:05--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-04-15 15:41:46 (5.12 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



In [11]:
!unzip "glove.6B.zip"

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


## 4. Setup pipeline to make custom embeddings

In [12]:
import gensim
from gensim.test.utils import get_tmpfile, datapath
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import Word2Vec

In [13]:
glove_file = datapath('/content/glove.6B.50d.txt')
tmp_file = get_tmpfile("test_word2vec.txt")

_ = glove2word2vec(glove_file, tmp_file)
glove_vectors = KeyedVectors.load_word2vec_format(tmp_file)

### 4.1 Process the corpus to the input format required by Word2Vec algorithm

In [15]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [14]:
# First we combine all the records into one single string
full_text = " ".join(mdpi_corpus)

In [39]:
sentences = []
for document in mdpi_corpus:
    # Break down each document in the corpus to list of sentences 
    sent_list = sent_tokenize(document)
    # For each sentence break it into list of words
    for sent in sent_list:
        word_list = word_tokenize(sent)
        sentences.append(word_tokenize(sent))

In [41]:
print("We have {} sentences in the corpus".format(len(sentences)))

We have 377179 sentences in the corpus


### 4.2 Setup Word2Vec model

In [46]:
# build a word2vec model on your dataset
base_model = Word2Vec(size=50, window=5, min_count=3, workers=4)
base_model.build_vocab(sentences)

In [52]:
total_examples = base_model.corpus_count

In [59]:
# Unique words in the vocabulary
len(base_model.wv.vocab)

26528

In [71]:
# Statistics of our vocabulary
unique_words = set(base_model.wv.vocab.keys()) - set(glove_vectors.vocab.keys())
common_words = set(base_model.wv.vocab.keys()).intersection(set(glove_vectors.vocab.keys()))

print("Unique words to our corpus {}".format(len(unique_words)))
print("Common words between corpus and glove {}".format(len(common_words)))

Unique words to our corpus 6268
Common words between corpus and glove 20260


### 4.3 Train Word2Vec model

In [72]:
# update our model with GloVe's vocabulary & weights
base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)

In [None]:
# train on your data
base_model.train(sentences, total_examples=total_examples, epochs=100)
base_model_wv = base_model.wv

### 4.4 Analyze our embeddings

In [77]:
list(unique_words)[:10]

['qair',
 'k.l',
 'singto',
 'forclime',
 'tection',
 'channan',
 'logsig',
 'mizoue',
 'y.o',
 'g.o']

In [78]:
'geoinform' in common_words

False

In [74]:
base_model_wv.most_similar('geoinform')

[('energy', 0.5695731043815613),
 ('topography', 0.49855130910873413),
 ('poulin', 0.4958604872226715),
 ('levin', 0.4894944727420807),
 ('sun', 0.48915255069732666),
 ('vicarious', 0.48822200298309326),
 ('quantification', 0.48101136088371277),
 ('gasparini', 0.470840185880661),
 ('cools', 0.46899086236953735),
 ('ahola', 0.4679374396800995)]