Text Representations: Words to Numbers
---

Computers today can not act on words or text directly. They need to be represented by meaningful number sequences. These long sequences of decimal numbers are called vectors. 

Where are these word vectors used?

- Text Classification and Summarization tasks
- Similar words search e.g. synonyms, logically similar
- Machine Translation (e.g Translate text from English to German)
- Understanding Similar texts (e.g. fb feed articles) 
- Question Answering and doing tasks (e.g chatbots in scheduling appointments etc.)

Usage
---

Using a machine learning or deep learning model for classification, with following text vectorization methods: 
- One Hot embedding
- TF-IDF
- word2vec by Google
- GLove by Stanford
- fastText by Facebook

Sentence and Document Embeddings 
---

Lastly, we look at text sequences larger than words and try to make sentence and document embeddings. doc2vec is a popular adaptation of the same. We will use gensim and gensim-data to play and evaluate above. 

Checklist
---

Level: ADVANCED 

- Introducing gensim and gensim-data
- word2vec, GloVe and modern: ConceptNet-Numberbatch and fastText
- Understanding Word Vectors
- Integrating with Text Classification

What will you be able to do by end of it? 
- SKILL 1: Vectorization of Text
- SKILL 2: Using gensim and gensim-data for topic modeling
- SKILL 3: Using word2vec, GloVe and fastText 
- SKILL 4: Integrating Text Representations with Classification and Basic Visualization
- SKILL 5: Creating sentence and document vectors for Information Retrieval by using word2vec adaptations: sent2vec and doc2vec


In [1]:
import gensim

In [2]:
print(f'gensim: {gensim.__version__}')

gensim: 3.4.0


Let's download some pre-trained GLove embeddings: 

In [3]:
from tqdm import tqdm
class TqdmUpTo(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None: self.total = tsize
        self.update(b * bsize - self.n)

def get_data(url, filename):
    """
    Download data if the filename does not exist already
    Uses Tqdm to show download progress
    """
    import os
    from urllib.request import urlretrieve
    
    if not os.path.exists(filename):

        dirname = os.path.dirname(filename)
        if not os.path.exists(dirname):
            os.makedirs(dirname)

        with TqdmUpTo(unit='B', unit_scale=True, miniters=1, desc=url.split('/')[-1]) as t:
            urlretrieve(url, filename, reporthook=t.update_to)

In [4]:
embedding_url = 'http://nlp.stanford.edu/data/glove.6B.zip'

In [5]:
get_data(embedding_url, 'data/glove.6B.zip')

### GLoVe or  word2vec?

In general, I recommend using **GLoVe over word2vec**. This is because it outperforms word2vec on most machine learning and NLP challenges in academia as well as my limited experience. 

I am convinced enough to skip the original word2vec completely here. But for the sake of completeness, we will see the following: 

- How to use the original embeddings? Example: GLoVe
- How to handle Out of Vocabulary words? Hint: FastText
- How to train your own word2vec vectors on your own corpus? 

In [6]:
# We need to run this only once, can unzip manually unzip to the data directory too
# !unzip data/glove.6B.zip
# !mv glove.6B.300d.txt data/glove.6B.300d.txt 
# !mv glove.6B.200d.txt data/glove.6B.200d.txt 
# !mv glove.6B.100d.txt data/glove.6B.100d.txt 
# !mv glove.6B.50d.txt data/glove.6B.50d.txt 

### How to use pre-trained embeddings?

**Challenge**: The file formats used by word2vec and GloVe are slightly different from each other. We'd like a consistent API to lookup any word embedding. We can do this 

**Solution**: This format conversion can be done using `gensim`'s API called `glove2word2vec`. We will use this to convert our glove embedding information to word2vec format.

In [7]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'data/glove.6B.300d.txt'
word2vec_output_file = 'data/glove.6B.300d.word2vec.txt'

In [9]:
import os
if not os.path.exists(word2vec_output_file):
    glove2word2vec(glove_input_file, word2vec_output_file)

### KeyedVectors API
We now have the simple task of loading the vectors from a file. We do this using `KeyedVectors` API in gensim. The word we want to lookup is the _key_ and the numerical representation of that word is the corresponding _value_. 

In [10]:
%time
from gensim.models import KeyedVectors
filename = word2vec_output_file 
# load the Stanford GloVe model
model = KeyedVectors.load_word2vec_format(filename, binary=False)
# binary=False format for human readable text (.txt) files, and binary=True for .bin files 

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 10.5 µs


In [11]:
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

[('queen', 0.6713277101516724)]


In [12]:
model.most_similar('india')

[('indian', 0.7355823516845703),
 ('pakistan', 0.7285579442977905),
 ('delhi', 0.6846907138824463),
 ('bangladesh', 0.6203191876411438),
 ('lanka', 0.609517514705658),
 ('sri', 0.6011613607406616),
 ('kashmir', 0.5746493935585022),
 ('nepal', 0.5421023368835449),
 ('pradesh', 0.5405811071395874),
 ('maharashtra', 0.518537700176239)]

#### What is missing in both word2vec and GloVe? 
Both glove and word2vec can not handle words or which they did not see during training. These words are called "out of vocabulary" or **OOV** in literature. 

This is evident if you try to lookup nouns which are not frequently used e.g. a name - the model throws `not in vocabulary` error. 

In [13]:
try:
    model.most_similar('nirant')
except Exception as e:
    print(e)

"word 'nirant' not in vocabulary"


### How to handle OOV words? 
The author of word2vec (Mikolov et al.) extended it to create fastText at Facebook. They work more on n-grams instead of the entire words. 

We can create our own fastText embeddings -  which can handle OOV tokens as well

### Get the Dataset

We download the subtitles of several TED talk from a public dataset. We will train our fastText embeddings on these as well as the word2vec embeddings for comparison. 

In [14]:
ted_dataset = "https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip"
get_data(ted_dataset, "data/ted_en.zip")

In [15]:
import zipfile
import lxml.etree
# extract subtitle
with zipfile.ZipFile('data/ted_en.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
input_text = '\n'.join(doc.xpath('//content/text()'))

In [16]:
input_text[:500]

"Here are two reasons companies fail: they only do more of the same, or they only do what's new.\nTo me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation. Both are necessary, but it can be too much of a good thing.\nConsider Facit. I'm actually old enough to remember them. Facit was a fantastic company. They were born deep in the Swedish forest, and they made the best mechanical calculators in the world. Everybody used them. A"

Since we are using subtitles from TED talks, there are some fillers which are not useful. These are often words describing sound in the parenthesis and the speaker’s name. 

Let's remove these fillers: 

In [17]:
import re
# remove parenthesis 
input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)

# store as list of sentences
sentences_strings_ted = []
for line in input_text_noparens.split('\n'):
    m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
    sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)

# store as list of lists of words
sentences_ted = []
for sent_str in sentences_strings_ted:
    tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
    sentences_ted.append(tokens)

Exercise for the reader: 
    Replace the .split() used above with the tokenizer from spacy and see how the `senetences_ted` changes

In [18]:
print(sentences_ted[:2])

[['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new'], ['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation']]


Notice that each `sentenced_ted` is now a list of list. Each element of the first list is a sentence, and each sentence is a list of tokens (e.g. words).

This is the expected structure for training text embeddings using `gensim`

### Train FastText Embedddings

In [19]:
from gensim.models.fasttext import FastText

In [20]:
%%time
model_ted = FastText(sentences_ted, size=100, window=5, min_count=5, workers=-1, sg=1)
# sg = 1 denotes skipgram, else CBOW is used

CPU times: user 8.87 s, sys: 444 ms, total: 9.31 s
Wall time: 9.17 s


In [21]:
model_ted.wv.most_similar("india")

[('indians', 0.5911639928817749),
 ('indian', 0.5406097769737244),
 ('indiana', 0.4898717999458313),
 ('indicated', 0.4400438070297241),
 ('indicate', 0.4042605757713318),
 ('internal', 0.39166826009750366),
 ('interior', 0.3871103823184967),
 ('byproducts', 0.3752930164337158),
 ('princesses', 0.37265270948410034),
 ('indications', 0.369659960269928)]

### Train word2vec Embeddings

In [22]:
from gensim.models.word2vec import Word2Vec

In [23]:
%%time
model_word2vec_ted = Word2Vec(sentences=sentences_ted, size=100, window=5, min_count=5, workers=-1, sg=1)

CPU times: user 2.6 s, sys: 80 ms, total: 2.68 s
Wall time: 1.39 s


In [24]:
model_word2vec_ted.wv.most_similar("india")

[('174', 0.38140085339546204),
 ('specifies', 0.3672548234462738),
 ('praising', 0.36509859561920166),
 ('offshore', 0.3498424291610718),
 ('objection', 0.34258580207824707),
 ('h', 0.34086084365844727),
 ('slapped', 0.3404659032821655),
 ('iconography', 0.3358972668647766),
 ('paintbrush', 0.33297550678253174),
 ('sprinter', 0.33028823137283325)]

## fastText or word2vec? 

According to the preliminary comparisons by gensim: 
> fastText embeddings are significantly better than word2vec at encoding syntactic information. This is expected, since most syntactic analogies are morphology based, and the char n-gram approach of fastText takes such information into account. The original word2vec model seems to perform better on semantic tasks, since words in semantic analogies are unrelated to their char n-grams, and the added information from irrelevant char n-grams worsens the embeddings.

> Source: [word2vec fasttext comparison notebook](https://github.com/RaRe-Technologies/gensim/blob/37e49971efa74310b300468a5b3cf531319c6536/docs/notebooks/Word2Vec_FastText_Comparison.ipynb)

In general, prefer fasttext for most web-scale systems because of it's capability to handle words which it has not seen in training. It is definitely better than word2vec on small data, and at least as good as word2vec on larger datasets. 

Here is a thumb rule for commercial applications: fastText > GloVe > word2vec. This is obviously not always true, but empirically most common result. 

# Document Embeddings
This section is based on the [Doc2Vec API Tutorial](https://github.com/RaRe-Technologies/gensim/blob/37e49971efa74310b300468a5b3cf531319c6536/docs/notebooks/doc2vec-wikipedia.ipynb) from gensim repository.

In [None]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing

Notice that we already have the wikicorpus data format implemented in gensim. Additonally, `doc2vec` expects data loaders in a particular format: as objects of the `TaggedDocument` class. 

## Preparing the corpus

1. Download the latest Wikipedia dump (filename: enwiki-latest-pages-articles.xml.bz2) from [Wiki Dumps here](https://dumps.wikimedia.org/enwiki/latest/)
1. Convert the articles to `WikiCorpus`. WikiCorpus construct a corpus from a Wikipedia (or other MediaWiki-based) database dump. 

In [None]:
# This downloads a large (greater than 15G file, do not download multiple copies)
get_data('https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2', "data/enwiki-latest-pages-articles.xml.bz2")

enwiki-latest-pages-articles.xml.bz2:   5%|▍         | 699M/15.2G [06:06<1:55:38, 2.09MB/s]   

In [None]:
wiki = WikiCorpus("data/enwiki-latest-pages-articles.xml.bz2")

In [None]:
class TaggedWikiDocument(object):
    def __init__(self, wiki):
        self.wiki = wiki
        self.wiki.metadata = True
    def __iter__(self):
        for content, (page_id, title) in self.wiki.get_texts():
            yield TaggedDocument([c.decode("utf-8") for c in content], [title])

In [None]:
documents = TaggedWikiDocument(wiki)

In [None]:
list(documents)[0]

In [None]:
pre = Doc2Vec(min_count=0)
pre.scan_vocab(documents)