Text Representations: Words to Numbers
---

Computers today can not act on words or text directly. They need to be represented by meaningful number sequences. These long sequences of decimal numbers are called vectors. 

Where are these word vectors used?

- Text Classification and Summarization tasks
- Similar words search e.g. synonyms, logically similar
- Machine Translation (e.g Translate text from English to German)
- Understanding Similar texts (e.g. fb feed articles) 
- Question Answering and doing tasks (e.g chatbots in scheduling appointments etc.)

Usage
---

Using a machine learning or deep learning model for classification, with following text vectorization methods: 
- One Hot embedding
- TF-IDF
- word2vec by Google
- GLove by Stanford
- fastText by Facebook

Sentence and Document Embeddings 
---

Lastly, we look at text sequences larger than words and try to make sentence and document embeddings. doc2vec is a popular adaptation of the same. We will use gensim and gensim-data to play and evaluate above. 

Checklist
---

Level: ADVANCED 

- Introducing gensim and gensim-data
- word2vec, GloVe and modern: ConceptNet-Numberbatch and fastText
- Understanding Word Vectors
- Integrating with Text Classification

What will you be able to do by end of it? 
- SKILL 1: Vectorization of Text
- SKILL 2: Using gensim and gensim-data for topic modeling
- SKILL 3: Using word2vec, GloVe and fastText 
- SKILL 4: Integrating Text Representations with Classification and Basic Visualization
- SKILL 5: Creating sentence and document vectors for Information Retrieval by using word2vec adaptations: sent2vec and doc2vec


In [1]:
import spacy

In [2]:
nlp = spacy.load('en')

If there is an error above, try:
- Windows Shell:```python -m spacy download en``` as **Administrator**
- Linux Terminal:```sudo python -m spacy download en ```

In [3]:
import gensim

In [4]:
print(f'Using spacy: {spacy.__version__}, gensim: {gensim.__version__}')

Using spacy: 2.0.11, gensim: 3.4.0


Let's download some pre-trained GLove embeddings: 

In [5]:
from tqdm import tqdm
class TqdmUpTo(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None: self.total = tsize
        self.update(b * bsize - self.n)

def get_data(url, filename):
    """
    Download data if the filename does not exist already
    Uses Tqdm to show download progress
    """
    import os
    from urllib.request import urlretrieve
    
    if not os.path.exists(filename):

        dirname = os.path.dirname(filename)
        if not os.path.exists(dirname):
            os.makedirs(dirname)

        with TqdmUpTo(unit='B', unit_scale=True, miniters=1, desc=url.split('/')[-1]) as t:
            urlretrieve(url, filename, reporthook=t.update_to)

In [6]:
embedding_url = 'http://nlp.stanford.edu/data/glove.6B.zip'

In [7]:
get_data(embedding_url, 'data/glove.6B.zip')

In [8]:
# We need to run this only once, can unzip manually unzip to the data directory too
# !unzip data/glove.6B.zip
# !mv glove.6B.300d.txt data/glove.6B.300d.txt 
# !mv glove.6B.200d.txt data/glove.6B.200d.txt 
# !mv glove.6B.100d.txt data/glove.6B.100d.txt 
# !mv glove.6B.50d.txt data/glove.6B.50d.txt 

In [9]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'data/glove.6B.300d.txt'
word2vec_output_file = 'data/glove.6B.300d.word2vec.txt'

In [10]:
if not os.path.exists(word2vec_output_file):
    glove2word2vec(glove_input_file, word2vec_output_file)

(400000, 300)

In [11]:
%time
from gensim.models import KeyedVectors
filename = word2vec_output_file 
# load the Stanford GloVe model
model = KeyedVectors.load_word2vec_format(filename, binary=False)

In [12]:
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

[('queen', 0.6713277101516724)]


In [None]:
GloVe and Word2vec short coming - cannot handle out of vocabulary words

In [57]:
model.wv.most_similar('china')

  """Entry point for launching an IPython kernel.


[('chinese', 0.7886985540390015),
 ('beijing', 0.7727974653244019),
 ('taiwan', 0.6810802817344666),
 ('shanghai', 0.6243096590042114),
 ('mainland', 0.6230916380882263),
 ('guangdong', 0.6093515753746033),
 ('tibet', 0.581537663936615),
 ('hong', 0.5814417600631714),
 ('kong', 0.575353741645813),
 ('korea', 0.5709474682807922)]

In [18]:
model['nirant']

KeyError: "word 'nirant' not in vocabulary"

We can create our own fastText embeddings -  which can handle out of word vocabulary as well:

### Creating our own fasttext embedding

In [20]:
ted_dataset = "https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip"
get_data(ted_dataset, "data/ted_en.zip")

ted_en-20160408.zip&filename=ted_en-20160408.zip: 16.0MB [00:05, 2.90MB/s]


In [21]:
import zipfile
import lxml.etree
# extract subtitle
with zipfile.ZipFile('data/ted_en.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
input_text = '\n'.join(doc.xpath('//content/text()'))

In [26]:
input_text[:2000]

'Here are two reasons companies fail: they only do more of the same, or they only do what\'s new.\nTo me the real, real solution to quality growth is figuring out the balance between two activities: exploration and exploitation. Both are necessary, but it can be too much of a good thing.\nConsider Facit. I\'m actually old enough to remember them. Facit was a fantastic company. They were born deep in the Swedish forest, and they made the best mechanical calculators in the world. Everybody used them. And what did Facit do when the electronic calculator came along? They continued doing exactly the same. In six months, they went from maximum revenue ... and they were gone. Gone.\nTo me, the irony about the Facit story is hearing about the Facit engineers, who had bought cheap, small electronic calculators in Japan that they used to double-check their calculators.\n(Laughter)\nFacit did too much exploitation. But exploration can go wild, too.\nA few years back, I worked closely alongside a 

Clearly, there are some redundant information that is not helpful for us to understand the talk, such as the words describing sound in the parenthesis and the speaker’s name. We get rid of these words with regular expression.


In [33]:
import re
# remove parenthesis 
input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)
# store as list of sentences
sentences_strings_ted = []
for line in input_text_noparens.split('\n'):
    m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
    sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)
# store as list of lists of words
sentences_ted = []
for sent_str in sentences_strings_ted:
    tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
    sentences_ted.append(tokens)

Exercise for the reader: 
    Replace the .split() used above with the tokenizer from spacy and see how the `senetences_ted` changes

In [36]:
sentences_ted[:2]

[['here',
  'are',
  'two',
  'reasons',
  'companies',
  'fail',
  'they',
  'only',
  'do',
  'more',
  'of',
  'the',
  'same',
  'or',
  'they',
  'only',
  'do',
  'what',
  's',
  'new'],
 ['to',
  'me',
  'the',
  'real',
  'real',
  'solution',
  'to',
  'quality',
  'growth',
  'is',
  'figuring',
  'out',
  'the',
  'balance',
  'between',
  'two',
  'activities',
  'exploration',
  'and',
  'exploitation']]

In [37]:
from gensim.models.fasttext import FastText

In [52]:
%%time
model_ted = FastText(sentences_ted, size=100, window=5, min_count=5, workers=4, sg=1)

In [56]:
model_ted.wv.most_similar("china")

[('india', 0.8381025791168213),
 ('brazil', 0.7989770174026489),
 ('gaia', 0.788372278213501),
 ('australia', 0.7769520282745361),
 ('russia', 0.7750577926635742),
 ('ecosia', 0.7711703777313232),
 ('canada', 0.7652829885482788),
 ('asia', 0.7639362812042236),
 ('uganda', 0.7624210715293884),
 ('america', 0.7615689039230347)]