<h1 align=center><font size=5>Word Embeddings in Python</font></h1>

### Table of contents

- [Objective](#objective)
- [One-hot encoding](#one_hot)
- [Encode each word with a unique number](#integer_enc)
- [Word embeddings](#word_embeddings)
- [References](#ref)

### Objective <a id="objective"></a>

In this notebook, we learn different ways for converting strings to numbers (or to vectorize the text) before feeding it to machine learning models. 

### One-hot encoding <a id="one_hot"></a>

As a first idea, we might "one-hot" encode each word in our vocabulary. Consider the sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). To represent each word, we will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word. 

In [1]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

text = 'The   cat sat on  the mat.'
text = text.lower().split()
print(text)

label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(text)
print(integer_encoded)

onehot_encoded = to_categorical(integer_encoded)
print(onehot_encoded)

['the', 'cat', 'sat', 'on', 'the', 'mat.']
[4 0 3 2 4 1]
[[0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0.]]


&#x270d; What are downsides to this approach?

> This approach is inefficient. A one-hot encoded vector is sparse (meaning, most indicices are zero). Imagine we have 10,000 words in the vocabulary. To one-hot encode each word, we would create a vector where 99.99% of the elements are zero.

### Encode each word with a unique number <a id="integer_enc"></a>

A second approach we might try is to encode each word using a unique number. Continuing the example above, we could assign 1 to "cat", 2 to "mat", and so on. We could then encode the sentence "The cat sat on the mat" as a dense vector like [5, 1, 4, 3, 5, 2]. 

&#x270d; What are pros and cons of this approach?

> This appoach is efficient. Instead of a sparse vector, we now have a dense one (where all elements are full).

> There are two downsides to this approach, however:
    - The integer-encoding is arbitrary (it does not capture any relationship between words).
    - An integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful.

Text tokenization utility class in Tensorflow allows us to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

Arguments:

- __num_words__: the maximum number of words to keep, based on word frequency. Only the most common `num_words-1` words will be kept.
- __filters__: a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the `'` character.
- __lower__: boolean. Whether to convert the texts to lowercase.
- __split__: str. Separator for word splitting.
- __char_level__: if True, every character will be treated as a token.
- __oov_token__: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls

By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the `'` character). These sequences are then split into lists of tokens. They will then be indexed or vectorized. Note that `0` is a reserved index that won't be assigned to any word.

#### Text Tokenization

Here, we learn how to tokenize a text, and then turn sentences into sequences using tensorflow.

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

texts = ['The cat sat on the mat.',
         'The dog sat on the log.',
         'Dogs and cats living together.']

tokenizer = Tokenizer(num_words = 20) 
tokenizer.fit_on_texts(texts)

word_index = tokenizer.word_index
print('Word index:\n', word_index)

sequences = tokenizer.texts_to_sequences(texts) # Transforms each text into a sequence of integers
print('Sequences:\n', sequences)

Word index:
 {'the': 1, 'sat': 2, 'on': 3, 'cat': 4, 'mat': 5, 'dog': 6, 'log': 7, 'dogs': 8, 'and': 9, 'cats': 10, 'living': 11, 'together': 12}
Sequences:
 [[1, 4, 2, 3, 1, 5], [1, 6, 2, 3, 1, 7], [8, 9, 10, 11, 12]]


#### Test Sequence

In [3]:
X_train = ['The cat sat on the mat.']

tokenizer = Tokenizer(num_words = 20) 
tokenizer.fit_on_texts(X_train)

word_index = tokenizer.word_index
print('Word index:\n', word_index)

X_train_seq = tokenizer.texts_to_sequences(X_train)
print('Sequences:\n', X_train_seq)
# --------------------------------------------------------
X_test = ['The dog sat on the log.']

X_test_seq = tokenizer.texts_to_sequences(X_test)
print('Test sequence:\n', X_test_seq)# here the unseen words has ignored

Word index:
 {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5}
Sequences:
 [[1, 2, 3, 4, 1, 5]]
Test sequence:
 [[1, 3, 4, 1]]


#### Out Of Vocabulary (OOV) words

In [4]:
X_train = ['The cat sat on the mat.']

tokenizer = Tokenizer(num_words = 20, oov_token = '<OOV>') 
tokenizer.fit_on_texts(X_train)

word_index = tokenizer.word_index
print('Word index:\n', word_index)

X_train_seq = tokenizer.texts_to_sequences(X_train)
print('Sequences:\n', X_train_seq)
# --------------------------------------------------------
X_test = ['The dog sat on the log.']

X_test_seq = tokenizer.texts_to_sequences(X_test)
print('Test sequence:\n', X_test_seq)

Word index:
 {'<OOV>': 1, 'the': 2, 'cat': 3, 'sat': 4, 'on': 5, 'mat': 6}
Sequences:
 [[2, 3, 4, 5, 2, 6]]
Test sequence:
 [[2, 1, 4, 5, 2, 1]]


#### Padding <br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences


In [5]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = ['I love my dog', 
             'You love my dog!',
             'Do you think my dog is amazing?']

tokenizer = Tokenizer(num_words = 20, oov_token = '<OOV>') 
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print('Word index:\n', word_index)

sequences = tokenizer.texts_to_sequences(sentences)
print('Sequences:\n', sequences)

padded = pad_sequences(sequences)
print('Padded sequences:\n', padded)

matrix2 = tokenizer.texts_to_matrix(['I love my dog']) 
print(matrix2)

Word index:
 {'<OOV>': 1, 'my': 2, 'dog': 3, 'love': 4, 'you': 5, 'i': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
Sequences:
 [[6, 4, 2, 3], [5, 4, 2, 3], [7, 5, 8, 2, 3, 9, 10]]
Padded sequences:
 [[ 0  0  0  6  4  2  3]
 [ 0  0  0  5  4  2  3]
 [ 7  5  8  2  3  9 10]]
[[0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]


In [6]:
padded = pad_sequences(sequences, padding = 'post', maxlen = 5, truncating = 'post') 
print('Padded sequences:\n', padded)
print('Padded shape:', padded.shape)

Padded sequences:
 [[6 4 2 3 0]
 [5 4 2 3 0]
 [7 5 8 2 3]]
Padded shape: (3, 5)


In [7]:
texts = ['The the the the the cat sat on the mat cat.']
tokenizer = Tokenizer(num_words = 10) 
tokenizer.fit_on_texts(texts)


word_index = tokenizer.word_index
print('Word index:', word_index)

sequences = tokenizer.texts_to_sequences(texts)
print('Sequences:', sequences)

for mode in ['binary', 'count', 'freq', 'tfidf']:
    matrix = tokenizer.texts_to_matrix(texts, mode) # Convert a list of texts to a Numpy matrix.
    print('-'*20, mode, '-'*20)
    print(matrix)

Word index: {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5}
Sequences: [[1, 1, 1, 1, 1, 2, 3, 4, 1, 5, 2]]
-------------------- binary --------------------
[[0. 1. 1. 1. 1. 1. 0. 0. 0. 0.]]
-------------------- count --------------------
[[0. 6. 2. 1. 1. 1. 0. 0. 0. 0.]]
-------------------- freq --------------------
[[0.         0.54545455 0.18181818 0.09090909 0.09090909 0.09090909
  0.         0.         0.         0.        ]]
-------------------- tfidf --------------------
[[0.         1.13196106 0.6865121  0.40546511 0.40546511 0.40546511
  0.         0.         0.         0.        ]]


### Word embeddings <a id="word_embeddings"></a>

Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, we do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

### References <a id="ref"></a>

- https://keras.io/preprocessing/text/
    
- https://www.tensorflow.org/tutorials/text/word_embeddings