## Text Processing
Before doing further analysis we can try to preprocess the text:
- Lower-case: convert all the words into lower case (if not matter)
- Word stemming: [writing, write, wrote,...] -> write
- Lemmatization: group inflected words into a single term
- Stop words: common words in a language [I,and,is,...], in most cases they're not needed
- Punctuation: sometimes provides useful meanings, and sometimes not

### Tokenization
Consider each word as a token not word.
We could have tokens formed by multiple words: [I am] 1 token
But for simplicity let's just use lowercase + word = token as tokenizer

In [16]:
import numpy as np
import pandas as pd
from collections import Counter

In [17]:
corpus = [
    "The cat sat on the mat",
    "The dog chased the cat",
    "The cat chased the mouse",
    "The dog barked loudly"
]

In [21]:
def tokenize(text) -> list:
    text = text.lower() # lowercase
    tokens = text.split(" ")
    return tokens

out = []
for phrase in corpus:
    out.append(tokenize(phrase))
print(out)

[['the', 'cat', 'sat', 'on', 'the', 'mat'], ['the', 'dog', 'chased', 'the', 'cat'], ['the', 'cat', 'chased', 'the', 'mouse'], ['the', 'dog', 'barked', 'loudly']]


### Bag of Word (BoW)
Build a dictionary of all unique words.
Based on a dictionary we encode the phrases into a big array
The output array should be the same size of dictionary, but contains how many times it appeared

This has 3 downsides: 
- 2 same BoW can have different meanings
    e.g. Alex is smarter than Bob; Bob is smarter than Alex. Same BoW, different meanings
- Produce a sparse vector (big array with zero, so meaningless)
- Doesn't catch semantic similarity

In [30]:
'''
Build vocabulary, based on their frequencies
'''
tokenized = []
for phrase in corpus:
    # Concat
    tokenized += tokenize(phrase)
# Count frequency
counter = Counter(tokenized)
sorted_tokens = sorted(counter.items(), key=lambda x: x[1], reverse=True)
w2i = {token: idx for idx, (token, _) in enumerate(sorted_tokens)}
""" w2i["UNK"] = len(sorted_tokens) + 1 """
vocab = counter.keys()

def bow_vec(tokens):
    vec = np.zeros(len(vocab))
    counter = Counter(tokens) # Process each word
    for token,count in counter.items():
        if token in w2i:
            vec[w2i[token]] = count
    return vec

example = "The cat is playing with the dog"
tokens = tokenize(example)
bow = bow_vec(tokens)
print(bow)

[2. 1. 1. 0. 0. 0. 0. 0. 0. 0.]


### Term Frequency (TF)
Instead we can measure the importance of a word in a document

Unlike BoW that have meaningless words (like, is, on,...), now we also take consideration about how important each word are
$$
TF(w,d) = \frac{count\ of\ word\ (w)\ in\ document\ (d)}{total\ words\ in\ d}
$$

In [32]:
def tf(tokens):
    vec = np.zeros(len(vocab))
    counts = Counter(tokens)
    for token,count in counts.items():
        if token in w2i:
            vec[w2i[token]] = count / len(tokens)
        """ else:
            # If unknown then assing it to unkown word in array
            vec[w2i["UNK"]] += count / len(tokens) """
    return vec

example = "The cat is playing with the dog"
tokens = tokenize(example)
tf_vec = tf(tokens)
print(tf_vec)

[0.28571429 0.14285714 0.14285714 0.         0.         0.
 0.         0.         0.         0.        ]


### Inverse Document Frequency (IDF)
IDF reduces the weight of common words in every document

Rare words will produce a high IDF, common will have a low IDF
$$
IDF(w) = log\frac{N}{1 + n_{w}}
$$
where $N$ is number of documents, $n_{w}$ is number of documents containing word w

Add 1 to avoid dividing by 0

In [22]:
def idf(corpus_tokens):
    N = len(corpus_tokens)
    vec = np.zeros(len(vocab))
    for token,idx in w2i.items():
        contains = sum(1 for doc in corpus_tokens if token in doc)
        vec[idx] = np.log((N + 1) / (contains + 1)) + 1 # Smoothed
    return vec
idf_vec = idf(out)

Now that we have TF and IDF:
$TF-IDF(w,d) = TF(w,d) * IDF(w)$
For each word, it's weight is determined by:
- It's importance in the document (TF)
- It's rarity across the corpus (IDF)

It fixes few issues of BoW:
- Downweights the stopwords
- Highlights keywords, unique words will have a higher score
- Balnce over documents: long documents don't dominate short ones

In [33]:
tfidf_matrix = tf_vec * idf_vec
print(tfidf_matrix)

[0.28571429 0.17473479 0.21583223 0.         0.         0.
 0.         0.         0.         0.        ]


We can also perform a L2 normalization:
$$
v_{new} = \frac{v_{i}}{\sqrt{v_{1}^2+v_{2}^2+...}}
$$
This makes documents comparable even if they have different lengths

In [35]:
def l2_normalize(vec):
    norm = np.sqrt(np.sum(vec ** 2))
    if norm == 0:
        return vec  # avoid div by 0
    return vec / norm

tfidf_normalized = l2_normalize(tfidf_matrix)
print(tfidf_normalized)


[0.71709584 0.43855558 0.54170339 0.         0.         0.
 0.         0.         0.         0.        ]


### Embeddings
Although we talked about how to represent each word and classify them, they cannot be used to predict the next word

The methods above are useful in topic detection, text classification,....

So to predict what should the next word be, we should be able to have some relationships between the general text that appeared before prediction and the word predicted

To do that 