## Representing text

If we want ot solve NLP tasks with neural networks, we need some way to represent text as tensors. Computers already represent textual characters as numbers that map to fonts on your screen using encodings such ASCII or UTF-8.

<figure><img src="https://hostux.social/system/media_attachments/files/110/744/867/966/745/527/original/2da950ca853de826.png" alt="" width="1000"><figcaption><p>Source from Microsoft Learning </a> </p></figcaption></figure>

We understand what each letter **represents**, and how all characters come topgether to form the words of a sentence. However, computers by themeselves do not have such an understanding, and neural network has to learn the meaning during training.

Therefore, we can use different approaches when representing text:

* **Character-level representation**, when we represent text by treating each character as a number. Given that we have $C$ different characters in out text corpus, the world Hello would be represented by $5*C$ tensor. Each letter would correspond to a tensor column in one-hot encoding.
* **Word-level representation**, in which we create a **vocabulary** of all words in our text, and then represent words using one-hot encoding, This approach is somehow better, because each letter by itself does not have much meaning, and thus by using higher-level semantic concepts - words - we simplify the task for the neural network. However, given large dictionary size, we need to deal with high-dimentional sparse tensors.


TO unify those approaches, we typically call an atomic piece of text a **a token**. In some cases tokens can be letters, in order cases - words, or parts of words.

> For example, we can choose to tokenize `indivisible` as `in -divis -ible`, where the # sign represents that the token is a continuation of the previous word. This would allow the root `divis` to always be reperesented by one token, corresponding to one core meaning.

The process of converting text into a sequence of tokens is called **tokenization**. Next, we need to assign each token to a number, which we can feed into a neural network. This is called **vectorization**, and is normally done by building a token vocabulary.

In [None]:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

In [None]:
pip install portalocker>=2.0.0

## Text classification task

We will start with a simple text classification tasks based on **AG_NEWS** dataset, which is to classify news headliness into one of categories: World, Sports, Business or Sci/Tech. This dataset is built into `TorchText`, and we can easily download it by using `torchtext.datasets.AG_NEWS` function.

In [None]:
import torch
import torchtext
import os
import collections

os.makedirs('./data', exist_ok=True)
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

In [None]:
# check the paltform, Apple Silicon or Linux
import os, platform

torch_device = 'mps' if platform.system() == 'Darwin' else 'cpu'

In [None]:
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

In [None]:
torch_device

Here, `train_dataset` and `test_dataset` contain iterators that return pairs of label (number of class) and text respectively, for example:

In [None]:
print(train_dataset)

Let's print out the first 5 new headlines from our datasets:

In [None]:
for i,x in zip(range(5), train_dataset):
    print(f'**{classes[x[0]]}** -> {x[1]}\n')

Because datasets are iterators if we want to use the data multiple times we need to convert it to list:

In [None]:
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
train_dataset = list(train_dataset)
test_dataset = list(test_dataset)

## Tokenization and Vectorization

Now we need to convert text into `numbers` that can be represented as tensors. If we want word-level representation, we need to do two things:
* First step is convert text to tokens(`tokenization`)
* build a `vocabulary` of those tokens

In [None]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
first_sentence = train_dataset[0][1]
second_sentence = train_dataset[1][1]

f_tokens = tokenizer(first_sentence)
s_tokens = tokenizer(second_sentence)

print(f'\nfirst token list: \n{f_tokens}')
print(f'\nsecond token list: \n{s_tokens}')

Next, to convert text to numbers, we will need to build a vocabulary of all tokens. We first build the dictionary using the `Counter` objet, and then create a `Vocab` object that would help us deal with vectorization:

In [None]:
import torchtext

counter =  collections.Counter()
for (label, line) in train_dataset:
    counter.update(tokenizer(line))

vocab = torchtext.vocab.vocab(counter, min_freq=1)

To see how each word maps to the vocabulary, we'll loop through each word in the list to lookup it's index number in `vocab`. Each word or character is displayed with it's corresponding index. For example, word `the` appears several times in both sentence and it's unique index in the vocab is the number 3.

In [None]:
word_lookup_f = [list((vocab[w],w) for w in f_tokens)]
print(f'\nIndex lockup in 1st sentence:\n{word_lookup_f}')

word_lookup_s = [list((vocab[w],w) for w in s_tokens)]
print(f'\nIndex lockup in 2nd sentence:\n{word_lookup_s}')

Using vocabulary, we can easily encode out tokenized string into a set of numbers:

In [None]:
vocab_size = len(vocab)
print(f'Vocab size: {vocab_size}')

def encode(text):
    return [vocab.get_stoi()[s] for s in tokenizer(text)]

vec=encode(train_dataset[0][1])
print(vec)

The torchtext `vocab.get_stoi` function allows us to convert from string representation into numbers (the name stoi stands for string-to-integers). To convert the text back from a numeric representation into text, we can use the `vocab.get_itos` dictionary to perform reverse lookup:

In [None]:
def decode(text):
    return [vocab.get_itos()[i] for i in text]

print(decode(vec))

## Bag of Words text representation

Because words represent meaning, sometimes we can figure out the meaning of a text by just looking at the individual words, regardless of their order in the sentence. For example, when classifying news, words like `weather`, `snow` are likely to indicate `weather forecast`, while words like `stocks`, `dollar` would count towards `financial news`.

**Bag of Words** (BoW) vector representation is the most commonly used traditional vector representation. Each word is linked to a vector index, vector element contains the number of occureences of a word in a given document.

<figure><img src="https://hostux.social/system/media_attachments/files/110/745/067/186/920/286/original/18c9a8e5cd6c7dfb.png" alt="" width="1000"><figcaption><p>Source from Microsoft Learning </a> </p></figcaption></figure>

> Note: You can also think of BoW as a sum of all one-hot-encoded bectors for individual words in the text.

Below is an example of how to generate a bag od word representation using the Scikit Learn python library:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
    'I like hot dogs.',
    'The dog ran fast.',
    'Its hot outside.',
]

vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

To compute bag-of-words vector form the vector representation of our AG_NEWS dataset, we can use the following function:

> Note: Here we are using global `vocab_size` variable to specify default size of the vocabulary. Since often vocabulary size id pretty big, we can limit the size of the vocabulary to most frequent words. Try lowering `vocab_size` value and running the code below, and see how it affects the accuracy. You should expect some accuracy drop, but not dramatic, in lieu of higher performance.

In [None]:
vocab_size = len(vocab)

def to_bow(text, bow_vocab_size=vocab_size):
    """
    Convert text string to a bag-of-words tensor.
    """
    res = torch.zeros(bow_vocab_size, dtype=torch.float32)
    for i in encode(text):
        if i <bow_vocab_size:
            res[i] += 1
    return res

print(f'sample text:{train_dataset[0][1]}')
print(f'bow vector: {to_bow(train_dataset[0][1])}')

## Training BoW classifier

Now that we have learned how to build Bag-of-Words representation of our text, let's train a classifier on top of it.

First, we need to convert our dataset for training in such a way, that all positional vector representations are converted to bag-of-words representation. This can be achieved by passing `bowify` function as `collate_fn` parameter to standard torch `DataLoader`:

In [None]:
from torch.utils.data import DataLoader

# This collate function gets list of batch_size tuples, and needs to return a pair of label-feature tensors for the whole minibatch
def bowify(b):
    return(
        torch.LongTensor([t[0]-1 for t in b]),
        torch.stack([to_bow(t[1]) for t in b])
    )

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=bowify) # collate_fn is the function that merges the list of samples into a mini-batch
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True, collate_fn=bowify)


Now let's define simple classifier neural network that contains one linear layer. The size of the input vector equals to `vocab_size`, and output size coresponds to the number of classes(4). Because we are solving classification task, the final activation function is `LogSoftmax`.

In [None]:
net = torch.nn.Sequential(
    torch.nn.Linear(vocab_size, 4),
    torch.nn.LogSoftmax(dim=1)
)

net=net.to(torch_device)

Now we will define standard PyTorch training loop. Because our dataset is quite large, we will train only for one epoch, and sometimes even for less than an epoch(specifying the `epoch_size` parameter allows us to limit training). We would also report accumulated training accuracy during training; the frequency of reporting is pecified using `report_freq` parameter.

In [None]:
def train_eopch(net, dataloader, lr=0.01, optimizer=None, loss_fn=torch.nn.NLLLoss(), epoch_size=None, report_freq=200):
    optimizer = optimizer or torch.optim.Adam(net.parameters(), lr=lr)
    net.train()
    total_loss, acc, count,i = 0,0,0,0
    for labels, features in dataloader:
        labels, features = labels.to(torch_device), features.to(torch_device)
        optimizer.zero_grad()
        output = net(features)
        loss = loss_fn(output, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss
        _, predicted = torch.max(output, 1)
        acc += (predicted == labels).sum()
        count += len(labels)
        i+=1
        if i%report_freq == 0:
            print(f'iteration {count}, loss {total_loss.item()/count}, accuracy {acc.item()/count}') # item() is used to get the value of a tensor
        if epoch_size  and count >= epoch_size:
            break
    
    return total_loss.item()/count, acc.item()/count

train_eopch(net, train_loader, epoch_size=1)

## BiGrams, TriGrams and N-Grams

One limitation of a bag of words approach is that some words are part of multi word expressions, for example, the word `hot dog` has a completely different meaning than the words `hot` and `dog` in other contexts. If we represent words `hot` and `dog` always by the same vectors, it can confuse our model.

To address this, **N-gram representations** are often used in methods of document classification, where the frequency of each word, **bi-word** ot **tri-word** is useful feature for training classifiers. 

* In bigram representaion, for example, we will add all word pairs to the vocabulary, in addition to original words.

Below is an example of how to generate a bigram bag of word representation using the Scikit Learn:

In [None]:
bigram_vectorizer = CountVectorizer(ngram_range=(1,2), token_pattern=r'\b\w+\b', min_df=1)
corpus = [
    'I like hot dogs.',
    'The dog ran fast.',
    'Its hot outside.',
]

bigram_vectorizer.fit_transform(corpus)
print("Vocabulary: ", bigram_vectorizer.vocabulary_)
bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

The **main drawback of N-gram approach** is that vocabulary size starts to grow exremly fast. In practice, we need to combine N-gram representation with some dimensionality reduction technique, such as embeddings, which we will discuss in the next notebook.

To use N-gram representaion in our **AG News** dataset, we need to build special ngram vocabulary:

In [None]:
counter = collections.Counter()
for (label, line) in train_dataset:
    l = tokenizer(line)
    counter.update(torchtext.data.utils.ngrams_iterator(l, ngrams=2))

bi_vocab = torchtext.vocab.vocab(counter, min_freq=1)

print(f'Bigram vocabulary length: {len(bi_vocab)}')

We could then use the same code as above to train the classifier, however, it would be very memory-inefficient. In the next notebook, we will train bigram classifier using embeedings.

>Note: We can only leave those ngrams that occur in the text more than specified number of times. This will make sure that infrequent bigrams will be omitted, and will decrease the dimensionality significantly. To do this, set `min_freq` parameter to a higher value, and observe the length of vocabulary change.

## Term Frequency Inverse Document Frequency TF-IDF

In BoW representaion, word occurences are evenly weighted, regardless of the word iteself. However, it is clear that frequent words, such as a,in, etc. are much less important for the classification, than specialized terms. In fact, in most NLP tasks some words are more relevant than others.

**TF-IDF** stands for **term frequency-inverse document frequency**. It is a variation of ag of words, where instead of a binary 0/1 value indicating the appearence of a word in a focument, a floating-point value is used, which is related to the frequency of word occurrence in the corpus.

More formally, the weights $w_{ij}$ of a word $i$ in the document $j$ is defined as:

$$w_{ij} = tf_{ij} \times \log({N \over df_i})$$

where
* i is the word
* j is the document
* $w_{ij}$ is the weight or the importance of the word in the document
* $tf_{ij}$ is the number of occurences of word $i$ in the document $j$. i.e. the BoW value we have seen before
* $N$ is the number of documents in the collection
* $df_i$ is the number of documents containing the word $i$ in the whole collection

TF-IDF value $w{ij}$ increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contains the word, which helps to adjust for the fact that some words appear more frequently than others. For example, if the word appeears in every document in the collection, $df_{i}=N$, and $w_{ij} = 0$, and those terms would be completely disregarded.

We can easily create TF-IDF vectorization of text using Scikit Learn:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

Or we can implement it ourselves. First, let's compute document frequency $df_{i}$ for each word $i$. We can represent it as tensor of size `vocab_size`. We will limit the number of documents to `N=1000` to speed up processing. For each input sentence, we compute the set of words(represented by their numbers), and increase the corresponding counter:

In [None]:
N = 5 # small value for testing the whole process
df = torch.zeros(vocab_size)
for _,line, in train_dataset[:N]:
    for i in set(encode(line)):
        df[i] += 1

Now that we have document frequencies for each word, we can define `tf_idr` function that will take a string, and produce TF-IDF vector. We will use `to_bow` defined above to calculate term frequency vector, and multiply it by inverse document frequency of the corresponding term. Remeber that all tensor operations are element-wide, which allows us to implement the whole computation as a tensof formula:

> Here we use $\log({N+1\over df_i+1})$ instead of $\log({N\over df_i})$. This yields simiar results, but prevents division by 0 in those cases when $df_i=0$.

In [None]:
def tf_idf(s):
    bow=to_bow(s)
    tf = bow*torch.log((N+1)/(df+1))

print(tf_idf(train_dataset[0][1])) # due to small N, the result `None` is not correct

However even though TF-IDF representaions provide frequency weight to different words they are unable to represent meaing or order. As the famous linguist J.R. Firth said in 1935, "The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously.". We will learn in the later notebooks how to capture contextual information from text using language modeling.