In [None]:
import numpy as np
import pandas as pd
import sentence_transformers
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Text vectorisation

For most computational analysis of text, you need to determine an appropriate way to represent your text as numbers. Typically this uses embedding, the concept of representing a piece of text as a vector that becomes a geometric representation of the meaning of a text. Here we are going to discuss a few different ways to embed texts and discuss their strengths and weaknesses.

This overview uses a collection of online news articles from October 2021 that use either of the terms `climate change` or `global warming`.

In [None]:
df_news = pd.read_json('data/cc_gw_news_blogs_2021-10-01_2021-10-31.json')

The simplest way to represent text as a vector is a `bag of words` approach. This approach defines a corpus of words, corresponding to the elements of a vector, and counts their frequency within a text.

Let's look at this for the first five headlines. Here the corpus will be all words that appear in the headlines.

In [None]:
headlines = [df_news.title[i].lower() for i in range(5)]
corpus = list(set(' '.join(headlines).split()))
print(corpus)

There are also standard tools available to handle this through libraries like `sklearn`.

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit_transform(headlines)
print(vectorizer.get_feature_names_out())

Notice how we have slightly different terms in our corpus - this is because these standard libraries automatically include a range of preprocessing steps to streamline the text analysis process.

Also, see how the corpus is already getting large - and you might realise that there are some words included that don't include much information such as `in`, `to`, `or`. We call these stopwords and have standard methods (and default lists) to remove them.

In [None]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(headlines)
print(vectorizer.get_feature_names_out())

We can also provide custom lists of stopwords - it makes sense to ignore `climate` since it features in our search terms.

In [None]:
from sklearn.feature_extraction import _stop_words
stop = list(_stop_words.ENGLISH_STOP_WORDS) + ['climate']
vectorizer = CountVectorizer(stop_words=stop)
X = vectorizer.fit_transform(headlines)
print(vectorizer.get_feature_names_out())

Note that `sklearn` only provides English language stopwords, but other libraries have similar tools with a wider range of languages. Let's now look at the vector representations of the headlines.

In [None]:
print(X.toarray())  ## The normal representation is more memory efficient, but this is easier for a human to read.

There are a few points to note about bag of words representations. Firstly, the representation is typically quite sparse, that is most entries in the array are `0`, which means that most headlines only use a few words in the corpus. It's also rare for any of the entries to be greater than `1` - this is a feature of choosing short headlines as our corpus. The last issue to consider is we lose all relationship between words besides appearing in the same headline - we don't know if words are consecutive or at either end of the text. This vectorisation does allow us to quickly compare texts for linguistic overlap however.

The next embedding technique we're going to look at is called term frequency-inverse document frequency (`TF-IDF`). TF-IDF is a great way to compare texts and identify which words are more important to a text. It does this be normalising the term frequency calculated in the bag of words representation by the inverse document frequency ('IDF'). The IDF is a measure of how common a term is across all the text you're comparing. By adding this factor, we penalise terms that are common to many text as we expect their frequency to be naturally higher. We calculate this representation in a similar way. 

In [None]:
tf_vectorizer = TfidfVectorizer(stop_words=stop)
X_tf = tf_vectorizer.fit_transform(headlines)
print(tf_vectorizer.get_feature_names_out())
print(X_tf.toarray())

While there is a limit to the conclusions that we can draw from such a small corpus, it does show the value of TF-IDF when comparing texts. Looking at the third row, we can see that each of its (present) features are scored highly - this tells us that the words used in that title are consistently unique with reference to the other headlines we consider.

So far, the embedding schemes we've considered have maintained a direct link between the words in the text and the values in the representation. This can become unmanageable with larger corpora. Let's look at an example with the first 1000 article bodies.

In [None]:
headlines = [df_news.body[i] for i in range(1000)]
X_tf = tf_vectorizer.fit_transform(headlines)
print(X_tf.toarray().shape)

Now we have vectors of 22,652 numbers for each piece of text in our corpus - and this issue only increase when you consider longer texts. Vectors of this size can be very slow for a number of computational methods so we need to reduce them.

# Text embeddings

Text embeddings are the next step in this process. They take high-dimensional vector representations like we've seen in the previous cases and *embed* them in lower-dimensional spaces. These spaces are typically trained on large volumes of text and can capture meaning in the proximity of the vectors.

There are a number of different pre-trained models for deriving these embeddings, here we'll look at the one of the leading tools: `BERT`. BERT was one of the first large language models to be made publicly available and was trained on an extensive corpus of text from across the internet. Since its release, there have been a number of updated and fine-tuned models made available that leverage the general understanding of BERT.

Many of these models are included in the `sentence_transformers` library and detailed on [Hugging Face](https://huggingface.co/). Here, we'll demonstrate the smallest model, but by updating the model name it is easy to access any of the available options.

In [None]:
sentence_model = sentence_transformers.SentenceTransformer('all-MiniLM-L6-v2')
embeddings = sentence_model.encode(df_news.title[:1000], show_progress_bar=True)
print(embeddings.shape)

These embeddings are correspond to a 384-dimension vector space such that texts that are close in the vector space are close in meaning. Note that this 384 dimension space is much smaller than the 22,652 dimensions we found with TF-IDF, even if this model is a little slower to produce embeddings.

There are a few pros and cons to be aware of with text embeddings. Since the model we're applying to the embeddings, adding new data is easy and only needs computation on the new data (compared to TF-IDF which needs the IDF values to be updated). In addition, the notion of semantic similarity in this space makes it easier to compare the meaning of texts. The drawbacks to embeddings come in the representation. Unlike the bag of words and TF-IDF representations, BERT embeddings are data dense and can therefore become large for significant corpora. The dimensions of the vector space are also removed from the underlying meaning of the words - they are instead *latent* dimensions that collect the relevance of many underlying words.

One other caveat to note is that some BERT models have limits to the length of text that they accept as an input. You may not encounter this with your data, but be aware that splitting longer texts may give more reliable measures of the semantic information in the text - the examples here are *sentence transformers*, designed to work on shorter texts.

The notion of semantic similarity in the text embeddings means that we can determine if two texts have similar meaning by taking the cosine similarity of their embedding vectors. This measures their proximity in the embedding space - which is trained such that texts with similar meanings should be close. Identical texts will have a cosine similarity of 1, whereas dissimilar texts will have cosine similarity approaching 0.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
sents = ['I took a walk down the street.','I strolled along the road.','The cat ate its supper.']
embs = sentence_model.encode(sents,show_progress_bar=True)
print(cosine_similarity(embs))

## Exercises

Here are some exercises to practice deriving a comparing representations of texts within a corpus.

Find the five most similar headline among the first 1000 headlines to the provided text.

In [None]:
ref_text = ['President Biden visited other G7 leaders in London']

Which of the first 1000 articles uses *climate* the most?