> # Pre-Lab Instructions
> <img src="https://github.com/Minyall/sc207_290_public/blob/main/images/attention.webp?raw=true" height=200>

> For this lab you will need:
> - DATA: `farage_reform_dataset.parquet` - Download from Moodle and upload to this Colab session.
> - INSTALL: You will need to install `embedding-atlas`. Use the cell below.

In [None]:
# Uncomment the line below and run 
# ! pip install embedding-atlas

# Let it completely finish before moving on

<img src="https://github.com/Minyall/sc207_290_public/blob/main/images/vector.png?raw=true" height=300 align="right">

# Vectorise!

How do computers 'understand' text, and what can we do with that understanding? Having learned to clean and tokenise our text last week, this week we're going to get to grips with:
1. Different ways of representing texts
2. How to use those representations to summarise texts, individually and as groups
3. How those representations can be used to determine the similarity of texts, and identify groups of them.

Understanding this will help you understand what is happening when we do topic modelling in the next session, as our topic modelling process uses these fundamentals.

Broadly this session is about how you translate a text into a "vector", a series of numbers representing the text, the different apprioaches you can use to do that, and what you can do with vectorised texts.



In [54]:
# We start with Pandas and we'll introduce additional libraries as we use them
import pandas as pd

<img src="https://github.com/Minyall/sc207_290_public/blob/main/images/Scikit_learn_logo_small.svg?raw=true" height=100 align="right">

# 1. Representing Text

We'll use this section to learn about the 'Vectorizer' tools provided by the library [Scikit-Learn](https://scikit-learn.org/stable/). Scikit-Learn was originally intended to be a library for teaching machine learning techniques. Machine learning is a branch of computer science that is about training computers to find patterns, make predictions, classify data and adapt without human intervention. It is a very broad field. These days scikit-learn has become a key library for researchers and the commerical sector wanting to use machine learning for their analysis. 


## 1.1 Counts
The most fundamental way to represent texts as numbers is to count the frequency of words. It makes sense to consider that if a document mentions a word a lot, it is probably about the thing the word represents. 


In [55]:
#*
# A very simple nonsensical example for us to grasp the basics.
test_corpus = ["Apple, Apple, Apple, Pear",
               "Apple, Apple, Apple, Bannana",
               "Pear, Pear, Pear, Bannana",
               "Pear, Pear, Pear, Apple"]

In [56]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

matrix = cv.fit_transform(test_corpus)
matrix


<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 8 stored elements and shape (4, 3)>

The 'features' are the unique words across the entire corpus. Whilst they are used muliple times, there are only three unique words used.

In [57]:
cv.get_feature_names_out()

array(['apple', 'bannana', 'pear'], dtype=object)

To actually use see the matrix we convert it to a 'dense' format - essentially showing a frequency count for each "feature" per document.

In [58]:
matrix.todense()

matrix([[3, 0, 1],
        [3, 1, 0],
        [0, 1, 3],
        [1, 0, 3]])

We can make this easier to interpret by laying it out in a dataframe

In [59]:

fruit_vectors = pd.DataFrame(matrix.todense(), columns=cv.get_feature_names_out())
fruit_vectors

Unnamed: 0,apple,bannana,pear
0,3,0,1
1,3,1,0
2,0,1,3
3,1,0,3


For each document, we have a row of numbers, a vector. The numbers act as a kind of 'signature', representing the extent to which each document expresses different words.

- The vector for the first document is `[3,0,1]`, for the second it is `[3,1,0]`.
- For the third it is `[0,1,3]` and the fourth `[1,0,3]`. 

- You might already get a sense that we can compare these documents just by their vectors. That the first two vectors have most of their weight in the first feature, whilst the last two have more of the weight in the last feature. If we took away the words themselves, just based on these numbers you might be able to make a good guess as to which documents were more alike. We'll come back to this later.

Let's move on to proper data. We'll see what it looks like vectorising the articles we cleaned.

In [65]:
articles = pd.read_parquet('farage_reform_dataset.parquet')
articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   id                  500 non-null    object             
 1   type                500 non-null    object             
 2   sectionId           500 non-null    object             
 3   sectionName         500 non-null    object             
 4   webPublicationDate  500 non-null    datetime64[ns, UTC]
 5   webTitle            500 non-null    object             
 6   webUrl              500 non-null    object             
 7   apiUrl              500 non-null    object             
 8   tags                500 non-null    object             
 9   isHosted            500 non-null    bool               
 10  pillarId            500 non-null    object             
 11  pillarName          500 non-null    object             
 12  byline              500 non-null    

In [66]:
cv = CountVectorizer()
article_vectors = cv.fit_transform(articles['cleaned_text'])
article_vectors

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 217688 stored elements and shape (500, 21041)>

As we can see there are more than 27,000 unique features, that's would create a dataframe of 27K columns. Remember when we pre-processed our text? We took a number of steps to reduce the range of variation in the words used, such as lemmatisation. If we use our tokens instead, does this improve?

In [67]:
cv = CountVectorizer() # It's important to recreate the vectoriser because 
                        # running .fit_transform updates and changes the vectorizer object we call cv
article_vectors = cv.fit_transform(articles['tokens'])
article_vectors

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 151326 stored elements and shape (500, 15555)>

We've reduced by a few thousand unique features. Remember a feature exists if the word is used once or more across all documents. However, how useful is a word that is only ever used once? Also, are words informative or representative of a document if that word is used in every single document? Scikit vectorisers have a few features that can help us to reduce this range of features further.

> ### Filtering Features
> - `stop_words`: Whilst we used Spacy's stop word filters, Sklearn have their own additional list which we can apply at the vectorisation stage.
> - `min_df`: Minimum document frequency. The proportion of documents a token must occur in to be included. Filters out very low frequency words.
> - `max_df`: Maximum document frequency. If a token occurs in too many documents it can be excluded.
> If we provide an integer it represents a set minimum or maximum number of documents. Providing a float between 0.0 - 1.0 indicates a proportion.
> - `min_df=5` means any features that occurs in less than 5 documents will be excluded.
> - `min_df=0.5` means any feature that occurs in less than 50% of the documents will be excluded.


In [68]:
cv = CountVectorizer(stop_words='english', min_df=5, max_df=0.95)
article_vectors = cv.fit_transform(articles['tokens'])
article_vectors

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 127898 stored elements and shape (500, 4610)>

Those filters have massively reduced a lot of features that would otherwise not be informative in representing the text. We'll wrap this in a dataframe for easier interpretation.

In [69]:

article_vectors = pd.DataFrame(article_vectors.todense(), columns=cv.get_feature_names_out())
article_vectors

Unnamed: 0,aamna,abandon,abbott,abide,ability,able,abolish,abolition,abortion,abroad,...,youtube,yusuf,yvette,zack,zarah,zealand,zelenskyy,zero,zia,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
496,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
497,0,0,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
498,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We can get an overall picture of the most common words across the corpus...

In [70]:
# Top 10 words overall
article_vectors.sum().sort_values(ascending=False).head(10)

party         2363
labour        1917
uk            1867
people        1445
government    1383
year          1370
election      1036
trump         1034
right          887
starmer        879
dtype: int64

In [71]:
articles['sectionName'].unique()

array(['UK news', 'Technology', 'US news', 'Politics', 'Business',
       'World news', 'Society', 'Media', 'Environment', 'News',
       'Education', 'Law', 'Australia news', 'Global development',
       'Science'], dtype=object)

In [72]:
# Top 10 words per group of texts
SECTION = "Society"
group_indexes = articles.groupby('sectionName').get_group(SECTION).index
article_vectors.loc[group_indexes].sum().sort_values(ascending=False).head(10)

labour        48
change        46
government    45
vote          38
people        35
child         34
minister      33
year          33
mp            31
law           30
dtype: int64

In [73]:
# Top 10 words per individual document
STORY_INDEX = 1
print(articles.loc[STORY_INDEX,'webTitle'])
print()
print(article_vectors.loc[STORY_INDEX].sort_values(ascending=False).head(10))

UK traffic to popular porn sites slumps after age checks introduced

age             12
site             8
act              7
check            6
content          4
uk               4
child            3
people           3
introduction     3
visit            3
Name: 1, dtype: int64


## 1.2 TFIDF
So far we've been using simple frequencies to explore this topic as it helps if at least one part of the process is familiar! However, word frequencies aren't necessarily the best way to represent what a document is about. Just because a word is used a great deal, doesn't necessarily mean that word is the most representative.

Term Frequency Inverse Document Frequency (TFIDF) is an approach to measuring word frequency that can be thought of as giving higher scores to words of greater "significance".

TFIDF is not a simple word frequency, instead it assigns a word a score based on...

- The frequency of that word in a document
- How many other words are in that document
- How many documents are in the overall corpus
- How many of those documents also contain that word.

#### The forumla for those interested
- TFIDF = term freqency * inverse document frequency
- term frequency = Frequency of occurences of a term within a single document, sometimes divided by the number of terms in the document.
- inverse document frequency = number of documents within the entire corpus / number of documents the term occurs in.

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', min_df=5, max_df=0.95)
tfidf_vectors = tfidf.fit_transform(articles['tokens'])
tfidf_vectors = pd.DataFrame(tfidf_vectors.todense(), columns=tfidf.get_feature_names_out())
tfidf_vectors

Unnamed: 0,aamna,abandon,abbott,abide,ability,able,abolish,abolition,abortion,abroad,...,youtube,yusuf,yvette,zack,zarah,zealand,zelenskyy,zero,zia,zone
0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.047029,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
1,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
2,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.050636,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
3,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.063075,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.059908
4,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.029727,0.0,0.000000
496,0.0,0.0,0.0,0.0,0.000000,0.034861,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.034022,0.0,0.000000
497,0.0,0.0,0.0,0.0,0.030104,0.045046,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
498,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000


In [75]:
#*
# Comparing Top 10 words overall
tfidf_vectors.mean().sort_values(ascending=False).head(10)

party         0.065902
labour        0.057428
uk            0.043369
trump         0.039387
people        0.038527
government    0.037883
election      0.035478
year          0.033891
musk          0.030347
starmer       0.030153
dtype: float64

In [76]:
#*

# Top 10 words per group of texts
SECTION = "Society"
group_indexes = articles.groupby('sectionName').get_group(SECTION).index
tfidf_vectors.loc[group_indexes].sum().sort_values(ascending=False).head(10)

assist        1.061655
child         0.922160
change        0.882139
labour        0.698213
die           0.695323
vote          0.694621
government    0.688314
dying         0.687261
winter        0.670620
council       0.665862
dtype: float64

In [77]:
#*
# Top 10 words per individual document
STORY_INDEX = 1
print(articles.loc[STORY_INDEX,'webTitle'])
print()
print(tfidf_vectors.loc[STORY_INDEX].sort_values(ascending=False).head(10))

UK traffic to popular porn sites slumps after age checks introduced

age             0.428634
site            0.360651
check           0.259795
act             0.201273
content         0.186911
introduction    0.175696
traffic         0.172289
app             0.140184
pornography     0.129307
ofcom           0.125633
Name: 1, dtype: float64


In [78]:
#*
# We can compare the top tens

SECTION = "Society"
group_indexes = articles.groupby('sectionName').get_group(SECTION).index

#ignore this mess 
tfidf_words = tfidf_vectors.loc[group_indexes].sum().sort_values(ascending=False).head(10).reset_index(name='tfidf_score').rename(columns={'index':'tfidf_word'})
count_words = article_vectors.loc[group_indexes].sum().sort_values(ascending=False).head(10).reset_index(name='count').rename(columns={'index':'count_word'})
pd.concat([tfidf_words, count_words], axis=1)



Unnamed: 0,tfidf_word,tfidf_score,count_word,count
0,assist,1.061655,labour,48
1,child,0.92216,change,46
2,change,0.882139,government,45
3,labour,0.698213,vote,38
4,die,0.695323,people,35
5,vote,0.694621,child,34
6,government,0.688314,minister,33
7,dying,0.687261,year,33
8,winter,0.67062,mp,31
9,council,0.665862,law,30


## 1.3 Summary: Scoring for Text Representation
In this first half we've examined how we can translate texts into numerical form to allow us to summarise and quickly generate representions of them at the corpus, subset and individual level. Being able to take a collection of texts and generate a human interpretable summary of what that group of texts is about is important in text analysis, particularly when we start using more complex analysis techniques to find those groups, i.e. topics or themes.

However, representing texts as numbers also allows us to find those groups of texts in the first place, by identifying which texts are more or less similar in their content. In the next half we'll work through how that works at a really basic level and finish with the more complex version that is being used in research and commercial contexts.


<img src="https://github.com/Minyall/sc207_290_public/blob/main/images/space_rabbit.png?raw=true" height=300 align="right">


# 2: Vectors for Similarity and Difference
We can use vectors to also identify whether groups of documents are similar or dissimilar, allowing us to find groups, or themes or topics across texts. The way the vectors are created has a direct impact on what 'similarity' means and what it can tells us.

First we'll work on understanding how a vector can convey similarity or difference. Then we'll finish by looking at *Embeddings* a way of creating vectors that tries to capture the meaning of texts.

## 2.1 How do vectors express similarity and difference?

For this section we're going to use a contrived set of example texts to illustrate how vectors can be used to determine document similarity and identify groups of documents.

> #### Examples disclaimer
> <img src="https://github.com/Minyall/sc207_290_public/blob/main/images/marfusha.jpg?raw=true" height=100 align="right">
> The following examples make claims about the distinction between rabbits and astronauts. These claims are purely for teaching purposes. The instructor nor the institution make any binding claims about the capabilities of rabbits in general to traverse space and recognise the proud history of actual rabbits in space.

><sub> Look up "Marfusha Rabbit" </sub>

In [79]:

# A: All rabbit focussed
a = """Rabbits do not do much in space.
In fact rabbits spend most of their time being rabbits on earth,
because they are rabbits."""

# B: All astronaut focussed
b = """ Astronauts are serious and scientific. They explore space,
because they are astronauts and that is what astronauts do. Important astronauts stuff.
"""

# C: Balanced evenly between both rabbit and astronaut
c = """The previous sentences were about rabbits, and astronauts respectively.
Astronauts are not much like rabbits, as they do more in space but this sentence addresses both topics."""

# D: Balanced except for a slight preference for rabbits
d = """This sentence talks about rabbits mostly because we all love rabbits. We
also enjoy astronauts though we mention astronauts one less time than rabbits."""

# E: Mentions neither rabbits or astronauts
e = """This sentence has had enough of the nonsense and will mention neither people that float in space, nor small fluffy creatures."""

example_texts = [a,b,c,d,e]

We'll vectorise as before into simple counts, dropping stop words.

In [80]:
cv = CountVectorizer(stop_words='english')
matrix = cv.fit_transform(example_texts)
example_vectors = pd.DataFrame(matrix.todense(), columns=cv.get_feature_names_out())
example_vectors

Unnamed: 0,addresses,astronauts,creatures,earth,enjoy,explore,fact,float,fluffy,important,...,scientific,sentence,sentences,small,space,spend,stuff,talks,time,topics
0,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,1,1,0,0,1,0
1,0,4,0,0,0,1,0,0,0,1,...,1,0,0,0,1,0,1,0,0,0
2,1,2,0,0,0,0,0,0,0,0,...,0,1,1,0,1,0,0,0,0,1
3,0,2,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,1,1,0
4,0,0,1,0,0,0,0,1,1,0,...,0,1,0,1,1,0,0,0,0,0


In [81]:
# Just to prove that rabbits, astronauts and space are the key words in these examples

example_vectors.sum().sort_values(ascending=False).head()

rabbits       9
astronauts    8
space         4
sentence      3
time          2
dtype: int64

For our example we're going to focus only on the top three features. We'll also add a column that labels each of the examples, a column containing the text itself, and a column we'll need to specify the size of the marker when we visualise this in a minute.

In [82]:
#*
labels = ["A: All Rabbits","B: All Astronauts","C: Balanced","D: Balanced but more rabbit", "E: Mentions neither"]
main_features = example_vectors.loc[:, ['rabbits', 'astronauts','space']]
main_features['labels'] = labels
main_features['marker_size'] = 10
main_features['text'] = example_texts
main_features


Unnamed: 0,rabbits,astronauts,space,labels,marker_size,text
0,4,0,1,A: All Rabbits,10,Rabbits do not do much in space.\nIn fact rabb...
1,0,4,1,B: All Astronauts,10,Astronauts are serious and scientific. They e...
2,2,2,1,C: Balanced,10,"The previous sentences were about rabbits, and..."
3,3,2,0,D: Balanced but more rabbit,10,This sentence talks about rabbits mostly becau...
4,0,0,1,E: Mentions neither,10,This sentence has had enough of the nonsense a...


Throughout this module We're going to use a different plotting library called `Plotly`. Seaborn is a good fundamental library for visualisation, but as visualisation becomes more about exploration of data, we need to use more interactive tools. It is also good to be familiar with a range of plotting libraries.

Plotly's Express package works similarly to Seaborn, with a simple one line command to build interactive plots.

We're going to use `px.scatter` to create a scatter plot of frequency of rabbits, vs frequency of astronuats.

In [83]:
# *
import plotly.express as px
px.scatter(data_frame=main_features,
            x='astronauts', y='rabbits',
              color='labels',
                size='marker_size',
                  hover_data='text',
                  range_x=(-0.5,4.5), # This argument and the one below are purely to force a square plot for teaching purposes.
                  range_y=(-0.5,4.5))

# IF YOU ARE READING THE PDF YOU WON'T SEE THE PLOTS BELOW. RUN THE FULL NOTEBOOK IN COLAB OR SIMILAR TO SEE THE RESULT

Just using two of the features we get a visual representation of the 'distance' between different documents based on the words they use. We have 4 regions. 
1. Bottom left is no rabbits, no astronauts.
2. Top left is rabbits only
3. Bottom right is astronauts only. 
3. In the middle we have those documents more balanced in their rabbit to astronaut ratio.

If we visualise in three dimensions we can include another feature, which adds another dimension to the distance or closeness of documents. Going from two to three dimensions introduces more variance into the positions of the markers, making the distances slightly more accurate by allowing the documents to be positioned in one more direction.

 Adding this shows that whilst documents C and D are still close, they aren't necessarily as close as we thought if we consider the use of the word 'space'.

In [84]:
px.scatter_3d(data_frame=main_features, 
              x='astronauts', y='rabbits',
                z='space', color='labels', 
                size='marker_size', 
                hover_data='text', 
                range_x=(-0.5,4.5), # This argument and the one below are purely to force a square plot for teaching purposes.
                  range_y=(-0.5,4.5))

When we talk about 'distance' between documents, this is not just a visual intuition but something measurable using 'cosine distance'. This metric is commonly used for vectors that represent text to measure the distance between two points.

In [85]:
#*
from scipy.spatial.distance import pdist, squareform

# By default, pdist gives you distances as a list of values, squareform reshapes this to make it more interpretable for us as a grid.
distances = squareform(pdist(main_features[['rabbits','astronauts','space']], metric='cosine'))

# We'll take that grid of distances and set both the column and index names to our labels.
pd.DataFrame(distances, columns=main_features['labels'], index=main_features['labels'])

labels,A: All Rabbits,B: All Astronauts,C: Balanced,D: Balanced but more rabbit,E: Mentions neither
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A: All Rabbits,0.0,0.941176,0.272393,0.192793,0.757464
B: All Astronauts,0.941176,0.0,0.272393,0.461862,0.757464
C: Balanced,0.272393,0.272393,0.0,0.0755,0.666667
D: Balanced but more rabbit,0.192793,0.461862,0.0755,0.0,1.0
E: Mentions neither,0.757464,0.757464,0.666667,1.0,0.0


> #### Interpreting the numbers
> - 0 means no distance, the two vectors are identical - in our case this only happens when comparing a document with itself.
> - 1 means entirely different, could not be more different in their feature scores. This does not mean that semantically the sentences couldn't be more different, just on the basis of what features we are measuring. In our case only docs D and E achieve this as they are completely opposite in their use of the three words. D uses `astrounaut` and `rabbit` but not `space`, E uses `space`, but not the others.
> - All the other scores vary between 1 and 0, based on the similarity or dissimilarity of their word usage. A good example to check this is for document D which is meant to be balanced, but leans a little more `rabbit`. We can see that D is closer to A, the all rabbits document, than it is to B the all astronauts one, but overall it is closest to C, the Balanced document, but not exactly the same.


Now imagine that we added a fourth feature, and then a fifth, and so on. In this example we have over 20 features we could include. In our proper news dataset we had thousands. Whilst we can't visualise that many dimensions directly, they can still be included in the computer's determination of 'distance'.


In [86]:
#*
# Recall how many features we had in our rabbits example
example_vectors.shape

(5, 28)

In [137]:
#*
# Using all the features
all_feature_distances = squareform(pdist(example_vectors, metric='cosine'))
pd.DataFrame(all_feature_distances, columns=main_features['labels'], index=main_features['labels'])


labels,A: All Rabbits,B: All Astronauts,C: Balanced,D: Balanced but more rabbit,E: Mentions neither
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A: All Rabbits,0.0,0.952381,0.50901,0.349186,0.927261
B: All Astronauts,0.952381,0.0,0.50901,0.599499,0.927261
C: Balanced,0.50901,0.50901,0.0,0.369107,0.833333
D: Balanced but more rabbit,0.349186,0.599499,0.369107,0.0,0.847056
E: Mentions neither,0.927261,0.927261,0.833333,0.847056,0.0


Broadly, the distances will be similar, but these distances integrate a consideration of each possible word in the corpus. Now some documents may be considered closer due to a similarity in the *other* words used, not just our key two or three.

## 2.2 Embeddings
Whilst TFIDF is still good for generating keywords to represent collections of text, we've moved on when it comes to determining document similarity. The vectorisation methods we've covered have some key issues:
- Bag of Words: Order is completely lost, all that matters is the presence of words, not where they come in the text or their context.
- With the main unit of analysis being individual words, the actual meaning of the text can get lost. Two texts might say the same thing using different words and be considered dissimilar.
- Word meaning can vary depending on the the words that come before and after it.

#### BERT
**B**i-Directional **E**ncoder **R**epresentations from **T**ransformers. BERT is a language model trained on large amounts of text. 

When creating vectors to represent texts it can better express the underlying semantic meaning of the words being used as well as consider words in their larger context. 

Generating vectors from an embedding model like BERT can take a while, but they are much better at capturing the nuance of document similarity.

As they consider the whole text, for context, we feed it text rather than tokens as it uses the punctuation as part of the decision making process.

We're going to use some text extracts to demonstrate this. 

In [176]:
space = "Space, often referred to as outer space, is the vast expanse that begins beyond Earth's atmosphere and continues indefinitely in all directions. It is a near-vacuum containing a low density of particles, predominantly hydrogen and helium, as well as electromagnetic radiation, magnetic fields, and cosmic rays. While often considered empty, space is punctuated by celestial bodies like planets, stars, asteroids, and galaxies, each governed by the laws of physics, particularly gravity. The study of space encompasses a multitude of scientific disciplines, including astronomy, astrophysics, and cosmology, all seeking to understand the origin, evolution, and ultimate fate of the universe."
religion = "Religion is a complex and multifaceted social phenomenon that encompasses beliefs, practices, and moral codes centered around a system of faith and worship, often involving a supernatural or transcendent reality. It typically provides a framework for understanding the meaning of life, the universe, and humanity's place within it, often offering explanations for existential questions and providing guidance for ethical behavior. Religions are diverse, ranging from organized institutions with established doctrines and rituals to personal and individual expressions of spirituality, and play a significant role in shaping cultures, societies, and individual identities around the globe."
cheese = "Cheese, a dairy product made by coagulating milk, comes in a vast and diverse array of types, each characterized by unique flavors, textures, and production methods. These variations arise from factors such as the type of milk used (cow, goat, sheep, etc.), the specific cultures introduced, the temperature and humidity during aging, and the length of the aging process. Broad categories include fresh cheeses like ricotta and mozzarella, soft cheeses like brie and camembert, semi-hard cheeses like cheddar and gouda, hard cheeses like parmesan and pecorino romano, and blue cheeses like gorgonzola and Roquefort, each offering a distinct culinary experience."
sociology = "Sociology is the scientific study of human society and social behavior. It examines the structure, development, and functioning of human social life, including social institutions, relationships, and interactions. Sociologists explore a wide range of topics, such as social inequality, culture, deviance, social change, and the impact of social forces on individual lives. Through research methods like surveys, interviews, and observations, sociology aims to understand how individuals are shaped by their social environment and how social structures and processes affect human actions and experiences."
rabbits = "Rabbits are small, furry mammals known for their long ears, powerful hind legs, and fluffy tails. Found in various habitats worldwide, they are herbivores with a diet primarily consisting of grasses, vegetables, and fruits. Their rapid breeding cycle contributes to their abundance, and they are often prey animals for larger predators. Rabbits exhibit social behaviors, living in groups called colonies, and are known for their characteristic hopping gait and twitching noses. While often kept as pets, wild rabbits play an important role in their respective ecosystems."
dogs = "Dogs, often hailed as “man's best friend,” are domesticated canines renowned for their loyalty, intelligence, and diverse breeds. From tiny Chihuahuas to towering Great Danes, dogs exhibit a wide range of sizes, temperaments, and purposes. They have been companions to humans for thousands of years, evolving from wolves into specialized working animals, beloved pets, and invaluable assistance animals. Their keen senses, trainability, and unwavering affection make them cherished members of families worldwide, offering comfort, protection, and unconditional love."
computer_hardware = "Computer hardware encompasses the physical components of a computer system, the tangible parts you can see and touch. These pieces work together to execute instructions and perform tasks. Essential components include the central processing unit (CPU), the “brain” of the computer, responsible for processing data; random access memory (RAM), which provides short-term data storage for quick access; storage devices like hard drives (HDDs) or solid-state drives (SSDs) for persistent data storage; and input/output devices such as keyboards, mice, and monitors that allow users to interact with the system. The motherboard acts as the central hub, connecting all these components and facilitating communication between them. Performance, speed, and overall capabilities of a computer system are heavily reliant on the quality and configuration of its hardware."
economics = "Economics is the study of how societies allocate scarce resources to satisfy unlimited wants and needs. It examines how individuals, businesses, and governments make decisions in the face of scarcity, focusing on production, distribution, and consumption of goods and services. The field encompasses both microeconomics, which analyzes individual markets and consumer behavior, and macroeconomics, which studies the economy as a whole, focusing on issues like inflation, unemployment, and economic growth. Ultimately, economics seeks to understand how to optimize resource allocation to improve the overall well-being of society."
physics = "Physics, at its core, is the study of the fundamental constituents of the universe and the forces that govern their interactions. It seeks to understand everything from the smallest subatomic particles to the largest structures in the cosmos, aiming to uncover the underlying laws that dictate how matter, energy, space, and time behave. Through observation, experimentation, and mathematical modeling, physicists develop theories that describe and predict natural phenomena. This pursuit of knowledge leads to advancements in technology, providing us with a deeper understanding of the world around us and enabling the creation of innovative tools and solutions that impact countless aspects of modern life, from medicine and communication to transportation and energy production. Ultimately, physics strives to provide a unified and comprehensive understanding of the universe, answering the fundamental questions of existence and continuously pushing the boundaries of human knowledge."

snippet_texts = [space, religion, cheese, sociology, rabbits, dogs, computer_hardware, economics, physics]
snippet_labels = ["Space", "Religion", "Cheese", "Sociology", "Rabbits", "Dogs", "Computer Hardware", "Economics", "Physics"]

In [177]:

from sentence_transformers import SentenceTransformer
transformer = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = transformer.encode(snippet_texts)


Rather than have a column per feature/word, model created embeddings have a set number of features that are meant to capture the semantic dimensions of a text. In the case of this model, 384 columns.

In [178]:
embeddings.shape

(9, 384)

In [None]:
#*
# For reference lets generate a word frequency based similarity table for these examples
cv = CountVectorizer(stop_words='english')
snippet_counts = cv.fit_transform(snippet_texts)
distances = squareform(pdist(snippet_counts.todense(), metric='cosine'))

# We'll take that grid of distances and set both the column and index names to our labels.
px.imshow(pd.DataFrame(distances, columns=snippet_labels, index=snippet_labels), color_continuous_scale='BuGn_r')
# The darker the square the closer the two documents are according to the model

In [180]:
# *
distances = squareform(pdist(embeddings, metric='cosine'))
px.imshow(pd.DataFrame(distances, columns=snippet_labels, index=snippet_labels), color_continuous_scale='BuGn_r')

According to the word counts there is very little overlap between the different extracts of text. They may use a few common words but otherwise they are distant.

However, with the semantic embeddings, there is less distance between certain text extracts. For example the docs about rabbits and dogs are considered closer in topic. Sociology is closer to economics and religion. Physics and Space are very close, whilst interestingly Space and Religion are also closer - perhaps because both extracts talk about the larger place of humanity.

### Embedding Atlas
To demonstrate this at scale we'll use a new tool called Embedding Atlas and our Guardian Articles. Embedding Atlas can take text and generate the 384 feature embeddings for us, but then it also uses those to create a simplified two-dimensional version of them so we can actually visualise it.

<sub>If you are able to see the reality in 384 dimensions feel free to disregard the above</sub>

In [None]:
from embedding_atlas.widget import EmbeddingAtlasWidget
from embedding_atlas.projection import compute_text_projection

to_visualise = articles[['webTitle','sectionName','webUrl','cleaned_text','tokens','wordcount']].copy()
compute_text_projection(to_visualise, x='x', y='y', text='cleaned_text')
widget = EmbeddingAtlasWidget(to_visualise, x='x', y='y', text='cleaned_text', show_charts=False)
widget
# This will take approx 2 minutes to load on Colab for the first run.

EmbeddingAtlasWidget()

### Task

**Explore the Embedding Atlas interface**. Before starting click on the cog to adjust the settings. Switch tooltip to `webTitle` and under `Text Style` set `webUrl` to `Link`.

1. Zoom in to dfifferent areas of the documents. What do you notice about the labels?
2. Click on individual points to see which stories they represent.
3. Pick a cluster of documents. What do you think is the underlying topic or theme linking them based on the titles? You can even follow the link to read the article if it helps.
3. Switch the `Display` setting (dropdown menu at the top right) to `Density`. Slowly increase the `Threshold` slider until you see the patterns. What do you think these lines are telling you?
4. Switch the `Color` dropdown to `sectionName`. These are the sections the newspaper assigned to each article. Are articles of the same section always grouped together? Do you have an explanation for why that is?

> #### Other features:
> - At the top right there are buttons that give you access to other panels. The first from the right is the charts button, the second is the table button. 
> - The charts section allows you to examine charts the tool has already made out of the dataset, and make your own. Try adding a chart, such as a boxplot of wordcount grouped by section.
> - The table section shows you the full table for the dataset.
> - Hidden at the bottom are two dotted selection tools, a square and a loop. You can use these to select some of the points. Once selected the table and charts view will filter to show you only information relevant to your selection.


## 2.3 Summary: Similarity and Difference
Embedding Atlas combines both text representation, in the form of TFIDF generated labels for collections of documents, and text similarity, in the form of the embeddings used to position the documents in relation to one another based on their semantic similarity.

This combination of trained model embedding with a summary representation of a collection of texts is rapidly becoming the standard for text analysis. In the next session we'll look at a another tool specifically designed for identifying topics in large collections of text, that uses these techniques.

