# Analysis of Sentence Embeddings in Aesop's Fables

Qi Wu

University of Illinois at Urbana-Champaign

---

## Introduction:
This demo delves into the intersection of NLP and classic literary. The goal is to leverage the power of modern machine learning techniques to analyze the linguistic structure of Aesop's Fables and generate the summaries of these influencing stories.

Here's a breakdown of this demo:
### Part 1: Sentence Embedding Generation
* Utilizing **Sentence-BERT**, a state-of-the-art model for generating sentence embeddings that capture deep semantic meanings, we'll transform each sentence from `Aesop_Fables.txt` into a high-dimensional vector space.

### Part 2: Semantic Similarity Computation
* By calculating **cosine similarities** between sentence embeddings, we'll identify and rank the sentences in Aesop's Fables that are semantically close to the given phrase.

### Part 3: 🌟 Fable Summarization:
* In the realm of storytelling, fables are known for their brevity and moral lessons. However, even these concise tales can benefit from summarization, especially when looking to quickly grasp the underlying message or for educational purposes. Harnessing the insights gained from our analysis, we'll attempt to summarize each fable in 3 sentences using **K-Means Clustering**.


## Table of Contents


1.   [Installing Packages & Importing Libraries](#s1)
2.   [Reading & Cleaning the Data](#s2)
3.   [Sentence Embedding Generation](#s3)
4.   [Semantic Similarity Computation](#s4)
5.   [Fable Summarization](#s5) 🌟





<a id='s1'></a>
## Installing Packages & Importing Libraries

In [1]:
! pip install sentence-transformers
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import re
import numpy as np



<a id='section2'></a>

<a id='s2'></a>
## Reading & Cleaning the Data

- **Unicode Character and Quote Removal**: The function begins by removing certain Unicode characters and double quotation marks from the text. These characters are often non-printing or formatting characters that can interfere with text analysis.

- **Paragraph Splitting**: The text is then split into paragraphs.

- **Sentence Tokenization**: Each paragraph is further split by single newline characters (`\n`), and each resulting line is stripped of leading and trailing whitespace. Non-empty lines are then split into individual sentences based on punctuation marks (periods, exclamation points, and question marks).

- **Sentence Collection**: The sentences extracted from each line are added to a list, which is then returned by the function.


In [2]:
def text_processing(text):
    # Removing unicode characters and quotes
    text = text.replace('\xa0', ' ').replace('"', '')
    paragraphs = text.split('\u2028\n')
    storys = []
    sentences = []

    for paragraph in paragraphs:
        paragraph = paragraph.replace('\u2028', ' ')
        lines = paragraph.split('\n')
        for line in lines:
            # Trim the line
            line = line.strip()
            if line:
            # Split the line
                line_sentences = re.split(r'(?<=[.!?])\s+', line)
                for sentence in line_sentences: sentence = sentence.strip()
                sentences.extend(line_sentences)
        storys.append(paragraph)

    return sentences, storys

Below the function definition, the code demonstrates how to use this `text_processing` function:

❗️❗️❗️Be sure to upload the given file `Aesop_Fables.txt` first, you can manually upload or use the code below in Google Colab:

```
from google.colab import files
uploaded = files.upload()
```


- A variable `file_path` is set with the filename 'Aesop_Fables.txt'.
- The file is opened and read into the variable `text`.
- The `text_processing` function is called with `text` as its argument, and the resulting list of cleaned sentences is stored in the variable `sentences`.
- Finally, the first 20 sentences from the list are printed out as a sample of the processed data.

In [3]:
file_path = 'Aesop_Fables.txt'
with open(file_path, 'r') as file:
    text = file.read()

sentences, _ = text_processing(text)
sentences[:20]

['The Cock and the Pearl',
 'A cock was once strutting up and down the farmyard among the hens when suddenly he espied something shinning amid the straw.',
 'Ho!',
 'ho!',
 'quoth he, that’s for me, and soon rooted it out from beneath the straw.',
 'What did it turn out to be but a Pearl that by some chance had been lost in the yard?',
 'You may be a treasure, quoth Master Cock, to men that prize you, but for me I would rather have a single barley-corn than a peck of pearls.',
 'Precious things are for those that can prize them.',
 'The Wolf and the Lamb',
 'Once upon a time a Wolf was lapping at a spring on a hillside, when, looking up, what should he see but a Lamb just beginning to drink a little lower down.',
 'There’s my supper, thought he, if only I can find some excuse to seize it.',
 'Then he called out to the Lamb, How dare you muddle the water from which I am drinking?',
 'Nay, master, nay, said Lambikin; if the water be muddy up there, I cannot be the cause of it, for it run

<a id='s3'></a>
## Sentence Embedding Generation

- The `SentenceTransformer` is instantiated with a pre-trained model **'all-mpnet-base-v2'**. This model is chosen for its ability to generate meaningful sentence embeddings that reflect the context and semantic content of the input sentences.

- The `sbert_model.encode()` method is called on the list of sentences to generate their embeddings. This method processes each sentence and converts it into a fixed-size vector in high-dimensional space.

- Inside the loop, the sentence and its embedding are printed out.

In [None]:
sbert_model = SentenceTransformer('all-mpnet-base-v2')
embeddings = sbert_model.encode(sentences)

# for sentence, embedding in zip(sentences, embeddings):
#     print("Sentence:", sentence)
#     print("Embedding:", embedding)
#     print("")

<a id='s4'></a>
## Semantic Similarity Computation

- **Cosine Similarity Function**: A function named `cosine` is defined to calculate the cosine similarity between two vectors. Cosine similarity then gives a useful measure of how similar two documents are likely to be, in terms of their subject matter, and independently of the length of the documents.

In [6]:
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

- **Sample Query**: "The brown cow crossed the road."

- **Query Vectorization**: The query sentence is encoded into a vector using the `sbert_model.encode()` method.

- **Similarity Computation**: For each sentence in the collection of embeddings, the cosine similarity between its embedding and the query vector is computed.

- **Pairing and Sorting**: Each similarity score is paired with its corresponding sentence, and these pairs are sorted in descending order of similarity. This sorting allows for the ranking of sentences by how semantically similar they are to the query.

- **Output**: The code then prints out these top 10 sentences along with their similarity scores, providing insight into which sentences from Aesop's Fables are most semantically similar to the sample query.

In [7]:
# Query example:
query = "The brown cow crossed the road."
query_vec = sbert_model.encode([query])[0]

similarities = [cosine(query_vec, vec) for vec in embeddings]

similarity_pairs = list(zip(sentences, similarities))
sorted_pairs = sorted(similarity_pairs, key=lambda x: x[1], reverse=True)

top_ten = sorted_pairs[:10]
for sentence, sim in top_ten:
    print("Sentence =", sentence, "\nSimilarity score =", sim, "\n")


Sentence = A Goat passed by shortly afterwards, and asked the Fox what he was doing down there. 
Similarity score = 0.36523297 

Sentence = Then a Sheep went in, and before she came out a Calf came up to receive the last wishes of the Lord of the Beasts. 
Similarity score = 0.32096326 

Sentence = But soon the Ox, returning from its afternoon work, came up to the Manger and wanted to eat some of the straw. 
Similarity score = 0.31628248 

Sentence = It happened that a Dog had got a piece of meat and was carrying it home in his mouth to eat it in peace. 
Similarity score = 0.30660456 

Sentence = So, she went to the horse, and asked him to carry her away from the hounds on his back. 
Similarity score = 0.2977109 

Sentence = Next time however he came near the King of Beasts he stopped at a safe distance and watched him pass by. 
Similarity score = 0.29741022 

Sentence = The Fox and the Goat 
Similarity score = 0.29278317 

Sentence = With the Farmer came his Lapdog, who danced about an

<a id='s5'></a>
## 🌟Fable Summarization


- **Text Processing**: The given fable is processed using the `text_processing` function to split the text into individual sentences.

- **Embedding Generation**: The SBERT model encodes these sentences into embeddings that capture their semantic content.

- **K-Means Clustering**: The embeddings are then clustered using the K-Means algorithm. This is used to find groups of sentences that are semantically similar.

- **Centroid Identification**: For each cluster, the centroid (the mean point of all embeddings in the cluster) is identified.

- **Selection of Representative Sentences**: The sentence closest to each centroid is found and selected as a representative for that cluster. This is achieved by calculating the distance of each embedding in the cluster to the centroid and selecting the the sentence with the smallest distance.

- **Result**: The selected sentences are combined to form the summary of the fable.

In [8]:
def summarize_fable(fable, model, num_clusters=1):
    sentences, _  = text_processing(fable)
    sentences = sentences[1::]
    embeddings = model.encode(sentences)

    kmeans = KMeans(n_clusters = num_clusters, n_init= 'auto')
    kmeans.fit(embeddings)

    closest_indices = []
    summary_sentences = []
    for i in range(num_clusters):
        centroid = kmeans.cluster_centers_[i]
        closest_idx = np.argmin(np.linalg.norm(embeddings - centroid, axis=1))
        closest_indices.append(closest_idx)

    closest_indices = sorted(closest_indices)
    summary_sentences = [sentences[idx] for idx in closest_indices]
    return '\n'.join(summary_sentences)

### Example Usage:

In [9]:
_, fable_text = text_processing(text)
for fable in fable_text:
    summary = summarize_fable(fable, sbert_model, num_clusters = 3)
    print("Summary of:", fable.split('\n')[0].strip(),"\n", summary)
    print("")

Summary of: The Cock and the Pearl 
 A cock was once strutting up and down the farmyard among the hens when suddenly he espied something shinning amid the straw.
Ho!
You may be a treasure, quoth Master Cock, to men that prize you, but for me I would rather have a single barley-corn than a peck of pearls.

Summary of: The Wolf and the Lamb 
 Then he called out to the Lamb, How dare you muddle the water from which I am drinking?
.WARRA WARRA WARRA WARRA WARRA
Any excuse will serve a tyrant.

Summary of: The Dog and the Shadow 
 It happened that a Dog had got a piece of meat and was carrying it home in his mouth to eat it in peace.
As he crossed, he looked down and saw his own shadow reflected in the water beneath.
Beware lest you lose the substance by grasping at the shadow.

Summary of: The Lion’s Share 
 They hunted and they hunted till at last they surprised a Stag, and soon took its life.
Then the Lion took his stand in front of the carcass and pronounced judgment:  The first quarter