Here, we will build a simple summarization pipeline using the `transformers` library.

In [2]:
from transformers.utils import logging
logging.set_verbosity_error() # suppress warnings

from transformers import pipeline 
import torch

The LLM used here is ['bart-large-cnn'](https://huggingface.co/facebook/bart-large-cnn)

In [3]:
summarizer = pipeline(task="summarization",
                      model="facebook/bart-large-cnn",
                      torch_dtype=torch.bfloat16)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [5]:
text = """Jakarta is the capital and most populous city of Indonesia, with
          an estimated population of 10.56 million as of 2020,
          in an area of more than 661.5 square kilometers.
          Jakarta sits on the northwest coast of the island of Java.
          A historic mix of cultures – Javanese, Malay, Chinese, Arab, Indian and European – has influenced its architecture, language and cuisine.
          The old town, Kota Tua, is home to Dutch colonial buildings, Glodok (Jakarta’s Chinatown), and \
          the old port of Sunda Kelapa, where traditional wooden schooners dock."""

In [6]:
summary = summarizer(text,
                     min_length=10,
                     max_length=100)

In [7]:
summary

[{'summary_text': 'Jakarta is the capital and most populous city of Indonesia. It has an estimated population of 10.56 million as of 2020. The old town, Kota Tua, is home to Dutch colonial buildings.'}]

## Sentence Embeddings

In this section, we measure sentence similarity of two pieces of text, useful for information retrieval or grouping/clustering. To install the required libraries:
``` 
    !pip install sentence-transformers
```

In [1]:
from transformers.utils import logging
logging.set_verbosity_error() # suppress warning messages

from sentence_transformers import SentenceTransformer
from sentence_transformers import util
model = SentenceTransformer("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In the above, we import the `SentenceTransformer` class, which is then used to load the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) sentence embedding model, which converts input text into embedding vectors. 

In [2]:
sentences1 = ['People love cats',
              'Students are learning robotics',
              'The plan is fantastic']

In [3]:
embeddings1 = model.encode(sentences1, convert_to_tensor=True) # make sure we get tensor as output

In [4]:
print(embeddings1)

tensor([[ 0.0573,  0.0129,  0.0817,  ...,  0.0944,  0.1124,  0.0369],
        [ 0.0120, -0.0606,  0.0035,  ...,  0.0517, -0.0638,  0.0297],
        [-0.0612,  0.0818,  0.0060,  ..., -0.0038, -0.0387,  0.0129]])

In [5]:
sentences2 = ['Girls like kittens',
              'Professors teach AI',
              'The new plan is so good']

In [6]:
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

In [7]:
print(embeddings2)

tensor([[-0.0079, -0.0359,  0.0422,  ...,  0.0699,  0.0833, -0.0151],
        [ 0.0108, -0.0791, -0.0280,  ...,  0.0540,  0.0025, -0.0517],
        [-0.0291, -0.0010,  0.0850,  ..., -0.0615, -0.0089,  0.0600]])


To calculate how close the sentences are, we use the cosine similarity.

In [9]:
cosine_scores = util.cos_sim(embeddings1,embeddings2)

In [10]:
# we get pairwise similarity for each sentence
print(cosine_scores)

tensor([[ 0.5480, -0.0190,  0.0796],
        [ 0.1634,  0.4903,  0.0945],
        [ 0.0403,  0.0667,  0.7678]])


In particular:
* "people love cats" and "girls like kittens" have similarity of 0.5480
* "students are learning robotics" and "professors teach AI" have similarity of 0.4903
* "the plan is fantastic" and "the new plan is so good" have similarity of 0.7688