<a href="https://colab.research.google.com/github/Aditya100300/LLMs_from_scratch/blob/main/Chapter_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sentence Transformers are powerful encoders

# **Sentence Transformers, Cosine Similarity, and Hugging Face API Examples**

We will:
1. Install and use **sentence-transformers** for embedding sentences.
2. Compare sentences via **cosine similarity**.
3. Show an alternative, simple Python function for bag-of-words cosine similarity.
4. Demonstrate the **Hugging Face Inference APIs** for text generation (Bloom, Dolly, etc.), question answering, and summarization.

Let's get started!


In [1]:
!pip install sentence_transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence_transformers)
 

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-mpnet-base-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Explanation:
- **sentence_transformers**: A library that simplifies creating and using pretrained sentence-level embedding models.
- **SentenceTransformer**: The main class to load a chosen model (e.g. "all-mpnet-base-v2").
- **requests**: For making HTTP requests (we’ll use it for Hugging Face Inference APIs).
- **pprint**: Nicely format JSON or dictionary outputs.


In [3]:
sentences = [
    "it caught him off guard that space smelled of seared steak",
    "she could not decide between painting her teeth or brushing her nails",
    "he thought there'd be sufficient time is he hid his watch",
    "the bees decided to have a mutiny against their queen",
    "the sign said there was road work ahead so she decided to speed up",
    "on a scale of one to ten, what's your favorite flavor of color?",
    "flying stinging insects rebelled in opposition to the matriarch"
]

In [4]:
long_text = ["""
Comparing the cosine similarity scores, we see that by introducing the word "elephant" instead of "mouse" in Sentence 2, the cosine similarity score increased from approximately 0.856 (without noise) to approximately 0.898 (with noise).
This demonstrates how noise can affect the similarity measurement between sentences.
"""]


### Explanation:
- **all-mpnet-base-v2**: Known for strong performance on semantic similarity tasks.
- **long_text**: Just a demonstration list with a single multi-line string.
- We display lengths for clarity.


In [5]:
len(long_text)

1

In [6]:
len(sentences)

7

Convert the text into a dense vector

In [7]:
embeddings = model.encode(sentences)
embeddings.shape

(7, 768)

### Explanation:
- **model.encode()**: Takes a list of strings, returns a NumPy array (7, 768).
- We store the array in `embeddings`.


In [8]:
#the bees decided to have a mutiny against their queen
#"flying stinging insects rebelled in opposition to the matriarch"

In [9]:
# embeddings[0]

In [10]:
# "it caught him off guard that space smelled of seared steak",


In [11]:
from sentence_transformers.util import cos_sim

scores = cos_sim(embeddings[0], embeddings[1:])

print(sentences[0])
for i, score in enumerate(scores[0]):
    print(f"{round(score.item(), 4)} | {sentences[i+1]}")

it caught him off guard that space smelled of seared steak
0.1293 | she could not decide between painting her teeth or brushing her nails
0.3083 | he thought there'd be sufficient time is he hid his watch
0.1389 | the bees decided to have a mutiny against their queen
0.1416 | the sign said there was road work ahead so she decided to speed up
0.097 | on a scale of one to ten, what's your favorite flavor of color?
0.1232 | flying stinging insects rebelled in opposition to the matriarch


### Explanation:
- **cos_sim(a, b)**: The sentence_transformers utility function for cosine similarity.
- We pass the embedding of `embeddings[0]` (shape 1 x 768) vs `embeddings[1:]` (shape 6 x 768).
- **.item()**: Convert a PyTorch float tensor to a Python float.
- Output: We see how each subsequent sentence ranks in similarity to the first.


- simple exercise:
-Sentence 1: The boy ran upstairs
-Sentence 2: The girl ran upstairs
-Sentence 3: The cat ran upstairs
-Sentence 4: The cat ran outside

Sentence 1 and 2 : 0.74999
Sentence 1 and 3: 0.7009
Sentence 1 and 4:

In [12]:
#Sentence 1 and 2 : 0.74999
#Sentence 1 and 3: 0.7009
#Sentence 3 and 4: 0.3317
#Sentence 1 and 4: 0.3897

In [13]:

# scores = cos_sim(embeddings[0], embeddings[:-1])

# print(sentences[-1])
# for i, score in enumerate(scores[0]):
#     print(f"{round(score.item(), 4)} | {sentences[i]}")

##Cosine Similarity

In [14]:
import numpy as np

def calculate_cosine_similarity(sentence1, sentence2):
    # Create a set of unique words from both sentences
    word_set = set(sentence1.split() + sentence2.split())

    # Create vectors representing the frequency of each word in each sentence
    vector1 = np.array([sentence1.split().count(word) for word in word_set])
    vector2 = np.array([sentence2.split().count(word) for word in word_set])

    # Calculate the dot product of the two vectors
    dot_product = np.dot(vector1, vector2)

    # Calculate the magnitude of each vector
    magnitude1 = np.linalg.norm(vector1)
    magnitude2 = np.linalg.norm(vector2)

    # Calculate the cosine similarity
    cosine_similarity = dot_product / (magnitude1 * magnitude2)

    return cosine_similarity

# Test the function
sentence1 = "The cat chased the mouse."
sentence2 = "The cat chased the mouse in the garden."
sentence3 = "The mouse chased the elephant in the garden."

similarity_score = calculate_cosine_similarity(sentence1, sentence2)
print(f"The cosine similarity between the two sentences is: {similarity_score}")


The cosine similarity between the two sentences is: 0.7071067811865475


### Explanation:
- **calculate_cosine_similarity**: This is a simple bag-of-words function that doesn’t use embeddings at all.  
- We compare two pairs of sentences. The function indicates how many words they share, ignoring synonyms or context.  
- Great for an educational contrast with neural embeddings.


In [15]:
# vector1 = np.array([sentence1.split().count(word) for word in word_set])

In [16]:

similarity_score = calculate_cosine_similarity(sentence2, sentence3)
print(f"The cosine similarity between the two sentences is: {similarity_score}")

The cosine similarity between the two sentences is: 0.8999999999999998


# Hugging Face APIs
Let’s look into how Hugging Face APIs can help generate text using LLMs like Bloom, Roberta-base, etc. First, we need to sign up for Hugging Face and copy the token for API access. After signup, hover over to the profile icon on the top right, click on settings, and then Access Tokens.

# **Hugging Face Inference API Examples**

We'll walk through a few examples of calling external Hugging Face endpoints for text generation, Q&A, and summarization. Note these can require an API token with an upgraded account if the model is large or behind a paywall.

---

## **Example 1: Sentence Completion with Bloom**  

We'll demonstrate how to call the "bigscience/bloom" model via the Inference API.


**Example 1: Sentence Completion**

Let’s look at how we can use Bloom for sentence completion. The code below uses the hugging face token for API to send an API call with the input text and appropriate parameters for getting the best response.

In [36]:
from huggingface_hub import notebook_login
from huggingface_hub import HfFolder


#enter your API key, you can make one for free on HF
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [37]:
from huggingface_hub import InferenceApi

inference = InferenceApi("bigscience/bloom",token=HfFolder.get_token())



In [38]:
import time

def infer(prompt,
          max_length = 32,
          top_k = 0,
          num_beams = 0,
          no_repeat_ngram_size = 2,
          top_p = 0.9,
          seed=42,
          temperature=0.7,
          greedy_decoding = False,
          return_full_text = False):


    top_k = None if top_k == 0 else top_k
    do_sample = False if num_beams > 0 else not greedy_decoding
    num_beams = None if (greedy_decoding or num_beams == 0) else num_beams
    no_repeat_ngram_size = None if num_beams is None else no_repeat_ngram_size
    top_p = None if num_beams else top_p
    early_stopping = None if num_beams is None else num_beams > 0

    params = {
        "max_new_tokens": max_length,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": temperature,
        "do_sample": do_sample,
        "seed": seed,
        "early_stopping":early_stopping,
        "no_repeat_ngram_size":no_repeat_ngram_size,
        "num_beams":num_beams,
        "return_full_text":return_full_text
    }

    s = time.time()
    response = inference(prompt, params=params)
    #print(response)
    proc_time = time.time()-s
    #print(f"Processing time was {proc_time} seconds")
    return response

In [39]:
prompt = "The thing that makes large language models interesting is"
resp = infer(prompt)

resp

{'error': 'Cannot override task for LLM models'}

### Explanation:
- **notebook_login()**: If you need to authenticate with HF – otherwise skip if you have an existing token setup.
- **InferenceApi**: We specify the model "bigscience/bloom".
- **infer(...)**: A convenience function that passes generation parameters to Bloom.
- We do a simple test prompt: "The thing that makes large language models interesting is..."


In [24]:
import requests
from pprint import pprint

API_URL = 'https://api-inference.huggingface.co/models/bigscience/bloomz'
def query(payload):
    response = requests.post(API_URL, json=payload)
    return response.json()

params = {'max_length': 200, 'top_k': 10, 'temperature': 2.5}
output = query({
    'inputs': 'Sherlock Holmes is a',
    'parameters': params,
})

pprint(output)

{'error': 'Invalid username or password.'}


### Explanation:
- **API_URL**: Points to the "bloomz" variant this time.
- We define a small `query` function that:
  1) Takes a Python dictionary of inputs and parameters
  2) Sends a POST request
  3) Returns the JSON response
- We test it with `'Sherlock Holmes is a'` as the prompt.


**Example 2: Question Answers**

We can use the API for the Roberta-base model which can be a source to refer to and reply to. Let’s change the payload to provide some information about myself and ask the model to answer questions based on that.

In [25]:
!pip install transformers



In [26]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Why is model conversion important?',
    'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
}
res = nlp(QA_input)




config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cuda:0


In [27]:
res

{'score': 0.21171417832374573,
 'start': 59,
 'end': 84,
 'answer': 'gives freedom to the user'}

In [28]:

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name,return_dict=False)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
question = "What has Huggingface done ?"
encoding = tokenizer(question, text, return_tensors="pt")
input_ids = encoding["input_ids"]
# Transform input tokens




# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]

start_scores, end_scores = model(input_ids, attention_mask=attention_mask)
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())

answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1]
answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))

In [29]:
answer

' democratized NLP'

### Explanation:
- **pipeline('question-answering')**: Simpler approach, just pass question + context.  
- We also show a deeper approach retrieving start/end token scores.
- **start_scores, end_scores**: Tensors representing the start/end index predictions.  
- We decode the tokens to a text answer.


**Example** 3: Summarization

We can summarize using Large Language Models. Let’s summarize a long text describing large language models using the Bart Large CNN model. We modify the API URL and added the input text below:

In [42]:
API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn"

def query(payload):
    response = requests.post(API_URL, json=payload)
    return response.json()

params = {'do_sample': False}

full_text = '''AI applications are summarizing articles, writing stories and
engaging in long conversations — and large language models are doing
the heavy lifting.

A large language model, or LLM, is a deep learning model that can
understand, learn, summarize, translate, predict, and generate text and other
content based on knowledge gained from massive datasets.

Large language models - successful applications of
transformer models. They aren’t just for teaching AIs human languages,
but for understanding proteins, writing software code, and much, much more.

In addition to accelerating natural language processing applications —
like translation, chatbots, and AI assistants — large language models are
used in healthcare, software development, and use cases in many other fields.'''

output = query({
    'inputs': full_text,
    'parameters': params
})

pprint(output)

{'error': 'Invalid username or password.'}


### Explanation:
- **params = {'do_sample': False}**: Means a more deterministic summarization.  
- We pass `'inputs': full_text` for summarization.  
- **pprint(output)**: Should print a dictionary with `'summary_text'`.




## **Example 4:  Text Generation**

We demonstrate using a local pipeline for the Dolly model. (Requires GPU resources to load its ~3B parameters.)


In [8]:
# Upgrade dependencies
!pip install --upgrade transformers diffusers


Collecting transformers
  Using cached transformers-4.50.3-py3-none-any.whl.metadata (39 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Using cached tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Using cached transformers-4.50.3-py3-none-any.whl (10.2 MB)
Using cached tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.13.3
    Uninstalling tokenizers-0.13.3:
      Successfully uninstalled tokenizers-0.13.3
  Attempting uninstall: transformers
    Found existing installation: transformers 4.31.0
    Uninstalling transformers-4.31.0:
      Successfully uninstalled transformers-4.31.0
Successfully installed tokenizers-0.21.1 transformers-4.50.3


In [3]:
from diffusers import StableDiffusionPipeline
import torch

# Load the Stable Diffusion v1.5 model
model_id = "runwayml/stable-diffusion-v1-5"

# Initialize the pipeline without specifying the revision
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

# Move the pipeline to the GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipe.to(device)

# Define your prompt
prompt = "A boat in the water under a sunset sky"

# Generate the image
image = pipe(prompt).images[0]

# Display the image
image.show()

# Optionally, save the image
image.save("generated_boat.png")


model_index.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

scheduler_config.json:   0%|          | 0.00/308 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/547 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

# **Notebook Summary**

We covered:

1. **Sentence Transformers**:
   - Installation, loading `all-mpnet-base-v2`.
   - Encoding sentences.
   - Cosine similarity using both neural (transformers) approach and a custom bag-of-words approach.

2. **Hugging Face Inference API** examples:
   - Bloom text generation (`InferenceApi`).
   - Summarization with `bart-large-cnn`.
   - QA with `deepset/roberta-base-squad2`.
   - Generation pipeline.

This demonstrates an end-to-end approach for analyzing and comparing textual data, plus harnessing LLMs for generation, Q&A, or summarization tasks.
