<a href="https://colab.research.google.com/github/MumbuaFaithK/ai-and-data-projects/blob/main/Faith_Mumbua_BERT_Sentence_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence Similarity with BERT
## Faith Mumbua | Natural Language Processing

# Introduction


*   In this assignment, i use a pre-trained BERT model from Hugging Face to compute sentence similarity
*   using contextual embeddings. i'll extract [CLS] token embeddings for sentence pairs, calculate cosine


*   similarity scores, predict similarity based on a threshold, and evaluate the model's performance.


















# 1. Install and import required libraries

In [6]:
!pip install transformers

from transformers import BertTokenizer, TFBertModel
import tensorflow as tf
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity



# 2. Load pre-trained BERT tokenizer and model

In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

# 3. Define sentence pairs and labels


In [8]:
# 5 given + 5 new sentence pairs
sentence_pairs = [
    ("How do I learn Python?", "What is the best way to study Python?"),
    ("What is AI?", "How to cook pasta?"),
    ("How do I bake a chocolate cake?", "Give me a chocolate cake recipe."),
    ("How can I improve my coding skills?", "Tips for becoming better at programming."),
    ("Where can I buy cheap laptops?", "Best sites to find affordable computers."),
    # Additional 5 sentence pairs
    ("I love playing football.", "Soccer is my favorite sport."),
    ("The weather is sunny today.", "It is raining heavily in Nairobi."),
    ("Can I travel to Mombasa without an ID?", "Do I need identification to travel to Mombasa?"),
    ("She is reading a novel.", "He is watching a movie."),
    ("My phone battery dies quickly.", "My smartphone doesn’t last long on charge."),
]

# Ground truth similarity labels
labels = [1, 0, 1, 1, 1, 1, 0, 1, 0, 1]

# 4. Define function to get [CLS] embedding for a sentence

In [9]:
def get_sentence_embedding(sentence):
    inputs = tokenizer(sentence, return_tensors='tf', padding=True, truncation=True)
    outputs = bert_model(inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :]  # [CLS] token
    return cls_embedding.numpy()

# 5. Calculate cosine similarity for each pair and predict similarity

In [10]:
predictions = []

for sent1, sent2 in sentence_pairs:
    emb1 = get_sentence_embedding(sent1)
    emb2 = get_sentence_embedding(sent2)
    sim_score = cosine_similarity(emb1, emb2)[0][0]
    pred = 1 if sim_score > 0.7 else 0
    predictions.append(pred)

    print(f"\nSentence 1: {sent1}")
    print(f"Sentence 2: {sent2}")
    print(f"Cosine Similarity: {sim_score:.4f} → Predicted Similar: {pred}")


Sentence 1: How do I learn Python?
Sentence 2: What is the best way to study Python?
Cosine Similarity: 0.9743 → Predicted Similar: 1

Sentence 1: What is AI?
Sentence 2: How to cook pasta?
Cosine Similarity: 0.9033 → Predicted Similar: 1

Sentence 1: How do I bake a chocolate cake?
Sentence 2: Give me a chocolate cake recipe.
Cosine Similarity: 0.8938 → Predicted Similar: 1

Sentence 1: How can I improve my coding skills?
Sentence 2: Tips for becoming better at programming.
Cosine Similarity: 0.8633 → Predicted Similar: 1

Sentence 1: Where can I buy cheap laptops?
Sentence 2: Best sites to find affordable computers.
Cosine Similarity: 0.8750 → Predicted Similar: 1

Sentence 1: I love playing football.
Sentence 2: Soccer is my favorite sport.
Cosine Similarity: 0.9594 → Predicted Similar: 1

Sentence 1: The weather is sunny today.
Sentence 2: It is raining heavily in Nairobi.
Cosine Similarity: 0.8443 → Predicted Similar: 1

Sentence 1: Can I travel to Mombasa without an ID?
Sentence

# 6. Evaluate accuracy

In [11]:
correct = sum([1 for i in range(len(labels)) if predictions[i] == labels[i]])
accuracy = correct / len(labels)
print(f"\nAccuracy: {accuracy:.2%}")


Accuracy: 70.00%


# Theory Questions

#  How does BERT differ from traditional NLP approaches like Bag of Words or TF-IDF?
### Traditional models like BoW or TF-IDF assign static vectors to words, ignoring word order and context.
### BERT generates contextual embeddings where the same word can have different meanings depending on surrounding words.

# What is the role of the encoder in the BERT model?
### BERT uses a stack of Transformer encoders to learn relationships between words. In this assignment,
### use the encoder's output (specifically the [CLS] token) to represent the sentence.

# What are contextual embeddings?
### Contextual embeddings are vector representations of words based on their usage in context.
### BERT creates them dynamically for each token based on the entire sentence. This allows it to capture nuances in meaning.

# Why is the [CLS] token used for sentence similarity?
### The [CLS] token is a special token added to the beginning of each sentence.
 ### BERT is trained to use the [CLS] output to summarize the sentence. We use it to represent the whole sentence when computing similarity.

# What is cosine similarity and why is it useful?
### Cosine similarity measures the angle between two vectors and indicates how similar they are regardless of magnitude.
### It is effective for comparing high-dimensional embeddings like those from BERT.
