# 🌟 Sentence Similarity Highlighter 🌟

This Jupyter Notebook provides a comprehensive overview of the Sentence Similarity Highlighter tool, which highlights similar sentences between two text files using advanced techniques.

## 📚 Overview

The **Sentence Similarity Highlighter** tool reads two text files, segments the content into sentences, computes the similarity between corresponding sentences, and highlights them in an HTML file based on their similarity scores.

### 🚀 Features

- **Advanced Similarity Measurement**: Uses sentence embeddings from the `sentence-transformers` library for precise similarity computation.
- **Dynamic Color Mapping**: Displays sentences with colors ranging from blue (low similarity) to red (high similarity), enhancing visual analysis.
- **Color Legend**: Includes a legend to interpret the color coding effectively.

## 🛠️ Requirements

Before running the code, ensure you have the following libraries installed:

```bash
pip install nltk sentence-transformers numpy


## 📜 Code Walkthrough
Let's go through the code step by step:

1. Import Libraries

In [None]:
import os
import nltk
from sentence_transformers import SentenceTransformer, util
import numpy as np

2. Download NLTK Data


In [None]:
nltk.download('punkt')

3. Load the Sentence Transformer Model


In [None]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

4. Read File Function

In [None]:
def read_file(file_path):
    """Reads a text file and returns its content as a string."""
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

5. Segment Sentences

In [None]:
def segment_sentences(text):
    """Segments text into sentences."""
    return nltk.sent_tokenize(text)

6. Compute Similarity

In [None]:
def compute_similarity(sentences1, sentences2):
    """Computes cosine similarity between corresponding sentences from two different files."""
    embeddings1 = model.encode(sentences1, convert_to_tensor=True)
    embeddings2 = model.encode(sentences2, convert_to_tensor=True)
    cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
    return cosine_scores

7. Color Mapping Function

In [None]:
def color_from_similarity(score):
    """Maps similarity scores to a color gradient with more categories."""
    if score < 0.2:
        return 'rgb(0, 0, 255)'  # Blue
    elif score < 0.4:
        return 'rgb(173, 216, 230)'  # Light Blue
    elif score < 0.6:
        return 'rgb(128, 0, 128)'  # Purple
    elif score < 0.8:
        return 'rgb(255, 160, 122)'  # Light Red
    else:
        return 'rgb(255, 0, 0)'  # Red

8. Highlight Sentences

In [None]:
def highlight_sentences(sentences1, sentences2, sim_matrix):
    """Wraps sentences in HTML <span> tags with background colors based on similarity."""
    html_sentences1 = []
    html_sentences2 = []
    
    for i, sentence in enumerate(sentences1):
        if i < len(sentences2):
            score = sim_matrix[i, i].item()  # Get the score as a number
            color = color_from_similarity(score)
            html_sentences1.append(f'<span style="background-color:{color}; color:white">{sentence}</span>')
            html_sentences2.append(f'<span style="background-color:{color}; color:white">{sentences2[i]}</span>')
        else:
            html_sentences1.append(f'<span>{sentence}</span>')
    
    return ' '.join(html_sentences1), ' '.join(html_sentences2)

9. Save HTML Output

In [None]:
def save_html(content1, content2, output_file):
    """Saves the highlighted content from two files to an HTML file."""
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write('<html><head><style>body { font-family: Arial, sans-serif; }</style></head><body>')
        file.write('<h2>Color Legend</h2>')
        file.write('<div style="margin-bottom: 20px;">')
        file.write('<span style="background-color:rgb(0,0,255); color:white; padding:5px;">Low Similarity</span>')
        file.write('<span style="background-color:rgb(173,216,230); color:white; padding:5px; margin-left: 20px;">Medium-Low Similarity</span>')
        file.write('<span style="background-color:rgb(128,0,128); color:white; padding:5px; margin-left: 20px;">Medium Similarity</span>')
        file.write('<span style="background-color:rgb(255,160,122); color:white; padding:5px; margin-left: 20px;">Medium-High Similarity</span>')
        file.write('<span style="background-color:rgb(255,0,0); color:white; padding:5px; margin-left: 20px;">High Similarity</span>')
        file.write('</div>')
        file.write('<h2>File 1 Content</h2>')
        file.write(content1)
        file.write('<h2>File 2 Content</h2>')
        file.write(content2)
        file.write('</body></html>')

10. Main Function

In [None]:
def main(file1_path, file2_path, output_file):
    """Main function to process files and save highlighted sentences to an HTML file."""
    text1 = read_file(file1_path)
    text2 = read_file(file2_path)
    
    sentences1 = segment_sentences(text1)
    sentences2 = segment_sentences(text2)
    
    sim_matrix = compute_similarity(sentences1, sentences2)
    
    highlighted_content1, highlighted_content2 = highlight_sentences(sentences1, sentences2, sim_matrix)
    
    save_html(highlighted_content1, highlighted_content2, output_file)

🏁 Running the Tool

In [None]:
if __name__ == "__main__":
    file1_path = 'file1.txt'  
    file2_path = 'file2.txt'  
    output_file = 'highlighted.html'
    main(file1_path, file2_path, output_file)