## Semantic search

In [59]:
%pip install -q sentencepiece torch transformers numpy pandas protobuf scikit-learn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [60]:
import torch
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity



#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

def get_embedding(tokenizer, model, data):

    # Tokenize sentences
    encoded_input = tokenizer(data, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    # Perform pooling. In this case, max pooling.
    return mean_pooling(model_output, encoded_input['attention_mask'])


# dataset we want embeddings for
data = pd.DataFrame({
    'title': [
        'Infringement of Copyright: Legal Proceedings and Remedies',
        'Harnessing Hugging Face: Revolutionizing NLP with Transformers Library',
        'Patent Breach Litigation: Analysis and Enforcement Strategies',
        'Quantum Computing Unleashed: Breaking Down Complexities',
        'Microbiome Mysteries: Unlocking Gut Health Secrets',
        'Solar Power Surge: Illuminating Renewable Energys Future'],
    'content': [
        'This document addresses the legal framework surrounding copyright violation, including statutory damages, injunctive relief, and defendant liabilities....',
        'Explore how Hugging Faces Transformers library is transforming NLP. Unmatched ease-of-use, advanced models, and community-driven innovation redefine AIs future...',
        'Detailing patent infringement cases, this brief explores judicial remedies, infringement criteria, and defense strategies in patent law...',
        'Demystifying quantum computings complexities. Insight into qubits, quantum supremacy, and its potential to revolutionize technology....',
        'Exploring the gut microbiomes impact on health. New research unveils links between bacteria diversity and disease prevention...',
        'Analyzing solar energys rapid growth. Innovations in photovoltaic technology and global impact on sustainable energy shift...']
})

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-distilroberta-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-distilroberta-base-v2')

sentence_embeddings = get_embedding(tokenizer, model, data['content'].tolist())

print("Sentence embeddings:")
print(sentence_embeddings)


Sentence embeddings:
tensor([[ 0.4828,  0.3504,  0.0829,  ...,  0.0499, -0.1595, -0.3722],
        [ 0.2384, -0.3978, -0.0407,  ..., -1.2387, -0.1273,  0.3967],
        [ 0.4384,  0.6839,  0.1998,  ..., -0.1153, -0.3081, -0.4717],
        [ 0.4972,  0.7983,  0.0944,  ..., -0.1175, -0.4632,  0.5812],
        [ 0.0112, -0.5605, -0.3236,  ..., -0.0455, -0.0269,  0.3353],
        [-0.1307, -0.1798, -0.2947,  ...,  1.0156,  0.3974, -0.4707]])


In [61]:
query = "intellectual property infringement"
def search(query, top_n=5):

    query_embeddings = get_embedding(tokenizer, model, [query])
    similarities = cosine_similarity(query_embeddings, sentence_embeddings)
    top_indices = np.argsort(-similarities[0])[:top_n]
    top_results = data.iloc[top_indices].reset_index(drop=True)
    top_results['similarity'] = similarities[0][top_indices]
    return top_results
results = search(query, top_n=5)
print(results)

                                               title  \
0  Infringement of Copyright: Legal Proceedings a...   
1  Patent Breach Litigation: Analysis and Enforce...   
2  Harnessing Hugging Face: Revolutionizing NLP w...   
3  Quantum Computing Unleashed: Breaking Down Com...   
4  Solar Power Surge: Illuminating Renewable Ener...   

                                             content  similarity  
0  This document addresses the legal framework su...    0.614404  
1  Detailing patent infringement cases, this brie...    0.503443  
2  Explore how Hugging Faces Transformers library...    0.100846  
3  Demystifying quantum computings complexities. ...    0.063512  
4  Analyzing solar energys rapid growth. Innovati...    0.007385  
