##Exercise 2
This notebook demonstrates the use of SBERT (Sentence-BERT) for implementing a semantic search system as part of an assignment. SBERT encodes sentences into dense vector representations, enabling efficient similarity-based retrieval. The project aims to create an innovative solution using semantic search, potentially integrating datasets, models, and interactive tools to deliver meaningful results.

###Data Loading and Preprocessing
This code sets up the environment by installing the necessary libraries, including sentence_transformers for SBERT models and datasets for accessing pre-built datasets. It then loads the "Natural Questions" dataset from Hugging Face, specifically the training split, which contains question-and-answer pairs suitable for semantic search tasks. Finally, it prints the first data entry to inspect the structure and ensure the dataset includes relevant fields for queries and answers. This step is critical for understanding the dataset and aligning it with the project’s goals.

In [None]:
!pip install -U sentence_transformers --q
!pip install datasets

from datasets import load_dataset
#First, we load the  dataset (with query and answer information)
# Indicate the dataset id from the Hub
dataset_id = "sentence-transformers/natural-questions"
dataset_file = load_dataset(dataset_id, split="train")

print(dataset_file[0])

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/268.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m24.5 MB/s[0m eta [36m

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/2.28k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/44.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100231 [00:00<?, ? examples/s]

{'query': 'when did richmond last play in a preliminary final', 'answer': "Richmond Football Club Richmond began 2017 with 5 straight wins, a feat it had not achieved since 1995. A series of close losses hampered the Tigers throughout the middle of the season, including a 5-point loss to the Western Bulldogs, 2-point loss to Fremantle, and a 3-point loss to the Giants. Richmond ended the season strongly with convincing victories over Fremantle and St Kilda in the final two rounds, elevating the club to 3rd on the ladder. Richmond's first final of the season against the Cats at the MCG attracted a record qualifying final crowd of 95,028; the Tigers won by 51 points. Having advanced to the first preliminary finals for the first time since 2001, Richmond defeated Greater Western Sydney by 36 points in front of a crowd of 94,258 to progress to the Grand Final against Adelaide, their first Grand Final appearance since 1982. The attendance was 100,021, the largest crowd to a grand final sinc

The output confirms successful installation of the required libraries and proper loading of the "Natural Questions" dataset. A sample entry from the dataset is displayed, showing a query and its detailed answer. The query asks about Richmond's last preliminary final, and the answer provides an in-depth response about their 2017 performance, including milestones and achievements. This dataset is relevant for semantic search because it pairs complex questions with comprehensive answers, allowing for effective training and testing of retrieval systems.

###Semantic Search Implementation Using SBERT
This code uses the pre-trained allenai-specter model from the SentenceTransformers library to implement a semantic search system. It begins by preparing a subset of the dataset, combining the query and answer fields into a unified text for encoding. The model computes embeddings for these texts, storing them in corpus_embeddings for efficient similarity calculations.

The search_papers function performs semantic search by encoding a user-provided query into an embedding and finding the most similar texts in the pre-encoded corpus using cosine similarity. The results are ranked by similarity scores, and the top 5 matches are displayed along with their respective answers. This approach leverages SBERT's ability to produce meaningful vector representations, enabling accurate and context-aware retrieval for complex queries.

This is demonstrated with the example query, “when did Richmond last play in a preliminary final,” which retrieves the most relevant responses based on semantic similarity.

In [None]:
import os
from sentence_transformers import SentenceTransformer, util
from datasets import load_dataset


# Use the allenai-specter model with SentenceTransformers
model = SentenceTransformer('allenai-specter')

# Prepare paper texts by combining query and answer fields
paper_texts = [
    record['query'] + '[SEP]' + record['answer'] for record in dataset_file.select(range(32))
]

# Compute embeddings for all paper texts
corpus_embeddings = model.encode(paper_texts, convert_to_tensor=True, show_progress_bar=True)

# Function to search for answers given a query
def search_papers(query):
    # Encode the query
    query_embedding = model.encode(query, convert_to_tensor=True)

    # Perform semantic search
    search_hits = util.semantic_search(query_embedding, corpus_embeddings)
    search_hits = search_hits[0]  # Get the hits for the first query

    print("\n\nQuery:", query)
    print("Most similar answers:")
    for hit in search_hits[:5]:  # Limit to top 5 results for clarity
        related_text = dataset_file[int(hit['corpus_id'])]  # Access related record
        print("{:.2f}\tAnswer: {}".format(
            hit['score'], related_text['answer']
        ))

# Example usage
search_papers("when did richmond last play in a preliminary final")


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.77k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/622 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/331 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/222k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/462k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]



Query: when did richmond last play in a preliminary final
Most similar answers:
0.87	Answer: Jack Scott (singer) At the beginning of 1960, Scott again changed record labels, this time to Top Rank Records.[1] He then recorded four Billboard Hot 100 hits – "What in the World's Come Over You" (#5), "Burning Bridges" (#3) b/w "Oh Little One" (#34), and "It Only Happened Yesterday" (#38).[1] "What in the World's Come Over You" was Scott's second gold disc winner.[6] Scott continued to record and perform during the 1960s and 1970s.[1] His song "You're Just Gettin' Better" reached the country charts in 1974.[1] In May 1977, Scott recorded a Peel session for BBC Radio 1 disc jockey, John Peel.
0.86	Answer: Cooley High Cooley High is a 1975 American coming-of-age/ drama film that follows the narrative of high school seniors and best-friends, Leroy "Preach" Jackson (Glynn Turman) and Richard "Cochise" Morris (Lawrence Hilton-Jacobs). Written by Eric Monte, directed by Michael Schultz and produ


The output provides the results of the semantic search using the example query. Here's a concise explanation:

The allenai-specter model successfully encoded both the corpus and the query, performing a semantic similarity search. The query, "when did Richmond last play in a preliminary final," retrieved a ranked list of answers based on similarity scores.

The top result has a score of 0.87, but it is not contextually relevant, showcasing potential noise in the dataset. The fourth result, with a score of 0.84, contains the correct and detailed response about Richmond's 2017 AFL performance.
This highlights the model's capability to identify semantically relevant answers but also demonstrates the challenge of irrelevant results being scored highly. This output suggests the system works but may benefit from techniques like re-ranking or additional filtering to improve precision.

###Executing Semantic Search with a New Query
This block reuses the search_papers function to search for the query "who made the song my achy breaky heart." It encodes the query into a dense vector using the SBERT model and calculates the semantic similarity with precomputed corpus embeddings. The function retrieves and ranks the most relevant responses based on similarity scores, displaying the top 5 answers.

This step tests the system's ability to find accurate and contextually relevant answers for a new query, showcasing the model's generalizability and retrieval performance for different types of questions.

In [None]:
search_hits = search_papers("who made the song my achy breaky heart")



Query: who made the song my achy breaky heart
Most similar answers:
0.89	Answer: Achy Breaky Heart "Achy Breaky Heart" is a country song written by Don Von Tress. Originally titled "Don't Tell My Heart" and performed by The Marcy Brothers in 1991, its name was later changed to "Achy Breaky Heart" and performed by Billy Ray Cyrus on his 1992 album Some Gave All. The song is Cyrus' debut single and signature song, it made him famous and has been his most successful song. It became the first single ever to achieve triple Platinum status in Australia[1] and also 1992's best-selling single in the same country.[2][3] In the United States it became a crossover hit on pop and country radio, peaking at number 4 on the Billboard Hot 100 and topping the Hot Country Songs chart, becoming the first country single to be certified Platinum since Kenny Rogers and Dolly Parton's "Islands in the Stream" in 1983.[4] The single topped in several countries, and after being featured on Top of the Pops in 

The semantic search successfully retrieves the correct and highly relevant answer as the top result. Here's a concise explanation of the output:

The query, "who made the song my achy breaky heart," produces the top result with a score of 0.89, correctly identifying "Achy Breaky Heart" as a song performed by Billy Ray Cyrus and detailing its origins and success. However, subsequent results, while semantically related to music or artists, are less relevant to the query, showing some limitations in precision. This demonstrates that the model effectively retrieves the correct answer but still ranks unrelated entries with similar semantic structures highly. Further refinement or filtering could improve accuracy in such cases.

###Combining Semantic Search with Text Summarization
This code integrates semantic search with text summarization to provide a concise summary of the most relevant answers for a given query. It uses the pipeline function from the transformers library to create a summarizer for generating concise summaries. The function first performs a semantic search using SBERT to retrieve the top 5 most similar answers from the corpus. These answers are then collected and concatenated into a single string. The summarizer processes this combined text to produce a brief summary, with the maximum length controlled by the max_summary_length parameter. This implementation enhances the search experience by condensing detailed search results into a clear and user-friendly format. The example query, "who made the song my achy breaky heart," demonstrates how the function retrieves relevant answers and provides a succinct summary.

In [None]:
from transformers import pipeline

# Summarization pipeline
summarizer = pipeline("summarization")

# Collect the relevant answers from the search function
def search_papers_and_summarize(query, max_summary_length=60):
    # Encode the query
    query_embedding = model.encode(query, convert_to_tensor=True)

    # Perform semantic search
    search_hits = util.semantic_search(query_embedding, corpus_embeddings)
    search_hits = search_hits[0]  # Get the hits for the first query

    # Collect answers from top hits
    answers = []
    for hit in search_hits[:5]:  # Limit to top 5 results
        related_text = dataset_file[int(hit['corpus_id'])]
        answers.append(related_text['answer'])

    # Combine answers into a single text for summarization
    combined_text = " ".join(answers)

    # Summarize the combined text
    summary = summarizer(combined_text, max_length=max_summary_length, clean_up_tokenization_spaces=True)
    print("Summary:")
    print(summary[0]['summary_text'])

# Example usage
search_papers_and_summarize("who made the song my achy breaky heart")


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Summary:
 Billy Ray Cyrus' "Achy Breaky Heart" is his signature song. It is considered by some as one of the worst songs of all time, featuring at number two in VH1 and Blender's list of the "50 Most Awesomely Bad Songs Ever" Jack Scott


This output indicates that the summarization pipeline used the default model, sshleifer/distilbart-cnn-12-6, as no specific model was provided during setup. This model is optimized for text summarization and provides a lightweight alternative to larger BART models. A cautionary note mentions that using a pipeline without explicitly specifying the model and its version is not recommended in production environments, as it might lead to unexpected behaviors or updates affecting results.

The summary generated highlights key information about Billy Ray Cyrus's "Achy Breaky Heart," identifying it as his signature song. It briefly notes its controversial reception, including being listed as one of VH1 and Blender's "50 Most Awesomely Bad Songs Ever." The mention of "Jack Scott" at the end suggests the text might have included extraneous or unrelated information, possibly due to the concatenation of multiple results during summarization. This output demonstrates the pipeline's ability to extract relevant information but also points to the need for careful preprocessing to avoid irrelevant details.

###Executing Semantic Search and Summarization for a New Query
This line executes the search_papers_and_summarize function with the query "Who is wimpy kid." It combines semantic search with text summarization to retrieve and summarize the most relevant answers from the corpus. The query is encoded into an embedding, and the top 5 similar results are retrieved using semantic search. These results are concatenated into a single text, which is then passed through the summarizer to generate a concise summary. This demonstrates the functionality of the pipeline for a new query, showcasing its ability to retrieve and summarize contextually relevant information.

In [None]:
search_papers_and_summarize("Who is wimpy kid")

Summary:
 The show first premiered on Cartoon Network on August 13, 2004, as a 90-minute television film. The series finished its run on May 3, 2009, with a total of six seasons and seventy-nine episodes. Reruns have aired on Boomerang from August 11,


The summary for the query "Who is wimpy kid" provides a brief overview of a show, including its premiere on Cartoon Network on August 13, 2004, as a 90-minute film, its conclusion on May 3, 2009, after six seasons and seventy-nine episodes, and mentions reruns on Boomerang beginning on August 11.

While the summary captures general information about a television series, it does not directly address the query about "Wimpy Kid." This mismatch indicates that the semantic search may not have retrieved fully relevant results, or the summarizer combined unrelated text during processing. This highlights the importance of refining the search process or implementing post-retrieval filtering to ensure better alignment between the query and the summary.