<a href="https://colab.research.google.com/github/Sweta-Das/DLCourse-LLMUsingSemanticSearch/blob/main/LLM_DenseRetrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install cohere

In [None]:
!pip install python-dotenv

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # reading local .env file

In [None]:
import cohere
co = cohere.Client('Cohere_API_key')

In [None]:
!pip install weaviate-client

In [None]:
import weaviate
auth_config = weaviate.auth.AuthApiKey(
    api_key = '76320a90-53d8-42bc-b41d-678647c6672e'
)

In [None]:
client = weaviate.Client(
    url = "https://cohere-demo.weaviate.network/",
    auth_client_secret = auth_config,
    additional_headers= {
        "X-Cohere-Api-Key": "Cohere_API_key"}
)

client.is_ready()

True

## Part-1: Vector Database for Semantic Search

In [None]:
def dense_retrieval(
    query,
    results_lang= 'en',
    properties = ["text", "title", "url", "views", "lang",
                  "_additional {distance}"],
    num_results= 5):

  nearText = {"concepts": [query]}

  # To filter by language
  where_filter = {
      "path": ["lang"],
      "operator": "Equal",
      "valueString": results_lang
  }

  response = (
      client.query
      .get("Articles", properties)
      .with_near_text(nearText)
      .with_where(where_filter)
      .with_limit(num_results)
      .do()
  )

  result = response['data']['Get']['Articles']

  return result

Here, we use 'nearText' as variable for the query to find the response to the real query "What is the capital of Canada?"

In [None]:
import utils
# from utils import print_result
def print_result(result):
  # Print results with colorful formatting
  for i, item in enumerate(result):
    print(f'item{i}')
    for key in item.keys():
      print(f"{key}:{item.get(key)}")
      print()
    print()

### Query-1 (Basic Query)

In [None]:
query = "Who wrote Hamlet?"
dense_retrieval_res = dense_retrieval(query)
print_result(dense_retrieval_res)

item0
_additional:{'distance': -154.75647}

lang:en

text:There are many works that have been pointed to as possible sources for Shakespeare's play—from ancient Greek tragedies to Elizabethan plays. The editors of the Arden Shakespeare question the idea of "source hunting", pointing out that it presupposes that authors always require ideas from other works for their own, and suggests that no author can have an original idea or be an originator. When Shakespeare wrote there were many stories about sons avenging the murder of their fathers, and many about clever avenging sons pretending to be foolish in order to outsmart their foes. This would include the story of the ancient Roman, Lucius Junius Brutus, which Shakespeare apparently knew, as well as the story of Amleth, which was preserved in Latin by 13th-century chronicler Saxo Grammaticus in his "Gesta Danorum", and printed in Paris in 1514. The Amleth story was subsequently adapted and then published in French in 1570 by the 16th-cen

### Query-2 (Medium Query)

In [None]:
query2 = "What is the capital of Canada?"
dense_retrieval_res = dense_retrieval(query2)
print_result(dense_retrieval_res)

item0
_additional:{'distance': -150.8129}

lang:en

text:The governor general of the province had designated Kingston as the capital in 1841. However, the major population centres of Toronto and Montreal, as well as the former capital of Lower Canada, Quebec City, all had legislators dissatisfied with Kingston. Anglophone merchants in Quebec were the main group supportive of the Kingston arrangement. In 1842, a vote rejected Kingston as the capital, and study of potential candidates included the then-named Bytown, but that option proved less popular than Toronto or Montreal. In 1843, a report of the Executive Council recommended Montreal as the capital as a more fortifiable location and commercial centre, however, the Governor General refused to execute a move without a parliamentary vote. In 1844, the Queen's acceptance of a parliamentary vote moved the capital to Montreal.

title:Ottawa

url:https://en.wikipedia.org/wiki?curid=22219

views:2000


item1
_additional:{'distance': -150.2

We can view above different wikipedia page that gives the answer for our query.

### Comparing Dense Retrieval with keyword search

In [None]:
# function defining keyword search
def keyword_search(query, client, results_lang = 'en',
                   properties= ["title", "url", "text"], num_results = 3):
  where_filter = {
      "path": ["lang"],
      "operator": "Equal",
      "valueString": results_lang
  }

  response = (
      client.query.get("Articles", properties)
      .with_bm25(query = query)
      .with_where(where_filter)
      .with_limit(num_results)
      .do()
  )

  result = response['data']['Get']['Articles']
  return result

In [None]:
query = "What is the capital of Canada?"
keyword_search_res = keyword_search(query, client)
print_result(keyword_search_res)

item0
text:In his 1990 book, "Continental Divide: the Values and Institutions of the United States and Canada," Seymour Martin Lipset argues that the presence of the monarchy in Canada helps distinguish Canadian identity from American identity. Since at least the 1930s, supporters of the Crown have held the opinion that the Canadian monarch is also one of the rare unified elements of Canadian society, focusing both "the historic consciousness of the nation" and various forms of patriotism and national love "[on] the point around which coheres the nation's sense of a continuing personality". Former Governor General Vincent Massey articulated in 1967 that the monarchy "is part of ourselves. It is linked in a very special way with our national life. It stands for qualities and institutions which mean Canada to every one of us and which for all our differences and all our variety have kept Canada Canadian." But, according to Arthur Bousfield and Gary Toffoli, Canadians were, through the la

On observing the output, we can find that though all the infos are related to Canada, none of them is directly relevant to the query asked using keyword search.<br/>
Through comparison, we can easily see that dense retrieval gives more relevant result compared to keyword search.

### Query-3 (Complicated Query)

In [None]:
# Using Keyword search
query3 = "Tallest person in history?"
keyword_search_res = keyword_search(query3, client)
print_result(keyword_search_res)

item0
text:The population of Japan peaked at 128,083,960 in 2008. It had decreased by 2,373,960 by December 2020. In 2011, the economy of China became the world's second largest. Japan's economy descended to third largest by nominal GDP. Despite Japan's economic difficulties, this period also saw Japanese popular culture, including video games, anime, and manga, expanding worldwide, especially among young people. In March 2011, the Tokyo Skytree became the tallest tower in the world at , displacing the Canton Tower. It is the second tallest structure in the world after the Burj Khalifa ().

title:History of Japan

url:https://en.wikipedia.org/wiki?curid=25890428


item1
text:Alpinism author Jon Krakauer (1997) wrote in "Into Thin Air" that it would be a bigger challenge to climb the second-highest peak of each continent, known as the Seven Second Summits – a feat that was not accomplished until January 2013. This discussion had previously been published in an article titled "The Second

In [None]:
# Using dense retrieval
query3 = "Tallest person in history?"
dense_retrieval_res = dense_retrieval(query3)
print_result(dense_retrieval_res)

item0
_additional:{'distance': -147.69855}

lang:en

text:Robert Pershing Wadlow (February 22, 1918 July 15, 1940), also known as the Alton Giant and the Giant of Illinois, was a man who was the tallest person in recorded history for whom there is irrefutable evidence. He was born and raised in Alton, Illinois, a small city near St. Louis, Missouri.

title:Robert Wadlow

url:https://en.wikipedia.org/wiki?curid=359117

views:3000


item1
_additional:{'distance': -147.39816}

lang:en

text:Bol came from a family of extraordinarily tall men and women. He said: "My mother was , my father , and my sister is . And my great-grandfather was even taller—." His ethnic group, the Dinka, and the Nilotic people of which they are a part, are among the tallest populations in the world. Bol's hometown, Turalei, is the origin of other exceptionally tall people, including basketball player Ring Ayuel. "I was born in a village, where you cannot measure yourself," Bol reflected. "I learned I was 7 foot 7 

Again, with comparison, we can see that dense retrieval gives better results.

### Dense Retrieval in multilingual model
Dense Retrieval is also good in finding answers to queries in different language.

In [None]:
# Query in arabic lang.
query4 = "أطول رجل في التاريخ"
dense_retrieval_res = dense_retrieval(query4)
print_result(dense_retrieval_res)

item0
_additional:{'distance': -147.40509}

lang:en

text:Robert Pershing Wadlow (February 22, 1918 July 15, 1940), also known as the Alton Giant and the Giant of Illinois, was a man who was the tallest person in recorded history for whom there is irrefutable evidence. He was born and raised in Alton, Illinois, a small city near St. Louis, Missouri.

title:Robert Wadlow

url:https://en.wikipedia.org/wiki?curid=359117

views:3000


item1
_additional:{'distance': -147.05692}

lang:en

text:Kösen turned 40 years old on 10 December 2022. He celebrated his birthday a few days early by visiting the Ripley's Believe It or Not! museum in Orlando, Florida, USA and posing next to a life-sized statue of Robert Wadlow, the tallest man ever at 272 cm (8 ft 11.1 in).

title:Sultan Kösen

url:https://en.wikipedia.org/wiki?curid=8445237

views:2000


item2
_additional:{'distance': -146.87894}

lang:en

text:Bol and Gheorghe Mureșan are the two tallest players in the history of the National Basketball 

### Exploration using dense retrieval

In [None]:
query5 = "film about a time travel paradox"
dense_retrieval_res = dense_retrieval(query5)
print_result(dense_retrieval_res)

item0
_additional:{'distance': -151.42374}

lang:en

text:Although the production contracted out various effect houses to try to make the time travelling effects feel like more of a spectacle, they found the resulting work "just completely wrong" tonally and instead focused on a more low-key approach. Curtis has opined "that in the end it turns out to be a kind of anti–time travel time travel movie. It uses all the time travel stuff but without it feeling like it's a science fiction thing particularly or without it feeling that time travel can actually solve your life."

title:About Time (2013 film)

url:https://en.wikipedia.org/wiki?curid=38094431

views:2000


item1
_additional:{'distance': -150.89359}

lang:en

text:The film was directed by Wells's great-grandson Simon Wells, with an even more revised plot that incorporated the ideas of paradoxes and changing the past. The place is changed from Richmond, Surrey, to downtown New York City, where the Time Traveller moves forward in ti

## Part-2: Building Semantic Search from Scratch
Building a vector search database

In [None]:
!pip install annoy

ANNoy is Python library to measure Approximate Nearest Neighbor & other functions.

In [None]:
# Downloading libraries
import annoy
from annoy import AnnoyIndex
import numpy as np
import pandas as pd
import re

In [None]:
# Getting the text archive - Interstellar Wikipedia page
text = """
Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.
Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.
Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.
Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.
Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects.

Interstellar premiered on October 26, 2014, in Los Angeles.
In the United States, it was first released on film stock, expanding to venues using digital projectors.
The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014.
It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambition, themes, and emotional weight.
It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics. Since its premiere, Interstellar gained a cult following,[5] and now is regarded by many sci-fi experts as one of the best science-fiction films of all time.
Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades"""

### Chunking text

In [None]:
# Splitting into a list of sentences
texts = text.split('.')

# Cleaning up to remove empty spaces & new lines
texts = np.array([t.strip('\n') for t in texts])

In [None]:
texts

array(['Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan',
       'It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine',
       'Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind',
       'Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007',
       'Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar',
       'Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm',
       'Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles',

With texts split into sentences, we can find many individual sentences that doesn't represent the context clearly. If we embed such sentences into the model, it'll just complicate the model as it won't know what we're trying to say. We can solve this issue by attaching title of the Wikipedia page with each sentence.

In [None]:
title = 'Interstellar (film)'

texts = np.array([f"{title} {t}" for t in texts])
print(texts)

['Interstellar (film) Interstellar (film) Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan'
 'Interstellar (film) Interstellar (film) It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine'
 'Interstellar (film) Interstellar (film) Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind'
 'Interstellar (film) Interstellar (film) Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007'
 'Interstellar (film) Interstellar (film) Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar'
 'Interstellar (film) Interstellar (film) Cinematographer Ho

In real life, it's more practical to split texts into paragraphs.

In [None]:
# Splitting into a list of paragraphs
texts = text.split('\n\n')

# Cleaning up to remove empty spaces & new lines
texts = np.array([t.strip('\n') for t in texts])

In [None]:
texts

array(['Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.\nIt stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.\nSet in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.',
       'Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.\nCaltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.\nCinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.\nPrincipal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.\nInterstellar uses extensive practical 

### Embedding the texts

In [None]:
# Getting the embeddings
response = co.embed(
    texts = texts.tolist()
).embeddings

In [None]:
embeds = np.array(response)
embeds.shape

(15, 4096)

embeds.shape tells that there're 15 sentences and each of the sentence is given 4096 vectors, that captures its meaning & context.

### Creating search index

In [None]:
search_index = AnnoyIndex(embeds.shape[1], 'angular')

# Adding all the vectors to the search index
for i in range(len(embeds)):
  search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('test.ann') # saving file

True

After creating search index, we can send query to the model, & it'll send the nearest result based on 15 sentences.

In [None]:
# Basic search function
pd.set_option('display.max_colwidth', None)

def search(query):

  # Getting the query's embedding
  query_embed = co.embed(texts= [query]).embeddings

  # Retrieving the nearest 3 neighbors from vector search index
  similar_item_ids = search_index.get_nns_by_vector(query_embed[0], 3,
                                                    include_distances= True)

  # Formatting the results
  results = pd.DataFrame(data = {'texts': texts[similar_item_ids[0]],
                                 'distance': similar_item_ids[1]})

  print(texts[similar_item_ids[0]])

  return results

In [None]:
query6 = "How much did the film make?"
search(query6)

['Interstellar (film) Interstellar (film) The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014'
 'Interstellar (film) Interstellar (film) Interstellar premiered on October 26, 2014, in Los Angeles'
 'Interstellar (film) Interstellar (film) In the United States, it was first released on film stock, expanding to venues using digital projectors']


Unnamed: 0,texts,distance
0,"Interstellar (film) Interstellar (film) The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014",1.010105
1,"Interstellar (film) Interstellar (film) Interstellar premiered on October 26, 2014, in Los Angeles",1.154889
2,"Interstellar (film) Interstellar (film) In the United States, it was first released on film stock, expanding to venues using digital projectors",1.168121


The given outputs are 3 top sentences among 15 sentences that are closest to the query. The distance is a scoring method to show how closely the text is related with the query.