Author: Arun Prasad Built on Sunday 10 November 2024 at Build Club (Sydney, Australia)

License: MIT License Copyright 2024 AI WHISPR PTY. LTD.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Cohere for Wikipedia RAG Project

I've set up a Jupyter Notebook to run and test code for a Retrieval-Augmented Generation (RAG) system on a wikapedia article. Working through the Rag workflow step-by-step has provided a robust understanding and foundation for building, testing, and refining a RAG system.


Key actions
1. Creation of a dedicated environment for my project
2. Installed and configured necessary libraries.
3. Fetched data from Wikipedia.
4. Divided the text into manageable chunks, and generated embeddings for semantic search.
5. Crafted an augmented prompt using retrieved text chunks and sent it to a language model to generate a detailed response.

By running the below RAG workflow, I am building a vector database and populating it with many different respositories.

1.Started by importing the Cohere library to connect to the Cohere's API for generating embeddings. 
2.Imported the library 
3.Set up the API key. 


In [1]:
%pip install "cohere<5" --quiet

# Import the Cohere python library. Setup API Key.

In [2]:
import cohere

In [3]:
API_KEY = "PMzMFQymUIHrGJXqMRh85P3NkSNypYMXwfnG39ls"
co = cohere.Client(API_KEY)

# Install the Wikpedia Library, Import the library, Read the Wikpedia Page

In [4]:
!pip install wikipedia --quiet

In [5]:
import wikipedia

In [6]:
article = wikipedia.page('Wild Robot')
text = article.content
print(f"The text has roughly {len(text.split())} words.")

The text has roughly 3033 words.


# Create Text Chunks.

In [7]:
%pip install -qU langchain-text-splitters --quiet

Note: you may need to restart the kernel to use updated packages.


In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [9]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False,
)

chunks_ = text_splitter.create_documents([text])
chunks = [c.page_content for c in chunks_]
print(f"The text has been broken down in {len(chunks)} chunks.")

The text has been broken down in 62 chunks.


# Create Embeddings for the text chunks

In [10]:
model="embed-english-v3.0"
response = co.embed(
    texts= chunks,
    model=model,
    input_type="search_document",
    embedding_types=['float']
)
embeddings = response.embeddings.float
print(f"We just computed {len(embeddings)} embeddings.")


We just computed 62 embeddings.


# Install numpy; use it to store embeddings.

In [13]:
!pip install numpy --quiet

In [14]:
import numpy as np
vector_database = {i: np.array(embedding) for i, embedding in enumerate(embeddings)}

# Create embeeding for the query; semantic search using cosine similarity to retrieve relevant text.

1.Define a sample query to test our RAG model’s retrieval capabilities. Run this code in a new cell in your notebook

In [22]:
query = "who is the director of the moive The Wild Robot? list out the names of all actors in the moive and their roles"

Embedding the Query

2.To compare the query with our text chunks, create an embedding for the query. Enter the following code in a new cell.

This code generates an embedding for the query, stored in query_embedding. This allows us to compare the query to each chunk based on meaning.

In [23]:
response = co.embed(
    texts=[query],
    model=model,
    input_type="search_query",
    embedding_types=['float']
)
query_embedding = response.embeddings.float[0]
print("query_embedding: ", query_embedding)

query_embedding:  [-0.0496521, -0.038269043, -0.0317688, 4.7326088e-05, -0.022232056, -0.0102005005, 0.020309448, 0.047851562, -0.044433594, 0.035308838, 0.0004208088, 0.03048706, -0.034332275, 5.4240227e-05, 0.015312195, -0.029891968, -0.007320404, 0.085510254, 0.07183838, 0.012046814, 0.02671814, 0.022598267, -0.016479492, -0.035125732, 0.0044670105, -0.0029411316, -0.018966675, -0.022613525, -0.012138367, 0.0072517395, 0.0035533905, 0.029052734, 0.00642395, 0.0129852295, -0.010025024, -0.016571045, -0.008041382, 0.00096797943, 0.02357483, -0.03567505, 0.043273926, -0.026550293, -0.013977051, -0.0065612793, -0.027420044, 0.032165527, 0.0058288574, 0.010322571, 0.04736328, 0.030517578, -0.026306152, -0.023712158, -0.022033691, -0.01423645, -0.04095459, -0.040893555, -0.015213013, -0.07788086, 0.0021572113, 0.04232788, 0.029251099, -0.00843811, 0.026382446, -0.026763916, 0.005252838, -0.007671356, 0.013198853, 0.019134521, 0.026138306, -0.010757446, -0.012420654, 0.024993896, 0.0117874

Performing Semantic Search Using Cosine Similarity

We’ll use cosine similarity to identify chunks most relevant to the query. Run this code in a new cell:

This code calculates similarity scores, identifying the top 10 chunks that most closely match the query. The scores and content of these chunks are printed, showing the most relevant sections.

In [27]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [cosine_similarity(query_embedding, chunk) for chunk in embeddings]
print("similarity scores: ", similarities)

sorted_indices = np.argsort(similarities)[::-1]
top_indices = sorted_indices[:10]
print("Here are the indices of the top 10 chunks after retrieval: ", top_indices)

top_chunks_after_retrieval = [chunks[i] for i in top_indices]
print("Here are the top 10 chunks after retrieval: ")
for t in top_chunks_after_retrieval:
    print("== " + t)

similarity scores:  [np.float64(0.6219875948073462), np.float64(0.21998720667926924), np.float64(0.3396590041313462), np.float64(0.4555954948148758), np.float64(0.15750122806654596), np.float64(0.2752094846704467), np.float64(0.2938197166203306), np.float64(0.15325545167078092), np.float64(0.20132181667719123), np.float64(0.11234805595958627), np.float64(0.21581867488546055), np.float64(0.1396626949202155), np.float64(0.24388295611978753), np.float64(0.3140320858386818), np.float64(0.12843673126698063), np.float64(0.18912504457755364), np.float64(0.3371557151760913), np.float64(0.3260076069799601), np.float64(0.3049394291600581), np.float64(0.3776809057658108), np.float64(0.33487191177673775), np.float64(0.2111742342232352), np.float64(0.5488856823898015), np.float64(0.33138073834143594), np.float64(0.2635394712506127), np.float64(0.10834247450939043), np.float64(0.25688155097015447), np.float64(0.2525985301944642), np.float64(0.3191863809879187), np.float64(0.3234387662416202), np.flo

# Create Augmented Prompt for LLM using Query +  top 3 ranked search results

Finally, we’ll create an augmented prompt to send to the language model. Enter this code in a new cell

This block prepares a prompt combining the query with relevant chunks. The "preamble" provides context, and "documents" include the top three retrieved chunks. When passed to co.chat(), the model uses these to generate a detailed response, displayed as the final answer.


In [28]:
preamble = """
## Task &amp; Context
You help people answer their questions and other requests interactively. You will be asked a very wide array of requests on all kinds of topics. You will be equipped with a wide range of search engines or similar tools to help you, which you use to research your answer. You should focus on serving the user's needs as best you can, which will be wide-ranging.

## Style Guide
Unless the user asks for a different style of answer, you should answer in full sentences, using proper grammar and spelling.
"""

In [31]:
documents = [
    {"title": "chunk 0", "snippet": top_chunks_after_retrieval[0]},
    {"title": "chunk 1", "snippet": top_chunks_after_retrieval[1]},
    {"title": "chunk 2", "snippet": top_chunks_after_retrieval[2]},
]

response = co.chat(
  message=query,
  documents=documents,
  preamble=preamble,
  model="command-r",
  temperature=0.3
)

print("Final answer:")
print(response.text)


Final answer:
The Wild Robot is an upcoming American animated science fiction film directed and written by Chris Sanders. Based on the 2016 novel of the same name by Peter Brown, the movie has a stellar cast of voices, including:
- Lupita Nyong'o as Roz
- Pedro Pascal
- Kit Connor
- Bill Nighy
- Stephanie Hsu
- Mark Hamill
- Catherine O'Hara
- Matt Berry
- Ving Rhames


In [32]:
!pip install graphviz


Collecting graphviz
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Downloading graphviz-0.20.3-py3-none-any.whl (47 kB)
Installing collected packages: graphviz
Successfully installed graphviz-0.20.3
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
os.environ["PATH"] += os.pathsep + "/usr/local/bin"  # Adjust if needed