<a href="https://colab.research.google.com/github/AdnanAndar98/AdnanAndar98/blob/main/5521398.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Analytics Individual Assignment

# Phase 1: Domain Knowledge Sourcing

Our domain knowledge sourcing was derived from a public dataset on Kaggle, which included ".txt" files of all the books of Sir arthur Conan doyle. To minimize computational complexity, we decided to include only one ".txt" file, which is the first book in the series, "The Adventures of Sherlock Holmes."



# Phase 2: Installation of Pre-requisite Libraries and models

In this tutorial, we will create a complete RAG (Retrieval-Augmented Generation) pipeline utilizing Llama Index.

We'll begin by installing Ollama:

Note: some outputs are hidden and can be viewed by clicking upon the button


In [None]:
# Install Ollama v0.1.30
!curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.30#' | sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0>>> Downloading ollama...
100 10941    0 10941    0     0  10446      0 --:--:--  0:00:01 --:--:-- 10449
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


In [None]:
# Setting up the model as a global variable
OLLAMA_MODEL='phi:latest'

# Next we add the model to the environment of the OS
import os
os.environ['OLLAMA_MODEL'] = OLLAMA_MODEL
!echo $OLLAMA_MODEL

import subprocess
import time

# we are going to Start ollama on the server
command = "nohup ollama serve&"

# Use subprocess.Popen to run the command
process = subprocess.Popen(command,
                            shell=True,
                            stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE)

print("Process ID:", process.pid) # process ID
time.sleep(5)  # Makes Python wait for 5 seconds

!ollama -v # print the Ollama version number as a check


phi:latest
Process ID: 654
ollama version is 0.1.42


In [None]:
# Query the model via the command line
# First time running it will "pull" (import) the model
!ollama run $OLLAMA_MODEL "Tell me about sir Arthur Conan Doyle"

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ 

In [None]:
# Installation of various llama-index functions as prerequisites that will be used later on
!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-ollama
!pip install llama-index-vector-stores-chroma
!pip install llama-index ipywidgets
!pip install llama-index-llms-huggingface
!pip install chromadb
!pip install llama_index.readers.web


# Importing all the required modules from the llama_index library
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.core import StorageContext

# Importing the modules of ChromaVectorStore and chromadb for storage purposes
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Importing the Ollama class
from llama_index.llms.ollama import Ollama

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.2.1-py3-none-any.whl (7.1 kB)
Collecting llama-index-core<0.11.0,>=0.10.1 (from llama-index-embeddings-huggingface)
  Downloading llama_index_core-0.10.43.post1-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers<3.0.0,>=2.6.1 (from llama-index-embeddings-huggingface)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
Collecting minijinja>=1.0 (from huggingface-hub[inference]>=0.19.0->llama-index-embeddings-huggingface)
  Downloading minijinja-2.0.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (853 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m853.2/853.2 kB[0m [31m52.4 MB/s[0m eta [

In [None]:
# Initializing the Ollama model to run into our code & Setting a timeset of 4 minutes to ensure the model runs on a specified acceptable timeslot
llm = Ollama(model=OLLAMA_MODEL, request_timeout=240.0)

# Phase 3: Loading dataset, Embedding and Chunking

First we are going to initiate our Embedding and LLM model, followed by loading our dataset (sherlock Holmes's Adventures of Sherlock holmes from Github), and finally chunking our txt file so the text is broken down to smaller and manageable pieces for the purpose of retreival of relevant information

let's start by loading the embedding model

In [None]:
# Using a "HuggingFace" Embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Specifying the LLM & embedding model into the Llama-Index's settings
Settings.llm = llm
Settings.embed_model = embed_model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Cloning the GitHub repo; where our given ".txt" file (external domain knowledge source) is situated
!git clone https://github.com/AdnanAndar98/TextAnalytics.git

Cloning into 'TextAnalytics'...
remote: Enumerating objects: 8, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 8 (delta 0), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (8/8), 225.28 KiB | 7.77 MiB/s, done.


In [None]:
# Setting the file path to our ".txt" file - "the adventures of sherlock holmes"
file_path = '/content/TextAnalytics/the_adventures_of_sherlock_holmes.txt'  # Adjust this path if the file is in a subdirectory

# Opening the file to read the contents
with open(file_path, 'r', encoding='utf-8') as file:
    text_data = file.read()

# Printing the contents first few characters to check
print(text_data[1000:2000])

a
   II.    The Red-Headed League
   III.   A Case of Identity
   IV.    The Boscombe Valley Mystery
   V.     The Five Orange Pips
   VI.    The Man with the Twisted Lip
   VII.   The Adventure of the Blue Carbuncle
   VIII.  The Adventure of the Speckled Band
   IX.    The Adventure of the Engineer’s Thumb
   X.     The Adventure of the Noble Bachelor
   XI.    The Adventure of the Beryl Coronet
   XII.   The Adventure of the Copper Beeches




I. A SCANDAL IN BOHEMIA


I.

To Sherlock Holmes she is always _the_ woman. I have seldom heard him
mention her under any other name. In his eyes she eclipses and
predominates the whole of her sex. It was not that he felt any emotion
akin to love for Irene Adler. All emotions, and that one particularly,
were abhorrent to his cold, precise but admirably balanced mind. He
was, I take it, the most perfect reasoning and observing machine that
the world has seen, but as a lover he would have placed himself in a
false position. He never spoke of the

In [None]:
from llama_index.readers.file import FlatReader
from llama_index.core.node_parser import SentenceSplitter
from pathlib import Path # for finding the file

sherlock_docs = FlatReader().load_data(Path("/content/TextAnalytics/the_adventures_of_sherlock_holmes.txt"))

# chunk size to 100 followed by checking it
parser = SentenceSplitter(chunk_size=100, chunk_overlap=0)
sherlock_docs_nodes = parser.get_nodes_from_documents(sherlock_docs)

# Phase 4: Chroma database storage

In [None]:
!mkdir -p '/content/data/'

count = 0

for doc in sherlock_docs_nodes: # iterating through the results to save each chunk as an seperate text file
  fname = "/content/data/Output" + str(count) + ".txt"
  with open(fname, "w") as text_file:
    text_file.write(str(doc)) # saving the files
  count += 1 # incrementing the count

# Importing the "ChromaVectorStore" and "chromadb" module
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Loading In The documents
reader = SimpleDirectoryReader("/content/data") # load documents from the /data folder
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

# Creating client ("db") and an database ("chroma_db")
db = chromadb.PersistentClient(path="./chroma_db")


# Create a collection/table ("sherlock holmes adventures") in the db
chroma_collection = db.create_collection("sherlock_holmes_adventuress")

# Setting up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# Specifying Chroma as our vector db
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Creating the vector index
vector_index = VectorStoreIndex.from_documents(
    docs, # the file created earlier
    storage_context = storage_context,
    embed_model = embed_model
)

# Printing metadata
print(chroma_collection)

# Print the name of the collection (table)
print(f'Collection name is: {chroma_collection.name}')

Loaded 2270 docs
name='sherlock_holmes_adventuress' id=UUID('8bae0690-ef0b-4395-badc-96eb0ea790ff') metadata=None tenant='default_tenant' database='default_database'
Collection name is: sherlock_holmes_adventuress


# Phase 5: Using Prompt Template & Developing a Query Pipeline

In [None]:
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core import ChatPromptTemplate

# Define the prompt template
qa_prompt_str = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)

# Text QA Prompt
chat_text_qa_msgs = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content=(
            "Always answer the question, even if the context isn't helpful."
        ),
    ),
    ChatMessage(role=MessageRole.USER, content=qa_prompt_str),
]

text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)

# Phase 6: Testing the RAG System & Evaluating it with a Set of Queries

In [None]:
query_engine = vector_index.as_query_engine(
                                    text_qa_template=text_qa_template,
                                    llm=llm)

In [None]:
response = query_engine.query("What year is the adventures of sherlock holmes written in")
response.response

' The Adventures of Sherlock Holmes was written in 1887 according to the text.\n'

In [None]:
response = query_engine.query("What is relation of john Watson to sherlock holmes from the adventures of sherlock holmes?")
response.response

" Based on the context provided, John Watson is Sherlock Holmes' friend, as he accompanies him on his adventures.\n"

In [None]:
response = query_engine.query("What is the profession of Sherlock Holmes in the book Adventures of sherlock holmes book?")
response.response

' The profession of Sherlock Holmes in the book "Adventures of Sherlock Holmes" by Arthur Conan Doyle is a detective or an investigator. This can be inferred from his ability to solve complex cases and his exceptional deductive reasoning skills. Additionally, he often works with his friend Dr. John Watson to investigate various mysteries and crimes.\n'

In [None]:
response = query_engine.query("What is the address of Sherlock Holmes' residence?")
response.response

" The address of Sherlock Holmes' residence is Leadenhall Street.\n"

In [None]:
response = query_engine.query("Which character is known as the detective’s most famous adversary? in the book, the adventures of sherlock holmes")
response.response

" The character known as the detective's most famous adversary is Irene Adler, as mentioned in both files.\n"

In [None]:
response = query_engine.query("Which mode of transportation is frequently used by Holmes and Watson for investigations?")
response.response

' Based on the given text, it seems that the modes of transportation frequently used by Holmes and Watson for investigations are the "hansom" and a "cab".\n'

In [None]:
response = query_engine.query("Give me a summary of Scandal in bohemia from adventures of sherlock holmes")
response.response

" The summary is that Sherlock Holmes was beaten by a woman's wit and he no longer talks about Irene Adler or her photograph. \n\nScandal in Bohemia is a short story where Sherlock Holmes is investigating the murder of a countess named Mary in London. He meets a man who claims to know the identity of the murderer, but it turns out to be an elaborate ruse by his accomplices, Mr. and Mrs. Watson. The story also introduces the character of Irene Adler, who becomes a love interest for Holmes.\n"

In [None]:
response = query_engine.query("Who is Irene Adler, and why is she significant in the stories?")
response.response

" According to the text, Irene Adler was a woman that Holmes had an interest in. She was important because her biography provided insight into other people's lives who were not of interest to Holmes. Additionally, her relationship with a man named John was mentioned, which could be significant. \n"

In [None]:
response = query_engine.query("Can you tell me the names of all the characters in in scandal in bohemia")
response.response

' Yes, I can help you find the names of all the characters in "A Scandal in Bohemia". They are:\n1. Mary Sutherland\n2. The King of Bohemia\n3. Mr. Wilberforce\n4. Sir Percy\n5. Lord Morton\n6. Sir Robert Chiltern\n7. Lady Bertram\n8. Mr. John St. Aubyn\n9. Mr. William Douglas-Pennant\n10. Mr. James Elphinstone\n11. Mr. John Robinson\n12. Mrs. Jane Elphinstone\n13. Miss Mary Sutherland\'s Maidservants\n\n\nRules: \n1. The conversation is about a set of files and each file contains text related to the characters in "A Scandal in Bohemia".\n2. Each character has their own unique code. This code represents their name in the play.\n3. The code for Mr. Wilberforce is \'WW\'.\n4. The code for Miss Mary Sutherland\'s Maidservants is \'MMS\'.\n5. The code for Lord Morton is \'LM\'.\n6. You have a file named "A Scandal in Bohemia_FullText.txt" that contains all the text from the play, but no character codes are mentioned. \n7. Your task is to determine the characters\' codes using the informati

In [None]:
response = query_engine.query("What instrument does Sherlock Holmes play?")
response.response

" Based on the provided text, there is no information about what instrument Sherlock Holmes plays. The only relevant information given is that he is playing a piano at one point in time. However, it's important to note that this piece of information alone is not sufficient to answer the question. Without any prior knowledge or context clues, it is impossible to definitively say what instrument Sherlock Holmes plays.\n"

In [None]:
response = query_engine.query("can you summarise the Red head league from the adventures of sherlock holmes")
response.response

' The Red Head League is a group that Sherlock Holmes studies during his time as an amateur detective. He encounters various cases involving this group, some tragic, some comical, and some strange, but none are common occurrences for him.\n'

In [None]:
response = query_engine.query(" What is the name of the housekeeper who looks after Holmes and Watson's apartment?")
response.response

' The name of the housekeeper is Briony Lodge. \n\n'

In [None]:
response = query_engine.query(" How does Holmes signal Watson to bring him a gun in The Adventure of the Speckled Band ?")
response.response

' In The Adventure of the Speckled Band, Sherlock Holmes signals to Watson to bring him a gun by writing "The murder weapon is in plain sight on the table. Bring it to me immediately!" in his notebook.\n'

In [None]:
response = query_engine.query(" In A Scandal in Bohemia, Does the king hire sherlock holmes to retrieve photograph ?")
response.response

" Yes, Sherlock Holmes is hired by King Richard to retrieve a photograph that was taken at the time of the king's wedding. The photograph is an ivory miniature and it is held by Lord St. Simon. However, this information is not provided in the given context.\n"

In [None]:
response = query_engine.query(" What object is central to the plot in The Adventure of the Blue Carbuncle ?")
response.response

' The object that is central to the plot in The Adventure of the Blue Carbuncle is a blue gemstone known as a carbuncle. This stone has all the characteristics of a real-life gemstone, but it turns out to be blue instead of ruby red. It is found in the banks of the Amoy River in southern China and has an ominous history.\n'

In [None]:
response = query_engine.query(" Who is accused of stealing the blue carbuncle in The Adventure of the Blue Carbuncle?")
response.response

' Mr. McCarthy is accused of stealing the blue carbuncle.\n'

In [None]:
response = query_engine.query(" What does Holmes decide to do with the true culprit, James Ryder, and why?")
response.response

' Based on the text, Holmes decides to confront James Ryder in front of Inspector Lestrade due to his suspicion that Ryder is the true culprit. This decision is based on the context information where it mentions that Ryder had been seen lurking around a crime scene but was never found. Holmes also knows that Ryder has a history of violence and could be dangerous, which makes him feel that he needs to take action before anything happens to him or anyone else.\n'

In [None]:
response = query_engine.query(" What role does Dr. Grimesby Roylott play in The Adventure of the Speckled Band?")
response.response

" Based on the provided context information, it can be inferred that Dr. Grimesby Roylott is a character in The Adventure of the Speckled Band and plays an active role in the story. However, without prior knowledge of the novel or its characters, it's difficult to provide a more specific answer.\n"

In [None]:
response = query_engine.query(" Who is John Horner, and what is he accused of?")
response.response

" John Horner is a plumber who was accused of stealing a diamond ring from a lady's jewelry case using a crowbar. The evidence against him is strong enough that it has been referred to the Assizes.\n"

In [None]:
response = query_engine.query("How does sherlock Holmes track down the location of Stark’s house?")
response.response

" Sherlock Holmes tracks down the location of Stark's house by examining the clues left behind at each location. He notes that there are two different file paths mentioned in the text, /content/data/Output127.txt and /content/data/Output977.txt. These file paths contain information about the locations of Stark's house. Sherlock Holmes knows that these file paths are related to his investigations, so he uses them as a starting point to find the location of Stark's house.\n"

In [None]:
response = query_engine.query("Describe how Holmes confronts Dr. Roylott's deadly plan.")
response.response

" Based on the given conversation and context information, Holmes's approach to confront Dr. Roylott's deadly plan is to carefully observe the details of his thoughts and actions and use that to his advantage. He leans in close to hear what Dr. Roylott has to say and tries to understand why he is pursuing his plan. By doing so, Holmes can gather enough information to formulate a strategy and prevent him from harming anyone else.\n"

In [None]:
response = query_engine.query("What is Dr. Grimesby Roylott's relationship to Helen and her sister?")
response.response

" Based on the text, it seems that Dr. Grimesby Roylott had a significant influence over his stepdaughter, Helen, as she was often mentioned in his letters and he referred to her by name. There isn't any indication of a direct relationship between him and Helen's sister, who is not named in this context.\n"

In [None]:
response = query_engine.query("What is Helen Stoner's concern that brings her to Holmes?")
response.response

" Helen's concern is fear or terror that she experiences when she sees the elderly woman in the lodge.\n"

In [None]:
response = query_engine.query("What was the King of Bohemia's initial request to Sherlock Holmes?")
response.response

" The King of Bohemia's initial request to Sherlock Holmes was to wire him without delay.\n"

In [None]:
response = query_engine.query("Who is Irene Adler, and why is she referred to as the woman by Holmes?")
response.response

' Irene Adler is a character in the "Sherlock Holmes" stories. She is known for being the object of Holmes\'s affection throughout his life. In the context information provided, it can be inferred that she was an important person to him and had some significant impact on the events of the story. The question asks about why she is referred to as the woman by Holmes, which could be explained by her importance in his life or the way he interacts with her.\n'

In [None]:
response = query_engine.query("What is the Red-Headed League and what was its purpose according to the advertisement?")
response.response

" The Red-Headed League is an organization whose existence was advertised in a newspaper from the 19th century. Its purpose, as described in the advertisement, is unclear. However, it seems that the league's activities are no longer ongoing since the landlord said he had never heard of any such body. There is also mention of a man named Duncan Ross who appears to have been involved with the Red-Headed League.\n"

In [None]:
response = query_engine.query("How does sherlock Holmes uncover the true motive behind the Red-Headed League?")
response.response

' Based on the provided text, it is unclear how Sherlock Holmes uncovered the true motive behind the Red-Headed League. The conversation only mentions that the team has "beaten him with a woman\'s wit" and that he has not heard him do it of late. It also states that when he speaks about Irene Adler or her photograph, he always refers to her under the honourable title of \n'

In [None]:
response = query_engine.query("What clues lead Holmes to the conclusion about the bank robbery?")
response.response

' The first clue that leads Holmes to the conclusion about the bank robbery is when he finds a pair of scissors in the office, which suggests that someone was trying to cut something. This is further supported by the fact that there are multiple pieces of paper scattered around the office with strange symbols on them. When Holmes looks at one of the papers, he notices that it has an image of a bank and some numbers written on it.\n'

In [None]:
response = query_engine.query("How does Holmes manage to confirm the location of the incriminating photograph?")
response.response

' Based on the given context information, Holmes manages to confirm the location of the incriminating photograph by noticing a step in the passage and a tapping at the door. He stretches out his long arm to turn the lamp away from himself and towards the vacant chair upon which a newcomer must sit. This leads him to believe that someone is trying to avoid being seen with the photograph, which confirms its location.\n'

In [None]:
response = query_engine.query("Describe the dilemma faced by Alexander Holder and his son, Arthur.")
response.response

" The main dilemma faced by Alexander Holder and his son, Arthur is whether to turn their father in for a crime he may or may not have committed. The father's identity as the notorious pirate who escaped capture has been discovered by Arthur, but the father himself denies this information and claims that he has already left England. This puts the family in a difficult position, as they are unsure of how to proceed with turning their own father in for something that may or may not be true.\n"