# Retrieval Augmentation Generation

##Packages Installation and Import

In [66]:
# Install Ollama v0.1.30
!curl https://ollama.ai/install.sh | sed 's#https://ollama.ai/download#https://github.com/jmorganca/ollama/releases/download/v0.1.30#' | sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0>>> Downloading ollama...
100 10091    0 10091    0     0  40196      0 --:--:-- --:--:-- --:--:-- 40364
############################################################################################# 100.0%
>>> Installing ollama to /usr/local/bin...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


In [67]:
%%capture
# Setup the model as a global variable
OLLAMA_MODEL='phi:latest'

# Add the model to the environment of the operating system
import os
os.environ['OLLAMA_MODEL'] = OLLAMA_MODEL
!echo $OLLAMA_MODEL # print the global variable to check it saved

import subprocess
import time

# Start ollama on the server ("serve")
command = "nohup ollama serve&" # "nohup" and "&" means run in the background

# Use subprocess.Popen to run the command
process = subprocess.Popen(command,
                            shell=True,
                            stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE)

time.sleep(5)  # Makes Python wait for 5 seconds

# Install prerequisites
!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-ollama
!pip install llama-index ipywidgets
!pip install llama-index-llms-huggingface
!pip install llama_index.readers.web
!pip install llama-index-vector-stores-chroma
!pip install chromadb

# Import required modules from the llama_index library
from llama_index.core import VectorStoreIndex, SummaryIndex, SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.core import StorageContext

# Import ChromaVectorStore and chromadb module
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Import the Ollama class
from llama_index.llms.ollama import Ollama

# Use the global variable (OLLAMA_MODEL) as our LLM
# Set a timeout of 8 minutes in case of CPU
llm = Ollama(model=OLLAMA_MODEL, request_timeout=480.0)

In [3]:
# Query the model via the command line
# First time running it will "pull" (import) the model

# Test question 1: general question

!ollama run $OLLAMA_MODEL "Give me a comprehensive introduction of the shipping company Yellow Corp."

[?25lpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest 
pulling 04778965089b...   0% ▕▏    0 B/1.6 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 04778965089b...   0% ▕▏    0 B/1.6 GB                  [?25h[?25l[2K[1G[A[2K[1Gpulling manifest 
pulling 047

In [4]:
# Test question 2: specific question

!ollama run $OLLAMA_MODEL "Who were the victim and perpetrator in the murder-suicide incident in Little Egg Harbor, New Jersey?"

[?25l⠙ [?25h[?25l[?25l[2K[1G[?25h[2K[1G[?25h As an[?25l[?25h AI[?25l[?25h language[?25l[?25h model[?25l[?25h,[?25l[?25h I[?25l[?25h do[?25l[?25h not[?25l[?25h have[?25l[?25h access[?25l[?25h to[?25l[?25h up[?25l[?25h-[?25l[?25hto[?25l[?25h-[?25l[?25hdate[?25l[?25h news[?25l[?25h or[?25l[?25h events[?25l[?25h.[?25l[?25h However[?25l[?25h,[?25l[?25h as[?25l[?25h of[?25l[?25h September[?25l[?25h 2021[?25l[?25h,[?25l[?25h there[?25l[?25h is[?25l[?25h no[?25l[?25h record[?25l[?25h of[?25l[?25h a[?25l[?25h recent[?25l[?25h murder[?25l[?25h-[?25l[?25hsu[?25l[?25hicide[?25l[?25h incident[?25l[?25h that[?25l[?25h occurred[?25l[?25h in[?25l[?25h Little[?25l[?25h Egg[?25l[?25h Harbor[?25l[?25h,[?25l[?25h New[?25l[?25h Jersey[?25l[?25h.[?25l[?25h It[?25l[?25h's[?25l[?25h always[?25l[?25h important[?25l[?25h to[?25l[?25h check[?25l[?25h reliable[?25l[?25h sources[?25l[?25h for

In [74]:
# Test question 3: complex question
!ollama run $OLLAMA_MODEL "Why 911 calls for severe allergic reactions nearly doubled in summer? What measures can be taken to prevent serious allergic reaction?"

[?25l⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[?25l[2K[1G[?25h[2K[1G[?25h I[?25l[?25h do[?25l[?25h not[?25l[?25h have[?25l[?25h access[?25l[?25h to[?25l[?25h recent[?25l[?25h data[?25l[?25h or[?25l[?25h statistics[?25l[?25h regarding[?25l[?25h the[?25l[?25h increase[?25l[?25h of[?25l[?25h 911[?25l[?25h calls[?25l[?25h related

In [71]:
# Test question 4: question with answer not given in the input context
!ollama run $OLLAMA_MODEL "Who is Emma Stone?"

[?25l⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[2K[1G⠸ [?25h[?25l[2K[1G⠼ [?25h[?25l[2K[1G⠴ [?25h[?25l[2K[1G⠦ [?25h[?25l[2K[1G⠧ [?25h[?25l[2K[1G⠇ [?25h[?25l[2K[1G⠏ [?25h[?25l[2K[1G⠋ [?25h[?25l[2K[1G⠙ [?25h[?25l[2K[1G⠹ [?25h[?25l[?25l[2K[1G[?25h[2K[1G[?25h Emma[?25l[?25h Stone[?25l[?25h is[?25l

##Data Loading

News data

In [12]:
# data loading

# Install the 'datasets' library from Hugging Face
!pip install datasets

# Import the 'load_dataset' function from the 'datasets' library
from datasets import load_dataset

# Load the 'News_August_2023' dataset from Hugging Face
dataset = load_dataset("RealTimeData/News_August_2023")



In [13]:
import pandas as pd
import re

# Convert the dataset to a pandas DataFrame
df = pd.DataFrame(dataset['train'])

# Display the first few rows of the DataFrame
print(df.head())

  authors              date_download date_modify         date_publish  \
0      []  2023-08-01 01:20:55+00:00           _  2023-08-01 01:10:02   
1      []  2023-08-01 01:20:06+00:00           _  2023-08-01 01:13:54   
2      []  2023-08-01 01:20:19+00:00           _  2023-08-01 01:07:57   
3      []  2023-08-01 01:20:00+00:00           _  2023-08-01 00:37:29   
4      []  2023-08-01 01:20:37+00:00           _  2023-08-01 01:11:50   

                                         description  \
0  A consultant cardiologist at the Federal Medic...   
1  The Nasarawa State government is taking measur...   
2  Lawyers are divided over the renewed moves to ...   
3  D’Tigress will face the winners between Mozamb...   
4  Liver cancer patients are being spared overnig...   

                                            filename  \
0  https%3A%2F%2Fdailytrust.com%2Ftherapeutic-lif...   
1  https%3A%2F%2Fdailytrust.com%2Fhow-nasarawa-go...   
2  https%3A%2F%2Fdailytrust.com%2Fnba-conference-...   


In [14]:
df

Unnamed: 0,authors,date_download,date_modify,date_publish,description,filename,image_url,language,localpath,maintext,source_domain,title,title_page,title_rss,url
0,[],2023-08-01 01:20:55+00:00,_,2023-08-01 01:10:02,A consultant cardiologist at the Federal Medic...,https%3A%2F%2Fdailytrust.com%2Ftherapeutic-lif...,https://dailytrust.com/wp-content/uploads/2018...,en,_,A consultant cardiologist at the Federal Medic...,dailytrust.com,‘Therapeutic lifestyle modification’ lowers ri...,_,_,https://dailytrust.com/therapeutic-lifestyle-m...
1,[],2023-08-01 01:20:06+00:00,_,2023-08-01 01:13:54,The Nasarawa State government is taking measur...,https%3A%2F%2Fdailytrust.com%2Fhow-nasarawa-go...,https://dailytrust.com/wp-content/uploads/2022...,en,_,The Nasarawa State government is taking measur...,dailytrust.com,How Nasarawa govt is responding to diphtheria ...,_,_,https://dailytrust.com/how-nasarawa-govt-is-re...
2,[],2023-08-01 01:20:19+00:00,_,2023-08-01 01:07:57,Lawyers are divided over the renewed moves to ...,https%3A%2F%2Fdailytrust.com%2Fnba-conference-...,https://dailytrust.com/wp-content/uploads/2022...,en,_,Lawyers are divided over the renewed moves to ...,dailytrust.com,NBA Conference: Lawyers divided over parallel ...,_,_,https://dailytrust.com/nba-conference-lawyers-...
3,[],2023-08-01 01:20:00+00:00,_,2023-08-01 00:37:29,D’Tigress will face the winners between Mozamb...,https%3A%2F%2Fdailytrust.com%2Fdtigress-to-fac...,https://dailytrust.com/wp-content/uploads/2022...,en,_,D’Tigress will face the winners between Mozamb...,dailytrust.com,D’Tigress to face Mozambique or Cote d’Ivoire ...,_,_,https://dailytrust.com/dtigress-to-face-mozamb...
4,[],2023-08-01 01:20:37+00:00,_,2023-08-01 01:11:50,Liver cancer patients are being spared overnig...,https%3A%2F%2Fdailytrust.com%2Fradioactive-bea...,https://dailytrust.com/wp-content/uploads/2022...,en,_,Liver cancer patients are being spared overnig...,dailytrust.com,Radioactive beads in the wrist that can fight ...,_,_,https://dailytrust.com/radioactive-beads-in-th...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5054,"[Carrie Young, Natalie Herbick]",2023-08-01 01:26:26+00:00,_,2023-08-01 01:01:07,She was diagnosed with stage 2B HER2-POSITIVE ...,https%3A%2F%2Fwww.wkbn.com%2Fnews%2Fohio%2Fbre...,https://www.wkbn.com/wp-content/uploads/sites/...,en,_,CLEVELAND (WJW) – Lesley Kiraly Hosta was just...,www.wkbn.com,"Breast cancer survivor says research, newer dr...",_,_,https://www.wkbn.com/news/ohio/breast-cancer-s...
5055,[Brooke Williams],2023-08-01 01:26:20+00:00,_,2023-08-01 00:19:52,Country artist Luke Bryan invited a local girl...,https%3A%2F%2Fwww.wkbn.com%2Fnews%2Fnational-w...,https://www.wkbn.com/wp-content/uploads/sites/...,en,_,DENVER (KDVR) — Country artist Luke Bryan invi...,www.wkbn.com,Child with cancer gets invited back stage to L...,_,_,https://www.wkbn.com/news/national-world/luke-...
5056,[Stephanie Whiteside],2023-08-01 01:04:45+00:00,_,2023-07-31 23:15:01,Social media has gone wild as people claim the...,https%3A%2F%2Ffox2now.com%2Fnews%2Fnational%2F...,https://fox2now.com/wp-content/uploads/sites/1...,en,_,"(NewsNation) — As in decades past, the questio...",fox2now.com,Did the government confirm aliens exist?,_,_,https://fox2now.com/news/national/did-the-gove...
5057,[Brooke Williams],2023-08-01 01:04:51+00:00,_,2023-08-01 00:18:40,Country artist Luke Bryan invited a local girl...,https%3A%2F%2Ffox2now.com%2Fnews%2Fnational%2F...,https://fox2now.com/wp-content/uploads/sites/1...,en,_,DENVER (KDVR) — Country artist Luke Bryan invi...,fox2now.com,Child with cancer gets invited back stage to L...,_,_,https://fox2now.com/news/national/luke-bryan-i...


In [15]:
# Remove duplicate rows based on the 'maintext' column
df = df.drop_duplicates(subset=['maintext'])

# Display the DataFrame after removing duplicates
df

Unnamed: 0,authors,date_download,date_modify,date_publish,description,filename,image_url,language,localpath,maintext,source_domain,title,title_page,title_rss,url
0,[],2023-08-01 01:20:55+00:00,_,2023-08-01 01:10:02,A consultant cardiologist at the Federal Medic...,https%3A%2F%2Fdailytrust.com%2Ftherapeutic-lif...,https://dailytrust.com/wp-content/uploads/2018...,en,_,A consultant cardiologist at the Federal Medic...,dailytrust.com,‘Therapeutic lifestyle modification’ lowers ri...,_,_,https://dailytrust.com/therapeutic-lifestyle-m...
1,[],2023-08-01 01:20:06+00:00,_,2023-08-01 01:13:54,The Nasarawa State government is taking measur...,https%3A%2F%2Fdailytrust.com%2Fhow-nasarawa-go...,https://dailytrust.com/wp-content/uploads/2022...,en,_,The Nasarawa State government is taking measur...,dailytrust.com,How Nasarawa govt is responding to diphtheria ...,_,_,https://dailytrust.com/how-nasarawa-govt-is-re...
2,[],2023-08-01 01:20:19+00:00,_,2023-08-01 01:07:57,Lawyers are divided over the renewed moves to ...,https%3A%2F%2Fdailytrust.com%2Fnba-conference-...,https://dailytrust.com/wp-content/uploads/2022...,en,_,Lawyers are divided over the renewed moves to ...,dailytrust.com,NBA Conference: Lawyers divided over parallel ...,_,_,https://dailytrust.com/nba-conference-lawyers-...
3,[],2023-08-01 01:20:00+00:00,_,2023-08-01 00:37:29,D’Tigress will face the winners between Mozamb...,https%3A%2F%2Fdailytrust.com%2Fdtigress-to-fac...,https://dailytrust.com/wp-content/uploads/2022...,en,_,D’Tigress will face the winners between Mozamb...,dailytrust.com,D’Tigress to face Mozambique or Cote d’Ivoire ...,_,_,https://dailytrust.com/dtigress-to-face-mozamb...
4,[],2023-08-01 01:20:37+00:00,_,2023-08-01 01:11:50,Liver cancer patients are being spared overnig...,https%3A%2F%2Fdailytrust.com%2Fradioactive-bea...,https://dailytrust.com/wp-content/uploads/2022...,en,_,Liver cancer patients are being spared overnig...,dailytrust.com,Radioactive beads in the wrist that can fight ...,_,_,https://dailytrust.com/radioactive-beads-in-th...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5050,[Sean Lafferty],2023-08-01 01:26:32+00:00,_,2023-08-01 01:04:26,The latest adjustment to the average National ...,https%3A%2F%2Fwww.wkbn.com%2Fnews%2Fpennsylvan...,https://www.wkbn.com/wp-content/uploads/sites/...,en,_,(WJET) – The latest adjustment to the average ...,www.wkbn.com,National Fuel reducing charges starting August 1,_,_,https://www.wkbn.com/news/pennsylvania/nationa...
5051,[Jacob Thompson],2023-08-01 01:26:08+00:00,_,2023-08-01 01:15:53,During Monday's Youngstown City Council meetin...,https%3A%2F%2Fwww.wkbn.com%2Fnews%2Flocal-news...,https://www.wkbn.com/wp-content/uploads/sites/...,en,_,"YOUNGSTOWN, Ohio (WKBN) – During Monday’s Youn...",www.wkbn.com,Council hashes out what to do with vacant buil...,_,_,https://www.wkbn.com/news/local-news/youngstow...
5052,[Jennifer Rodriguez],2023-08-01 01:26:14+00:00,_,2023-08-01 00:32:40,The timeline of the construction was also ques...,https%3A%2F%2Fwww.wkbn.com%2Fnews%2Flocal-news...,https://www.wkbn.com/wp-content/uploads/sites/...,en,_,"YOUNGSTOWN, Ohio (WKBN) – The topic of downtow...",www.wkbn.com,Why multiple roads are being dug up at once,_,_,https://www.wkbn.com/news/local-news/youngstow...
5053,[Desirae Gostlin],2023-08-01 01:26:02+00:00,_,2023-07-31 23:55:01,The Mahoning Valley Scrappers become the Mahon...,https%3A%2F%2Fwww.wkbn.com%2Fnews%2Flocal-news...,https://www.wkbn.com/wp-content/uploads/sites/...,en,_,"NILES, Ohio (WKBN) – The Mahoning Valley Scrap...",www.wkbn.com,Scrappers host ‘Hot Peppers in Oil’ night,_,_,https://www.wkbn.com/news/local-news/niles-new...


In [17]:
# create an empty directory called "news_data"
!mkdir -p '/content/news_data/'

# Initialize a counter for file naming
count = 0

#store each row in column 'maintext' in separated txt files
for index, row in df.iterrows():
    data_content = row['maintext'] # Get the content of the 'maintext' column for the current row
    fname = "/content/news_data/Output" + str(count) + ".txt"
    with open(fname, "w") as text_file:
        text_file.write(data_content) # Write the content to the text file
    count += 1

##Chunking

###Semantic Splitter

In [19]:
# Chunking: semantic splitter
# Load documents from the "/content/news_data" folder
reader = SimpleDirectoryReader("/content/news_data") # load documents from the /data folder
docs = reader.load_data()

# Print the number of documents loaded
print(f"Loaded {len(docs)} docs")

# Initialize a HuggingFace Embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Initialize an LLM model with a specified model and request timeout
llm = Ollama(model=OLLAMA_MODEL, request_timeout=1500.0)

# Specify the LLM and embedding model into LlamaIndex's settings
Settings.llm = llm
Settings.embed_model = embed_model

Loaded 3305 docs




In [20]:
from llama_index.core.node_parser import SemanticSplitterNodeParser

# Initialize a SemanticSplitterNodeParser with specified parameters
parser = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=90, embed_model=embed_model
)

# Parse the documents into semantic nodes using the parser
semantic_nodes = parser.get_nodes_from_documents(docs)

# Print the semantic nodes for further processing
print(semantic_nodes)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [21]:
semantic_nodes

[TextNode(id_='3aae4846-dea0-4e5c-b7d6-e36c89003f4d', embedding=None, metadata={'file_path': '/content/news_data/Output0.txt', 'file_name': 'Output0.txt', 'file_type': 'text/plain', 'file_size': 4383, 'creation_date': '2024-05-26', 'last_modified_date': '2024-05-26'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='a815fb28-2834-497f-bae3-bcafd1d169fb', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': '/content/news_data/Output0.txt', 'file_name': 'Output0.txt', 'file_type': 'text/plain', 'file_size': 4383, 'creation_date': '2024-05-26', 'last_modified_date': '2024-05-26'}, hash='c3086f8209ea77b87beb04e2e38395f2e068c3c6e07c3f2184ea0664145fad3a'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(nod

In [22]:
# Access the text of the first semantic node
semantic_nodes[0].text

'A consultant cardiologist at the Federal Medical Centre (FMC), Idi Aba, Abeokuta, Ogun State, Dr Akinlolu Ajani, has proffered ways to reduce the risk of having diabetes.\nHe said they include suggested therapeutic lifestyle modification, eating less of carbohydrates but more of vegetables and fruits; exercise regularly, and abstaining from passive and active smoking, not taking alcohol, reduction of salt intake, and psychological or physical stress.\nAjani said this while speaking with Daily Trust shortly after giving a health talk at Ibara Baptist Church, Abeokuta, Ogun State.\nHe spoke on ‘Diabetes Mellitus, A Ravaging Disease’, in commemoration of the annual health week of the Nigerian Baptist Convention.\nAjani blamed increasing cases of diabetes on “family history, sedentary lifestyle and what we are eating.”\nThe medical expert said it’s always good to be knowledgeable about prevention and management of the disease, adding that it’s “not the matter of ‘I reject it in Jesus Name

In [23]:
# extract splitted text from the semantic output
all_texts = [node.text for node in semantic_nodes]

# 'all_texts' contains all the extracted texts from each TextNode
all_texts

['A consultant cardiologist at the Federal Medical Centre (FMC), Idi Aba, Abeokuta, Ogun State, Dr Akinlolu Ajani, has proffered ways to reduce the risk of having diabetes.\nHe said they include suggested therapeutic lifestyle modification, eating less of carbohydrates but more of vegetables and fruits; exercise regularly, and abstaining from passive and active smoking, not taking alcohol, reduction of salt intake, and psychological or physical stress.\nAjani said this while speaking with Daily Trust shortly after giving a health talk at Ibara Baptist Church, Abeokuta, Ogun State.\nHe spoke on ‘Diabetes Mellitus, A Ravaging Disease’, in commemoration of the annual health week of the Nigerian Baptist Convention.\nAjani blamed increasing cases of diabetes on “family history, sedentary lifestyle and what we are eating.”\nThe medical expert said it’s always good to be knowledgeable about prevention and management of the disease, adding that it’s “not the matter of ‘I reject it in Jesus Nam

In [25]:
!mkdir -p '/content/splitted_data/' # create an empty directory called "splitted_data"

count = 0

for doc in all_texts: # iterate through the results
  fname = "/content/splitted_data/Output" + str(count) + ".txt"
  with open(fname, "w") as text_file:
    text_file.write(doc) # save the file
  count += 1 # increment the count

##Embedding & Vector Database Setup

In [26]:
# Import ChromaVectorStore and chromadb module
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

# Initialize a HuggingFace Embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Initialize an LLM model with a specified model and request timeout
llm = Ollama(model=OLLAMA_MODEL, request_timeout=1500.0)

# Specify the LLM and embedding model into LlamaIndex's settings
Settings.llm = llm
Settings.embed_model = embed_model

# Load documents
reader = SimpleDirectoryReader("/content/splitted_data") # load documents from the /data folder
docs = reader.load_data()
print(f"Loaded {len(docs)} docs")

# Create client ("db") and a database ("chroma_db")
db = chromadb.PersistentClient(path="./chroma_db")

# Create a collection/table in the db
chroma_collection = db.create_collection("my-db")

# Set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# Specify Chroma as our vector db
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create the vector index
vector_index = VectorStoreIndex.from_documents(
    docs, # the file created earlier
    storage_context = storage_context,
    embed_model = embed_model
)

# Print the metadata
print(chroma_collection)

# Print the name of the collection (table)
print(f'Collection name is: {chroma_collection.name}')

Loaded 11887 docs
name='my-db' id=UUID('b09726b5-7c29-43ab-b0d3-c49aecc998bf') metadata=None tenant='default_tenant' database='default_database'
Collection name is: my-db


##Prompt Template Setup

In [43]:
# Prompt Template Setup
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core import ChatPromptTemplate

# Define the QA prompt string
qa_prompt_str = (
    "Below is the context information.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "answer the question: {query_str}\n"
)

# Define the text QA prompt messages
chat_text_qa_msgs = [
    ChatMessage(
        role=MessageRole.SYSTEM,
        content=(
            "Please just say 'I don't know' if the answer is not provided in the given context."
        ),
    ),
    ChatMessage(role=MessageRole.USER, content=qa_prompt_str),
]

# Create the ChatPromptTemplate with the defined messages
text_qa_template = ChatPromptTemplate(chat_text_qa_msgs)

##Query Testing

In [33]:
# Test1: general question
print(
    vector_index.as_query_engine(
        response_mode = 'tree_summarize',
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("Give me a comprehensive introduction of the shipping company Yellow Corp.")
)

 Based on the context information provided, it appears that Yellow Corporation is an American trucking firm specializing in less-than-truckload service. The company was founded by two brothers who started their business with one truck and have since expanded to a fleet of over 2,000 trucks. They primarily transport goods for large corporations like Walmart and The Home Depot.

However, Yellow Corporation has recently faced financial trouble and announced its closure. This has raised concerns about the impact it will have on supply chains across the country, as well as leaving many employees without jobs. Despite the uncertainty, it is clear that Yellow Corp had a significant presence in the trucking industry and played an important role in transporting goods to customers.



In [37]:
# Test2: specific question
print(
    vector_index.as_query_engine(
        response_mode = 'tree_summarize',
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("Who were the victim and perpetrator in the murder-suicide incident in Little Egg Harbor, New Jersey?")
)

 The woman found dead in her apartment was identified as Kimberly Hoffman, 49, and her attacker was her ex-husband, Carl Schulz Jr., 52.



In [73]:
# Test3: complex question
print(
    vector_index.as_query_engine(
        response_mode = 'tree_summarize',
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("Why 911 calls for severe allergic reactions nearly doubled in summer? What measures can be taken to prevent serious allergic reaction?")
)

 In the summertime, people are more prone to insect stings and exposure to allergens such as peanuts, milk, and eggs at picnics and barbeques. As a result, BCEHS sees an increase of almost double the usual calls to 911 for severe allergic reactions. To prevent serious allergic reactions, it is important to stay vigilant and watch for the signs, including severe skin rash, swollen lips and eyes, swelling in the tongue or throat with difficulty swallowing, and trouble breathing. People with severe allergies should always have Epipen on hand and make sure they are not expired. If a person experiences anaphylaxis, it is important to call 911 immediately while remaining calm and following the advice of dispatch staff.



In [75]:
# Test4: not answerable question
print(
    vector_index.as_query_engine(
        response_mode = 'tree_summarize',
        text_qa_template=text_qa_template,
        llm=llm,
    ).query("Who is Emma Stone?")
)

 Based on the context information provided in the given text files, it seems that there is no direct mention of Emma Stone as a person or character. However, based on her presence in popular media such as movies like "La La Land" and "The Help," she can be identified as an actress who has achieved great success in Hollywood.

