# Demo for comparing RAG performance w/ and w/o using HelloRAG
1. Refer to [How to use HelloRAG](https://hellorag.ai/tutorial) tutorial to process your PDF(s) and export the resulting zip file(s)
2. Place the exported zip file(s) under the directory './hellorag-result'
>./hellorag-result  
> ├──test_only.pdf.zip  
> ├──some_other.pdf.zip  
> ...


# 0. Setup:

## Install the required packages

In [None]:
!pip install lxml==5.1.0
!pip install llama-index==0.10.18
!pip install reportlab==4.0.6
!pip install beautifulsoup4==4.12.3
!pip install llama-cpp-python==0.2.23
!pip install pandas==2.2.0
print('Ready')

## Config the LLM and Embedding Models

In [1]:
import openai
from llama_index.core.settings import Settings
from llama_index.llms.openai import OpenAI
import requests
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
import pandas as pd  

#Set your OpenAI key or set your own LLM and Embedding Models
openai.api_key = "sk-*"
Settings.llm = OpenAI(model="gpt-4-0125-preview")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
Settings._node_parser = SentenceSplitter()

print('model config done!')

model config done!


-------------------------------------------------------------------------------------------------------------------

# 1. RAG without using HelloRAG results:

## 1.1 Build index (no need to rebuild if file(s) unchanged) 

In [2]:
from llama_index.core import VectorStoreIndex,SimpleDirectoryReader

#put your documents in a dir and change the dir followed
dir='./nohellorag'

documents = SimpleDirectoryReader(dir).load_data()
vector_index = VectorStoreIndex.from_documents(documents)
vector_index.storage_context.persist(persist_dir="./nohellorag-index")

print('Index Built')

Index Built


## 1.2 Query

In [3]:
from llama_index.core import VectorStoreIndex,load_index_from_storage,StorageContext
vector_index=load_index_from_storage(StorageContext.from_defaults(persist_dir="./nohellorag-index"))
engine=vector_index.as_query_engine(similarity_top_k=3)
# your query
response = engine.query("Tell me about the Empire Corridor")
print(response)
print('--------------------------------------------------------------------------------')
print('References:')
data = []  
for source_node in response.source_nodes:   
    row_data = {  
        'File Name': source_node.metadata['file_name'], 
        'Page No.': source_node.metadata['page_label'],  
        'Score': source_node.score, 
        'Text': source_node.text,  
    }  
    data.append(row_data)  
  
df = pd.DataFrame(data)  
pd.set_option('display.max_colwidth', None)
styled_df = df.style.set_table_styles([
     {'selector': 'th, td', 'props': [('text-align', 'left'),('white-space','pre-wrap')]}
]).hide(axis='index')

# display styled DataFrame
display(styled_df)

The Empire Corridor is a significant passenger rail corridor in New York State, extending 461 miles from Penn Station in New York City to Niagara Falls, New York. It passes through key cities such as Poughkeepsie, Albany, Schenectady, Amsterdam, Utica, Syracuse, Rochester, and Buffalo. This corridor, historically part of the New York Central Railroad's main line, is now served by trains collectively known as the Empire Service. The Empire Service name, originally introduced by the New York Central in 1967 and discontinued by Amtrak in 1971, was reinstated by Amtrak in 1972. Over the years, the service has seen various changes, including the addition and discontinuation of individual train names, with a notable shift in 1995 when individual names were replaced with the Empire Service brand, only to have individual names briefly return from 1996 to 1999 before settling back to the Empire Service branding.
--------------------------------------------------------------------------------
Re

File Name,Page No.,Score,Text
Amtrak Routes Tables.pdf,1,0.831836,"1 Amtrak Routes Amtrak operates inter-city rail service in 46 of the 48 contiguous U.S. states and three Canadian provinces. Amtrak is a portmanteau of the words America and trak, the latter itself a sensational spelling of track. Amtrak service is divided into three categories of routes: Northeast Corridor routes, state-supported routes, and long distance routes. These types indicate how the service is funded. Northeast Corridor service is directly subsidized by federal appropriations. Federally-supported long distance services are subsidized by appropriations under a separate line item from the NEC in federal budgets. Additionally, Amtrak partners with 17 states to provide additional short- and medium-distance services desired by those states. They are subsidized by periodic payments to Amtrak from the state partners. Three routes – the Carolinian, Northeast Regional, and Vermonter – are state-subsidized only on the sections of their routes off the Northeast Corridor (north of New Haven, and south of Washington). The Northeast Regional and San Joaquin have branches served by different trips, while the Empire Builder and Lake Shore Limited split into two sections to serve branches. On the Capitol Corridor, Cascades, Empire Service, Keystone Service, Northeast Regional, and Pacific Surfliner, some or all trips do not run the full length of the route. Empire Corridor The Empire Corridor is a 461-mile (742 km) passenger rail corridor in New York State running between Penn Station in New York City and Niagara Falls, New York. Major cities on the route include Poughkeepsie, Albany, Schenectady, Amsterdam, Utica, Syracuse, Rochester, and Buffalo. Much of the corridor was once part of the New York Central Railroad's main line. Trains operating over the Empire Corridor (the former New York Central Railroad Water Level Route) are now collectively known as the Empire Service. The name was used by the New York Central beginning in 1967, but dropped by Amtrak in 1971. Amtrak restored the Empire Service brand with the June 11, 1972, timetable, and added individual train names on the May 19, 1974, timetable. As was done on the Northeast Corridor with NortheastDirect, individual train names for New York-Albany and New York-Niagara Falls service were dropped on October 28, 1995, and replaced with Empire. The individual names were re-added in November 1996, but dropped in favor of Empire Service in May 1999."
Amtrak Routes Tables.pdf,3,0.817947,"3 Full list of current and discontinued Empire Corridor routes operated by Amtrak since May 1, 1971 (continued from Page 2) Name Route Service began Service ended Notes Cayuga New York City – Schenectady October 28, 1984 April 4, 1987 Central Park New York City – Albany April 2, 1995 October 27, 1995 Merged into Empire Service DeWitt Clinton New York City – Albany May 19, 1974 April 25, 1981 Previously unnamed; replaced by Rip Van Winkle Electric City Express New York City – Schenectady April 26, 1981 Replaced Salt City Express Empire Service New York City – Buffalo May 1, 1971 May 18, 1974 Inherited from PC Empire Service; unnamed until June 11, 1972. Individual names applied on May 19, 1974. New York City – Niagara Falls October 28, 1995 present Merged from various individual train names. Individual names restored under the Empire Service brand from November 1996 to May 1999. Empire State Express New York City – Buffalo May 19, 1974 October 30, 1974 Previously unnamed New York City – Detroit October 31, 1974 April 24, 1976 Renamed Niagara Rainbow New York City – Buffalo January 8, 1978 October 28, 1978 Renamed from Water Level Express New York City – Niagara Falls October 29, 1978"
Amtrak Routes Tables.pdf,2,0.812325,"2 Full list of current and discontinued Empire Corridor routes operated by Amtrak since May 1, 1971 Name Route Service began Service ended Notes Adirondack New York City – Montreal August 6, 1974 April 1, 1995 Joint operation with Empire State Express/DeWitt Clinton until April 1975 Washington, D.C. – Montreal April 2, 1995 April 13, 1996 New York City – Montreal April 14, 1996 present Bear Mountain New York City – Albany February 15, 1977 April 29, 1978 August 3, 1980 October 25, 1980 April 26, 1981 Renamed from Henry Hudson Catskill New York City – Albany October 27, 1991 October 30, 1993 New York City – Schenectady October 31, 1993 May 4, 1994 New York City – Syracuse May 5, 1994 October 29, 1994 New York City – Albany August 6, 1974 April 1, 1995 Joint operation with Empire State Express/DeWitt Clinton until April 1975 New York City – Niagara Falls April 2, 1995 April 13, 1996"


-------------------------------------------------------------------------------------------------------------------

# 2. RAG using HelloRAG results:

## 2.1 Build index (no need to rebuild if file(s) unchanged)

In [3]:
from llama_index.core.settings import Settings
from hellorag_llama_pack.hellorag_llama_index_pack.base import HelloragLlamaindexPack

print('Building index  ...')
hellorag_pack = HelloragLlamaindexPack(
        base_path="./hellorag-result",
        need_refresh=True,
        index_path="./hellorag-files",
    )
print('Index Built')


Building index  ...


Generating embeddings:   0%|          | 0/9 [00:00<?, ?it/s]

Index Built


## 2.2 Query

In [4]:
from llama_index.core.settings import Settings
from hellorag_llama_pack.hellorag_llama_index_pack.base import HelloragLlamaindexPack
import base64  
import io  
from IPython.display import display, Image 
hellorag_pack = HelloragLlamaindexPack(
        index_path="./hellorag-files",
        top_k=3
    )
# your query
response = hellorag_pack.run("How many U.S. States does Amtrak operate in? ")
print(response)
def show_hellorag_references(source_nodes):
    print('--------------------------------------------------------------------------------')
    print('References:')
    data = []  
    for source_node in response.source_nodes:
        base64_str=getattr(source_node.node, 'image', None) 
        row_data = {  
            'File Name': source_node.metadata['file_name'], 
            'Page No.': source_node.metadata['page_label'],  
            'Score': source_node.score, 
            'Text': source_node.text,  
            'Table': source_node.metadata['table_html'] if 'table_html' in source_node.metadata else '',  
            'Image': f'<img src="data:image/png;base64,{base64_str}" alt="Image" style="max-width: 200px; max-height: 200px;"/>' if base64_str else '',  
        }  
        data.append(row_data)  

    df = pd.DataFrame(data)  
    pd.set_option('display.max_colwidth', None)
    styled_df = df.style.set_table_styles([
         {'selector': 'th, td', 'props': [('text-align', 'left'),('white-space','pre-wrap')]}
    ]).hide(axis='index')

    # display styled DataFrame
    display(styled_df)
    
show_hellorag_references(response.source_nodes)

Amtrak operates inter-city rail service in 46 of the 48 contiguous U.S. states.
--------------------------------------------------------------------------------
References:


File Name,Page No.,Score,Text,Table,Image
Amtrak Routes Tables.pdf,1,0.859739,"Amtrak Routes■■■■Amtrak operates inter-city rail service in 46 of the 48 contiguous U.S. states and three Canadian provinces. Amtrak is a portmanteau of the words America and trak, the latter itself a sensational spelling of track. Amtrak service is divided into three categories of routes: Northeast Corridor routes, state-supported routes, and long distance routes. These types indicate how the service is funded. Northeast Corridor service is directly subsidized by federal appropriations. Federally-supported long distance services are subsidized by appropriations under a separate line item from the NEC in federal budgets. Additionally, Amtrak partners with 17 states to provide additional short- and medium-distance services desired by those states. They are subsidized by periodic payments to Amtrak from the state partners. Three routes – the Carolinian, Northeast Regional, and Vermonter – are state-subsidized only on the sections of their routes off the Northeast Corridor (north of New Haven, and south of Washington).■■■■The Northeast Regional and San Joaquin have branches served by different trips, while the Empire Builder and Lake Shore Limited split into two sections to serve branches. On the Capitol Corridor, Cascades, Empire Service, Keystone Service, Northeast Regional, and Pacific Surfliner, some or all trips do not run the full length of the route.■■■■Empire Corridor■■■■The Empire Corridor is a 461-mile (742 km) passenger rail corridor in New York State running between Penn Station in New York City and Niagara Falls, New York. Major cities on the route include Poughkeepsie, Albany, Schenectady, Amsterdam, Utica, Syracuse, Rochester, and Buffalo. Much of the corridor was once part of the New York Central Railroad's main line.■■■■Trains operating over the Empire Corridor (the former New York Central Railroad Water Level Route) are now collectively known as the Empire Service. The name was used by the New York Central beginning in 1967, but dropped by Amtrak in 1971. Amtrak restored the Empire Service brand with the June 11, 1972, timetable, and added individual train names on the May 19, 1974, timetable.",,
Amtrak Routes Tables.pdf,1,0.820117,"Major cities on the route include Poughkeepsie, Albany, Schenectady, Amsterdam, Utica, Syracuse, Rochester, and Buffalo. Much of the corridor was once part of the New York Central Railroad's main line.■■■■Trains operating over the Empire Corridor (the former New York Central Railroad Water Level Route) are now collectively known as the Empire Service. The name was used by the New York Central beginning in 1967, but dropped by Amtrak in 1971. Amtrak restored the Empire Service brand with the June 11, 1972, timetable, and added individual train names on the May 19, 1974, timetable. As was done on the Northeast Corridor with NortheastDirect, individual train names for New York-Albany and New York-Niagara Falls service were dropped on October 28, 1995, and replaced with Empire. The individual names were re-added in November 1996, but dropped in favor of Empire Service in May 1999.",,
Amtrak Routes Tables.pdf,1,0.809435,Amtrak National Route Map,,


-------------------------------------------------------------------------------------------------------------------

In [6]:
# your query
response = hellorag_pack.run("Tell me about the route Empire Service ")
print(response)
show_hellorag_references(response.source_nodes)

The Empire Service route initially ran from New York City to Buffalo, starting service on May 1, 1971, and ending on May 18, 1974. This service was inherited from PC Empire Service and remained unnamed until June 11, 1972. Individual names were applied on May 19, 1974. Later, the Empire Service route was extended from New York City to Niagara Falls, beginning on October 28, 1995, and is still in operation. This extension merged various individual train names, which were restored under the Empire Service brand from November 1996 to May 1999.
--------------------------------------------------------------------------------
References:


File Name,Page No.,Score,Text,Table,Image
Amtrak Routes Tables.pdf,1,0.858153,"Major cities on the route include Poughkeepsie, Albany, Schenectady, Amsterdam, Utica, Syracuse, Rochester, and Buffalo. Much of the corridor was once part of the New York Central Railroad's main line.■■■■Trains operating over the Empire Corridor (the former New York Central Railroad Water Level Route) are now collectively known as the Empire Service. The name was used by the New York Central beginning in 1967, but dropped by Amtrak in 1971. Amtrak restored the Empire Service brand with the June 11, 1972, timetable, and added individual train names on the May 19, 1974, timetable. As was done on the Northeast Corridor with NortheastDirect, individual train names for New York-Albany and New York-Niagara Falls service were dropped on October 28, 1995, and replaced with Empire. The individual names were re-added in November 1996, but dropped in favor of Empire Service in May 1999.",,
Amtrak Routes Tables.pdf,1,0.839979,"Amtrak Routes■■■■Amtrak operates inter-city rail service in 46 of the 48 contiguous U.S. states and three Canadian provinces. Amtrak is a portmanteau of the words America and trak, the latter itself a sensational spelling of track. Amtrak service is divided into three categories of routes: Northeast Corridor routes, state-supported routes, and long distance routes. These types indicate how the service is funded. Northeast Corridor service is directly subsidized by federal appropriations. Federally-supported long distance services are subsidized by appropriations under a separate line item from the NEC in federal budgets. Additionally, Amtrak partners with 17 states to provide additional short- and medium-distance services desired by those states. They are subsidized by periodic payments to Amtrak from the state partners. Three routes – the Carolinian, Northeast Regional, and Vermonter – are state-subsidized only on the sections of their routes off the Northeast Corridor (north of New Haven, and south of Washington).■■■■The Northeast Regional and San Joaquin have branches served by different trips, while the Empire Builder and Lake Shore Limited split into two sections to serve branches. On the Capitol Corridor, Cascades, Empire Service, Keystone Service, Northeast Regional, and Pacific Surfliner, some or all trips do not run the full length of the route.■■■■Empire Corridor■■■■The Empire Corridor is a 461-mile (742 km) passenger rail corridor in New York State running between Penn Station in New York City and Niagara Falls, New York. Major cities on the route include Poughkeepsie, Albany, Schenectady, Amsterdam, Utica, Syracuse, Rochester, and Buffalo. Much of the corridor was once part of the New York Central Railroad's main line.■■■■Trains operating over the Empire Corridor (the former New York Central Railroad Water Level Route) are now collectively known as the Empire Service. The name was used by the New York Central beginning in 1967, but dropped by Amtrak in 1971. Amtrak restored the Empire Service brand with the June 11, 1972, timetable, and added individual train names on the May 19, 1974, timetable.",,
Amtrak Routes Tables.pdf,3,0.839887,,"NameRouteService beganService endedNotesCayugaNew York City –SchenectadyOctober 28,1984April 4, 1987Central ParkNew York City – AlbanyApril 2, 1995October 27, 1995Merged into EmpireServiceDeWittClintonNew York City – AlbanyMay 19, 1974April 25, 1981Previously unnamed;replaced by Rip VanWinkleElectric CityExpressNew York City –SchenectadyApril 26, 1981Replaced Salt CityExpressEmpireServiceNew York City – BuffaloMay 1, 1971May 18, 1974Inherited from PC EmpireService; unnamed untilJune 11, 1972. Individualnames applied on May19, 1974.New York City – NiagaraFallsOctober 28,1995presentMerged from variousindividual train names.Individual names restoredunder the EmpireService brand fromNovember 1996 to May1999.EmpireStateExpressNew York City – BuffaloMay 19, 1974October 30, 1974Previously unnamedNew York City – DetroitOctober 31,1974April 24, 1976Renamed NiagaraRainbowNew York City – BuffaloJanuary 8, 1978October 28, 1978Renamed from WaterLevel ExpressNew York City – NiagaraFallsOctober 29,1978",
Name,Route,Service began,Service ended,Notes,
Cayuga,New York City –Schenectady,"October 28,1984","April 4, 1987",,
Central Park,New York City – Albany,"April 2, 1995","October 27, 1995",Merged into EmpireService,
DeWittClinton,New York City – Albany,"May 19, 1974","April 25, 1981",Previously unnamed;replaced by Rip VanWinkle,
Electric CityExpress,New York City –Schenectady,"April 26, 1981",,Replaced Salt CityExpress,
EmpireService,New York City – Buffalo,"May 1, 1971","May 18, 1974","Inherited from PC EmpireService; unnamed untilJune 11, 1972. Individualnames applied on May19, 1974.",
EmpireService,New York City – NiagaraFalls,"October 28,1995",present,Merged from variousindividual train names.Individual names restoredunder the EmpireService brand fromNovember 1996 to May1999.,

0,1,2,3,4
Name,Route,Service began,Service ended,Notes
Cayuga,New York City –Schenectady,"October 28,1984","April 4, 1987",
Central Park,New York City – Albany,"April 2, 1995","October 27, 1995",Merged into EmpireService
DeWittClinton,New York City – Albany,"May 19, 1974","April 25, 1981",Previously unnamed;replaced by Rip VanWinkle
Electric CityExpress,New York City –Schenectady,"April 26, 1981",,Replaced Salt CityExpress
EmpireService,New York City – Buffalo,"May 1, 1971","May 18, 1974","Inherited from PC EmpireService; unnamed untilJune 11, 1972. Individualnames applied on May19, 1974."
EmpireService,New York City – NiagaraFalls,"October 28,1995",present,Merged from variousindividual train names.Individual names restoredunder the EmpireService brand fromNovember 1996 to May1999.
EmpireStateExpress,New York City – Buffalo,"May 19, 1974","October 30, 1974",Previously unnamed
EmpireStateExpress,New York City – Detroit,"October 31,1974","April 24, 1976",Renamed NiagaraRainbow
EmpireStateExpress,New York City – Buffalo,"January 8, 1978","October 28, 1978",Renamed from WaterLevel Express


-------------------------------------------------------------------------------------------------------------------

# 3. Comparison (w/ vs. w/o HelloRAG)

## Install requirements

In [None]:
!pip install IPython

## Battle Start

In [7]:
from llama_index.core.settings import Settings
from hellorag_llama_pack.hellorag_llama_index_pack.base import HelloragLlamaindexPack
from llama_index.core import VectorStoreIndex,load_index_from_storage,StorageContext
import pandas as pd
from IPython.display import HTML
hellorag_pack = HelloragLlamaindexPack(
        index_path="./hellorag-files",
        top_k=3
    )
#change your question
question="Which other routes are between nyc and niagara falls?"
response_1 = hellorag_pack.run(question)
vector_index=load_index_from_storage(StorageContext.from_defaults(persist_dir="./nohellorag-index"))
engine=vector_index.as_query_engine(similarity_top_k=3)
response_2 = engine.query(question)

data = {'w/ HelloRAG': [response_1],
        ' ': ['vs.'],
        'w/o HelloRAG': [response_2]}
df = pd.DataFrame(data)
pd.set_option('display.max_colwidth', None)

styled_df = df.style.set_table_styles([
     {'selector': 'th, td', 'props': [('text-align', 'left'),('white-space','pre-wrap'),('font-size','15px')]}
]).hide(axis='index')

# display styled DataFrame
display(styled_df)

w/ HelloRAG,Unnamed: 1,w/o HelloRAG
"- Mohawk: New York City – Niagara Falls, Service began April 26, 1981, Service ended April 28, 1984. - Mohawk: New York City – Niagara Falls, Service began October 28,1984. - Niagara Rainbow: New York City – Niagara Falls, Service began January 31,1979. - Empire Service: New York City – Niagara Falls, Service began October 28,1995, Service is present. - Empire State Express: New York City – Niagara Falls, Service began October 29,1978.",vs.,"The routes between New York City and Niagara Falls include the Mohawk, Niagara Rainbow, and Empire Service."
