Reference: [LlamaIndex overview & use cases | LangChain integration](https://www.youtube.com/watch?v=cNMYeW2mpBs) ([Colab notebook](https://colab.research.google.com/drive/19xBNmejiJUhWIy71bWFnlL1H-O-hjTbW?usp=sharing#scrollTo=z7U9dbyLTFOD)) ([GitHUb](https://github.com/jerryjliu/llama_index/tree/main))

In [1]:
import openai
import environ
from llama_index import SimpleDirectoryReader
from llama_index import VectorStoreIndex
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine
from llama_hub.wikipedia.base import WikipediaReader
import wikipedia

import nest_asyncio
nest_asyncio.apply()

In [2]:
env = environ.Env()
environ.Env.read_env()
API_KEY = env('OPENAI_API_KEY')
openai.api_key = API_KEY



# Data Connectors (Llama Hub)


see [Llama Hub](https://llamahub.ai/)

In [3]:
list_wiki_pages = wikipedia.search("Migration to Germany")
list_wiki_pages

['Immigration to Germany',
 'British migration to Germany',
 'Migration from Ghana to Germany',
 'Migration Period',
 'Foreign-born population of the United Kingdom',
 'Germany',
 'Romani people in Germany',
 'Islam in Germany',
 'Germans in the United Kingdom',
 'Human migration']

In [4]:
def select_pages(list_wiki_pages: list) -> list:
    """
    TODO: create a function to select a subset of pages are contain information relevant for the topic

    Args:
        list_wike_pages (list): A DataFrame with the list of databases (Title, Code, etc.).

    Returns:
        List of a subset of wikipedia pages that are relevant for the topic.
    """
    subset_list_wiki_pages = list_wiki_pages
    return subset_list_wiki_pages

In [5]:
subset_list_wiki_pages = select_pages(list_wiki_pages)

In [6]:
loader = WikipediaReader()
documents = loader.load_data(pages=list_wiki_pages)

# Basic Query

In [7]:
# Build an index for the Document objects.
index = VectorStoreIndex.from_documents(documents)

In [8]:
# query an index
query_engine = index.as_query_engine()
response = query_engine.query("How many move to Germany each year?")

In [9]:
print(response)

Over 1 million people move to Germany each year since 2013.


# Query Documents

In [10]:
doc_1 = SimpleDirectoryReader(input_files=["docs/KS-09-23-223-EN-N.pdf"]).load_data()
# doc_1

In [11]:
doc_1_index = VectorStoreIndex.from_documents(doc_1)

In [12]:
doc_1_engine = doc_1_index.as_query_engine(similarity_top_k=3)

In [13]:
query_engine_tools = [
    QueryEngineTool(
        query_engine=doc_1_engine,
        metadata=ToolMetadata(name='report_2022', description='Annual Report on Migration and Asylum 2022')
    ),
]

In [14]:
# Given a query, this query engine "SubQuestionQueryEngine" will generate a “query plan”
# containing sub-queries against sub-documents before synthesizing the final answer.
s_engine = SubQuestionQueryEngine.from_defaults(query_engine_tools=query_engine_tools)

In [15]:
response = s_engine.query("How many people received residence permits?")

Generated 1 sub questions.
[36;1m[1;3m[report_2022] Q: How many people received residence permits according to the Annual Report on Migration and Asylum 2022?
[0m[36;1m[1;3m[report_2022] A: Based on the Annual Report on Migration and Asylum 2022, the number of people who received residence permits is not explicitly mentioned in the provided context information.
[0m

# Hypothetical document embeddings (HyDE)

In [24]:
from llama_index.indices.query.query_transform import HyDEQueryTransform
from llama_index.query_engine.transform_query_engine import TransformQueryEngine
from IPython.display import Markdown, display

In [16]:
doc_2 = SimpleDirectoryReader(input_dir='docs').load_data()

In [17]:
doc_2_index = VectorStoreIndex.from_documents(doc_2)

In [18]:
query_str = "What is the source of the data?"

In [33]:
#Now, we use HyDEQueryTransform to generate a hypothetical document and use it for embedding lookup.
hyde = HyDEQueryTransform(include_original=True)
query_engine = doc_2_index.as_query_engine()
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response = hyde_query_engine.query(query_str)

The source of the data is Eurostat.


In [38]:
print(f"Prompt:\n{query_str}\n\nResponse:\n{response}")

Prompt:
What is the source of the data?

Response:
The source of the data is Eurostat.


In [39]:
#In this example, HyDE improves output quality significantly, by hallucinating
# accurately, and thus improving the embedding quality, and final output.
query_bundle = hyde(query_str)
hyde_doc = query_bundle.embedding_strs[0]

In [58]:
print(f"Prompt:\n{query_str}\n\nResponse:")
display(Markdown(f"{hyde_doc}"))

Prompt:
What is the source of the data?

Response:


The source of the data refers to the origin or the specific place from where the information or statistics are obtained. It is crucial to identify the source of the data as it helps in determining the reliability, credibility, and accuracy of the information presented. The source can vary depending on the context and the type of data being referred to.

In the case of scientific research or academic studies, the source of the data is often primary research conducted by the researchers themselves. This involves collecting data through experiments, surveys, observations, or interviews. The researchers ensure that the data is collected in a systematic and controlled manner to minimize biases and errors.

In other instances, the source of the data can be secondary research, which involves using existing data collected by others. This can include data from government agencies, research institutions, or publicly available databases. Secondary data is often used when primary data collection is not feasible or when there is a need to analyze large-scale trends or patterns.

Furthermore, the source of the data can also be derived from various sources such as official records, historical documents, financial reports, market research, social media platforms, or even personal experiences. It is essential to critically evaluate the source of the data to ensure its reliability and relevance to the topic or research question at hand.

In summary, the source of the data is the origin or specific place from where the information or statistics are obtained. It can be primary research conducted by the researchers themselves, secondary research using existing data, or derived from various sources such as official records or personal experiences. Identifying the source of the data is crucial in assessing its credibility and accuracy.