# Introduction to LlamaIndex

**LlamaIndex** is a framework to connect data and LLM's. This data will be loaded into a some type of structure that later will receive the LLM.

## Overview of RAG and it's components with LlamaIndex

The main objective of retrieval augmentation is to put some context to the prompt

The way RAG works is 

1.  A documents is loaded and divided into chunks. This chunks processed by a embedding model .Finally,  their vector representations are stored into a vector database. **This first step is the data ingestion**.
2.  **The second step is data querying(retrieval+synthesis)**. At this step, chunks of data are extracted from the vector database, based on the similarity with the user's prompt, and given as context to the LLM. You can extract the l-most similar chunks from the vector database and plug them to the synthesis module.

So, the main component's in this framework are these : 

-   *LlamaHub (Data ingestion)* : Connect to your existing data, like PDF's, doc's, DDBB's...
-   *Data Structures* : Store and index your data for different use cases. It can be integrated with different DDBB's, like vector db.
-   *Queries* : Retrieve and query over the stored data in the data structures. This includes agents, QA, summarization, ... 

## Vector Stores

Vector store databases enable to store high-dimensional data and provide the essential tools for semantically retrieving relevant documents. These systems analyze the emebddings vectors that encapsulate the entire document's meaning.

A primary function is the similarity search. Semantic search transcends traditional keyword matching. It captures the meaning in vectorized representations, and this technique can be applied to all data formats. Once we have the embedded format, we can calculate indexed similarities or capture the context embedded in the query. These ensures that the results are relevant and in line with the contextual and conceptual nuances of the user input's.

## Data Connectors

Managing data in diverse formats can be challenging, like PDF's, doc's, DDBB's, .csv's... To solve this problem we use the data connectors, also called `Readers`. Readers are responsible for parsing and converting the data into a simplified `Document` representation, **consisting in text and basic metadata**.

So, in summary, data connectors are designed to to streamline the data ingestion process, automating the process of fetching data fro differents sources and format it.

In [1]:
from llama_index.core import download_loader

WikipediaReader = download_loader("WikipediaReader") # Download the wikipedia reader to fetch documents from that website
loader = WikipediaReader() # Create an object of Wikipedia reader
documents = loader.load_data(pages=['Natural Language Processing', 'Artificial Intelligence']) # Get documents about NLP and IA
print(len(documents))

  WikipediaReader = download_loader("WikipediaReader") # Download the wikipedia reader to fetch documents from that website


2


## Nodes

Once the data is ingested as documents, it passes through a processing structure that transforms these documents into `Node` objects. Nodes are data units created from the original documents which constains also metadata and contextual information. In LlamaIndex, there's the `NodeParser` class, designed to convert the content of documents into structured nodes automatically. The `SimpleNodeParser` converts a list of documents objects into nodes.

In [3]:
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core import download_loader

# Download the document loader
WikipediaReader = download_loader("WikipediaReader")
# Create an object to get documents from Wikipedia
loader = WikipediaReader()
# Load documents
loader.load_data(pages=['Natural Language Processing', 'Artificial Intelligence'])

# Initialize the parser
parser = SimpleNodeParser.from_defaults(chunk_size=512, chunk_overlap=20) # Define number of token per chunk, and overlap between chunks
# Parse the documents into nodes
nodes = parser.get_nodes_from_documents(documents)
print(len(nodes))

  WikipediaReader = download_loader("WikipediaReader")


58


We can observe that have been generated 48 chunks from the 2 documents fetched from Wikipedia.

## Indexes