In [None]:
%%capture
!pip install llama-index
!pip install datasets

In [None]:
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

In [None]:
import os
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

## Download Data

In [None]:
from datasets import load_dataset
wiki_dataset = load_dataset("wikipedia", "20220301.en")  # Specific English Wikipedia version

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading dataset shards:   0%|          | 0/41 [00:00<?, ?it/s]

Selects the first 1000 articles from the Wikipedia dataset.

In [None]:
wiki_data = wiki_dataset['train'].select(range(min(1000, len(wiki_dataset['train']))))

In [None]:
wiki_data

Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 1000
})

## Step 1: Data Ingestion

### Data Loaders

Processes Wikipedia articles, creates a corpus directory, and saves articles as text files.

In [None]:

from tqdm import tqdm
import os


output_dir = 'corpus'

if not os.path.exists(output_dir):
  os.makedirs(output_dir)

documents = []

for i, article in enumerate(tqdm(wiki_data)):
  text = article['text']
  title = article['title']

  ref_title = title = "".join([c if c.isalnum() else "_" for c in title]).strip("_")
  ref_title = ref_title[:100] if len(ref_title) > 100 else ref_title

  with open(os.path.join(output_dir, f"{ref_title}.txt"), "w", encoding="utf-8") as f:
    f.write(f"Title: {title}")
    f.write(text)

100%|██████████| 1000/1000 [00:01<00:00, 774.95it/s]


In [None]:
from llama_index.core import SimpleDirectoryReader

Loads documents from the corpus directory using `SimpleDirectoryReader`.

In [None]:
documents = SimpleDirectoryReader(output_dir).load_data()

In [None]:
type(documents)

list

In [None]:
len(documents)

1000

In [None]:
documents[0]

Document(id_='0d8038c0-aebb-4a2c-8d83-3be7b70b174d', embedding=None, metadata={'file_path': '/content/corpus/A.txt', 'file_name': 'A.txt', 'file_type': 'text/plain', 'file_size': 10427, 'creation_date': '2025-03-19', 'last_modified_date': '2025-03-19'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Title: AA, or a, is the first letter and the first vowel of the modern English alphabet and the ISO basic Latin alphabet. Its name in English is a (pronounced ), plural aes. It is similar in shape to the Ancient Greek letter alpha, from which it derives. The uppercase version consists of the two slanting sides of a triangle, crossed in the mid

In [None]:
# get id of first document
documents[0].id_

'0d8038c0-aebb-4a2c-8d83-3be7b70b174d'

In [None]:
documents[0].doc_id

'0d8038c0-aebb-4a2c-8d83-3be7b70b174d'

In [None]:
documents[0].metadata

{'file_path': '/content/corpus/A.txt',
 'file_name': 'A.txt',
 'file_type': 'text/plain',
 'file_size': 10427,
 'creation_date': '2025-03-19',
 'last_modified_date': '2025-03-19'}

In [None]:
# get the text content from the first document
print(documents[0].text)

Title: AA, or a, is the first letter and the first vowel of the modern English alphabet and the ISO basic Latin alphabet. Its name in English is a (pronounced ), plural aes. It is similar in shape to the Ancient Greek letter alpha, from which it derives. The uppercase version consists of the two slanting sides of a triangle, crossed in the middle by a horizontal bar. The lowercase version can be written in two forms: the double-storey a and single-storey ɑ. The latter is commonly used in handwriting and fonts based on it, especially fonts intended to be read by children, and is also found in italic type.

In the English grammar, "a", and its variant "an", are indefinite articles.

History

The earliest certain ancestor of "A" is aleph (also written 'aleph), the first letter of the Phoenician alphabet, which consisted entirely of consonants (for that reason, it is also called an abjad to distinguish it from a true alphabet). In turn, the ancestor of aleph may have been a pictogram of an

### Embedding Model

In [None]:
# embedding model
from llama_index.embeddings.openai import OpenAIEmbedding

In [None]:
# initialize embedding model
embed_model = OpenAIEmbedding(model="text-embedding-3-large")

### LLM

In [None]:
from llama_index.llms.openai import OpenAI

In [None]:
# initialize large language model
llm = OpenAI(model='gpt-4')

## Step 2: Indexing

In [None]:
from llama_index.core import VectorStoreIndex

In [None]:
# create an index from the documents using the embedding model and LLM

index = VectorStoreIndex.from_documents(
                          documents,
                          embed_model=embed_model
                      )

## Step 3: Retrieval

In [None]:
retriever = index.as_retriever()

In [None]:
retrieved_nodes = retriever.retrieve("Can an infinite extension be algebraic? Give an example.")

In [None]:
retrieved_nodes[0].metadata

{'file_path': '/content/corpus/Algebraic_extension.txt',
 'file_name': 'Algebraic_extension.txt',
 'file_type': 'text/plain',
 'file_size': 3022,
 'creation_date': '2025-03-19',
 'last_modified_date': '2025-03-19'}

In [None]:
retrieved_nodes[0].id_

'66c10025-54bf-46e0-8b5d-c859329c786c'

In [None]:
retrieved_nodes[0].node_id

'66c10025-54bf-46e0-8b5d-c859329c786c'

In [None]:
retrieved_nodes[0].node

TextNode(id_='66c10025-54bf-46e0-8b5d-c859329c786c', embedding=None, metadata={'file_path': '/content/corpus/Algebraic_extension.txt', 'file_name': 'Algebraic_extension.txt', 'file_type': 'text/plain', 'file_size': 3022, 'creation_date': '2025-03-19', 'last_modified_date': '2025-03-19'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='f5a04cb3-5d9a-476b-bd59-154a8d1ba2a8', node_type='4', metadata={'file_path': '/content/corpus/Algebraic_extension.txt', 'file_name': 'Algebraic_extension.txt', 'file_type': 'text/plain', 'file_size': 3022, 'creation_date': '2025-03-19', 'last_modified_date': '2025-03-19'}, hash='72754c56c3203c7f3638fe1cada41709626e94c77847f93f01f5559c183393e6')}, metadata_template='{key}: {val

In [None]:
print(retrieved_nodes[0].text)

Title: Algebraic_extensionIn abstract algebra, a field extension L/K is called algebraic if every element of L is algebraic over K, i.e. if every element of L is a root of some non-zero polynomial with coefficients in K. Field extensions that are not algebraic, i.e. which contain transcendental elements, are called transcendental.

For example, the field extension R/Q, that is the field of real numbers as an extension of the field of rational numbers, is transcendental, while the field extensions C/R and Q()/Q are algebraic, where C is the field of complex numbers.

All transcendental extensions are of infinite degree. This in turn implies that all finite extensions are algebraic. The converse is not true however: there are infinite extensions which are algebraic.  For instance, the field of all algebraic numbers is an infinite algebraic extension of the rational numbers.

Let E be an extension field of K, and a ∈ E. If a is algebraic over K, then K(a), the set of all polynomials in a 

In [None]:
retrieved_nodes[1].metadata

{'file_path': '/content/corpus/Algebraic_number.txt',
 'file_name': 'Algebraic_number.txt',
 'file_type': 'text/plain',
 'file_size': 7725,
 'creation_date': '2025-03-19',
 'last_modified_date': '2025-03-19'}

In [None]:
print(retrieved_nodes[1].text)

Title: Algebraic_numberAn algebraic number is a number that is a root of a non-zero polynomial in one variable with integer (or, equivalently, rational) coefficients.  For example, the golden ratio, , is an algebraic number, because it is a root of the polynomial . That is, it is a value for x for which the polynomial evaluates to zero.  As another example, the complex number  is algebraic because it is a root of .

All integers and rational numbers are algebraic, as are all roots of integers. Real and complex numbers that are not algebraic, such as  and , are called transcendental numbers.

The set  of algebraic numbers is countably infinite and has measure zero in the Lebesgue measure as a subset of the uncountable complex numbers. In that sense, almost all complex numbers are transcendental.

Examples
 All rational numbers are algebraic. Any rational number, expressed as the quotient of an integer  and a (non-zero) natural number , satisfies the above definition, because  is the roo

## Step 4: Response Synthesis

In [None]:
from llama_index.core import get_response_synthesizer

In [None]:
# initialize the response synthesizer

response_synthesizer = get_response_synthesizer(llm = llm)

## Step 5: Query Engine

In [None]:
# create a query engine using index, llm and response synthesizer

query_engine = index.as_query_engine(
                        llm = llm,
                        response_synthesizer = response_synthesizer
                    )


In [None]:
# query llm using query engine

response = query_engine.query('Can an infinite extension be algebraic? Give an example.')


In [None]:
response.response

'Yes, an infinite extension can be algebraic. An example of this is the field of all algebraic numbers, which is an infinite algebraic extension of the rational numbers.'

In [None]:
len(response.response)

168

In [None]:
len(response.source_nodes)

2

In [None]:
response.source_nodes[0].id_

'66c10025-54bf-46e0-8b5d-c859329c786c'

In [None]:
response.source_nodes[0].metadata

{'file_path': '/content/corpus/Algebraic_extension.txt',
 'file_name': 'Algebraic_extension.txt',
 'file_type': 'text/plain',
 'file_size': 3022,
 'creation_date': '2025-03-19',
 'last_modified_date': '2025-03-19'}

In [None]:
response.source_nodes[1].id_

'c24da429-416b-461b-a42a-88397263efa2'

In [None]:
response.source_nodes[1].metadata

{'file_path': '/content/corpus/Algebraic_number.txt',
 'file_name': 'Algebraic_number.txt',
 'file_type': 'text/plain',
 'file_size': 7725,
 'creation_date': '2025-03-19',
 'last_modified_date': '2025-03-19'}

### Few More Examples

In [None]:
query1 = "What is considered Berg's most widely known and beloved composition?"
print(query_engine.query(query1).response)

Berg's most widely known and beloved composition is his Violin Concerto from 1935.


In [None]:
query2 = "What distinguishes atomic physics from nuclear physics?"
print(query_engine.query(query2).response)

Atomic physics deals with the atom as a system consisting of a nucleus and electrons, focusing on the study of atomic structure and the interaction between atoms. On the other hand, nuclear physics studies nuclear reactions and special properties of atomic nuclei.


In [None]:
query3 = "How do Roland Barthes and Michel Foucault challenge traditional views of authorship?"
print(query_engine.query(query3).response)

Roland Barthes and Michel Foucault challenge traditional views of authorship by suggesting that the author's identity and personal characteristics should not influence the interpretation of a written work. Barthes, in his essay "Death of the Author," argues that it is the language of the text itself that speaks and determines meaning, not the author. He believes that every line of written text is a reflection of references from various traditions and that the text is never original. This perspective removes the author from the text and destroys the limits imposed by the idea of one authorial voice or one ultimate and universal meaning.

Foucault, on the other hand, argues in his essay "What is an author?" that all authors are writers, but not all writers are authors. He introduces the concept of the "author function," suggesting that an author exists only as a function of a written work, a part of its structure, but not necessarily part of the interpretive process. He warns of the risk

This notebook demonstrates a complete RAG pipeline: ingesting Wikipedia articles, converting them to vector embeddings with OpenAI, creating a searchable index, and answering questions by retrieving relevant content and generating responses with GPT-4. The implementation showcases how LlamaIndex simplifies building retrieval-augmented generation systems.