# Building Data Ingestion from Scratch

In this notebook, we build a data ingestion pipeline into a vector database.

In [1]:
import pinecone
import os

pinecone.init(api_key=os.getenv("PINECONE_API_KEY"), environment="gcp-starter")

  from tqdm.autonotebook import tqdm


In [5]:
# create index for text-embedding-002
pinecone.create_index("quickstart", dimension=1536, metric="euclidean")

pinecone_index = pinecone.Index("quickstart")

In [6]:
# [Optional] drop contents in index
# pinecone_index.delete(deleteAll=True)

- Createa PineconeVectorStore
    - Simple wrapper for LlamaIndex

In [7]:
from llama_index.vector_stores import PineconeVectorStore

In [8]:
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


- Build Ingestion Pipeline fromScratch
  1. Load Data
  2. Use a Text Splitter to Split Documents
  3. Manually construct Nodes from Text Chunks

1. load data

In [9]:
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

mkdir: cannot create directory ‘data’: File exists


--2023-09-19 19:07:00--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 128.84.21.199
Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’


2023-09-19 19:07:13 (1.02 MB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]



In [10]:
from pathlib import Path
from llama_index import download_loader

In [11]:
PyMuPDFReader = download_loader("PyMuPDFReader")

In [12]:
loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")

2. split text to smaller chunks

In [13]:
from llama_index.text_splitter import SentenceSplitter

In [14]:
text_splitter = SentenceSplitter(chunk_size=1024)

In [15]:
text_chunks = []

# maintain relationship with source doc index to help include metadata in (3)
doc_idxs = []

for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_splitter.split_text(doc.text)
    
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

3. Manually construct nodes from chunks

- convert each chunk into a TextNode -> LlamaIndex abstraction to store data & define metadata + relationships to other nodes
- inject metaddata from doc into each node
- implementation of `SimpleNodeParser`


In [16]:
from llama_index.schema import TextNode

In [17]:
nodes = []

for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(text=text_chunk)
    src_doc = documents[doc_idxs[idx]]

    node.metadata = src_doc.metadata
    nodes.append(node)

In [18]:
src_doc.metadata

{'total_pages': 77, 'file_path': './data/llama2.pdf', 'source': '77'}

In [19]:
print(nodes[3].get_content(metadata_mode="json"))

total_pages: 77
file_path: ./data/llama2.pdf
source: 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
A.6 Dataset Contamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
A.7 Model Card
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
2


[Optional] 4. Extract Metadata from each Node
We extract metadata from each Node using our Metadata extractors.

This will add more metadata to each Node.


`TitleExtractor` - go through first `nodes` nodes and use LLM to create a title for each node, then use LLM to summarize titles into a single title for the document.
  - adds `document_title` to each node

`QuestionsAnsweredExtractor` - 
  - adds `questions_this_excerpt_answers` to each node

In [20]:
from llama_index.node_parser.extractors import (
    MetadataExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
)
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

metadata_extractor = MetadataExtractor(
    extractors=[
        TitleExtractor(nodes=5, llm=llm),
        QuestionsAnsweredExtractor(questions=3, llm=llm),
    ],
    in_place=False
)

In [21]:
nodes = metadata_extractor.process_nodes(nodes)

Extracting questions:   0%|          | 0/110 [00:00<?, ?it/s]

5. Generate Embeddings for each Node

In [22]:
from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()

In [27]:
print(nodes[1].get_content(metadata_mode="all"))

[Excerpt from document]
total_pages: 77
file_path: ./data/llama2.pdf
source: 2
document_title: Developing and Evaluating Llama 2-Chat: Pretraining, Fine-tuning, Safety Evaluation, Red Teaming, Dataset Contamination, and Model Card
questions_this_excerpt_can_answer: 1. What are the safety measures implemented in the pretraining and fine-tuning processes of Llama 2-Chat?
2. How is reinforcement learning with human feedback used in the fine-tuning of Llama 2-Chat?
3. What is the process of red teaming and how is it applied to evaluate the safety of Llama 2-Chat?
Excerpt:
-----
Contents
1
Introduction
3
2
Pretraining
5
2.1
Pretraining Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.3
Llama 2 Pretrained Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
3
Fine-tuning
8
3.1
Supervised Fine-Tuning (S

In [28]:
for node in nodes:
    node.embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )

6. Load Nodes into Vector Store


In [29]:
vector_store.add(nodes)

Upserted vectors:   0%|          | 0/110 [00:00<?, ?it/s]

['5b8a563e-b010-4034-8bb7-dca8f0f97d3c',
 'bfad99bb-bcfb-46c3-b1d0-32b32fa34b0b',
 'f65a8dd2-a280-4256-98e2-a74b65642dfd',
 'bfaa3b35-4ef4-465b-8a71-15d0fb24e370',
 'd11ca168-6ccc-47bd-9945-1cf2b7aa1397',
 '2270d5a9-1d4d-41f7-ad95-6125c4ffeff5',
 'd93f09bc-82c5-4c6c-b754-8e3c7e9f6bb7',
 '8f6ebfa3-1833-430e-bb77-1217ee5668fc',
 '2937ab84-c966-42b1-a98e-9ec86e7d724c',
 'a7fd58a9-4dac-440e-82d9-7c7cd542a56e',
 '152ab395-cfe2-49b5-a556-3e3333f72e2b',
 'ea7bccf1-aff2-4cea-9737-c1c8d79de397',
 'ed916a2c-5706-4a57-860b-81309de4c0af',
 'f4743a32-326c-4b6b-ad1b-4acdfa294c06',
 '8ce59037-d034-4271-a2fd-08ae5569f6d3',
 'a76070f6-dbcd-4e15-a50c-0cf853ce1c62',
 'b5d737da-95e2-4f6a-921f-45c66b4e7fc0',
 '9e1add67-3bb9-4550-bcc6-d8743cd55832',
 'c772c457-d93d-4411-89cc-4c95dacea958',
 '8dcfaacd-fb09-4174-93aa-c577c102317f',
 'd45c452c-00f6-479e-a62f-aedfe52817c6',
 'cdbdad2e-d73a-4620-b9bd-c77c4c63c999',
 'a4250d06-96fb-48ad-8a8c-e71753c56c23',
 '95303a77-d6d1-4159-b817-d68beabad8c2',
 '69e5a158-883e-

Retrieve & Query from Vector Store

- Here we use `VectoreStoreIndex` to speed up querying

In [23]:
from llama_index import VectorStoreIndex
from llama_index.storage import StorageContext

In [31]:
index = VectorStoreIndex.from_vector_store(vector_store)

In [32]:
query_engine = index.as_query_engine()

In [34]:
query = "Can you tell me about the key concepts for safety finetuning?"
response = query_engine.query(query)

print(str(response))

The key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Reinforcement Learning from Human Feedback), and safety context distillation. 

Supervised safety fine-tuning involves gathering adversarial prompts and safe demonstrations to train the model to align with safety guidelines. This is done before RLHF and helps lay the foundation for high-quality human preference data annotation.

Safety RLHF integrates safety into the general RLHF pipeline. It includes training a safety-specific reward model and gathering more challenging adversarial prompts for rejection sampling style fine-tuning and PPO (Proximal Policy Optimization) optimization.

Safety context distillation involves generating safer model responses by prefixing a prompt with a safety preprompt and then fine-tuning the model on the safer responses without the preprompt. This distills the safety preprompt (context) into the model and allows the safety reward model to choose whether to use cont