The **ingestion process** refers to the workflow in which data is acquired, processed, and stored in a system, such as a database or knowledge base, to be subsequently used in applications like machine learning or search.

In the context of the code:
1. **Data Scraping**: The content of specified web pages is acquired (in this case, with the `scrape` function) and transformed into a machine-readable format.
2. **Embedding Generation**: The content extracted from each document is processed with a machine learning model (`generateEmbedding`) to generate vector representations (embeddings). These vectors semantically represent the data, making them useful for activities like search, clustering, or classification.
3. **Collection Creation**: A dedicated area is prepared in the database (`createCollection`) to preserve the processed data.
4. **Data Loading**: The data, now enriched with embeddings and other structured information, is loaded into the database (`uploadData`), where it can be used for activities like **Retrieval-Augmented Generation (RAG)**.

In [None]:
import { createCollection, uploadData} from "./lib/db"; //database utility functions for creating collections and uploading data
import { generateEmbedding } from "./lib/openai";  //utility function for generating embeddings
import { scrape } from "./lib/scrape";  //import web scraping utility function

const urls = [ 
    "https://en.wikipedia.org/wiki/Formula_One",
    "https://en.wikipedia.org/wiki/George_Russel_(racing_driver)",  //add more urls here to expand the RAG knowledge base
];

async function ingest() {
    let chunks: { text: string, $vector: number [], url: string } [] = []; // Initialize an empty array to store processed data chunks
    await (Promise.all(urls.map(async url => {      // Process all URLs concurrently using Promise.all
        let data = await scrape(url);           //scrape webpages at the given url

        const embeddings = await Promise.all(data.map(async (doc, index) => {    //generate embeddings for each scraped document
            const embedding = await generateEmbedding(doc.pageContent);   //Use OpenAI to generate an embedding for the document content 
            return embedding;
        }));
        
        //Combine the scraped data and corresponding embeddings into chunks
        chunks = chunks.concat( data.map (( doc, index) => {
            return {
                text: doc.pageContent,  //main content of the webpage
                $vector: embeddings[index].data[0].embedding, //content generated embedding vector
                url: url //source url
            }
        }));
    })));

    await createCollection();   //create collection in the database to store processed data

    // Upload the processed chunks to the knowledge base
    await uploadData(chunks.map((doc, index)=>{
        return {
            $vector: doc.$vector,  //embedding vector
            text: doc.text,        //content of the webpage
            source: doc.url        //source url
        }
    }));   //upload data to the knowledge base
}

ingest();  //run the ingestion process

Scraping web pages

In [None]:
import playwright from "playwright";  //import playwright for web scraping   
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

export async function scrape(url: string) {

    const browser = await playwright.chromium.launch();  //launch a chromium browser instance

    const context = await browser.newContext();  //create a new browser context

    const page = await context.newPage(); 

    await page.goto(url);

    const text = await page.innerText("body");  //extract text content from the webpage

    text.replace(/\n/g, " ");  //replace newline characters with spaces

    await browser.close();

    //split the text into smaller chunks using the RecursiveCharacterTextSplitter

    const splitter = new RecursiveCharacterTextSplitter({
        chunksize: 512,
        chunkoverlap: 100,
    });

    const output = await splitter.createDocuments([text]);  //split the text into smaller chunks

    return output;
}

Generate embeddings

In [None]:
import OpenAI from 'openai';

const client = new OpenAI({
    apiKey: "API_KEY",
});

export async function generateEmbedding(text: string) {
    const embedding = await client.embeddings.create({
        model: "text-embedding-ada-02",
        input: text,
    })

    return embedding;
    }

Creating a vector database collection called f1gpt to store the retrieved info

In [None]:
import { DataAPIClient } from "@datastax/astra-db-ts"

const client = new DataAPIClient('TOKEN');
const db = client.db('DB_URL');
const collection = db.collection('f1gpt');

export async function createCollection(){
    const res = await db.createCollection('f1gpt', {
        vector: { 
            dimension: 1536, //(corresponds to the embedding vector dimension of the model)
            
            metric: "dot_product"
        }
});
return res;
}

Uploading data to the vector database

In [None]:
export async function uploadData(data: {
    $vector: number[],
    text: string
}[]) {
    return await collection.insertMany(data);
}

3 steps process to answering the questions:
1) Using existing "generateEmbedding" function
2) Querying the database
3) Generating a response to return to the user

In [None]:
import { queryDatabase } from "./lib/db";
import { generateEmbedding, generateResponse } from "./lib/openai";

async function askQuestion(question: string) {
    const embedding = await generateEmbedding(question);  //generate an embedding for the question
    const queryRes = await queryDatabase(embedding.data[0].embedding);  //query the database with the generated embedding
    const response = generateResponse(question, queryRes.map((doc) => doc.text));  //generate a response based on the query result
    
    return response;
}   

askQuestion("Why are George Russell and Max Verstappen arguing after Qatar 2024?").then((res) => {
    console.log(res);
});

Once the embedding for the user has been generated, this function is used to query the database for similar records in order to answer the question

In [None]:
export async function queryDatabase(query: number[]){
    const res = await collection.find(null, {
        sort: {
            $vector: query
        },
        limit: 10
    }).toArray();
    return res;
}

Using an LLM (gpt 4o in this case) a response can be generated basing on the up to date knowledge base

In [None]:
export async function generateResponse(question: string, context: string[]){
    const response = await client.chat.completions.create({
        model:"gpt-4o",
        messages: [{
            role: "user",               //role prompting the question
            content: 'You are an expert in Formula 1 racing. You need to answer this question using the context provided. Do not mention that you have been provided with the context. QUESTION: ${ question }. CONTEXT: ${ context.join(" ")}'
        }]
    })
    return response.choices[0].message.content; 
}

Testing the RAG application

In [None]:
askQuestion("Why are George Russell and Max Verstappen arguing after Qatar 2024?").then((res)=>{
    console.log(res);
});