### Before Running the notebook

Please complete [setting up python dev environment](./setup-python-dev-env.md)

### Overview

This notebook will process PDF documents as part of RAG pipeline

![](media/rag-overview-2.png)

This notebook will perform steps 1, 2 and 3 in RAG pipeline.

Here are the processing steps:

- **pdf2parquet** : Extract text from PDF and convert them into parquet files
- **Chunk documents**: Split the PDFs into 'meaningful sections' (paragraphs, sentences ..etc)
- **Doc_ID generation**: Each chunk is assigned a uniq id, based on content and hash
- **Exact Dedup**: Chunks with exact same content are filtered out
- **Text encoder**: Convert chunks into vectors using embedding models

### Step-1: Configuration

In [1]:
from my_config import MY_CONFIG

### Step-2:  Data Acqusition using the data-prep-connector

Data-Prep-Connector

Let's say we want to run RAG on the content published in the conference proceedings of Advances in Neural Information Processing Systems (NeurIPS) for the year 2017. The Data-Prep-Connector is a scalable and compliant web crawler that can be used to acquire targeted content for use cases such as RAG or LLM development. In this notebook example, we will run it to selectively crawl pages under a specific path (https://proceedings.neurips.cc/paper_files/paper/2017) and only save the PDFs. The crawler will automatically follow robots.txt and auto-throttle based on the server response time.

You can of course substite your own data below

In [2]:
from dpk_connector import crawl, shutdown
import nest_asyncio
import os
from utils import get_mime_type, get_filename_from_url
from dpk_connector.core.utils import validate_url

# Use nest_asyncio to enable a nested event loop run for the crawler inside the Jupyter notebook
nest_asyncio.apply()

# Initialize counter
retrieved_pages = 0
saved_pages = 0

# Define a callback function to be executed at the retrieval of each page during a crawl
def on_downloaded(url: str, body: bytes, headers: dict) -> None:
    """
    Callback function called when a page has been downloaded.
    You have access to the request URL, response body and headers.
    """
    global retrieved_pages, saved_pages
    retrieved_pages+=1
    
    if saved_pages<20:
        print(f"Visited url: {url}")

    # Get mime_type of retrieved page
    mime_type = get_mime_type(body)
    
    # Save the page if it is a PDF to only download research papers
    if 'pdf' in mime_type.lower():
        filename = get_filename_from_url(url)
        local_file_path = os.path.join(MY_CONFIG.INPUT_DATA_DIR, filename)
        
        with open(local_file_path, 'wb') as f:
            f.write(body)
            
        if saved_pages<20:
            print(f"Saved contents of url: {url}")
        saved_pages+=1
        
# Define a user agent to provide information about the client making the request
user_agent = "dpk-connector"

async def run_my_crawl():
    crawl(["https://proceedings.neurips.cc/paper_files/paper/2017"], on_downloaded,  user_agent=user_agent, depth_limit = 2, path_focus = True, download_limit = 50)
    return "Crawl is done"

# Now run the configured crawl
await run_my_crawl()



Visited url: https://proceedings.neurips.cc/paper_files/paper/2017
Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/0060ef47b12160b9198302ebdb144dcf-Abstract.html
Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/12a1d073d5ed3fa12169c67c4e2ce415-Abstract.html
Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1264a061d82a2edae1574b07249800d6-Abstract.html
Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/136f951362dab62e64eb8e841183c2a9-Abstract.html
Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1113d7a76ffceca1bb350bfe145467c6-Abstract.html
Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/1271a7029c9df08643b631b02cf9e116-Abstract.html
Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/10ce03a1ed01077e3e289f3e53c72813-Abstract.html
Visited url: https://proceedings.neurips.cc/paper_files/paper/2017/hash/10c272d06794d3e5785d5e7c5356

'Crawl is done'

In [3]:
# Note that the number of retrieved pages can be slightly different from download limit set which is a soft limit
print(f'Pages retrieved during the crawl: {retrieved_pages}')
print(f'Pages downloaded locally during the crawl: {saved_pages}')

Pages retrieved during the crawl: 62
Pages downloaded locally during the crawl: 20


## 2.2 - Set input/output path variables for the pipeline

The next steps following the content acquisition from the data-prep-connector can be followed in the same fashion as described in [rag_1A_dpk_process_python notebook](https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/rag/rag_1A_dpk_process_python.ipynb)