# Rag From Scratch: Overview

These notebooks walk through the process of building RAG app(s) from scratch.

They will build towards a broader understanding of the RAG langscape, as shown here:

The topic for the RAG is
##Research Papers in Deep Learning and Chemical Structures (Image Data)

## Enviornment

`(1) Packages`

In [4]:
!pip install fake_useragent

Collecting fake_useragent
  Downloading fake_useragent-2.2.0-py3-none-any.whl.metadata (17 kB)
Downloading fake_useragent-2.2.0-py3-none-any.whl (161 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/161.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m153.6/161.7 kB[0m [31m4.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fake_useragent
Successfully installed fake_useragent-2.2.0


In [1]:
! pip install langchain_community tiktoken langchain-google-genai langchainhub chromadb langchain PyMuPDF

Collecting langchain_community
  Downloading langchain_community-0.3.26-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-google-genai
  Downloading langchain_google_genai-2.1.5-py3-none-any.whl.metadata (5.2 kB)
Collecting langchainhub
  Downloading langchainhub-0.1.21-py3-none-any.whl.metadata (659 bytes)
Collecting chromadb
  Downloading chromadb-1.0.13-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting PyMuPDF
  Downloading pymupdf-1.26.1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain

`(2) LangSmith`

https://docs.smith.langchain.com/

In [2]:
import os
from google.colab import userdata

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
LANGCHAIN_API_KEY = userdata.get('LANGCHAIN_API_KEY')
os.environ['LANGCHAIN_API_KEY'] = LANGCHAIN_API_KEY


`(3) API Keys`

In [107]:
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY
PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')
os.environ['PINECONE_API_KEY'] = PINECONE_API_KEY

# Basic Rag using Chroma DB

## Part 1: Overview

[RAG quickstart](https://python.langchain.com/docs/use_cases/question_answering/quickstart)

In [83]:
vectorstore.delete_collection()

In [86]:
import bs4
import fitz  # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader, PyMuPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import requests
import tempfile
import os
import re
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from langchain.schema import Document
from urllib.parse import urlparse
import time

class DocumentProcessor:
    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=5000,
            chunk_overlap=250
        )
        self.embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
        self.ua = UserAgent()
        self.headers = {
            'User-Agent': self.ua.random,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def load_html(self, url):
        """Enhanced HTML loader with better error handling"""
        try:
            response = self.session.get(url, timeout=30)
            response.raise_for_status()

            # Check if content-type is PDF
            content_type = response.headers.get('Content-Type', '')
            if 'application/pdf' in content_type:
                return self.load_pdf_from_url(url)

            soup = BeautifulSoup(response.content, 'html.parser')

            # Remove unwanted elements
            for element in soup(['script', 'style', 'nav', 'footer', 'iframe', 'noscript']):
                element.decompose()

            # Try to find main content areas
            article = (soup.find('article') or
                      soup.find('main') or
                      soup.find(class_=re.compile('content|main|body|post')) or
                      soup.find('div', role='main') or
                      soup)

            # Extract all text with structure
            content = self._extract_structured_content(article)
            if not content:
                raise ValueError("No content extracted from HTML")

            return [Document(page_content=content, metadata={'source': url, 'type': 'html'})]

        except Exception as e:
            print(f"Error loading {url}: {str(e)}")
            return []

    def _extract_structured_content(self, element):
        """Extract content while preserving document structure"""
        content = []

        def process_element(elem):
            if isinstance(elem, bs4.NavigableString):
                text = elem.strip()
                if text and len(text) > 10:
                    content.append(text)
                return

            tag = elem.name
            if not tag:
                return

            text = elem.get_text(' ', strip=True)
            if not text or len(text) <= 10:
                return

            # Handle headings
            if tag.startswith('h') and tag[1:].isdigit():
                level = int(tag[1:])
                content.append(f"\n{'#'*level} {text}\n")
            # Handle list items
            elif tag == 'li':
                content.append(f"- {text}")
            # Handle table cells
            elif tag in ['td', 'th']:
                content.append(f"[TABLE CELL] {text}")
            # Handle regular paragraphs
            elif tag == 'p':
                content.append(text)
            # Recursively process containers
            else:
                for child in elem.children:
                    process_element(child)

        process_element(element)
        full_text = '\n'.join(content)
        full_text = re.sub(r'\n{3,}', '\n\n', full_text)
        full_text = re.sub(r'[ \t]{2,}', ' ', full_text)
        return full_text.strip()

    def load_pdf_from_url(self, url):
        """Improved PDF loader with retries and better cleaning"""
        max_retries = 3
        retry_delay = 2

        for attempt in range(max_retries):
            try:
                response = self.session.get(url, timeout=30)
                response.raise_for_status()

                with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file:
                    tmp_file.write(response.content)
                    tmp_path = tmp_file.name

                loader = PyMuPDFLoader(tmp_path)
                docs = loader.load()

                # Clean up the extracted text
                for doc in docs:
                    doc.page_content = self._clean_pdf_text(doc.page_content)
                    doc.metadata.update({
                        'source': url,
                        'type': 'pdf',
                        'pages': doc.metadata.get('page', '')
                    })

                os.unlink(tmp_path)
                return docs

            except Exception as e:
                print(f"Attempt {attempt + 1} failed for {url}: {str(e)}")
                if attempt < max_retries - 1:
                    time.sleep(retry_delay)
                else:
                    if 'tmp_path' in locals() and os.path.exists(tmp_path):
                        os.unlink(tmp_path)
                    return []

    def _clean_pdf_text(self, text):
        """Clean and normalize PDF text"""
        # Remove page numbers and footers
        text = re.sub(r'Page \d+ of \d+', '', text)
        # Normalize whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        # Remove lonely characters
        text = re.sub(r'(?<!\w)\w(?!\w)', '', text)
        # Fix hyphenated words
        text = re.sub(r'(\w+)-\s+(\w+)', r'\1\2', text)
        return text

    def process_documents(self, urls):
        """Process documents with better error handling"""
        all_docs = []
        failed_urls = []

        for url in urls:
            print(f"\nProcessing: {url}")
            try:
                if url.lower().endswith('.pdf'):
                    docs = self.load_pdf_from_url(url)
                else:
                    docs = self.load_html(url)

                if docs:
                    all_docs.extend(docs)
                    print(f"Successfully loaded {len(docs)} documents")
                else:
                    failed_urls.append(url)
                    print("Failed to load document")

            except Exception as e:
                failed_urls.append(url)
                print(f"Error processing {url}: {str(e)}")

        if not all_docs:
            raise ValueError("No documents were successfully loaded")

        print(f"\nSummary:")
        print(f"- Successfully loaded: {len(all_docs)} documents")
        print(f"- Failed URLs: {len(failed_urls)}")
        if failed_urls:
            print("Failed URLs:", failed_urls)

        splits = self.text_splitter.split_documents(all_docs)
        print(f"- Total chunks after splitting: {len(splits)}")
        return splits, failed_urls # Return splits and failed_urls

    def create_vector_store(self, splits, persist_dir="chroma_db"):
        """Create and persist Chroma vector store"""
        vectorstore = Chroma.from_documents(
            documents=splits,
            embedding=self.embeddings,
            persist_directory=persist_dir
        )
        print(f"\nVector store created with {vectorstore._collection.count()} chunks")
        return vectorstore

if __name__ == "__main__":
    # Your list of documents
    documents = [
        "https://portlandpress.com/biochemj/article/477/23/4559/227194/Deep-learning-and-generative-methods-in",
        "https://pubs.acs.org/doi/10.1021/acs.jcim.9b00236",
        "https://www.osti.gov/servlets/purl/1427646",
        "https://depth-first.com/articles/2019/02/04/chemception-deep-learning-from-2d-chemical-structure-images/",
        "https://arxiv.org/ftp/arxiv/papers/1802/1802.04903.pdf",
        "http://cucis.eecs.northwestern.edu/publications/pdf/PJA18.pdf",
        "https://www.nature.com/articles/s41467-022-28494-3",
        "https://link.springer.com/article/10.1007/s00521-021-05961-4",
        "https://www.mdpi.com/journal/molecules/special_issues/deep_learning_structure",
        "https://www.sciencedirect.com/science/article/abs/pii/B9780443186387000050",
        "https://www.mdpi.com/1420-3049/25/12/2764",
        "https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00435-6",
        "https://www.nature.com/articles/s41598-025-95720-5",
        "https://pmc.ncbi.nlm.nih.gov/articles/PMC11571686/",
        "https://link.springer.com/article/10.1557/s43578-022-00628-9"
    ]

    # Initialize and process
    processor = DocumentProcessor()
    splits, failed_urls = processor.process_documents(documents) # Unpack the tuple here
    vectorstore = processor.create_vector_store(splits) # Pass only splits

    # Get retriever
    retriever = vectorstore.as_retriever()
    print("Vector store and retriever created successfully!")


Processing: https://portlandpress.com/biochemj/article/477/23/4559/227194/Deep-learning-and-generative-methods-in
Successfully loaded 1 documents

Processing: https://pubs.acs.org/doi/10.1021/acs.jcim.9b00236
Error loading https://pubs.acs.org/doi/10.1021/acs.jcim.9b00236: 403 Client Error: Forbidden for url: https://pubs.acs.org/doi/10.1021/acs.jcim.9b00236
Failed to load document

Processing: https://www.osti.gov/servlets/purl/1427646
Error loading https://www.osti.gov/servlets/purl/1427646: 502 Server Error: Proxy Error for url: https://www.osti.gov/servlets/purl/1427646
Failed to load document

Processing: https://depth-first.com/articles/2019/02/04/chemception-deep-learning-from-2d-chemical-structure-images/
Successfully loaded 1 documents

Processing: https://arxiv.org/ftp/arxiv/papers/1802/1802.04903.pdf
Successfully loaded 16 documents

Processing: http://cucis.eecs.northwestern.edu/publications/pdf/PJA18.pdf
Successfully loaded 13 documents

Processing: https://www.nature.com

In [None]:
collection_data = vectorstore._collection.get(include=["embeddings","documents",'metadatas'])
print(collection_data.get('ids', []),)
print(collection_data.get('documents', []))
print(collection_data.get('metadatas', []))
print(collection_data.get('embeddings', []))

In [88]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate

# Prompt
# Create a LANGSMITH_API_KEY in Settings > API Keys
from langsmith import Client
client = Client(api_key=LANGCHAIN_API_KEY)
# prompt_object = client.pull_prompt("chatbot", include_model=True)

# Define the prompt template using input variables
prompt = ChatPromptTemplate.from_template("""You are an assistant for question-answering tasks. Use the following pieces of retrieved context from research papes to answer the question in detail minimum 500 words. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}""")

print(prompt)
# LLM
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0)

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Question
print(rag_chain.invoke("What is SMILES tell everything about it in detail"))

input_variables=['context', 'question'] input_types={} partial_variables={} messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context from research papes to answer the question in detail minimum 500 words. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question}\nContext: {context}"), additional_kwargs={})]
SMILES, which stands for Simplified Molecular-Input Line-Entry System, is a line notation used to represent chemical structures. It encodes the connection table and stereochemistry of a molecule as a line of text using short ASCII strings. SMILES utilizes a grammar structure where alphabets denote atoms, special characters indicate bond types, encapsulated numbers represent rings, and parentheses represent side chains.

In [56]:
# Documents
question = "What kinds of pets do I like?"
document = "My favorite pet is a cat."

[Count tokens](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb) considering [~4 char / token](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)

In [57]:
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string(question, "cl100k_base")

8

[Text embedding models](https://python.langchain.com/docs/integrations/text_embedding/openai)

In [73]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embd = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
query_result = embd.embed_query(question)
document_result = embd.embed_query(document)
len(query_result)

[-0.006222310476005077, -0.006061272229999304, -0.026451561599969864, -0.020352467894554138, 0.0056349425576627254, -0.008037385530769825, 0.028191709890961647, -0.010441510006785393, 0.027259130030870438, 0.013420348055660725, 0.06445972621440887, -0.016194358468055725, 0.025595178827643394, 0.016282696276903152, 0.0048562223091721535, -0.031569819897413254, 0.005982452072203159, -0.00033394136698916554, 0.0012122254120185971, -0.04223364591598511, 0.009387334808707237, -0.002535228617489338, -0.019733130931854248, -0.005538017023354769, 0.04108477756381035, -0.06169067695736885, 0.04840322211384773, -0.029915019869804382, 0.0035526345018297434, 0.04842689633369446, -0.06559431552886963, 0.05941709503531456, -0.07555248588323593, -0.0007875484297983348, -0.023788658902049065, -0.04315219447016716, -0.032032158225774765, -0.019618835300207138, -0.014419357292354107, 0.05566913262009621, 0.01879522204399109, 0.002243547234684229, -0.05868314951658249, -0.051063422113657, 0.0222663395106

768

[Cosine similarity](https://platform.openai.com/docs/guides/embeddings/frequently-asked-questions) is reccomended (1 indicates identical) for OpenAI embeddings.

In [59]:
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

similarity = cosine_similarity(query_result, document_result)
print("Cosine Similarity:", similarity)

Cosine Similarity: 0.8535652119095083


[Document Loaders](https://python.langchain.com/docs/integrations/document_loaders/)

In [None]:
#### INDEXING ####

# Load blog
import bs4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
blog_docs = loader.load()

[Splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter)

> This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

In [None]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300,
    chunk_overlap=50)

# Make splits
splits = text_splitter.split_documents(blog_docs)

[Vectorstores](https://python.langchain.com/docs/integrations/vectorstores/)

In [None]:
# Index
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits,
                                    embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))

retriever = vectorstore.as_retriever()

## Part 3: Retrieval

In [113]:
# # Index
# from langchain_google_genai import google_vector_store
# from langchain_community.vectorstores import Chroma
# vectorstore = Chroma.from_documents(documents=splits,
#                                     embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))


retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

In [114]:
docs = retriever.get_relevant_documents("what is NER?")

In [101]:
len(docs)

4

In [100]:
print(docs)

[Document(metadata={'type': 'html', 'source': 'https://link.springer.com/article/10.1007/s00521-021-05961-4'}, page_content='### 5.1 SMILES reconstruction\n\nForty-six tests (9x5+1 for more details see 4.1 ) were conducted to assess the accuracy of SMILES reconstruction on different portion of training data. The results obtained are detailed in Table 6 .\nTable 6 SMILES reconstruction on different portion of training data\nFull size table\nChanges in accuracy and editing distance for different size of training data sets are presented in Figs. 13 and 14 . As it can be seen from Fig. 13 , the reconstruction accuracy increases from 0.247 \\(\\pm \\) 0.027 for 10% of randomly selected samples to 0.877 \\(\\pm \\) 0.009 for 50% of samples accordingly. From 60% onwards accuracy stays around 0.8 on average with slight fluctuations. This is an expected result. However slight variations in accuracy starting from 60% of samples needs to be addressed. It is a difficult task to identify the exact 

## Part 4: Generation



In [115]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import ChatPromptTemplate

# Prompt
template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context from research papes to answer the question in detail minimum 500 words. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context from research papes to answer the question in detail minimum 500 words. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question}\nContext: {context}\n"), additional_kwargs={})])

In [116]:
# LLM
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0)

In [117]:
# Chain
chain = prompt | llm

In [118]:
# Run
chain.invoke({"context":docs,"question":"What is NER"})

AIMessage(content='Based on the context provided, NER refers to Named Entity Recognition, specifically in the context of the BC5CDR task. The experiment results for ChemProt relation extraction and BC5CDR NER indicate that pre-trained language models are generally the best solutions for these natural language processing tasks. Models like BioBERT (+PubMed) and RoBERTa achieve comparable results with Sci-BERT in these tasks.', additional_kwargs={}, response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'model_name': 'gemini-2.0-flash', 'safety_ratings': []}, id='run--a5769e56-f382-4730-aec3-023040ddb2d5-0', usage_metadata={'input_tokens': 3546, 'output_tokens': 79, 'total_tokens': 3625, 'input_token_details': {'cache_read': 0}})

In [None]:
from langchain import hub
prompt_hub_rag = hub.pull("rlm/rag-prompt")

In [None]:
prompt_hub_rag

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})])

[RAG chains](https://python.langchain.com/docs/expression_language/get_started#rag-search-example)

In [106]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is Task Decomposition and Self Reflection?")

'I am sorry, but the provided context does not contain information about Task Decomposition and Self Reflection. Therefore, I cannot answer your question.'

# Basic RAG using PINECONE

In [None]:
!pip install langchain_pinecone pinecone

Collecting langchain_pinecone
  Downloading langchain_pinecone-0.2.8-py3-none-any.whl.metadata (5.3 kB)
Collecting pinecone
  Downloading pinecone-7.2.0-py3-none-any.whl.metadata (9.5 kB)
Collecting langchain-tests<1.0.0,>=0.3.7 (from langchain_pinecone)
  Downloading langchain_tests-0.3.20-py3-none-any.whl.metadata (3.3 kB)
Collecting langchain-openai>=0.3.11 (from langchain_pinecone)
  Downloading langchain_openai-0.3.25-py3-none-any.whl.metadata (2.3 kB)
Collecting pinecone-plugin-assistant<2.0.0,>=1.6.0 (from pinecone)
  Downloading pinecone_plugin_assistant-1.7.0-py3-none-any.whl.metadata (28 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Collecting pytest-asyncio<1,>=0.20 (from langchain-tests<1.0.0,>=0.3.7->langchain_pinecone)
  Downloading pytest_asyncio-0.26.0-py3-none-any.whl.metadata (4.0 kB)
Collecting syrupy<5,>=4 (from langchain-tests<1.0.0,>=0.3.7->langchain_pinecone)

In [None]:
#### INDEXING ####

# Load blog
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import pinecone
import os

# Initialize Pinecone
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

# Create Pinecone index (if it doesn't exist)
index_name = "langchain-demo"  # Choose a unique index name
dimension = 768  # Dimension of Google's embedding-001 model

# Check if index exists, if not create it
if index_name not in [index.name for index in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=dimension,
        metric="cosine",
        spec=ServerlessSpec(cloud='aws', region='us-east-1') # Specify cloud and region
    )

# Load blog (assuming this part is still needed for Pinecone indexing)
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
blog_docs = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300,
    chunk_overlap=50)

# Make splits
splits = text_splitter.split_documents(blog_docs)

# Create vector store
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vectorstore = PineconeVectorStore.from_documents(
    documents=splits,
    embedding=embeddings,
    index_name=index_name
)

# Get retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

print(f"Pinecone vector store '{index_name}' created and retriever initialized successfully!")

Pinecone vector store 'langchain-demo' created and retriever initialized successfully!


In [None]:
import bs4
import fitz  # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader, PyMuPDFLoader
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import requests
import tempfile
import os
import re
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from langchain.schema import Document
from urllib.parse import urlparse
import time
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import pinecone
import os

# Initialize Pinecone
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

# Create Pinecone index (if it doesn't exist)
index_name = "langchain-demo"  # Choose a unique index name
dimension = 768  # Dimension of Google's embedding-001 model
if index_name not in [index.name for index in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=dimension,
        metric="cosine",
        spec=ServerlessSpec(cloud='aws', region='us-east-1') # Specify cloud and region
    )
class DocumentProcessor:
    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=5000,
            chunk_overlap=250
        )
        self.embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
        self.ua = UserAgent()
        self.headers = {
            'User-Agent': self.ua.random,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
        }
        self.session = requests.Session()
        self.session.headers.update(self.headers)

    def load_html(self, url):
        """Enhanced HTML loader with better error handling"""
        try:
            response = self.session.get(url, timeout=30)
            response.raise_for_status()

            # Check if content-type is PDF
            content_type = response.headers.get('Content-Type', '')
            if 'application/pdf' in content_type:
                return self.load_pdf_from_url(url)

            soup = BeautifulSoup(response.content, 'html.parser')

            # Remove unwanted elements
            for element in soup(['script', 'style', 'nav', 'footer', 'iframe', 'noscript']):
                element.decompose()

            # Try to find main content areas
            article = (soup.find('article') or
                      soup.find('main') or
                      soup.find(class_=re.compile('content|main|body|post')) or
                      soup.find('div', role='main') or
                      soup)

            # Extract all text with structure
            content = self._extract_structured_content(article)
            if not content:
                raise ValueError("No content extracted from HTML")

            return [Document(page_content=content, metadata={'source': url, 'type': 'html'})]

        except Exception as e:
            print(f"Error loading {url}: {str(e)}")
            return []

    def _extract_structured_content(self, element):
        """Extract content while preserving document structure"""
        content = []

        def process_element(elem):
            if isinstance(elem, bs4.NavigableString):
                text = elem.strip()
                if text and len(text) > 10:
                    content.append(text)
                return

            tag = elem.name
            if not tag:
                return

            text = elem.get_text(' ', strip=True)
            if not text or len(text) <= 10:
                return

            # Handle headings
            if tag.startswith('h') and tag[1:].isdigit():
                level = int(tag[1:])
                content.append(f"\n{'#'*level} {text}\n")
            # Handle list items
            elif tag == 'li':
                content.append(f"- {text}")
            # Handle table cells
            elif tag in ['td', 'th']:
                content.append(f"[TABLE CELL] {text}")
            # Handle regular paragraphs
            elif tag == 'p':
                content.append(text)
            # Recursively process containers
            else:
                for child in elem.children:
                    process_element(child)

        process_element(element)
        full_text = '\n'.join(content)
        full_text = re.sub(r'\n{3,}', '\n\n', full_text)
        full_text = re.sub(r'[ \t]{2,}', ' ', full_text)
        return full_text.strip()

    def load_pdf_from_url(self, url):
        """Improved PDF loader with retries and better cleaning"""
        max_retries = 3
        retry_delay = 2

        for attempt in range(max_retries):
            try:
                response = self.session.get(url, timeout=30)
                response.raise_for_status()

                with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file:
                    tmp_file.write(response.content)
                    tmp_path = tmp_file.name

                loader = PyMuPDFLoader(tmp_path)
                docs = loader.load()

                # Clean up the extracted text
                for doc in docs:
                    doc.page_content = self._clean_pdf_text(doc.page_content)
                    doc.metadata.update({
                        'source': url,
                        'type': 'pdf',
                        'pages': doc.metadata.get('page', '')
                    })

                os.unlink(tmp_path)
                return docs

            except Exception as e:
                print(f"Attempt {attempt + 1} failed for {url}: {str(e)}")
                if attempt < max_retries - 1:
                    time.sleep(retry_delay)
                else:
                    if 'tmp_path' in locals() and os.path.exists(tmp_path):
                        os.unlink(tmp_path)
                    return []

    def _clean_pdf_text(self, text):
        """Clean and normalize PDF text"""
        # Remove page numbers and footers
        text = re.sub(r'Page \d+ of \d+', '', text)
        # Normalize whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        # Remove lonely characters
        text = re.sub(r'(?<!\w)\w(?!\w)', '', text)
        # Fix hyphenated words
        text = re.sub(r'(\w+)-\s+(\w+)', r'\1\2', text)
        return text

    def process_documents(self, urls):
        """Process documents with better error handling"""
        all_docs = []
        failed_urls = []

        for url in urls:
            print(f"\nProcessing: {url}")
            try:
                if url.lower().endswith('.pdf'):
                    docs = self.load_pdf_from_url(url)
                else:
                    docs = self.load_html(url)

                if docs:
                    all_docs.extend(docs)
                    print(f"Successfully loaded {len(docs)} documents")
                else:
                    failed_urls.append(url)
                    print("Failed to load document")

            except Exception as e:
                failed_urls.append(url)
                print(f"Error processing {url}: {str(e)}")

        if not all_docs:
            raise ValueError("No documents were successfully loaded")

        print(f"\nSummary:")
        print(f"- Successfully loaded: {len(all_docs)} documents")
        print(f"- Failed URLs: {len(failed_urls)}")
        if failed_urls:
            print("Failed URLs:", failed_urls)

        splits = self.text_splitter.split_documents(all_docs)
        print(f"- Total chunks after splitting: {len(splits)}")
        return splits, failed_urls # Return splits and failed_urls

    def create_vector_store(self, splits):
        """Create and persist Chroma vector store"""
        vectorstore = PineconeVectorStore.from_documents(
        documents=splits,
        embedding=self.embeddings,
        index_name=index_name
)
        print(f"Pinecone vector store '{index_name}' created successfully!")
        return vectorstore

if __name__ == "__main__":
    # Your list of documents
    documents = [
        "https://portlandpress.com/biochemj/article/477/23/4559/227194/Deep-learning-and-generative-methods-in",
        "https://pubs.acs.org/doi/10.1021/acs.jcim.9b00236",
        "https://www.osti.gov/servlets/purl/1427646",
        "https://depth-first.com/articles/2019/02/04/chemception-deep-learning-from-2d-chemical-structure-images/",
        "https://arxiv.org/ftp/arxiv/papers/1802/1802.04903.pdf",
        "http://cucis.eecs.northwestern.edu/publications/pdf/PJA18.pdf",
        "https://www.nature.com/articles/s41467-022-28494-3",
        "https://link.springer.com/article/10.1007/s00521-021-05961-4",
        "https://www.mdpi.com/journal/molecules/special_issues/deep_learning_structure",
        "https://www.sciencedirect.com/science/article/abs/pii/B9780443186387000050",
        "https://www.mdpi.com/1420-3049/25/12/2764",
        "https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00435-6",
        "https://www.nature.com/articles/s41598-025-95720-5",
        "https://pmc.ncbi.nlm.nih.gov/articles/PMC11571686/",
        "https://link.springer.com/article/10.1557/s43578-022-00628-9"
    ]

    # Initialize and process
    processor = DocumentProcessor()
    splits, failed_urls = processor.process_documents(documents) # Unpack the tuple here
    vectorstore = processor.create_vector_store(splits) # Pass only splits

    # Get retriever
    retriever = vectorstore.as_retriever()
    print("Vector store and retriever created successfully!")


Processing: https://portlandpress.com/biochemj/article/477/23/4559/227194/Deep-learning-and-generative-methods-in
Successfully loaded 1 documents

Processing: https://pubs.acs.org/doi/10.1021/acs.jcim.9b00236
Error loading https://pubs.acs.org/doi/10.1021/acs.jcim.9b00236: 403 Client Error: Forbidden for url: https://pubs.acs.org/doi/10.1021/acs.jcim.9b00236
Failed to load document

Processing: https://www.osti.gov/servlets/purl/1427646
Successfully loaded 22 documents

Processing: https://depth-first.com/articles/2019/02/04/chemception-deep-learning-from-2d-chemical-structure-images/
Successfully loaded 1 documents

Processing: https://arxiv.org/ftp/arxiv/papers/1802/1802.04903.pdf
Successfully loaded 16 documents

Processing: http://cucis.eecs.northwestern.edu/publications/pdf/PJA18.pdf
Successfully loaded 13 documents

Processing: https://www.nature.com/articles/s41467-022-28494-3
Successfully loaded 1 documents

Processing: https://link.springer.com/article/10.1007/s00521-021-0596

AttributeError: 'PineconeVectorStore' object has no attribute '_collection'

In [None]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
docs=retriever.get_relevant_documents("what is SMILE?")

In [None]:
chain.invoke({"context":docs,"question":"What is SMILE tell me about it in detail"})

AIMessage(content="SMILES, or Simplified Molecular Input Line Entry System, is a prevalent method for representing molecules in deep learning. It uses ASCII character strings to represent a molecule's chemical structure, encoding the connection table and stereochemistry as a line of text. Each element in the periodic table is assigned a corresponding token using its atomic symbol, with bond types inferred or explicitly indicated using non-alphanumeric tokens and brackets for branches or cycles.\n\nSMILES can be considered a chemical language with chemical tokens as words and molecules as sentences, but it can have syntactic and grammar errors, especially with branches and cycles. While SMILES is a non-unique molecular representation, it can be transformed into a unique one through canonicalization algorithms. DeepSMILES and SELFIES are SMILES-like notations developed to address some of the limitations of SMILES, such as grammatical errors and valency constraints.\n\nSMILES notations ar

768