### RAG Pipeline Creation


- In this step, we will create a database to store the research papers for a given user query. To do this, we first need to retrieve a list of relevant papers from the arXiv API for the query.

- We will be using the ArxivLoader() package from LangChain as it abstracts API interactions, and retrieves the papers for further processing. 

- We can split these papers into smaller chunks to ensure efficient processing and relevant information retrieval later on. To do this, we will use the RecursiveTextSplitter() from LangChain, since it ensures semantic preservation of information while splitting documents. 

- Next, we will create embeddings for these chunks using the sentence-transformers embeddings from HuggingFace. Finally, we will ingest these split document embeddings into a Chroma DB database for further querying.

In [1]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import ArxivLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.llms import Replicate

In [2]:
query = "lightweight transformers for textual data"
arxiv_docs = ArxivLoader(query=query, load_max_docs=3).load()
pdf_data = []

In [3]:
for doc in arxiv_docs:
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100,
    )
    texts = text_splitter.create_documents([doc.page_content])
    pdf_data.append(texts)

In [4]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-l6-v2")
db = Chroma.from_documents(pdf_data[0], embedding=embeddings)

  from .autonotebook import tqdm as notebook_tqdm


### Retrieval and Generation

- Once the database for a particular topic has been created, we can use this database as a retriever to answer user questions based on the provided context. LangChain offers a few different chains for retrieval, the simplest being the RetrievalQA chain that we will use. We will set it up using the from_chain_type() method, specifying the model and the retriever. For document integration into the LLM, we’ll use the stuff chain type, as it stuffs all documents into a single prompt.

In [5]:
llm = Replicate(
    model="meta/meta-llama-3-70b-instruct",
    model_kwargs={"temperature": 0.0, "top_p": 1, "max_new_tokens": 1000},
)



qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=db.as_retriever()
)

In [7]:
question = "Overview of language models, their applications, and their capabilities?"
answer = qa({"query": question})
answer

{'query': 'Overview of language models, their applications, and their capabilities?',
 'result': 'Based on the provided context, here is an overview of language models, their applications, and their capabilities:\n\n**Overview of Language Models:**\nLanguage models, such as ChatGPT and GPT-4, are pre-trained using self-supervised learning, which leverages the data itself as supervision. They can capture different levels of language representation, such as words, sentences, or documents. The process of training a language model involves data preparation, pre-training, and fine-tuning.\n\n**Applications:**\nLanguage models have various applications, including:\n\n* Natural Language Processing (NLP) tasks, such as text summarization, question answering, and language translation\n* Multi-player games, such as Poker\n* Visual reasoning and coordination\n* Medium-range global weather forecasting\n* Multi-modal modeling with in-context instruction tuning\n\n**Capabilities:**\nLanguage models 

In [19]:
# Create a Markdown report for the answer['result']



ModuleNotFoundError: No module named 'langchain_community.reporters'