In [1]:
!pip install langchain rank_bm25 pypdf unstructured chromadb
!pip install unstructured['pdf'] unstructured
!pip install openai tiktoken langchain_groq

Collecting langchain
  Downloading langchain-0.1.17-py3-none-any.whl.metadata (13 kB)
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Collecting unstructured
  Downloading unstructured-0.13.6-py3-none-any.whl.metadata (30 kB)
Collecting chromadb
  Downloading chromadb-0.5.0-py3-none-any.whl.metadata (7.3 kB)
Collecting langchain-community<0.1,>=0.0.36 (from langchain)
  Downloading langchain_community-0.0.36-py3-none-any.whl.metadata (8.7 kB)
Collecting langchain-core<0.2.0,>=0.1.48 (from langchain)
  Downloading langchain_core-0.1.48-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Downloading langchain_text_splitters-0.0.1-py3-none-any.whl.metadata (2.0 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.52-py3-none-any.whl.metadata (13 kB)
Collecting chardet (from unstructured)
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting filetype (

In [2]:
import os 

os.environ["OPENAI_API_KEY"] = "enter your OPENAI_API_KEY"

### Data

In [3]:
from langchain.document_loaders import UnstructuredPDFLoader

file_path = "/kaggle/input/ragdata/Orca_paper.pdf"
documents = UnstructuredPDFLoader(
    file_path
)

docs = documents.load()

In [4]:
print(docs[0].page_content)

3 2 0 2

n u J

5

] L C . s c [

1 v 7 0 7 2 0 . 6 0 3 2 : v i X r a

Orca: Progressive Learning from Complex

Explanation Traces of GPT-4

Subhabrata Mukherjee∗†, Arindam Mitra∗

Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah

Microsoft Research

Abstract

Recent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation signals from shallow LFM outputs; small scale homogeneous training data; and most notably a lack of rigorous evaluation resulting in overestimating the small model’s capability as they tend to learn to imitate the style, but not the reasoning process of LFMs. To address these challenges, we develop Orca, a 13-billion parameter model that learns to imitate the reasoning process of LFMs. Orca learns from rich signals from GPT-4 including explanation traces; step-by-st

### Split data

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 100
)

chunks = splitter.split_documents(docs)

In [7]:
chunks[0].page_content

'3 2 0 2\n\nn u J\n\n5\n\n] L C . s c [\n\n1 v 7 0 7 2 0 . 6 0 3 2 : v i X r a\n\nOrca: Progressive Learning from Complex\n\nExplanation Traces of GPT-4\n\nSubhabrata Mukherjee∗†, Arindam Mitra∗\n\nGanesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah\n\nMicrosoft Research\n\nAbstract'

### Embeddings and Indexing

In [13]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain_groq import ChatGroq

embeddings = OpenAIEmbeddings()

llm = ChatGroq(
    groq_api_key = "enter your GROQ_API_KEY",
    model_name = 'llama3-70b-8192'
)

vectordb = Chroma.from_documents(
    chunks,
    embeddings
)

In [14]:
vectordb_retreiver = vectordb.as_retriever(
    search_kwargs = {"k": 3}
)

In [15]:
# BM25 Retriever
from langchain.retrievers import BM25Retriever, EnsembleRetriever

keyword_retriever = BM25Retriever.from_documents(
    chunks
)
keyword_retriever.k = 3

### Ensamble Retriever

In [16]:
ensemble_retriever = EnsembleRetriever(
    retrievers = [vectordb_retreiver, keyword_retriever],
    weights = [0.7, 0.3] # must sum to 1
)

### prompt tempelate

In [17]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template(
    """
    Answer the following quesion based on the retrieved context.
    Think step by step.
    You will get a reward if your answers were helpful.
    <context>
    {context}
    </context>
    
    Question: {query}
    """
)
output_parser = StrOutputParser()

### Chain 

In [18]:
chain = (
    {"context": ensemble_retriever, "query": RunnablePassthrough()}
    | prompt
    | llm
    | output_parser
)

In [19]:
print(chain.invoke("What is instruction tuning?"))

According to the context, instruction tuning is a technique that allows pre-trained language models to learn from input (natural language descriptions of the task) and response pairs. It has been applied to both language-only and multimodal tasks, and has been shown to improve the zero-shot and few-shot performance of models such as FLAN and InstructGPT on various benchmarks.


In [20]:
print(chain.invoke("How does Orca compares to ChatGPT?"))

Based on the provided context, here's a step-by-step analysis of how Orca compares to ChatGPT:

1. **Table Understanding**: ChatGPT has better table understanding and reasoning capabilities than Orca, with Orca lagging behind by 7.4% in the penguins in a table task.

2. **Multilingual Understanding**: Orca performs on par with GPT-4, exceeding ChatGPT by 4.7%. Both Orca and ChatGPT achieve parity on the salient translation error detection task.

3. **Logical and Geometric Reasoning**: ChatGPT shows superior logical reasoning capabilities compared to Orca, outperforming Orca by at least 9% in the Boolean expressions and Web of Lies tasks. However, Orca performs better than ChatGPT in the logical deduction task for five objects, but ChatGPT excels in the three and seven objects tasks, outperforming Orca by at least 4.9%. ChatGPT also has better geometric reasoning capabilities, outperforming Orca by 23% in the geometric shape task.

In summary, Orca has strengths in multilingual understa