In [1]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("language_understanding_paper.pdf")
docs = loader.load()
docs

[Document(metadata={'source': 'language_understanding_paper.pdf', 'page': 0}, page_content='Improving Language Understanding\nby Generative Pre-Training\nAlec Radford\nOpenAI\nalec@openai.comKarthik Narasimhan\nOpenAI\nkarthikn@openai.comTim Salimans\nOpenAI\ntim@openai.comIlya Sutskever\nOpenAI\nilyasu@openai.com\nAbstract\nNatural language understanding comprises a wide range of diverse tasks such\nas textual entailment, question answering, semantic similarity assessment, and\ndocument classiﬁcation. Although large unlabeled text corpora are abundant,\nlabeled data for learning these speciﬁc tasks is scarce, making it challenging for\ndiscriminatively trained models to perform adequately. We demonstrate that large\ngains on these tasks can be realized by generative pre-training of a language model\non a diverse corpus of unlabeled text, followed by discriminative ﬁne-tuning on each\nspeciﬁc task. In contrast to previous approaches, we make use of task-aware input\ntransformations dur

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000 , chunk_overlap = 200)
document = text_splitter.split_documents(docs)
document

[Document(metadata={'source': 'language_understanding_paper.pdf', 'page': 0}, page_content='Improving Language Understanding\nby Generative Pre-Training\nAlec Radford\nOpenAI\nalec@openai.comKarthik Narasimhan\nOpenAI\nkarthikn@openai.comTim Salimans\nOpenAI\ntim@openai.comIlya Sutskever\nOpenAI\nilyasu@openai.com\nAbstract\nNatural language understanding comprises a wide range of diverse tasks such\nas textual entailment, question answering, semantic similarity assessment, and\ndocument classiﬁcation. Although large unlabeled text corpora are abundant,\nlabeled data for learning these speciﬁc tasks is scarce, making it challenging for\ndiscriminatively trained models to perform adequately. We demonstrate that large\ngains on these tasks can be realized by generative pre-training of a language model\non a diverse corpus of unlabeled text, followed by discriminative ﬁne-tuning on each\nspeciﬁc task. In contrast to previous approaches, we make use of task-aware input\ntransformations dur

In [4]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS

db = FAISS.from_documents(document[:30] , OllamaEmbeddings())

In [5]:
db

<langchain_community.vectorstores.faiss.FAISS at 0x7ce70810d280>

In [6]:
query = "What are the model specifications of the paper Improving Language Understanding by Generative Pre-Training"
result = db.similarity_search(query)
result[0].page_content

'gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [ 18]. We\nused learned position embeddings instead of the sinusoidal version proposed in the original work.\nWe use the ftfylibrary2to clean the raw text in BooksCorpus, standardize some punctuation and\nwhitespace, and use the spaCy tokenizer.3\nFine-tuning details Unless speciﬁed, we reuse the hyperparameter settings from unsupervised\npre-training. We add dropout to the classiﬁer with a rate of 0.1. For most tasks, we use a learning rate\nof 6.25e-5 and a batchsize of 32. Our model ﬁnetunes quickly and 3 epochs of training was sufﬁcient\nfor most cases. We use a linear learning rate decay schedule with warmup over 0.2% of training. λ\nwas set to 0.5.\n4.2 Supervised ﬁne-tuning\nWe perform experiments on a variety of supervised tasks including natural language inference,\nquestion answering, semantic similarity, and text classiﬁcation. Some of these tasks are available'

In [7]:
from langchain_community.llms import Ollama

llm = Ollama(model = "llama2")
llm

Ollama()

In [8]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("""
                                          Answer the following questions only based on the context provided. Think step by step and provide detailed answers.
                                          <context>
                                          {context}
                                          </context>
                                          Question : {input} """)

In [10]:
## Chains
## Stuff document chain

from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(llm , prompt)

In [9]:
# Retriever

retriever = db.as_retriever()
retriever

VectorStoreRetriever(tags=['FAISS', 'OllamaEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x7ce70810d280>)

In [11]:
# Retriever chain
from langchain.chains import create_retrieval_chain
retrieval_chain = create_retrieval_chain(retriever , document_chain)

In [13]:
response = retrieval_chain.invoke({"input" : "What are the model specifications of the paper Improving Language Understanding by Generative Pre-Training"})

In [14]:
response['answer']

'\nBased on the context provided, here are the model specifications for the fine-tuned Transformer LM in the paper "Improving Language Understanding by Generative Pre-Training":\n\n1. Model architecture: The authors use a Transformer [62] architecture with 8 attention heads and 3072 dimensional inner states.\n2. Hyperparameters: The learning rate is set to 6.25e-5, and the batch size is set to 32. The warmup period over 0.2% of training data is also used. The value of λ is set to 0.5.\n3. Optimization: Adam optimization scheme [27] with a max learning rate of 2.5e-4 is used. The learning rate is increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule.\n4. Training data: The model is trained on a large corpus of unlabeled text and several datasets with manually annotated training examples (target tasks).\n5. Tokenization: The authors use the spaCy tokenizer to tokenize the text data.\n6. Attention dropout: Residual, embedding, and attention dro