### Information Retreival - LangChain
- Without Using OpenAI Embeddings
- Without OpenAI LLM

Two Applications:
- Text Documents
- Multiple PDF Files

In [None]:
!pip install langchain
!pip install huggingface_hub
!pip install sentence_transformers
!pip install pypdf



### Get HUGGINGFACEHUB_API_KEY

In [None]:
import os
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "hf_XXXXXXXXXXXXXXXXXXXXXXXXX"

### Download Text File

In [None]:
import requests

url = "https://raw.githubusercontent.com/hwchase17/langchain/master/docs/modules/state_of_the_union.txt"
res = requests.get(url)
with open("state_of_the_union.txt", "w") as f:
  f.write(res.text)

In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("/content/Hallucinations.pdf")
documents = loader.load_and_split()

In [None]:
# Document Loader
#from langchain.document_loaders import TextLoader
#loader = TextLoader('./state_of_the_union.txt')
#documents = loader.load()

In [None]:
documents

[Document(page_content='Challenges in Domain-Speciﬁc Abstractive Summarization and How to\nOvercome Them\nAnum Afzal1, Juraj Vladika1, Daniel Braun2and Florian Matthes1\n1Department of Computer Science, Technical University of Munich, Boltzmannstrasse 3, 85748\nGarching bei Muenchen, Germany\n2Department of High-tech Business and Entrepreneurship, University of Twente, Hallenweg 17,\n \nKeywords: Text Summarization, Natural Language Processing, Efﬁcient Transformers, Model Hallucination, Natural\nLanguage Generation Evaluation, Domain-Adaptation of Language Models.\nAbstract: Large Language Models work quite well with general-purpose data and many tasks in Natural Language\nProcessing. However, they show several limitations when used for a task such as domain-speciﬁc abstractive\ntext summarization. This paper identiﬁes three of those limitations as research problems in the context of\nabstractive text summarization: 1) Quadratic complexity of transformer-based models with respect to t

In [None]:
import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

In [None]:
print(wrap_text_preserve_newlines(str(documents[0])))

page_content='Challenges in Domain-Speciﬁc Abstractive Summarization and How to\nOvercome Them\nAnum Afzal1,
Juraj Vladika1, Daniel Braun2and Florian Matthes1\n1Department of Computer Science, Technical University of
Munich, Boltzmannstrasse 3, 85748\nGarching bei Muenchen, Germany\n2Department of High-tech Business and
Entrepreneurship, University of Twente, Hallenweg 17,\n \nKeywords: Text Summarization, Natural Language
Processing, Efﬁcient Transformers, Model Hallucination, Natural\nLanguage Generation Evaluation, Domain-
Adaptation of Language Models.\nAbstract: Large Language Models work quite well with general-purpose data and
many tasks in Natural Language\nProcessing. However, they show several limitations when used for a task such
as domain-speciﬁc abstractive\ntext summarization. This paper identiﬁes three of those limitations as research
problems in the context of\nabstractive text summarization: 1) Quadratic complexity of transformer-based
models with respect to the\ninput

In [None]:
# Text Splitter
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

In [None]:
len(docs)

16

In [None]:
docs[0]

Document(page_content='Challenges in Domain-Speciﬁc Abstractive Summarization and How to\nOvercome Them\nAnum Afzal1, Juraj Vladika1, Daniel Braun2and Florian Matthes1\n1Department of Computer Science, Technical University of Munich, Boltzmannstrasse 3, 85748\nGarching bei Muenchen, Germany\n2Department of High-tech Business and Entrepreneurship, University of Twente, Hallenweg 17,\n \nKeywords: Text Summarization, Natural Language Processing, Efﬁcient Transformers, Model Hallucination, Natural\nLanguage Generation Evaluation, Domain-Adaptation of Language Models.\nAbstract: Large Language Models work quite well with general-purpose data and many tasks in Natural Language\nProcessing. However, they show several limitations when used for a task such as domain-speciﬁc abstractive\ntext summarization. This paper identiﬁes three of those limitations as research problems in the context of\nabstractive text summarization: 1) Quadratic complexity of transformer-based models with respect to th

In [None]:
docs[1]

Document(page_content='are long, it creates a need for language models ca-\npable of handling them efﬁciently without over-682\nAfzal, A., Vladika, J., Braun, D. and Matthes, F .\nChallenges in Domain-Speciﬁc Abstractive Summarization and How to Overcome Them.\nDOI: 10.5220/0011744500003393\nInProceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3 , pages 682-689\nISBN: 978-989-758-623-1; ISSN: 2184-433X\nCopyright c⃝2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY -NC-ND 4.0)', metadata={'source': '/content/Hallucinations.pdf', 'page': 0})

### Embeddings

In [None]:
# Embeddings
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()

In [None]:
!pip install faiss-cpu



In [None]:
# Vectorstore: https://python.langchain.com/en/latest/modules/indexes/vectorstores.html
from langchain.vectorstores import FAISS

db = FAISS.from_documents(docs, embeddings)

In [None]:
query = "What is hallucination?"
docs = db.similarity_search(query)

In [None]:
print(wrap_text_preserve_newlines(str(docs[0].page_content)))

one of them being abstractive summarization. The
output of models for NLG tasks is notoriously hard
to evaluate because there is usually a trade-off be-
tween the expressiveness of the model and its fac-
tual accuracy (Sai et al., 2022). Metrics to evaluate
generated text can be word-based, character-based, or
embedding-based. Word-based metrics are the most
popular evaluation metrics, owing to their ease of use.
They look at the exact overlap of n-grams (n consec-
utive words) between generated and reference text.
Their main drawback is that they do not take into ac-
count the meaning of the text. Two sentences such
as “Berlin is the capital of Germany ” and “ Berlin is
not the capital of Germany ” have an almost complete
n-gram overlap despite having opposite meanings.
2.2.2 Model Hallucinations
Even though modern transformer models can gener-
ate text that is coherent and grammatically correct,
they are prone to generating content not backed by the
source document. Borrowing the ter

### Create QA Chain

In [None]:
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub


In [None]:
llm=HuggingFaceHub(repo_id="mistralai/Mistral-7B-v0.1", model_kwargs={"temperature":0.7})

In [None]:
chain = load_qa_chain(llm, chain_type="stuff")

In [None]:
query = "What is hallucination?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Hallucination is a phenomenon\ncharacterized by sensing things that are not present\nin reality'

In [None]:
query = "what is BigBird?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)

' BigBird is an efﬁcient Transformer\nmodel for long sequences\nHall'

### Working with PDF Files

In [None]:
!pip install unstructured
!pip install chromadb
!pip install Cython
!pip install tiktoken
!pip install unstructured[local-inference]

Collecting unstructured
  Downloading unstructured-0.10.18-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2023.6.15-py3-none-any.whl (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.1/275.1 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.indexes import VectorstoreIndexCreator

In [None]:
# connect your Google Drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

pdf_folder_path = '/content/gdrive/My Drive/data_2/'
os.listdir(pdf_folder_path)

Mounted at /content/gdrive


['2023_GPT4All_Technical_Report.pdf', '2008.10010.pdf']

In [None]:
pdf_folder_path ='/content/RE'

In [None]:
loaders = [UnstructuredPDFLoader(os.path.join(pdf_folder_path, fn)) for fn in os.listdir(pdf_folder_path)]
loaders

[<langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x7d07509f87f0>,
 <langchain.document_loaders.pdf.UnstructuredPDFLoader at 0x7d07509fa920>]

In [None]:
index = VectorstoreIndexCreator(
    embedding=HuggingFaceEmbeddings(),
    text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)).from_loaders(loaders)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
llm=HuggingFaceHub(repo_id="mistralai/Mistral-7B-v0.1", model_kwargs={"temperature":0.5})

In [None]:
from langchain.chains import RetrievalQA
chain = RetrievalQA.from_chain_type(llm=llm,
                                    chain_type="stuff",
                                    retriever=index.vectorstore.as_retriever(),
                                    input_key="question")

In [None]:
chain.run('Provide a summary of the paper')

' The paper summarizes the effectiveness of summarization models on a dataset of 100k research'

In [None]:
chain.run('What is Bigbird?')

' BigBird is a 175 billion parameter language model trained on a corpus of '