# Natural Language Query Agent (Dataset Preparation)

The primary objective of this project is to develop a Natural Language Query Agent that leverages Large Language Models (LLMs) and open-source vector indexing and storage frameworks to provide concise responses to straightforward queries within a substantial dataset comprising lecture notes and a table detailing LLM architectures. This notebook offers a comprehensive guide to preparing the dataset for use in our final pipeline, facilitating answers to conversational questions.

> The data sources utilized for this project encompass the following:

- [Stanford LLMs Lecture Notes](https://stanford-cs324.github.io/winter2022/lectures/)

- [Awesome LLM Milestone Papers](https://github.com/Hannibal046/Awesome-LLM#milestone-papers)

Let's begin by installing the essential libraries and frameworks required to run this project. The following tools are necessary for its proper functionality.

- [transformers](https://pypi.org/project/transformers/)
- [accelerate](https://pypi.org/project/accelerate/)
- [einops](https://pypi.org/project/einops/)
- [langchain](https://pypi.org/project/langchain/)
- [xformers](https://pypi.org/project/xformers/)
- [bitsandbytes](https://pypi.org/project/bitsandbytes/)
- [faiss-gpu](https://pypi.org/project/faiss-gpu/)
- [sentence_transformers](https://pypi.org/project/sentence-transformers/)
- [pypdf](https://pypi.org/project/pypdf/)


In [None]:
!pip install -qU transformers accelerate einops langchain xformers bitsandbytes faiss-gpu sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m66.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.8/211.8 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip install pypdf



In [None]:
from torch import cuda, bfloat16
import transformers
import torch
import pickle
from transformers import StoppingCriteria, StoppingCriteriaList
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import WebBaseLoader
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain

#### Next, we'll generate the dataset that will be employed in our project. We will utilize the document loaders provided by Langchain, which include:

- [WebBaseLoader](https://python.langchain.com/docs/integrations/document_loaders/web_base): WebBaseLoader represents a robust framework designed for extracting text content from HTML web pages and transforming it into a document format suitable for a wide range of downstream tasks.
- [PyPDF](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf): PyPDF is a potent framework crafted for the extraction of text from PDF files.


We will retrieve links to webpages and PDF files from the following .txt documents:

	ema_dataset_web_links.txt
	ema_dataset_pdf_links.txt

In [None]:
def read_file(file_path):
    lines = []
    with open(file_path, 'r') as file:
        for line in file:
            line = line.rstrip('\n')
            lines.append(line)
    return lines

In [None]:
def read_pdf(file_path):
    documents_pdf =[]
    pdf_links = read_file(file_path)
    for link in pdf_links:
        loader = PyPDFLoader(link)
        pdf = loader.load()
        documents_pdf = documents_pdf + pdf
    return documents_pdf

In [None]:
file_path = '/content/ema_dataset_web_links.txt'
web_links = read_file(file_path)
loader = WebBaseLoader(web_links)
documents_web_links = loader.load()

In [None]:
file_path = '/content/ema_dataset_pdf_links.txt'
documents_pdf = read_pdf(file_path)

#### At the end of the process, we'll preserve the custom dataset as a .pkl file, an integral part of our pipeline.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits_p0 = text_splitter.split_documents(documents_web_links)
all_splits_p1 = text_splitter.split_documents(documents_pdf)

all_splits = all_splits_p0 + all_splits_p1

with open('/content/ema_dataset.pkl', 'wb') as file:
    pickle.dump(all_splits, file)