In [1]:
# Data ingestion steps

from langchain_community.document_loaders import TextLoader

loader = TextLoader("speech.txt")
# Covert to a text document
text_document = loader.load()
text_document

[Document(metadata={'source': 'speech.txt'}, page_content='Members of the faculty and members of the student body of this great institution of learning; ladies and gentlemen.\nNow there are several things that one could talk about before such a large, concerned, and enlightened audience. There are so many problems facing our nation and our world, that one could just take off anywhere. But today I would like to talk mainly about the race problems since I\'ll have to rush right out and go to New York to talk about Vietnam tomorrow. and I\'ve been talking about it a great deal this week and weeks before that.\nBut I\'d like to use a subject from which to speak this afternoon, the Other America.\nAnd I use this subject because there are literally two Americas. One America is beautiful for situation. And, in a sense, this America is overflowing with the milk of prosperity and the honey of opportunity. This America is the habitat of millions of people who have food and material necessities f

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")

Python-dotenv could not parse statement starting at line 1
Python-dotenv could not parse statement starting at line 5


In [3]:
# Web based loader

from langchain_community.document_loaders import WebBaseLoader
import bs4

# load , chunk and index the content of the web page
loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2024-04-12-diffusion-video/" ,), 
                       bs_kwargs= dict( parse_only = bs4.SoupStrainer(
                         class_ = ("post-title" , "post-content" , "post-header"))
                         ) )

text_documents = loader.load()


USER_AGENT environment variable not set, consider setting it to identify your requests.


In [4]:
text_documents

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2024-04-12-diffusion-video/'}, page_content='\n\n      Diffusion Models for Video Generation\n    \nDate: April 12, 2024  |  Estimated Reading Time: 20 min  |  Author: Lilian Weng\n\n\nDiffusion models have demonstrated strong results on image synthesis in past years. Now the research community has started working on a harder task—using it for video generation. The task itself is a superset of the image case, since an image is a video of 1 frame, and it is much more challenging because:\n\nIt has extra requirements on temporal consistency across frames in time, which naturally demands more world knowledge to be encoded into the model.\nIn comparison to text or images, it is more difficult to collect large amounts of high-quality, high-dimensional video data, let along text-video pairs.\n\n\n\n🥑 Required Pre-read: Please make sure you have read the previous blog on “What are Diffusion Models?” for image generation before 

In [5]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("language_understanding_paper.pdf")
pdf_docs = loader.load()

In [6]:
pdf_docs

[Document(metadata={'source': 'language_understanding_paper.pdf', 'page': 0}, page_content='Improving Language Understanding\nby Generative Pre-Training\nAlec Radford\nOpenAI\nalec@openai.comKarthik Narasimhan\nOpenAI\nkarthikn@openai.comTim Salimans\nOpenAI\ntim@openai.comIlya Sutskever\nOpenAI\nilyasu@openai.com\nAbstract\nNatural language understanding comprises a wide range of diverse tasks such\nas textual entailment, question answering, semantic similarity assessment, and\ndocument classiﬁcation. Although large unlabeled text corpora are abundant,\nlabeled data for learning these speciﬁc tasks is scarce, making it challenging for\ndiscriminatively trained models to perform adequately. We demonstrate that large\ngains on these tasks can be realized by generative pre-training of a language model\non a diverse corpus of unlabeled text, followed by discriminative ﬁne-tuning on each\nspeciﬁc task. In contrast to previous approaches, we make use of task-aware input\ntransformations dur

In [7]:

# Transform the documents into chunks of various sizes
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000 , chunk_overlap = 200)
documents = text_splitter.split_documents(pdf_docs)

In [17]:
## Vector Embedding and Vector Store

from langchain_community.embeddings import OllamaEmbeddings , OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(documents , OllamaEmbeddings())

In [15]:
# Chroma Vector database

query = "What are the model specifications?"
result = db.similarity_search(query)
result[0].page_content

'Recent approaches have investigated learning and utilizing more than word-level semantics from\nunlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled\ncorpus, have been used to encode text into suitable vector representations for various target tasks [ 28,\n32, 1, 36, 22, 12, 56, 31].\nUnsupervised pre-training Unsupervised pre-training is a special case of semi-supervised learning\nwhere the goal is to ﬁnd a good initialization point instead of modifying the supervised learning\nobjective. Early works explored the use of the technique in image classiﬁcation [ 20,49,63] and\nregression tasks [ 3]. Subsequent research [ 15] demonstrated that pre-training acts as a regularization\nscheme, enabling better generalization in deep neural networks. In recent work, the method has\nbeen used to help train deep neural networks on various tasks like image classiﬁcation [ 69], speech\nrecognition [68], entity disambiguation [17] and machine translation

In [18]:
## FAISS Vector database

from langchain_community.vectorstores import FAISS

db = FAISS.from_documents(pdf_docs[:20] , OllamaEmbeddings())

In [19]:
query = "What are the model specifications?"
result = db.similarity_search(query)
result[0].page_content

'Improving Language Understanding\nby Generative Pre-Training\nAlec Radford\nOpenAI\nalec@openai.comKarthik Narasimhan\nOpenAI\nkarthikn@openai.comTim Salimans\nOpenAI\ntim@openai.comIlya Sutskever\nOpenAI\nilyasu@openai.com\nAbstract\nNatural language understanding comprises a wide range of diverse tasks such\nas textual entailment, question answering, semantic similarity assessment, and\ndocument classiﬁcation. Although large unlabeled text corpora are abundant,\nlabeled data for learning these speciﬁc tasks is scarce, making it challenging for\ndiscriminatively trained models to perform adequately. We demonstrate that large\ngains on these tasks can be realized by generative pre-training of a language model\non a diverse corpus of unlabeled text, followed by discriminative ﬁne-tuning on each\nspeciﬁc task. In contrast to previous approaches, we make use of task-aware input\ntransformations during ﬁne-tuning to achieve effective transfer while requiring\nminimal changes to the model 