## RAG pipeline with vector database


In [6]:
## Data Ingestion
from langchain_community.document_loaders import TextLoader
loader = TextLoader("speech.txt")
loaded_doc = loader.load()  


In [7]:
loaded_doc

[Document(metadata={'source': 'speech.txt'}, page_content='I have three visions for India. In 3000 years of our history people from all over the world have come and invaded us, captured our lands, conquered our minds. From Alexander onwards the Greeks, the Turks, the Moguls, the Portuguese, the British, the French, the Dutch, all of them came and looted us, took over what was ours. Yet we have not done this to any other nation. We have not conquered anyone. We have not grabbed their land, their culture and their history and tried to enforce our way of life on them. Why? Because we respect the freedom of others. That is why my FIRST VISION is that of FREEDOM. I believe that India got its first vision of this in 1857, when we started the war of Independence. It is this freedom that we must protect and nurture and build on. If we are not free, no one will respect us.\n\nWe have 10 percent growth rate in most areas. Our poverty levels are falling. Our achievements are being globally recogn

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()
os.environ['GOOGLE_API_KEY'] = os.getenv("GOOGLE_API_KEY")

In [47]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

loader2 = WebBaseLoader(web_path = ('https://lilianweng.github.io/posts/2023-06-23-agent/'), 
                        bs_kwargs = dict(parse_only= bs4.SoupStrainer(
                            class_ = ("post-title", "post-content", "post-header")
                         
                        )))
loaded_doc2 = loader2.load()


In [20]:
loaded_doc2

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistake

In [26]:
from langchain_community.document_loaders import PyPDFLoader
loader3 = PyPDFLoader('Monitoring_Desertification_Using_Machine-Learning_.pdf')
loaded_doc3 = loader3.load()

In [27]:
loaded_doc3

[Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2022-06-02T17:14:19+08:00', 'author': 'Kun Feng, Tao Wang, Shulin Liu, Wenping Kang, Xiang Chen, Zichen Guo and Ying Zhi', 'keywords': 'desertification; CART-DT; RF; CNN; image classification; remote sensing index', 'moddate': '2022-06-02T11:24:56+02:00', 'subject': 'Mu Us Sandy Land is a typical semi-arid vulnerable ecological zone, characterized by vegetation degradation and severe desertification. Effectively identifying desertification changes has been a topical environmental issue in China. However, most previous studies have used a single method or remote sensing index to monitor desertification, and lacked an efficient and high-precision monitoring system. In this study, an optimal monitoring scheme that considers multiple indicators combination and different machine learning methods (Classification and Regression Tree-Decision Tree, CART-DT; Random Forest, RF; Convolutional Neur

In [35]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunk_documents = text_splitter.split_documents(loaded_doc3)
chunk_documents[:5]

[Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2022-06-02T17:14:19+08:00', 'author': 'Kun Feng, Tao Wang, Shulin Liu, Wenping Kang, Xiang Chen, Zichen Guo and Ying Zhi', 'keywords': 'desertification; CART-DT; RF; CNN; image classification; remote sensing index', 'moddate': '2022-06-02T11:24:56+02:00', 'subject': 'Mu Us Sandy Land is a typical semi-arid vulnerable ecological zone, characterized by vegetation degradation and severe desertification. Effectively identifying desertification changes has been a topical environmental issue in China. However, most previous studies have used a single method or remote sensing index to monitor desertification, and lacked an efficient and high-precision monitoring system. In this study, an optimal monitoring scheme that considers multiple indicators combination and different machine learning methods (Classification and Regression Tree-Decision Tree, CART-DT; Random Forest, RF; Convolutional Neur

In [None]:
## Vector Embeddings and Vector Store

from langchain_google_genai import GoogleGenerativeAIEmbeddings
# from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import FAISS


embeddings = GoogleGenerativeAIEmbeddings(
    model="models/embedding-001",  
    google_api_key=os.getenv("GOOGLE_API_KEY")  
)

db = FAISS.from_documents(chunk_documents, embeddings)
db


<langchain_community.vectorstores.faiss.FAISS at 0x27c67e40bd0>

In [46]:
query = "As one of the most important environment–economy–society problems of the world, desertification threatens regional ecological security and limits economic development at national levels"

retrived_result = db.similarity_search(query)
print(retrived_result[0].page_content)

ecological security and limits economic development at national levels, accompanying de-
clining soil fertility and vegetation degradation [3,4]. The UNCCD 2017 report showed that
some 10–20% of drylands are already degraded, the total area affected by desertiﬁcation is
between 6 and 12 million km2, about 1–6% of the inhabitants of drylands live in desertiﬁed
areas, and one billion people are under threat from further desertiﬁcation [5]. China is also
Remote Sens. 2022, 14, 2663. https://doi.org/10.3390/rs14112663 https://www.mdpi.com/journal/remotesensing
