<a href="https://colab.research.google.com/github/Pratik-Behera/PDF_Summarizer-GPT-Langchain-/blob/main/Economin_Overview_2022_2023_GPT%2BLangchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install langchain unstructured unstructured[local-inference] openai pinecone-client

In [1]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

### Load your data

In [None]:
pip install unstructured[local-inference]

In [12]:
# loader = UnstructuredPDFLoader("../content/dataset/economic_survey_2022_2023.pdf")
loader = OnlinePDFLoader("https://www.indiabudget.gov.in/economicsurvey/doc/echapter.pdf")

In [13]:
data = loader.load()



In [14]:
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

You have 1 document(s) in your data
There are 1137104 characters in your document


### Chunk your data up into smaller documents

In [15]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [16]:
print (f'Now you have {len(texts)} documents')

Now you have 1181 documents


### Create embeddings of your documents to get ready for semantic search

In [17]:
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone

  from tqdm.autonotebook import tqdm


In [18]:
OPENAI_API_KEY = 'ENTER API KEY HERE'
PINECONE_API_KEY = 'ENTER API KEY HERE'
PINECONE_API_ENV = 'ENTER API KEY HERE'

In [19]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [20]:
# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to api key in console
)
index_name = "default"

In [21]:
docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

### Query those docs to get your answer back

In [22]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

In [46]:
llm = OpenAI(temperature=0.5, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

In [47]:
query = "What are Social Infrastructure and Employment and how can we improve quality and affordable Health for all and write an essay on it"
docs = docsearch.similarity_search(query, include_metadata=True)

In [48]:
chain.run(input_documents=docs, question=query)

' Social Infrastructure and Employment refer to the infrastructure and employment opportunities needed to improve the quality of life of individuals. This includes access to clean drinking water, sanitation, health care, social security, connectivity, employment prospects, etc. Improving quality and affordable health for all requires increased investment in health infrastructure, increased access to health services, improved quality of health care, and improved working conditions for health workers. An essay on this topic should discuss the importance of investing in social infrastructure and employment opportunities, the need to improve access to health services, the importance of improving the quality of health care, and the need to improve working conditions for health workers.'