<a href="https://colab.research.google.com/github/Rishika70/LLM/blob/main/Finance_db.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [1]:
!pip install -q cassio datasets langchain openai tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.9/44.9 kB[0m [31m644.8 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.3/320.3 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m31.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.9/18.9 MB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

Import the packages you'll need:

In [2]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [3]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [4]:
from PyPDF2 import PdfReader

### Setup

In [15]:
ASTRA_DB_APPLICATION_TOKEN = "AstraCS:iZdZiPbTtDwtLjxvhodLFdQO:631b6dbbeef3305a27978e4b50663257a36c7bebf016691567027649b5a0d816" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "8189e03a-a76a-4d32-b297-0392d0143b31" # enter your Database ID

OPENAI_API_KEY = "sk-proj-vUUaPqmsmLc98XFYJ9vuT3BlbkFJXonzWfhMGHL8PgmV7aTr" # enter your OpenAI key

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [16]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('/pfi-briefings.pdf')

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [18]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [19]:
raw_text



Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [20]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

ERROR:cassandra.connection:Closing connection <LibevConnection(140520847770528) 8189e03a-a76a-4d32-b297-0392d0143b31-us-east1.db.astra.datastax.com:29042:05c9a152-b1cf-4ea1-82d7-bda86f884677> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


Create the LangChain embedding and LLM objects for later usage:

In [21]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

Create your LangChain vector store ... backed by Astra DB!

In [22]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [23]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)



In [24]:
texts[:50]

['Budapest, 2018.\u2029\t\t\nCORVINUS UNIVERSITY OF BUDAPEST DEPARTMENT OF FINANCE Basics of Finance Authors Gábor Kürthy (Chapter 1, Chapter 2)\nJózsef Varga (Chapter 3)\nTamás Pesuth (Chapter 4)\nÁgnes Vidovics-Dancs  (Chapter 5.1 - 5.3)\nIldikó Gelányi  (Chapter 5.4)\nGéza Sebestyén (Chapter 5.5)\nEszter Boros (Chapter 6)\nGábor Sztanó  (Chapter 7)\nErzsébet Varga (Chapter 8)\nEditor Gábor Kürthy\nReviewers Ágnes Vidovics-Dancs (Chapter 1)\nGyörgy Surányi (Chapter 2)\nGábor Kürthy (Chapters 3, 4, 5, 6, 7, 8)\nBudapest, 2018. ISBN 978-963-503-743-8\u2029\t\t\x002TABLE OF CONTENTS Chapter 1 Technical introduction 4 ...............................................................................Chapter 2 Money and Banking from a Historical and Theoretical Perspective 7 ....2.1 Money in history and theory\t 7',
 '.................................................................................2.2 Production of money and creation of money\t 9\n.............................................

### Load the dataset into the vector store



In [25]:

astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 50 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _Define basics of finance?_
- _How many types of finance is there_


In [26]:

first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): what is finance

QUESTION: "what is finance"




ANSWER: "Finance is the study of how individuals, businesses, and organizations manage their money and assets. It involves making decisions about how to allocate resources, raise capital, and invest funds in order to achieve financial goals and maximize profits. Finance also involves analyzing market trends, managing risks, and understanding the impact of economic factors on financial decisions."

FIRST DOCUMENTS BY RELEVANCE:




    [0.9034] "Budapest, 2018. 		
CORVINUS UNIVERSITY OF BUDAPEST DEPARTMENT OF FINANCE Basics of F ..."
    [0.9005] " 3CHAPTER 1 TECHNICAL INTRODUCTION To understand ﬁnance properly, one needs to have  ..."
    [0.8948] "➡Company “Z” pays 2,000 EUR wage to Mrs. M.
➡Mr. Q. repays 100 EUR of debt plus 5 EU ..."
    [0.8921] "The most important economic problem is the relative shortage of money. This occurs w ..."

What's your next question (or type 'quit' to exit): define basics of finance

QUESTION: "define basics of finance"




ANSWER: "Basics of finance refer to the fundamental concepts, principles, and techniques used in the field of finance. This includes understanding accounting and financial statements, money and banking systems, financial markets and instruments, and the balance of payments. A solid understanding of these basics is essential for effectively managing and making decisions related to personal or business finances."

FIRST DOCUMENTS BY RELEVANCE:




    [0.9207] "Budapest, 2018. 		
CORVINUS UNIVERSITY OF BUDAPEST DEPARTMENT OF FINANCE Basics of F ..."
    [0.9143] " 3CHAPTER 1 TECHNICAL INTRODUCTION To understand ﬁnance properly, one needs to have  ..."
    [0.9010] ".................................................................................... ..."
    [0.8972] "Example: the balance sheet of a company Company “ABC” has 57,000 EUR worth of assets ..."

What's your next question (or type 'quit' to exit): what is balance sheet

QUESTION: "what is balance sheet"




ANSWER: "A balance sheet is a financial statement that shows the assets, liabilities, and equity of an economic agent, such as a household, company, or bank. It is always in balance, with assets equaling liabilities plus equity. Assets are listed first, followed by liabilities, and equity is the residual amount. Economic events can change the balance sheet in four ways, and these events are recorded through double entry bookkeeping. On a systemic level, these changes are recorded through quadruple entry bookkeeping."

FIRST DOCUMENTS BY RELEVANCE:




    [0.9340] " 3CHAPTER 1 TECHNICAL INTRODUCTION To understand ﬁnance properly, one needs to have  ..."
    [0.9125] "ASSETS = LIABILITIES + EQUITY
When constructing the balance sheet, assets are listed ..."
    [0.9116] "Example: the balance sheet of a company Company “ABC” has 57,000 EUR worth of assets ..."
    [0.9007] "Assets: 	 Machinery, +5,000 EUR	 Liabilities: Long-term liabilities, +5,000 EUR 	 	  ..."

What's your next question (or type 'quit' to exit): quit
