# CRA chatbot
This web APP is a friendly chatbot Application for personal Income tax filers to go through the complex CRA document and help in answering questions related to personal tax. 
- Tools Stack:-
 - Streamlit:- Front end UI
 - Pinecone:- Vector Database to store relevant info in chunks
 - Open AI embedding model(ADA) :- embedding model for Text2Vec
    - Using Retreival QA chain
 - Hugging Face/OpenAI Da vinci :- Model for QA
 - Tiktoken:- Count tokens used

# Workflow
- Load the document through relevant loader(PyPDFloader or Textloader)
- Split the file into chunks using the recursive character text splitter(chunk overlap = 10%)
- Initialise Pinecone DB and create index for the CRA doc file. 
- Store the text chunks as Vector through Text2Vec ADA embedding model
- Create a Prompt Template with allowance of user input
- Create a ConversationMemoryBuffer (or summarywindow?) to store past conversations
- Create a Retreival QA chain using the above 
- Test the functionality and then create a streamlit implementation 

In [2]:
import os 
import pinecone
import openai
from tqdm.autonotebook import tqdm
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']
pinecone.api_key=os.environ['PINECONE_API_KEY']
#HF_API_KEY=os.environ['HF_API_KEY']
# get openai api key from platform.openai.com
#OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or 'OPENAI_API_KEY'


  from tqdm.autonotebook import tqdm


## Tiktoken to measure tokens

In [3]:
import tiktoken
#encoding for text da vinci 3 model
tiktoken.encoding_for_model('text-davinci-003')

<Encoding 'p50k_base'>

In [4]:
tokenizer = tiktoken.get_encoding('p50k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

28

## Load the text file from folder

In [5]:
# Can also use directory loader for multiple text files. For this 1 doc I will use basic TextLoader
# Txt file path = "/Users/akashjoshi/Desktop/Python_Learning/streamlit_folder/CRAdoc.txt"
from langchain.document_loaders import TextLoader
loader = TextLoader("/Users/akashjoshi/Desktop/Python_Learning/streamlit_folder/CRAdoc.txt")
CRAdoc=loader.load()
len(CRAdoc)

1

In [6]:
tiktoken_len(CRAdoc[0].page_content) #40k tokens

40619



## Text splitter using recursive character text splitter

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, #default 1000.chunk size is dependent on how coherent or random sentences are in a para
    chunk_overlap=100,#default 200.usually 10-20% to mantain the sentence struct
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

In [8]:
# split data here
#Signature: text_splitter.split_text(text: 'str') -> 'List[str]'
#Docstring: Split incoming text and return chunks.
chunks = text_splitter.split_text(CRAdoc[0].page_content)
len(chunks)#102 blocks of approx 200 tokens

102

In [32]:
#check format of chunks
chunks[5]

"New items are flagged with NEW! throughout this guide. \n \n+++ The CRA's services \n \nSubmit your service feedback online! \n \nYou can submit a complaint, compliment, or suggestion to the CRA using the \nnew Service Feedback RC193 online form. This online form can be used by \nindividuals, businesses, and representatives. To submit your feedback, go to \ncanada.ca/cra-service-feedback. \n \nCOVID-19 benefits and your taxes \n \nAmounts received related to COVID-19 \n \nIf you received federal, provincial, or territorial government COVID-19 \nbenefit payments, such as the Canada Recovery Benefit (CRB), Canada Recovery \nCaregiving Benefit (CRCB), Canada Recovery Sickness Benefit (CRSB), or Canada \nWorker Lockdown Benefit (CWLB), you will receive a T4A slip with instructions \non how to report these amounts on your return. These slips are also available \nin My Account at canada.ca/my-cra-account. \n \nIf your income was tax exempt \n \nIf your CRB, CRCB, CRSB, or CWLB income is eli

## Creating Embeddings using OpenAI embedding model 

In [10]:
from langchain.embeddings.openai import OpenAIEmbeddings
OPENAI_API_KEY=os.getenv("OPENAI_API_KEY")
model_name = 'text-embedding-ada-002' #Embedding model with N-Dim = 1536 ,
#i.e each vector is represented in 1536 dim space

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

In [33]:
#Test
res = embed.embed_documents(chunks[-1])
len(res), len(res[0])

(1, 1536)

In [34]:
print("The total tokens required for the doc is : ",f"{tiktoken_len(CRAdoc[0].page_content):,d}")

The total tokens required for the doc is :  40,619


## Vector Database

In [13]:
#pip install "pinecone-client[grpc]"
index_name = 'langchain-retrieval-augmentation'

In [14]:
import pinecone
PINECONE_API_KEY=os.getenv("PINECONE_API_KEY")

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment='us-west1-gcp-free'
)

if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(res[0])  # 1536 dim of text-embedding-ada-002
    )

To check go to Pinecone Website Indexes section. It should have the details of new database created

In [35]:
#verify in python
pinecone.list_indexes()
index1 = pinecone.Index("langchain-retrieval-augmentation")
index1.describe_index_stats()
#vector count=0 before upsert

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 102}},
 'total_vector_count': 102}

In [36]:
# store embeddings in pinecone 
#using Langchain Pinecone Upsert wrapper (Pinecone.from_texts or from_documents)
from langchain.vectorstores import Pinecone
#Already Run below statement once
#crasearch = Pinecone.from_texts(texts=chunks, embedding=embed, index_name=index_name)

In [37]:
#verify in python
pinecone.list_indexes()
index1 = pinecone.Index("langchain-retrieval-augmentation")
index1.describe_index_stats()
#vector count=102 after upsert

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 102}},
 'total_vector_count': 102}

# Use Pinecone to search query



In [38]:
text_field = "text"

index1 = pinecone.Index(index_name)

vectorstore = Pinecone(
    index1, embed.embed_query, text_field
)

In [57]:
# Vector search example
query = "What digital channels of communication can i use to communicate with CRA"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content="After you file your return, page 30 \nNotice of assessment, page 30 \n- Express NOA, page 30 \nProcessing time, page 30 \nHow to change a return, page 30 \n- Formal disputes (objections and appeals), page 30 \nCRA Service Feedback Program, page 30 \n- Service complaints, page 30 \n- Reprisal complaints, page 31 \n \nDigital services for individuals, page 31 \nMy Account, page 31 \nMyCRA mobile web app, page 31 \nMyBenefits CRA mobile app, page 31 \n \nRetirement income summary table, page 32 \n \nThe CRA's publications and personalized correspondence are available in \nbraille, large print, e-text, or MP3 for those who have a visual impairment. \nFor more information, go to canada.ca/cra-multiple-formats or call 1-800-959-\n8281. If you are outside Canada and the United States, call 613-940-8495. The \nCRA only accepts collect calls made through telephone operators. After your \ncall is accepted by an automated response, you may hear a beep and notice a \nnormal

# Create a prompt template to describe the system and generate output in secific format
### It will be passed in kwargs in QA chain

In [63]:
from langchain.prompts import PromptTemplate
prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Respond in pointwise manner with each point starting in a new line.Explain like a 10year old can understand.
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
chain_type_kwargs = {"prompt": PROMPT}

# Create basic QA chain test

In [73]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0,
    verbose=True
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    chain_type_kwargs=chain_type_kwargs,
    retriever=vectorstore.as_retriever()
)

In [74]:
from langchain.callbacks import get_openai_callback
#get query and run
query = input("Enter your query here:")

with get_openai_callback() as cb:
    # Run query
    response=qa.run(query)
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")

Enter your query here:What incentives do i get as a student?
Total Tokens: 2606
Prompt Tokens: 2281
Completion Tokens: 325


In [75]:
response

"As a student, you may be eligible for certain incentives or benefits. Here are some of them:\n\n1. Canada Training Credit (CTC): If you meet certain conditions, such as being a resident in Canada and having a Canada training credit limit, you can claim a credit for courses you took in 2022. This credit can help reduce your taxes in the future.\n\n2. Eligible Educator School Supply Tax Credit: If you are a teacher or early childhood educator and you bought teaching supplies for your classroom, you can claim up to $1,000 of eligible supplies expenses. These expenses should be directly related to teaching and not reimbursed by your school.\n\n3. Scholarships, Fellowships, and Bursaries: Certain scholarships, fellowships, and bursaries are not taxable. This means that if you receive these types of financial assistance for your education, you don't have to pay taxes on them.\n\n4. Transfer of Unused Tuition Amount: If you have unused tuition amounts, you can transfer them to a parent or gr

In [72]:
qa.verbose

False