# CRA chatbot
This web APP is a friendly chatbot Application for personal Income tax filers to go through the complex CRA document and help in answering questions related to personal tax. 
- Tools Stack:-
 - Streamlit:- Front end UI
 - Pinecone:- Vector Database to store relevant info in chunks
 - Open AI embedding model(ADA) :- embedding model for Text2Vec
    - Using Retreival QA chain
 - Hugging Face/OpenAI Da vinci :- Model for QA
 - Tiktoken:- Count tokens used

# Workflow
- Load the document through relevant loader(PyPDFloader or Textloader)
- Split the file into chunks using the recursive character text splitter(chunk overlap = 10%)
- Initialise Pinecone DB and create index for the CRA doc file. 
- Store the text chunks as Vector through Text2Vec ADA embedding model
- Create a Prompt Template with allowance of user input
- Create a ConversationMemoryBuffer (or summarywindow?) to store past conversations
- Create a Retreival QA chain using the above 
- Test the functionality and then create a streamlit implementation 

In [10]:
import os 
import pinecone
import openai
from tqdm.autonotebook import tqdm
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']
pinecone.api_key=os.environ['PINECONE_API_KEY']
#HF_API_KEY=os.environ['HF_API_KEY']
# get openai api key from platform.openai.com
#OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or 'OPENAI_API_KEY'


## Tiktoken to measure tokens

In [11]:
import tiktoken
#encoding for text da vinci 3 model
tiktoken.encoding_for_model('text-davinci-003')

<Encoding 'p50k_base'>

In [12]:
tokenizer = tiktoken.get_encoding('p50k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

28

## Load the text file from folder

In [16]:
# Can also use directory loader for multiple text files. For this 1 doc I will use basic TextLoader
# Txt file path = "/Users/akashjoshi/Desktop/Python_Learning/streamlit_folder/CRAdoc.txt"
from langchain.document_loaders import TextLoader
loader = TextLoader("/Users/akashjoshi/Desktop/Python_Learning/streamlit_folder/CRAdoc.txt")
CRAdoc=loader.load()
len(CRAdoc)

1

In [17]:
tiktoken_len(CRAdoc[0].page_content) #40k tokens

40619



## Text splitter using recursive character text splitter

In [51]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, #default 1000.chunk size is dependent on how coherent or random sentences are in a para
    chunk_overlap=100,#default 200.usually 10-20% to mantain the sentence struct
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

In [76]:
# split data here
#Signature: text_splitter.split_text(text: 'str') -> 'List[str]'
#Docstring: Split incoming text and return chunks.
chunks = text_splitter.split_text(CRAdoc[0].page_content)
len(chunks)#102 blocks of approx 200 tokens

102

In [60]:
#check format of chunks
chunks[5]

"New items are flagged with NEW! throughout this guide. \n \n+++ The CRA's services \n \nSubmit your service feedback online! \n \nYou can submit a complaint, compliment, or suggestion to the CRA using the \nnew Service Feedback RC193 online form. This online form can be used by \nindividuals, businesses, and representatives. To submit your feedback, go to \ncanada.ca/cra-service-feedback. \n \nCOVID-19 benefits and your taxes \n \nAmounts received related to COVID-19 \n \nIf you received federal, provincial, or territorial government COVID-19 \nbenefit payments, such as the Canada Recovery Benefit (CRB), Canada Recovery \nCaregiving Benefit (CRCB), Canada Recovery Sickness Benefit (CRSB), or Canada \nWorker Lockdown Benefit (CWLB), you will receive a T4A slip with instructions \non how to report these amounts on your return. These slips are also available \nin My Account at canada.ca/my-cra-account. \n \nIf your income was tax exempt \n \nIf your CRB, CRCB, CRSB, or CWLB income is eli

## Creating Embeddings using OpenAI embedding model 

In [54]:
from langchain.embeddings.openai import OpenAIEmbeddings
OPENAI_API_KEY=os.getenv("OPENAI_API_KEY")
model_name = 'text-embedding-ada-002' #Embedding model with N-Dim = 1536 ,
#i.e each vector is represented in 1536 dim space

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

In [55]:
#Test
res = embed.embed_documents(chunks[-1])
len(res), len(res[0])

(1, 1536)

In [56]:
print("The total tokens required for the doc is : ",f"{tiktoken_len(CRAdoc[0].page_content):,d}")

The total tokens required for the doc is :  40,619


In [57]:
#how does an embedding look?
res

[[-0.015442107804119587,
  -0.028933634981513023,
  0.001672895043157041,
  -0.0323200598359108,
  -0.02195759490132332,
  0.003775866236537695,
  -0.019776735454797745,
  -0.03890327736735344,
  -0.029638011008501053,
  0.0036539549473673105,
  0.03995984047651291,
  0.016932135447859764,
  -0.022959977388381958,
  0.008357702754437923,
  0.005760312546044588,
  0.021131305024027824,
  0.005093186628073454,
  -0.010687564499676228,
  0.025425296276807785,
  0.006606919690966606,
  -0.0033915068488568068,
  -0.00111244129948318,
  -0.017731333151459694,
  -0.001688980613835156,
  -0.005746766924858093,
  0.02000701241195202,
  0.0034880200400948524,
  -0.016566401347517967,
  -0.0030105337500572205,
  -0.008174835704267025,
  -0.007409502752125263,
  -0.012231775559484959,
  -0.010342149063944817,
  -0.028771085664629936,
  -0.02915036492049694,
  0.01322738453745842,
  0.007741372566670179,
  -0.006247958168387413,
  0.014534546062350273,
  0.0036675005685538054,
  0.02497828751802444

## Vector Database

In [25]:
#pip install "pinecone-client[grpc]"
index_name = 'langchain-retrieval-augmentation'

In [70]:
import pinecone
PINECONE_API_KEY=os.getenv("PINECONE_API_KEY")

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment='us-west1-gcp-free'
)

if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(res[0])  # 1536 dim of text-embedding-ada-002
    )

To check go to Pinecone Website Indexes section. It should have the details of new database created

In [86]:
#verify in python
pinecone.list_indexes()
index1 = pinecone.Index("langchain-retrieval-augmentation")
index1.describe_index_stats()
#vector count=0 before upsert

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 102}},
 'total_vector_count': 102}

In [85]:
# store embeddings in pinecone 
#using Langchain Pinecone Upsert wrapper (Pinecone.from_texts or from_documents)
from langchain.vectorstores import Pinecone
crasearch = Pinecone.from_texts(texts=chunks, embedding=embed, index_name=index_name)

In [87]:
#verify in python
pinecone.list_indexes()
index1 = pinecone.Index("langchain-retrieval-augmentation")
index1.describe_index_stats()
#vector count=102 after upsert

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 102}},
 'total_vector_count': 102}

# Use Pinecone to search query



In [89]:
text_field = "text"

index1 = pinecone.Index(index_name)

vectorstore = Pinecone(
    index1, embed.embed_query, text_field
)

In [92]:
# Vector search example
query = "what is the maximum RRSP allowance this year?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content="- certain amounts from a registered retirement income fund (RRIF) from box 22 \nof your T4RIF slips, or the pooled registered pension plan (PRPP) amount from \nbox 194 of your T4A slips, or the specified pension plan (SPP) amount from \nbox 018 of your T4A slips \n \nNote \nIf you rolled over an amount to a registered disability savings plan (RDSP), \nsee line 23200 on page 20 for information about the corresponding deduction. \nFor more information about RDSPs, go to canada.ca/taxes-rdsp or see Guide \nT4040, RRSPs and Other Registered Plans for Retirement, and Guide RC4460, \nRegistered Disability Savings Plan. \n \n- grant amounts paid to you as a result of taking time away from work to cope \nwith the death or disappearance of your child because of an offence or \nprobable offence under the Criminal Code (from box 136 of your T4A slip) \n \n- PRPP income from box 194 of your T4A slips if you were under 65 years of \nage and you did not receive this income up

# Create basic QA chain test

In [94]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [102]:
#get query and run
query = input("Enter your query here:")
# Run query
qa.run(query)

Enter your query here:what digital services are available for individuals to interact with CRA


'The CRA offers several digital services for individuals to interact with them. These include:\n\n1. My Account: My Account allows you to view and manage your personal income tax and benefit information online. You can register for My Account at canada.ca/my-cra-account.\n\n2. MyCRA mobile web app: The MyCRA mobile web app allows you to access key portions of your tax information. You can access the app at canada.ca/cra-mobile-apps.\n\nUsing My Account or MyCRA, you can:\n\n- View your benefit and credit information\n- View your notice of assessment\n- Change your address, direct deposit information, marital status, and information about children in your care\n- Manage notification preferences and receive email notifications when important changes are made on your account\n- Check your TFSA contribution room and RRSP deduction limit\n- Check the status of your tax return\n- Make a payment to the CRA online with My Payment or a pre-authorized debit\n\nThese digital services are fast, ea