<a href="https://colab.research.google.com/github/sungkim11/TaxGPT/blob/main/TaxGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Python Packages

In [226]:
%%writefile requirements.txt
openai
chromadb
tiktoken

Overwriting requirements.txt


In [227]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Import Python Packages

In [187]:
import os
import platform

import openai
import tiktoken

import chromadb
chroma_client = chromadb.Client()

print('Python: ', platform.python_version())

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
Python:  3.9.16


Mount Storage - Google Drive

In [188]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Set OpenAI API Key

In [189]:
os.environ["OPENAI_API_KEY"] = "openai api key"

Load ChromaDB

In [190]:
from chromadb.config import Settings
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="/content/drive/My Drive/Colab Notebooks/chromadb/tax/" # Optional, defaults to .chromadb/ in the current directory
))

Running Chroma using direct local API.


FloatProgress(value=0.0, layout=Layout(width='100%'), style=ProgressStyle(bar_color='black'))

loaded in 59866 embeddings
loaded in 2 collections


In [191]:
embeddings = openai.Embedding()
irc_collection = client.get_collection(name="irc", embedding_function=embeddings)
irb_collection = client.get_collection(name="irb", embedding_function=embeddings)

Define Functions

In [192]:
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

In [197]:
def break_up_text_to_chunks(text, chunk_size=2000, overlap_size=100):
    encoding = tiktoken.get_encoding("gpt2")

    tokens = encoding.encode(text)
    num_tokens = len(tokens)

    chunks = []
    for i in range(0, num_tokens, chunk_size - overlap_size):
        chunk = tokens[i:i + chunk_size]
        chunks.append(chunk)
    
    return chunks

In [228]:
def askTaxGPT(question):

    #Change question to embeddings.
    irc_question_ids = get_embedding(question)

    #Query IRC collections.
    irc_query_results = irc_collection.query(
        query_embeddings=irc_question_ids,
        n_results=10,
        include=["documents"]
    )

    #Join all items in a list
    irc_documents = irc_query_results["documents"][0]
    irc_query_results_doc = "".join(irc_documents)

    #For a given question, only return a list relevant Internal Revenue Codes that covers this topic.
    prompt_response = []
    encoding = tiktoken.get_encoding("gpt2")
    chunks = break_up_text_to_chunks(irc_query_results_doc)

    for i, chunk in enumerate(chunks):
        prompt_request = question + " Only return a list relevant Internal Revenue Codes that covers this topic.: " + encoding.decode(chunks[i])
        response = openai.Completion.create(
                model="text-davinci-003",
                prompt=prompt_request,
                temperature=0,
                max_tokens=1000,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0
        )        
        prompt_response.append(response["choices"][0]["text"].strip())

    #Consolidate a list relevant Internal Revenue Codes that covers this topic.
    prompt_request = "Consoloidate these a list of Internal Revenue Codes: " + str(prompt_response)

    response = openai.Completion.create(
            model="text-davinci-003",
            prompt=prompt_request,
            temperature=0,
            max_tokens=1000,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
    
    irc_codes = response["choices"][0]["text"].strip()

    #Change question to embeddings.
    irb_question_ids = get_embedding(question + irc_codes)

    #Query IRB collections.
    irb_query_results = irb_collection.query(
        query_embeddings=irb_question_ids,
        n_results=20,
        include=["documents"]
    )

    #Join all items in a list
    irb_documents = query_results["documents"][0]
    irb_query_results_doc = "".join(irb_documents)

    #For a given question, provides answers, referencing I.R.C.
    prompt_response = []
    encoding = tiktoken.get_encoding("gpt2")
    chunks = break_up_text_to_chunks(irb_query_results_doc)

    for i, chunk in enumerate(chunks):
        prompt_request = question + " Cite I.R.C. as references." + encoding.decode(chunks[i])
        response = openai.Completion.create(
                model="text-davinci-003",
                prompt=prompt_request,
                temperature=0,
                max_tokens=1000,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0
        )        
        prompt_response.append(response["choices"][0]["text"].strip())

    prompt_request = question + " Cite I.R.C. as references." + str(prompt_response)

    response = openai.Completion.create(
            model="text-davinci-003",
            prompt=prompt_request,
            temperature=0,
            max_tokens=1000,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0            
        )    
    
    return response["choices"][0]["text"].strip()

Questions and Answers

In [230]:
answer = askTaxGPT("I am USA citizen who is working for a USA-based company, but lived outside of USA for the calendar year. Do I still need to pay income tax?")

DEBUG:Chroma:time to pre process our knn query: 4.291534423828125e-06
DEBUG:Chroma:time to run knn query: 0.0003650188446044922
DEBUG:Chroma:time to pre process our knn query: 3.5762786865234375e-06
DEBUG:Chroma:time to run knn query: 0.0005962848663330078


In [231]:
answer

'In summary, a USA citizen who is working for a USA-based company and living outside of the USA for the calendar year is still required to pay income tax. The Internal Revenue Code (I.R.C.) does not exclude income derived in the U.S. from taxable income. The I.R.C. defines a taxpayer as any person subject to any internal revenue tax and further defines a person as an individual, trust, estate, partnership, association, company or corporation. Taxpayers are required to disclose foreign financial accounts to the Treasury Department and to report the income earned thereon. Taxpayers must report income earned as an individual on a Form 1040 and may not attribute the income to a trust created solely for the purpose of tax-avoidance, or claim deductions related to any expenses purportedly incurred by such a trust. Business expenses, including expenses related to a home-based business, are not deductible unless the expenses relate to a legitimate profit-seeking trade or business. Taxpayers ar

In [232]:
answer = askTaxGPT("I am USA citizen who Lives in Colombia and is working for a USA-based company for the calendar year. I am required to pay income tax to Colombian government as part of my visa. Do I still need to pay income tax to USA?")

DEBUG:Chroma:time to pre process our knn query: 4.0531158447265625e-06
DEBUG:Chroma:time to run knn query: 0.001069784164428711
DEBUG:Chroma:time to pre process our knn query: 4.291534423828125e-06
DEBUG:Chroma:time to run knn query: 0.0006427764892578125


In [233]:
answer

'In conclusion, a USA citizen who lives in Colombia and is working for a USA-based company for the calendar year is required to pay income tax to the Colombian government as part of their visa. However, they are still required to pay income tax to the USA, as there is no exclusion from taxable income under the Internal Revenue Code (IRC). The IRC provides that a dual resident taxpayer who is treated as a nonresident alien for purposes of computing their US tax liability is not considered a US person for the taxable year. Furthermore, the IRC defines withholding as the deduction and withholding of tax at the applicable rate from a payment. Finally, the IRC provides that a civilian spouse who claims tax residence in a US territory under the Military Spouses Residency Relief Act (MSRRA) and is a bona fide resident of the US territory is not required to file a US federal income tax return or pay US federal income tax on income derived from sources within the US territory. For more informat