<a href="https://colab.research.google.com/github/sungkim11/TaxGPT/blob/main/TaxGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Python Packages

In [1]:
%%writefile requirements.txt
openai
chromadb
tiktoken

Writing requirements.txt


In [2]:
%pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.2-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.3.11-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 KB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.3.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m31.9 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Collecting duckdb>=0.5.1
  Downloadin

Import Python Packages

In [3]:
import os
import platform

import openai
import tiktoken

import chromadb
chroma_client = chromadb.Client()

print('Python: ', platform.python_version())

DEBUG:Chroma:Logger created


Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
Python:  3.9.16


Mount Storage - Google Drive

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Set OpenAI API Key

In [49]:
os.environ["OPENAI_API_KEY"] = "open ai api key"

In [16]:
openai.api_key = os.getenv("OPENAI_API_KEY")

Load ChromaDB

In [7]:
from chromadb.config import Settings
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="/content/drive/My Drive/Colab Notebooks/chromadb/tax/" # Optional, defaults to .chromadb/ in the current directory
))

Running Chroma using direct local API.


FloatProgress(value=0.0, layout=Layout(width='100%'), style=ProgressStyle(bar_color='black'))

loaded in 59866 embeddings
loaded in 2 collections


In [8]:
embeddings = openai.Embedding()
irc_collection = client.get_collection(name="irc", embedding_function=embeddings)
irb_collection = client.get_collection(name="irb", embedding_function=embeddings)

Define Functions

In [9]:
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

In [10]:
def break_up_text_to_chunks(text, chunk_size=2000, overlap_size=100):
    encoding = tiktoken.get_encoding("gpt2")

    tokens = encoding.encode(text)
    num_tokens = len(tokens)

    chunks = []
    for i in range(0, num_tokens, chunk_size - overlap_size):
        chunk = tokens[i:i + chunk_size]
        chunks.append(chunk)
    
    return chunks

In [32]:
def convert_to_prompt_text(tokenized_text):
    prompt_text = " ".join(tokenized_text)
    prompt_text = prompt_text.replace(" 's", "'s")
    return prompt_text

In [39]:
def askTaxGPT(question, debug = False):

    #Change question to embeddings.
    irc_question_ids = get_embedding(question)

    #Query IRC collections.
    irc_query_results = irc_collection.query(
        query_embeddings=irc_question_ids,
        n_results=10,
        include=["documents"]
    )

    #Join all items in a list
    irc_documents = irc_query_results["documents"][0]
    irc_query_results_doc = "".join(irc_documents)

    if debug == True:
        print(irc_query_results_doc)

    #For a given question, only return a list relevant Internal Revenue Codes that covers this topic.
    prompt_response = []
    encoding = tiktoken.get_encoding("gpt2")
    chunks = break_up_text_to_chunks(irc_query_results_doc)

    for i, chunk in enumerate(chunks):
        prompt_request = question + " Only return a list relevant Internal Revenue Codes that covers this topic.: " + encoding.decode(chunks[i])
        #prompt_request = question + " Only return a list relevant Internal Revenue Codes that covers this topic.: " + convert_to_prompt_text(chunks[i])
        response = openai.Completion.create(
                model="text-davinci-003",
                prompt=prompt_request,
                temperature=0,
                max_tokens=1000,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0
        )        
        prompt_response.append(response["choices"][0]["text"].strip())

    #Consolidate a list relevant Internal Revenue Codes that covers this topic.
    prompt_request = "Consoloidate these a list of Internal Revenue Codes: " + str(prompt_response)

    if debug == True:
        print(prompt_request)

    response = openai.Completion.create(
            model="text-davinci-003",
            prompt=prompt_request,
            temperature=0,
            max_tokens=1000,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0
        )
    
    irc_codes = response["choices"][0]["text"].strip()

    if debug == True:
        print(prompt_request)

    #Change question to embeddings.
    irb_question_ids = get_embedding(question + irc_codes)

    #Query IRB collections.
    irb_query_results = irb_collection.query(
        query_embeddings=irb_question_ids,
        n_results=20,
        include=["documents"]
    )

    #Join all items in a list
    irb_documents = irb_query_results["documents"][0]
    irb_query_results_doc = "".join(irb_documents)

    if debug == True:
        print(irb_query_results_doc)

    #For a given question, provides answers, referencing I.R.C.
    prompt_response = []
    encoding = tiktoken.get_encoding("gpt2")
    chunks = break_up_text_to_chunks(irb_query_results_doc)

    for i, chunk in enumerate(chunks):
        prompt_request = question + " Cite I.R.C. as references." + encoding.decode(chunks[i])
        #prompt_request = question + " Cite I.R.C. as references." + convert_to_prompt_text(chunks[i])
        response = openai.Completion.create(
                model="text-davinci-003",
                prompt=prompt_request,
                temperature=0,
                max_tokens=1000,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0
        )        
        prompt_response.append(response["choices"][0]["text"].strip())

    

    if debug == True:
        print(prompt_request)

    #For a given question, provides answers, referencing I.R.C.
    prompt_response = []
    encoding = tiktoken.get_encoding("gpt2")
    chunks = break_up_text_to_chunks(str(prompt_response))

    for i, chunk in enumerate(chunks):
        prompt_request = question + " Cite I.R.C. as references." + encoding.decode(chunks[i])
        response = openai.Completion.create(
                model="text-davinci-003",
                prompt=prompt_request,
                temperature=0,
                max_tokens=1000,
                top_p=1,
                frequency_penalty=0,
                presence_penalty=0            
            )    
        
        return response["choices"][0]["text"].strip()

Questions and Answers

In [None]:
answer = askTaxGPT("I am USA citizen who is working for a USA-based company, but lived outside of USA for the calendar year. Do I still need to pay income tax?")

DEBUG:Chroma:time to pre process our knn query: 4.291534423828125e-06
DEBUG:Chroma:time to run knn query: 0.0003650188446044922
DEBUG:Chroma:time to pre process our knn query: 3.5762786865234375e-06
DEBUG:Chroma:time to run knn query: 0.0005962848663330078


In [None]:
answer

'In summary, a USA citizen who is working for a USA-based company and living outside of the USA for the calendar year is still required to pay income tax. The Internal Revenue Code (I.R.C.) does not exclude income derived in the U.S. from taxable income. The I.R.C. defines a taxpayer as any person subject to any internal revenue tax and further defines a person as an individual, trust, estate, partnership, association, company or corporation. Taxpayers are required to disclose foreign financial accounts to the Treasury Department and to report the income earned thereon. Taxpayers must report income earned as an individual on a Form 1040 and may not attribute the income to a trust created solely for the purpose of tax-avoidance, or claim deductions related to any expenses purportedly incurred by such a trust. Business expenses, including expenses related to a home-based business, are not deductible unless the expenses relate to a legitimate profit-seeking trade or business. Taxpayers ar

In [40]:
answer = askTaxGPT("I am USA citizen who Lives in Colombia and is working for a USA-based company for the calendar year. I am required to pay income tax to Colombian government as part of my visa. Do I still need to pay income tax to USA?", debug = False)

DEBUG:Chroma:time to pre process our knn query: 3.5762786865234375e-06
DEBUG:Chroma:time to run knn query: 0.0004699230194091797
DEBUG:Chroma:time to pre process our knn query: 3.0994415283203125e-06
DEBUG:Chroma:time to run knn query: 0.0005600452423095703


In [41]:
answer

'Yes, you are still required to pay income tax to the United States. According to the Internal Revenue Code (IRC) section 871(a), all income earned by a US citizen, regardless of where it is earned, is subject to US income tax. Additionally, IRC section 911(a) provides an exclusion from US income tax for certain foreign earned income, but this exclusion does not apply to income earned by a US citizen.'

In [42]:
answer = askTaxGPT("What are the tax implications of exercising stock options or selling restricted stock units (RSUs)?")

DEBUG:Chroma:time to pre process our knn query: 4.291534423828125e-06
DEBUG:Chroma:time to run knn query: 0.000978708267211914
DEBUG:Chroma:time to pre process our knn query: 2.6226043701171875e-06
DEBUG:Chroma:time to run knn query: 0.0006501674652099609


In [46]:
answer.strip()

'Exercising stock options and selling restricted stock units (RSUs) both have tax implications. \n\nWhen exercising stock options, the difference between the exercise price and the fair market value of the stock at the time of exercise is considered ordinary income and is subject to income tax. This is known as the “bargain element” and is reported on Form 1099-MISC. The bargain element is also subject to employment taxes such as Social Security and Medicare taxes.\n\nWhen selling RSUs, the fair market value of the stock at the time of sale is considered ordinary income and is subject to income tax. This is reported on Form 1099-MISC. The sale of RSUs is also subject to employment taxes such as Social Security and Medicare taxes.\n\nThe Internal Revenue Code (IRC) sections that apply to the taxation of stock options and RSUs are sections 83 and 451. Section 83 of the IRC states that the difference between the exercise price and the fair market value of the stock at the time of exercise

In [47]:
answer = askTaxGPT("How do I apply passive activity loss rules to my investments or business activities?")

DEBUG:Chroma:time to pre process our knn query: 4.76837158203125e-06
DEBUG:Chroma:time to run knn query: 0.002870798110961914
DEBUG:Chroma:time to pre process our knn query: 3.337860107421875e-06
DEBUG:Chroma:time to run knn query: 0.003595113754272461


In [48]:
answer.strip()

'The passive activity loss rules are found in Internal Revenue Code (IRC) Section 469. Generally, these rules limit the amount of losses from passive activities that can be used to offset income from other sources. \n\nIn order to apply the passive activity loss rules, you must first determine whether the activity is a passive activity. A passive activity is any activity in which you do not materially participate. Generally, material participation is defined as any activity in which you are involved on a regular, continuous, and substantial basis. \n\nIf the activity is determined to be a passive activity, then you must determine whether the activity is a rental activity or a trade or business activity. If the activity is a rental activity, then the passive activity loss rules do not apply. However, if the activity is a trade or business activity, then the passive activity loss rules do apply. \n\nIf the activity is a trade or business activity, then you must determine whether the acti