## Import information sources: BDO Insights UK (Scraped)
- Currently ```pdf``` and ```txt``` files supported
- Files are parsed to ```document``` type objects. This is necessary for upcoming Langchain document splitting operation and for optional metadata tagging. 
- Currently, combining multiple documents is not yet supported.

```
Document objects: {
    page_content: Index, Title, Date, Description, Link
    metadata: File directory
}
```


In [29]:
#!pip install --upgrade langchain

In [30]:
import langchain

In [31]:
# Specify directories
import_directory_pdf = "C:/Users/RLee/Downloads/scrape bdo uk.pdf"
import_directory_txt = "C:/Users/RLee/Desktop/TAX BASE/bdo_uk_scrape.txt"

Reading **pdf** files:

In [32]:
# pip install pypdf 

In [33]:
# from langchain.document_loaders import PyPDFLoader

# # Read pdf file
# loader = PyPDFLoader(import_directory_pdf)
# doc = loader.load()

# # First pdf page
# page = doc[0]

Reading **txt** files

In [34]:
# import
from langchain.document_loaders import TextLoader

# read txt file as load() object
loader = TextLoader(import_directory_txt)
doc = loader.load()

## Document splitting
- The scraping method seperates documents by "*\n-\n*", which we can as splitting criteria. Ideally this should assign each chunk a single article.
- The number chunks should ideally represent the total of documents used.
- Information corresponding to the articles is fairly concise, preventing the need to split articles into multiple chunks.
- At this stage, URL links are messed up due to automated formatting. If intended to be accessed at this stage, we can copy the URL links manually.

Returns: ```doc_split``` (list of chunks)

In [35]:
from langchain.text_splitter import CharacterTextSplitter

# split it into chunks on "\n-\n" (scraping method ensures this splits goes as intended)
text_splitter = c_splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0, 
    separator = '\n-\n' # Splitting criteria
)

doc_split = text_splitter.split_documents(doc)

Created a chunk of size 514, which is longer than the specified 500
Created a chunk of size 512, which is longer than the specified 500


In [36]:
# Print total number of chunks
len(doc_split)

252

In [37]:
# Show example of chunk
doc_split[-1]

Document(page_content='Index: 255 \nTitle: Which Employee Share Plan Tool  \nDate: 27 January 2021 \nDescription: None \nLink: https://www.bdo.co.uk/en-gb/insights/tax/global-employer-services/which-employee-share-plan-tool', metadata={'source': 'C:/Users/RLee/Desktop/TAX BASE/bdo_uk_scrape.txt'})

Optional: create **JSON** and **Excel** files with all the chunks (documents)

In [38]:
import json

def parse_to_json(data_list):
    parsed_data = []

    for document_item in data_list:
        item = document_item.page_content
        # Splitting the string by new lines and then by ': ' to get key-value pairs
        split_data = item.split('\n')
        item_dict = {}

        for element in split_data:
            key_value = element.split(': ', 1)  # Splitting only on the first occurrence

            if len(key_value) == 2:
                # Assigning the value to the respective key in the dictionary
                item_dict[key_value[0]] = key_value[1]
            else:
                # Handling cases with missing description
                item_dict[key_value[0]] = None

        parsed_data.append(item_dict)

    return parsed_data

parsed_data = parse_to_json(doc_split)

In [72]:
# Print all json objects (slice list for subset)
[print(x, "\n") for x in parsed_data[:5]]

{'Index': '0 ', 'Title': 'VAT and other indirect taxes changes in 2023 ', 'Date': '17 November 2023 ', 'Description': "Stay ready for 2023's VAT and indirect tax changes. Expert insights for businesses. ", 'Link': 'https://www.bdo.co.uk/en-gb/insights/tax/vat-and-indirect-taxes/are-you-ready-for-2023-upcoming-changes-in-vat-and-other-indirect-taxes'} 

{'Index': '1 ', 'Title': 'Christmas Parties: Tax Issues ', 'Date': '13 November 2023 ', 'Description': 'Planning a good event is a real talent but there is also opportunity to make them even better through ensuring that the tax angles are not forgotten. ', 'Link': 'https://www.bdo.co.uk/en-gb/insights/tax/global-employer-services/christmas-parties-tax-issues'} 

{'Index': '2 ', 'Title': 'VAT exemption for Loan Administration services ', 'Date': '08 November 2023 ', 'Description': 'The Supreme Court’s ruling in the recent case of Target V HMRC may finally have brought clarity to the VAT treatment of financial intermediary services. ', 'Li

[None, None, None, None, None]

In [40]:
# Export directory JSON:
export_directory_json = "C:/Users/RLee/Desktop/TAX BASE/output.json"

# Write to a JSON file
with open(export_directory_json, 'w') as file:
    json.dump(parsed_data, file, indent=4)
    
print(f"All document summaries (in JSON) exported to {export_directory_json}")

All document summaries (in JSON) exported to C:/Users/RLee/Desktop/TAX BASE/output.json


In [41]:
import pandas as pd

# Convert the parsed data to a DataFrame
df_parsed_docs = pd.DataFrame(parsed_data)

# Export directory for Excel
export_directory_excel = "C:/Users/RLee/Desktop/TAX BASE/output.xlsx"

# Exporting the DataFrame to an Excel file
df_parsed_docs.to_excel(export_directory_excel, index=False)

print(f"All document summaries (in Excel) exported to {export_directory_excel}")

All document summaries (in Excel) exported to C:/Users/RLee/Desktop/TAX BASE/output.xlsx


## Embedding

In [42]:
# ! pip install openai
# ! pip install os
# ! pip install python-dotenv

In [44]:
import openai
import os
from dotenv import load_dotenv, find_dotenv

# Get openai_api_key using .env file
_ = load_dotenv(find_dotenv("C:/Users/RLee/Desktop/TAX BASE/openai_api_key.env")) # .env filepath
openai.api_key = os.environ["OPENAI_API_KEY"]

In [45]:
from langchain.embeddings.openai import OpenAIEmbeddings

# Create an Embedding function
embedding = OpenAIEmbeddings()

In [46]:
# Example of embedded chunk
embedding_example = embedding.embed_query(doc_split[1].page_content)
print(doc_split[1].page_content)
print(embedding_example[:30])

Index: 1 
Title: Christmas Parties: Tax Issues 
Date: 13 November 2023 
Description: Planning a good event is a real talent but there is also opportunity to make them even better through ensuring that the tax angles are not forgotten. 
Link: https://www.bdo.co.uk/en-gb/insights/tax/global-employer-services/christmas-parties-tax-issues
[-0.0011373558241921404, -0.029427741137483598, -0.0029262789006034096, -0.039694462530751685, -0.02100850485060493, 0.0030466949530767147, -0.013163306004484112, -0.0040314687052603135, -0.020322297054734952, -0.013427231649977526, 0.006433194593821364, 0.00605050203532688, -0.03172389835109568, 0.028926283528633218, 0.004341581431847334, 0.01394848545175283, 0.040618200427333453, -0.031644721029976695, 0.011903059836533699, -0.024228401823443962, -0.0315919336657038, -0.013414035740231891, -0.00977185754833888, 0.03819008225361984, 0.005723893723065874, -0.0035827945025618283, 0.02436036464619067, -0.01813171270702348, 0.008300470305200182, 0.0147534601

## Storing embedded chunks in Vector Database 
- Creates repository for all embedded documents. This yields a more scalable LLM design, not limited by the input-token constraints. 
- ```Chroma```: Open-source & light-weight (Alternatives may become necessary at larger scale)
- VectorDB (in specified directory) should be emptied such that we always start from scratch (to prevent duplicate information), however, sometimes this part messes up. 

Documentation Chroma/Langchain: https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html

In [47]:
# !pip install chromadb 
# !pip install --upgrade langchain (requires recent version langchain)

In [48]:
# https://python.langchain.com/docs/integrations/vectorstores/chroma
from langchain.vectorstores import Chroma
import chromadb

In [49]:
# Creates/updates VectorDB file at specified directory 
vectordb_directory = "C:/Users/RLee/Downloads/vectordb"

In [80]:
# Empty current collection/VectorDB before adding stuff to prevent duplicates
try:
    vectordb.delete_collection()
except Exception as e:
    pass

# Add chunks to VectorDB in the above specified directory (will run whether delete_collection() succeeds or fails)
finally:
    vectordb = Chroma.from_documents(
        documents = doc_split,
        embedding = embedding,
        persist_directory = vectordb_directory # chroma-specific keyword
    )
    
    # Save to use later
    vectordb.persist()

In [81]:
# Show number of items stored in VectorDB (should be same as number of chunks earlier)
vectordb._collection.count()

252

## Retrieval 
- Collect a number documents (```n_subset```), that, based on the embeddings, have the highest likelihood of being relevant. This pre-selection step yields a more scalable approach to handling a significantly larger volume of articles.   
- Relevance is determined through Embedding similarity matching (numerical representations of the query's and document's content)

In [83]:
# User input
question = """
As a BDO UK tax professional, I'm interested in the developments following the OECD's publication of the Model Globe Rules. 
Can you tell me which jurisdictions have adopted final legislation to implement Pillar Two and which jurisdictions have published draft legislation for the same?
"""

In [85]:
# Number of documents that are selected through embedding similarity
n_subset = 20

In [86]:
# Collect n_subset documents, by maximising embedding similarity scores
vector_db_matches = vectordb.similarity_search(question, n_subset) 
[print(x.page_content, "\n") for x in vector_db_matches]

Index: 159 
Title: UK legislates for OECD Pillar Two rules 
Date: 11 August 2022 
Description: Understand the UK legislation on OECD Pillar Two rules. 
Link: https://www.bdo.co.uk/en-gb/insights/tax/corporate-international-tax/uk-legislates-for-oecd-pillar-two-rules 

Index: 179 
Title: The impact of the OECD Pillar Two model rules on natural resource companies 
Date: 27 April 2022 
Description: All natural resources companies will be affected by the Pillar 2 proposals and this article considers some areas of immediate concern. 
Link: https://www.bdo.co.uk/en-gb/insights/tax/corporate-international-tax/the-impact-of-the-oecd-pillar-two-model-rules-on-natural-resource-companies 

Index: 178 
Title: Pillar One and Pillar Two – implications for professional service partnerships 
Date: 27 April 2022 
Description: Assess the implications of Pillar One and Pillar Two for professional service partnerships, staying compliant and making informed tax planning decisions. 
Link: https://www.bdo.co

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

Parse the list of chunks to a single string to conveniently feed to the LLM in the following steps.

In [87]:
vector_db_matches_str = ""

# Parse the list of document types to a single string (for convenient prompt ingestion)
for x in vector_db_matches:
    vector_db_matches_str = vector_db_matches_str + "\n-\n" + x.page_content

In [88]:
# May cause problems if excessively high (e.g. due to a large n_subset)
print(len(vector_db_matches_str))

7077


## LLM

In [89]:
# Get recent LLM version
import datetime
current_date = datetime.datetime.now().date()
llm_name = "gpt-3.5-turbo-0301" if current_date < datetime.date(2023, 9, 2) else "gpt-3.5-turbo"

In [90]:
# Standard helper function
def get_completion(prompt, model = llm_name):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]

In [110]:
# Prompt engineering
prompt = f""" 
Given a list of document summaries, your task is to assess each document strictly for relevance to the provided query. Exclude all documents that are 'likely irrelevant'—those that are only marginally related to the query.

Input:

Query: <{question}>
Document Summaries: <{vector_db_matches_str}>

Procedure:

1. Essence Extraction: Discern the core essence and essential points of the query.
2. Relevance Assessment: Review the titles and descriptions of each document summary. Determine whether the document is likely to be relevant or maybe relevant to the query's key points. Disregard any document that does not appear to closely align with the query's essence.

Output:

First print the question, and your associated interpretation of the question:

- Question: [User input question]
- Key Objective of the Query: [Concise summary of the query's key points]

Insert a line here, to show a clear break using dashes 

Don't include the remaining irrelevant documents in the output.
Then, for each document that is determined to be relevant or maybe relevant, present the following details:

- Conclusion: [Relevant/Maybe Relevant]
- Reasoning: [Justification for the relevance assessment, connecting the document's title and description to the query]
- Document Details:
  - Index: [Document index]
  - Title: [Title of the document]
  - Date: [Publication date of the document]
  - Description: [Overview of the document's main themes and points]
  - Link: [Direct URL]
"""

In [111]:
# Track total runtime
from datetime import datetime
_start_time = datetime.now()

# Get LLM response
response = get_completion(prompt)
print(response)

    
# Print execution time
print(f"\n\n\n===============================\nTotal runtime:  {datetime.now() - _start_time}")

Question: As a BDO UK tax professional, I'm interested in the developments following the OECD's publication of the Model Globe Rules. Can you tell me which jurisdictions have adopted final legislation to implement Pillar Two and which jurisdictions have published draft legislation for the same?

Key Objective of the Query: Identify jurisdictions that have adopted final legislation or published draft legislation to implement Pillar Two.

--------------------------------------------------

Conclusion: Relevant
Reasoning: The document titled "UK legislates for OECD Pillar Two rules" is likely to be relevant as it specifically discusses the UK legislation on OECD Pillar Two rules, which aligns with the query's objective of identifying jurisdictions that have implemented Pillar Two.
Document Details:
- Index: 159
- Title: UK legislates for OECD Pillar Two rules
- Date: 11 August 2022
- Description: Understand the UK legislation on OECD Pillar Two rules.
- Link: https://www.bdo.co.uk/en-gb/i