# Find and summarise relevant pages within documents

## How this works

For **docx (Word)** and **PDF** files, this code will first find pages that
match the keywords that you define in the `keywords` array.

It will then send these pages to the LLM to determine if the page is relevant to
the task you have set, and if it is, it will summarise the page.

For all relevent pages, these are recorded in a spreadsheet in the results
folder.

## Instructions

There are three inputs required to make this work. If this is a fresh install,
there example below gives a good outline of what to include (this can always be
found back at the [Github repo](https://github.com/Academic-ID/academicAI)).

1. You need to provide a list of **keywords** in the cell below to search for in
   the documents. This can be any number of keywords and should look like the
   following: `keywords = ['regulation', 'regulatory']`. This stage is akin to
   doing a Ctr+F over a document. You should first ensure PDF documents have
   readable text. This code doesn't at this stage provide an OCR functionality
   so if documents do not have searchable text, this code will miss them
   entirely (to check, open the file and see if you can select text as you would
   to copy and paste it).
2. You need to detail the **task** that the AI assistant is carrying out. It is
   vital that you are extremely explicit in your instructions. The more detailed
   you are, the better the summaries you will receive and the lower the
   liklihood that something relevant is missed.
3. To speed things up, pages are processed in parallel. How many pages are being
   processed at one time is set by the **max_workers** variable. There are two
   things you need to consider when setting this. First, your machine's
   resources (e.g. available threads), and second (and likely more of a concern)
   the rate limiting of requests to the LLM. It is worth playing around with
   this value starting from higher to lower and reducing the number until errors
   stop appearing. Between 5 and 10 seems a fairly safe bet just using the API
   with no additional rate limit increases.


In [1]:
from processor.handler import process

keywords = [
    'regulation', 'regulatory', 'compliance', 'comply', 'compliant', 'ombudsman',
    'ombud', 'industry code', 'Comms Alliance', 'Communications Alliance',
    'Communication Alliance', 'ACIF', 'Australian Communications Industry Forum',
    'ACMA', 'Australian Communications and Media Authority', 'Communications and Media Authority'
]

task = """You are a highly paid, and highly respected academic at a prestigious university. 

Your expertise is in the field of Australian Telecommunications Regulation. 

You have been asked to provide a summary of TPG (and predecessors Vodafone and Hutchinsons)'s compliance approach in Australia. 

You have been provided with a page from a public document from these companies. 
    
Your task is to read the text very carefully and thoroughly. 

Your task is to determine if it contains any information relevant to TPG's compliance approaches in Australia (more details below). If so, you must summarise the information.

What I mean by compliance approaches:
- I am interested in how the company complies with its regulatory obligations as a telecommunications provider in Australia. This includes any information on how TPG ensures it complies with the law, industry codes, and any other telecommunications-focused regulatory obligations. 
- This is only in relation to regulations specific to the telecommunications industry. Do not include anything relating to companies regulations (e.g. Australian consumer law, tax law, stock exchange or other non-telecommunications related regulation).
- Some keywords for regulation within the telecommunications industry includes ACMA (Australian Communications and Media Authority), TCP (Telecommunications Consumer Protections Code), MPS (Mobile Premium Services Code), mobile base station deployment, Comms Alliance (Communications Alliance), ACIF (Australian Communications Industry Forum), and other industry codes or consumer specific provisions/regulations.
- I want to know what information they have provided relating to how they approach compliance, and how they interact with the regulator (ACMA) and other industry participants.

What I mean by summarise:
- Please write a paragraph or two summarising the information you have found.
- Keep it highly factual, relying purely on what is provided in the text for your summary.
- Please do not copy and paste the information you have found. I want you to read the information carefully, understand it, and then summarise it in your own words. However, the only caveat to this is to ensure to maintain language with specific meaning to regulation or law. For example, words such as 'comply', 'presumption' or 'regulation' should be maintained.
- Please use extremely clear and precise language. You do not need to use overly descriptive or emotive language. I want to know the facts, not your opinion or evaluation on the approach. Do not speculate or try to summarise what you think the approach is. Only summarise what is provided in the text. For example, do not write something along the lines of: 'this demonstrates a robust approach to compliance'. Instead, write something like: 'TPG's compliance approach includes...'.
- Do not include marketspeech (adjectives or adverbs TPG has used to describe their approaches in more positive terms) - just provide a plain english recount of their compliance approach as documented in the text.
- Do not make reference to the text by saying 'the provided text' or anything like that. Do not use phrases such as 'This text' or 'This document' to start your summary and/or introduce the text. Just recount what is in the provided text.

If there is no information relevant to compliance approaches:
- Return with a message stating: N/A (you do not need to summarise anything, you do not need to explain what is on the page or why it is not relevant, just return with N/A)

Other notes:
- Please do not include any information that is not relevant to telecommunications compliance approaches (refer to above notes on 'What I mean by compliance approaches').
- Please do not summarise pages that are just tables, contents pages, glossary pages, index pages or similar.
- Read the provided text very carefully and thoroughly. If you do not read the text carefully and thoroughly, you will not be able to complete the task. If this task is not completed to the highest quality and exactly as per these instructions, you and I will loose our jobs so it is vital you follow these instructions very carefully."""

source_folder_path = 'content'
destination_folder_path = 'result'
max_workers = 10

process(keywords, task, source_folder_path, destination_folder_path)

  0%|          | 0/29 [00:00<?, ?it/s]

'Overivew saved at: result/00. Summaries.xlsx'