![iCor logo light](../logo.png)

# iCOR

iCOR is a company that builds legal registers for companies to evaluate their environmental performance. They do this by manually searching environmental legislation and summarising it along with steps of what has changed. 

At the moment iCOR do this process by hand. They would like to be able to automate this. Can we develop a business idea around this?

# Using GPT-4 model to extract information 

In this notebook we will use the [Azure OpenAI GPT-4](https://azure.microsoft.com/en-us/blog/introducing-gpt4-in-azure-openai-service/) model to extract information from the scraped web pages by simply asking free-text questions.

In [1]:
import configparser
from openai import AzureOpenAI

from pandas import read_csv

In [2]:
# read in data from web scraping process
legislation_df = read_csv('data/raw/legislation.csv')
# take the first row of extracted text
leg_text = legislation_df['text'].tolist()[0]

### Azure OpenAI GPT4 model

The model is used to extract relevant information from the scraped documents.

First you need to add information as the whether you are in group 3 or group 4.

In [9]:
#specify what group you are in. Either group3 or group4

group = 'group4'

In [10]:
# load OpenAI API keys
config = configparser.ConfigParser()
config.read(f'config/{group}.ini')
openai_endpoint = config['API']['endpoint']
openai_key = config['API']['key']
deployment_name = config['API']['deployment-name']

# init OpenAI client
client = AzureOpenAI(
  azure_endpoint = openai_endpoint, 
  api_key = openai_key,  
  api_version = "2024-02-01"
)

You can experiment with GPT-4 model by updating the query in the code cell below. Remember generative models can (and will) [hallucinate](https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)).

In [11]:
query = """
        Question1: Summarise the given text into no more than 1 paragraph.\n
        Question2: What are the key changes in the updated legislation?\n
        Question3: List all of the legislations in date order.
        """

# request response from OpenAI GPT-4 model
response = client.chat.completions.create(
    model=deployment_name, 
    messages=[
        {"role": "system", "content": """
                                            You are an expert in environmental legislation within the UK and building legal registers. 
                                            Your task is to provide responses to the questions/instructions.
                                            Return your responses in a JSON format, where the keys are the questions and the values are the answers.
                                            If the information is not present in the provided text, please return 'unknown'.
                                        """},
        {"role": "user", "content": query},
        {"role": "assistant", "content": leg_text}
    ]
)

# return response content
print(response.choices[0].message.content)

```json
{
  "Question1": "The Hazardous Waste (England and Wales) Regulations 2005 established comprehensive guidelines and procedures for the handling, movement, and disposal of hazardous waste in England and Wales. These regulations aim to prevent pollution and safeguard human health by ensuring waste management practices that do not endanger the environment. They mandate the notification of premises generating hazardous waste, codify the mixing prohibitions, and detail the documentation and record-keeping requirements for waste producers, carriers, and consignees. The regulations also encompass emergency measures, enforcement protocols, and penalties for non-compliance. Furthermore, they harmonize with European directives and provide transitional provisions and amendments to existing legislation.",
  
  "Question2": "Key changes in the updated legislation include the replacement of the term 'special waste' with 'hazardous waste,' the implementation of stricter controls and tracking 

Can you create a better system prompt and questions? Have a play about with change the questions and the prompt. 

Would it be better to apply this to the text we extracted using the `find_all()` method in `01_web_scraping.ipynb`? Azure OpenAI charges per token (think of a token as a word), so with less tokens there is less cost that the business has to pay!