In [None]:
!pip install tenacity

In [None]:
!pip install langchain

In [None]:
!pip install langchain_openai

In [None]:
!pip install langchain_experimental

In [5]:
import logging
import os
import re
import json
from datetime import date
from typing import List, Optional

import pandas as pd
import requests
from bs4 import BeautifulSoup
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
from tenacity import retry, stop_after_attempt, wait_exponential

# Web Scraping

Let's start by scraping a Wikipedia page about Australian bushfires! First, we need to inspect the webpage to identify which elements to extract. The key pieces of information we’re interested in are:

- The name of the wildfire.
- The time period during which the wildfire occurred.
- A brief description of what happened.

We use Python’s requests library to fetch the webpage, and BeautifulSoup to help us navigate its HTML structure and extract the required information. Once we've collected everything, we’ll package the data into a pandas DataFrame and display it to get a clear view of our results.

In [6]:
# URL of the Wikipedia page containing data on the 2018–19 Australian bushfire season
url = 'https://en.wikipedia.org/wiki/2018%E2%80%9319_Australian_bushfire_season'

# Send a GET request to fetch the content of the webpage
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content of the webpage using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Initialize an empty list to store the extracted wildfire data
    wildfires_data = []

    # Find all sections on the page that represent distinct wildfire instances
    wildfire_sections = soup.find_all('div', class_='mw-heading mw-heading3')

    # Iterate over each section to extract the relevant information (name, period, description)
    for section in wildfire_sections:
        # Extract the wildfire name from the <h3> tag
        h3_tag = section.find('h3')
        if h3_tag:
            name = h3_tag.text.strip()
        else:
            # If no name is found, label the wildfire as "Unnamed" and log a warning
            name = "Unnamed"
            logging.warning("Unnamed wildfire found")

        # Find the next sibling element after the heading to locate the period and description
        next_element = section.find_next_sibling()

        # Skip over irrelevant elements until we find period and description
        while next_element and next_element.name != 'dl':
            next_element = next_element.find_next_sibling()

        # Once we find the <dl> element, we continue to extract the period and description
        while next_element and next_element.name == 'dl':
            dt_tag = None
            if next_element.name == 'dl':
                dt_tag = next_element.find('dt')
                if dt_tag:
                    dt_tag = dt_tag.text.strip()
                next_element = next_element.find_next_sibling()

            description = ""
            # Continue to find all <p> tags that provide the description and concatenate them
            while next_element and next_element.name == 'p':
                description += next_element.get_text(separator=" ", strip=True) + " "
                next_element = next_element.find_next_sibling()

            # Append the extracted data to our list
            wildfires_data.append({
                "Name": name,
                "Period": dt_tag,
                "Description": description
            })

# Convert the extracted data into a pandas DataFrame for easy manipulation and display
df = pd.DataFrame(wildfires_data)

# Display the DataFrame with the extracted wildfire data
display(df)

Unnamed: 0,Name,Period,Description
0,New South Wales,July 2018,New South Wales experienced 525 fires in the l...
1,New South Wales,August 2018,"On 15 August, a bushfire near Bega burned out ..."
2,Victoria,December 2018,On 7th December a grassfire threatened the tin...
3,Victoria,March 2019,On 1 March several fires were lit in the east ...
4,Queensland,November 2018,"By 30 November, 144 bushfires were burning thr..."
5,Tasmania,January 2019,A series of bushfires started in Tasmania from...
6,Western Australia,October 2018,"A bushfire, that started on 11 October, approx..."
7,Western Australia,January 2019,"In mid-January, south western Australia experi..."
8,Western Australia,February 2019,"On 7 February, a large bushfire started in the..."


# Cleaning Up the Data

Before we dive deeper, we need to clean up the descriptions a bit. Specifically, we want to remove citation references like [4] or [9] that appear throughout the text. These citations are useful on the Wikipedia page but not necessary for our analysis.

To do this, we define a simple function that uses a regular expression (regex) to identify and remove any text inside square brackets. We then apply this function to the 'Description' column of our DataFrame, ensuring our data is cleaner and easier to work with.

In [7]:
# Define a function to remove citation references
def remove_citations(text):
    # Use regex to match brackets and numbers inside them, such as [4], [ 9 ]
    return re.sub(r'\[\s*\d+\s*\]', '', text)

# Apply the function to the 'Description' column in the DataFrame
df['Description'] = df['Description'].apply(remove_citations)

In [8]:
# Print an example of cleaned text
print(df['Description'][2])

On 7th December a grassfire threatened the tiny town of Little River which is south west from Melbourne. The fire started at 11:42am and burnt at least 1,240 hectares.   


# Setting Up the Language Model

Here, we prepare our environment and set up the language model we’ll use for extraction tasks. First, we store the OpenAI API key in the environment to authenticate our requests. Then, we choose the model version, in this case, gpt-4o-mini, a lighter variant of GPT-4, suited for this task. Finally, we initialize the language model with a temperature setting of 0, which ensures deterministic behavior. By setting the temperature to 0, we reduce the randomness in the model’s responses, making the outputs more reliable and consistent, especially important when performing data extraction or structured tasks.



In [10]:
# Set the OpenAI API key as an environment variable to authenticate API requests
os.environ["OPENAI_API_KEY"] = 'sk...'

# Choose the model version to be used
model = "gpt-4o-mini"

# Initialize the LLM with the specified model and temperature set to 0 (deterministic behavior, less hallucination).
llm = ChatOpenAI(model = model , temperature=0)

# Structuring Wildfire Data with Pydantic

Here, we define a data model using Pydantic to structure the information we extract from wildfire descriptions. The WildfireInfo class outlines the key pieces of data we're interested in:

- **total_burnt_land**: This captures the total area burned by the wildfire, measured in hectares. If multiple areas are mentioned, we sum them up.
- **number_of_deaths**: This field records the number of confirmed fatalities caused by the wildfire. We focus only on fatalities, excluding injuries, evacuations, or hospitalizations.
- **list_of_areas_affected**: This holds a list of geographical locations impacted by the wildfire, including cities, towns, regions, and national parks. We avoid adding vague location descriptors like 'north of' or 'nearby.'
- **number_of_fires**: This field tracks the total number of fires mentioned, including all types (e.g., bushfires, grassfires). If fires occurred simultaneously, we sum them. If they occurred over time, we select the final or most significant number.

This model helps us organize and standardize the extracted data, making it easier to work with and analyze.

In [13]:
class WildfireInfo(BaseModel):
    """Information extracted from the wildfire descriptions."""

    total_burnt_land: Optional[int] = Field(
        description=(
            "The total area of land burnt by the wildfire, measured in hectares. "
            "If more than one area is mentioned, sum the values."
        ),
        default=None
    )

    number_of_deaths: Optional[int] = Field(
        description=(
            "The total number of deaths caused directly by the wildfire. "
            "Only confirmed fatalities should be counted. "
            "Do not include hospitalizations, injuries, or evacuations."
        ),
        default=None
    )

    list_of_areas_affected: List[str] = Field(
        description=(
            "A list of specific geographical areas affected by the wildfire. "
            "Include cities, towns, regions, and national parks, but avoid descriptors like "
            "'north of', 'close to', or 'around'."
        ),
        default=[]
    )

    number_of_fires: Optional[int] = Field(
        description=(
            "The total number of fires (including bushfires, grassfires, or any other kind of fire) "
            "mentioned in the description. If multiple numbers are mentioned, sum the numbers if the fires "
            "occurred at the same time. "
            "If the fires occurred over different times or days, select the final number reported, "
        ),
        default=None
    )

# Binding the Model with Tools

Here, we create a list of tools that our language model can use. In this case, we include WildfireInfo, the Pydantic model we've defined previously.

Next, we bind the WildfireInfo tool to the language model, which allows the model to call this tool when processing descriptions. This means the model can now directly output structured data according to the format specified by WildfireInfo.

In [14]:
# Define a list of tools (models) the LLM will use for structured data extraction
tools_descr = [WildfireInfo]

# Bind the HurricaneInfo tool to the LLM, allowing the model to call this tool.
llm_with_tools_descr = llm.bind_tools(tools_descr)

# Defining a Prompt for Wildfire Data Extraction

In this part, we create a structured prompt that will guide the language model in extracting relevant information from wildfire descriptions. The prompt sets clear instructions for the model, ensuring it adheres to the predefined JSON schema and only extracts the necessary details. The system message emphasizes that the model should not infer or add any extra data beyond what is explicitly mentioned in the text.

Additionally, we provide two examples to help the model understand how to handle different cases. The first example demonstrates extracting the names of affected areas and the total burnt land in hectares from a text, while the second example highlights how to handle multiple fire counts, ensuring the model correctly identifies the final or most significant number of fires.

In [23]:
wildfires_prompt = ChatPromptTemplate.from_messages('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'
    [
        ('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'
            "system",
            "You are a specialized information extraction algorithm focused on analyzing the impact of wildfires in Australia and summarizing geographic data. "
            "Your task is to extract only relevant information from the input text according to the provided JSON schema. "
            "Do not infer or add any information that is not explicitly mentioned in the text, and strictly adhere to the JSON structure. "
        ),
        (
            "human",
            "For example, given the text: 'On 7th December a grassfire threatened the tiny town of Little River which is southwest of Melbourne. "
            "The fire started at 11:42am and burnt at least 1,240 hectares.', "
            "the list of affected areas should be: ['Little River', 'Melbourne'], and the total burnt land in hectares: 1240."
        ),
        (
            "human",
            "Another example, given the text: 'By 30 November, 144 bushfires were burning throughout Queensland at the end of a week where 200 fires had been battled by firefighters.', "
            "the total number of fires is 200."
        ),
        (
            "human",
            "Another example, given the text: 'A fire near Terang reached over 6,700 hectares (17,000 acres), one in Gnotuk-Camperdown approximately 200 hectares (490 acres), "
            "and in Garvoc approximately 4,000 hectares (9,900 acres).', "
            "the total area of burnt land is 10,900 hectares."
        ),

        ("human", "{user_input}"),
    ]
)

# Running an Example to Validate Results

To check how well our model is extracting wildfire data, we run a simple example using a random wildfire description from our dataset. First, we select a sample description from the DataFrame to serve as our input. Next, we format the query by embedding this description into the prompt we've designed for the task. Then, we invoke the language model, which is equipped with the tools for structured data extraction, to process the input and return the relevant information.

Once the model has completed the task, we print the input example so we can easily compare it with the output. Finally, we display the results extracted by the model, showing the key pieces of information it has pulled from the description, such as the number of fires, areas affected, and more.

In [16]:
# Select a random wildfire description from the dataframe to demonstrate the example
wildfire_ex = df['Description'][4]

# Format the query by embedding the user input (wildfire description) into the prompt
query = wildfires_prompt.format(user_input=wildfire_ex)

# Invoke the language model (LLM) with the tools bound to the description data to extract relevant information
ai_message = llm_with_tools_descr.invoke(query)

# Print the selected wildfire description as the input example
print(f"Example input:\n{wildfire_ex}\n")'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'

# Output the extracted results generated by the model in response to the example input
print("Extracted output:")
print(ai_message.additional_kwargs['tool_calls'][0]['function']['arguments'])

Example input:
By 30 November, 144 bushfires were burning throughout Queensland at the end of a week where 200 fires had been battled by firefighters. High temperatures and strong winds made for difficult conditions and two houses, two cabins and 15 sheds were lost to fire with a further 14 houses damaged.  

Extracted output:
{"total_burnt_land":null,"number_of_deaths":null,"list_of_areas_affected":["Queensland"],"number_of_fires":200}


# Extracting Features from Descriptions

Here, we define a function that processes each wildfire description and extracts key features using our language model. First, the function formats the input description into a query, embedding it into the structured prompt we created earlier. Then, we call the language model, which is equipped with the bound tools for data extraction. The response from the model contains the extracted information, which we parse from a JSON string into a Python dictionary.

From the parsed dictionary, we retrieve the key pieces of data: the total burnt land, number of fires, number of deaths, and the list of affected areas. These values are returned for each description.

Next, we apply this function to every wildfire description in the DataFrame, extracting the structured data for each entry. The results are then stored in new columns—total_burnt_land, number_of_fires, number_of_deaths, and list_of_areas_affected. Finally, we display the updated DataFrame to examine the extracted features for each wildfire instance.

In [24]:
def extract_features_from_description(description):
    # Format the query for the LLM
    query = wildfires_prompt.format(user_input=description)

    # Call the LLM with the tool bound
    ai_message = llm_with_tools_descr.invoke(query)

    # Extract tool_call_arguments from the LLM response
    tool_call_arguments = ai_message.additional_kwargs['tool_calls'][0]['function']['arguments']

    # Parse the arguments from JSON string into a Python dictionary
    parsed_arguments = json.loads(tool_call_arguments)

    # Extract the structured output
    total_burnt_land = parsed_arguments.get("total_burnt_land", None)
    number_of_fires = parsed_arguments.get("number_of_fires", None)
    number_of_deaths = parsed_arguments.get("number_of_deaths", None)
    list_of_areas_affected = parsed_arguments.get("list_of_areas_affected", [])

    return total_burnt_land, number_of_fires, number_of_deaths, list_of_areas_affected

# Apply the function to each row in the 'Description' column
df[['total_burnt_land', 'number_of_fires', 'number_of_deaths', 'list_of_areas_affected']] = df['Description'].apply(lambda desc: pd.Series(extract_features_from_description(desc)))

# Display the DataFrame with the results
display(df)

Unnamed: 0,Name,Period,Description,total_burnt_land,number_of_fires,number_of_deaths,list_of_areas_affected
0,New South Wales,July 2018,New South Wales experienced 525 fires in the l...,,525.0,,[]
1,New South Wales,August 2018,"On 15 August, a bushfire near Bega burned out ...",2300.0,,,[]
2,Victoria,December 2018,On 7th December a grassfire threatened the tin...,1240.0,,,"[Little River, Melbourne]"
3,Victoria,March 2019,On 1 March several fires were lit in the east ...,,,,[Bunyip State Park]
4,Queensland,November 2018,"By 30 November, 144 bushfires were burning thr...",,200.0,,[]
5,Tasmania,January 2019,A series of bushfires started in Tasmania from...,278000.0,3.0,,[Southwest National Park]
6,Western Australia,October 2018,"A bushfire, that started on 11 October, approx...",880000.0,1.0,,[Broome]
7,Western Australia,January 2019,"In mid-January, south western Australia experi...",160.0,1.0,,[Collie]
8,Western Australia,February 2019,"On 7 February, a large bushfire started in the...",144.0,,,"[Forrestdale, Harrisdale, Piara Waters]"


In [19]:
# Display the data types of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Name                    9 non-null      object 
 1   Period                  9 non-null      object 
 2   Description             9 non-null      object 
 3   total_burnt_land        7 non-null      float64
 4   number_of_fires         8 non-null      float64
 5   number_of_deaths        1 non-null      float64
 6   list_of_areas_affected  9 non-null      object 
dtypes: float64(3), object(4)
memory usage: 632.0+ bytes


It's great to see that the output is well-structured according to the schema we defined, and that the model didn't generate any extraneous information or produce values of the wrong data type. In the majority of cases, the extracted responses are correct and align with our expectations. However, further analysis is needed for cases where the information is more complex, such as when additional calculations are required, or when the model needs to determine the final value of something mentioned multiple times in the description.

In [22]:
# Save the final DataFrame to a CSV file named 'Australian_bushfires_2018_2019.csv' without including the index column
df.to_csv('Australian_bushfires_2018_2019.csv', index=False)