# Extracting specific values from freeform text

Sometimes information that we need to build a predictive model is contained in freeform text. An LLM can be very useful to extract that information.

A Real Estate Automated Valuation Model (AVM) predicts the price for which a property is likely to sell for. The land size of a property is an important input into the AVM. When listing a property, real estate agents sometimes don't input the land size in the "land size" field. Sometimes they only write it in the text "description" but not in the "land size" field.

We can build a better AVM if we can extract land size from the description text.

A scaled up version of the use case presented here would extract information from a large dataset of real estate listings. We could use the text descriptions from property listings to improve the inputs to our Real Estate AVM. 


In [1]:
#Step 0: Import the openai package and set the API key. 
#         I have my API key stored in an environment variable for this demo.
#         In prod, you might prefer to use a secret store.
import openai
import os
import backoff
import unittest
import json

openai.api_key  = os.getenv('MY_API_KEY')

### A real property description

This is a description of a real property that listed by a real estate agent.

The text below comes from a real listing on realestate.com.au. The URL is https://www.realestate.com.au/sold/property-house-vic-black+rock-142012720?sourcePage=rea:p4ep:property-details&sourceElement=avm-currently-advertised-view-listing (accessed on 2023-06-15)

In [2]:
a_particular_property_description = """
Premium Beachside Position & Endless Opportunity
22A LOVE ST, BLACK ROCK
The beach at one end of the street, and shops & cafes at the other, this unique clinker brick home, with timeless warmth and classic comfort, is ready to welcome a new generation, with potential to go big on a contemporary renovation and make the most of one of Bayside's best addresses.

Showcasing house-like scale throughout and a radiant range of indoor/outdoor entertaining areas, the layout is filled with natural light from top to bottom. Under high ceilings, a wide-reaching open-plan living and dining room is instantly relaxed in nature, underpinned by a stone-topped kitchen with stainless steel appliances and servery. Timber doors slide open to a north facing deck, perfectly placed for alfresco indulgence. Peaceful, private, and protected, this sanctuary of space is thoughtfully landscaped with lush grass and mature plants. A second living zone with garden access leaves you spoiled for choice and makes for a family friendly design. The ground floor master enjoys walk through robes and a cleverly crafted ensuite/two-way central bathroom with separate WC. The upper level continues to impress with a fantastic rumpus retreat, two large, robed bedrooms (one with a built-in desk), a shared bathroom & separate WC.

Additional features include a spacious laundry with external access, storeroom/shed, stunning leadlight windows, custom timber cabinetry, ample storage, ducted heating, split-system heating/cooling, double undercover garage, and off-street parking for a further two cars. Walk along the Bay Trail, have Fish & Chips at Half Moon Bay, and embrace the Bayside way of life. All this, and with Black Rock Village, schools, and transport nearby. Lovingly maintained for immediate live in or lease out options, or alternatively add your modern vision and make your mark in this coveted location!

At a glance…

- Solid clinker brick home with huge proportions & plenty of potential

- 520m2 of land (approx.)

- Multiple north facing living zones

- Sliding doors to the private alfresco deck and lush garden – great for entertaining

- Stone topped kitchen with premium appliances & servery

- Ground floor master with walk through robes and two-way ensuite/central bathroom

- Upstairs rumpus/retreat plus 2 large bedrooms – perfect for the kids

- Undercover parking for 2 cars plus off-street parking for another 2

- Beach at the end of the street and Bluff Road shops & cafes at the other

- Close to schools, parks, public transport

Property Code: 2588
"""

### Our prompt template

In [3]:
#This is a "template". The reviews are inserted into the template with python's .format() function.
##When we specify the JSON template, we double up the curly braces so that they don't conflict with the .format() function.

prompt_template_v1 = """
What is the land size in square meters (m2) in the property description in the triple backticks?

```
{0}
```

What is the land size in square meters (m2) in the property description in the triple backticks?

Give your answer in JSON format. The land size value should be an integer. 
The answer must be in square meters, which is abbreviated as "m2". 
Use the following JSON template.

{{
    "unit_of_measurement": The unit of measurement as a string,
    "land_size": The land size as an integer
}}

"""

In [50]:
#Make our prompt
our_prompt = prompt_template_v1.format(a_particular_property_description)
#Print the prompt to verify it.
print(our_prompt)


What is the land size in square meters (m2) in the property description in the triple backticks?

```

Premium Beachside Position & Endless Opportunity
22A LOVE ST, BLACK ROCK
The beach at one end of the street, and shops & cafes at the other, this unique clinker brick home, with timeless warmth and classic comfort, is ready to welcome a new generation, with potential to go big on a contemporary renovation and make the most of one of Bayside's best addresses.

Showcasing house-like scale throughout and a radiant range of indoor/outdoor entertaining areas, the layout is filled with natural light from top to bottom. Under high ceilings, a wide-reaching open-plan living and dining room is instantly relaxed in nature, underpinned by a stone-topped kitchen with stainless steel appliances and servery. Timber doors slide open to a north facing deck, perfectly placed for alfresco indulgence. Peaceful, private, and protected, this sanctuary of space is thoughtfully landscaped with lush grass a

In [5]:
#We are using the backoff package to handle the rate limit error
## We wrap the openai.ChatCompletion.create() in our own function 
### and use the @backoff.on_exception() decorator.
@backoff.on_exception(backoff.expo, openai.RateLimitError)
def query_llm_single_turn(prompt, model="gpt-3.5-turbo", temperature=0, **kwargs):
    """
    This function queries the openai ChatCompletion API, with exponential backoff.
    
    Args:
        prompt(str): The prompt  
        model(str): The type of model to use. The default is "gpt-3.5-turbo".
        temperature(float): The temperature to use. The default is 0.
        **kwargs: Additional keyword arguments to be passed to openai.ChatCompletion.create()
    Returns:
        An opanai ChatCompletion object.
    """
    ##Set up the messages list
    messages = [{"role": "user", "content": prompt}]
    return openai.chat.completions.create(
        model=model,    
        messages=messages,
        temperature=temperature,
        **kwargs)

In [6]:
response = query_llm_single_turn(our_prompt)
the_reply = response.choices[0].message.content
print(the_reply)

{
    "unit_of_measurement": "m2",
    "land_size": 520
}


In [7]:
class SingleReviewPromptUnitTest(unittest.TestCase):
    def test_valid_json(self):
        #Tests that the LLM returns valid JSON
        #1. Parse the JSON into a dictionary object.
        #   Did that work?
        response_as_dict = json.loads(the_reply)
        self.assertIsInstance(response_as_dict, dict)
        #2. Is there a "unit_of_measurement" key? 
        self.assertIn("unit_of_measurement", response_as_dict)
        #3. Is there a "land_size" key?
        self.assertIn("land_size", response_as_dict)
        #4. Is the land size correct?
        self.assertEqual(520, response_as_dict["land_size"])
        #5. Is the unit of measurement m2?
        self.assertEqual("m2", response_as_dict["unit_of_measurement"])

        
unittest.main(module=__name__, argv=[''], exit=False, verbosity=2)

test_valid_json (__main__.SingleReviewPromptUnitTest) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.001s

OK


<unittest.main.TestProgram at 0x7f1aa84aa040>

# Using JSON Mode

JSON Mode gaurantees that the response will be valid JSON. The above prompt has returned valid JSON on its own. Sometimes LLMs don't return valid JSON even when instructed. OpenAI has introduced a new feature that gaurantees valid JSON.

As of Decemeber 2023, the JSON mode feature only works with `gpt-3.5-turbo-1106` and `gpt-4-1106-preview`. You can check the JSON mode documentation to see which models it's available for today. Here's the link: [https://platform.openai.com/docs/guides/text-generation/json-mode](https://platform.openai.com/docs/guides/text-generation/json-mode)

And here's the link to the [Chat Completion API documentation](https://platform.openai.com/docs/api-reference/chat/create)

Now let's see how JSON mode works.

In [51]:
#We can use our query_llm_single_turn() function, but we need to switch models and include the response_format parameter
response = query_llm_single_turn(
    our_prompt, 
    model="gpt-3.5-turbo-1106", 
    response_format={ "type": "json_object" }
)
the_reply = response.choices[0].message.content
print(the_reply)

{
    "unit_of_measurement": "m2",
    "land_size": 520
}


In [7]:
#It's important to check the finish_reason attribute. As we will see shortly.
response.choices[0].finish_reason

'stop'

In [8]:
#How the call to openai.ChatCompletion.Create() looks without our wrapper function
messages = [{"role": "user", "content": our_prompt}]
response = openai.chat.completions.create(
                model="gpt-3.5-turbo-1106",    
                messages=messages,
                temperature=0,
                response_format={ "type": "json_object" })

In [9]:
the_reply = response.choices[0].message.content
print(the_reply)

{
    "unit_of_measurement": "m2",
    "land_size": 520
}


In [10]:
response.choices[0].finish_reason

'stop'

### Note: When using JSON mode, we MUST instruct the LLM to output JSON. 
We cannot just turn JSON mode "on" and hope for the best.

# Improving our Real Estate Listing Summary Prompt

We have a prompt template that extracts the land size from real estate listings. Now let's improve our prompt to extract more information.

In [11]:
prompt_template_v2 = """
The property_description is in the XML tags. The property_description describes the real estate property.
Summarise the property_description as specified by the JSON template.

<property_description>
{0}
</property_description>

The property_description is in the XML tags. The property_description describes the real estate property.
Summarise the property_description as specified by the JSON template.


Use the following JSON template.
{{
    "newly_built_renovated": Is the property newly built, renovated, or neither? Choose one of the following options: ["newly built", "renovated", "neither"],
    "outdoor_deck": Does the property have an outdoor deck? Choose one of the following options: ["yes","no","unkown"]
    "construction_materials: What materials is the property built from? Or "NULL" if you don't know.,
    "parking_spaces": How many cars can i park on the property?,
    "land_size_unit_of_measurement": The unit of measurement as a string or "NULL" if you don't know.,
    "land_size": The land size as an integer or 0 if you don't know
}}

"""

In [13]:
#Make our prompt
our_prompt = prompt_template_v2.format(a_particular_property_description)
#Print the prompt to verify it.
print(our_prompt)


The property_description is in the XML tags. The property_description describes the real estate property.
Summarise the property_description as specified by the JSON template.

<property_description>

Premium Beachside Position & Endless Opportunity
22A LOVE ST, BLACK ROCK
The beach at one end of the street, and shops & cafes at the other, this unique clinker brick home, with timeless warmth and classic comfort, is ready to welcome a new generation, with potential to go big on a contemporary renovation and make the most of one of Bayside's best addresses.

Showcasing house-like scale throughout and a radiant range of indoor/outdoor entertaining areas, the layout is filled with natural light from top to bottom. Under high ceilings, a wide-reaching open-plan living and dining room is instantly relaxed in nature, underpinned by a stone-topped kitchen with stainless steel appliances and servery. Timber doors slide open to a north facing deck, perfectly placed for alfresco indulgence. Peac

In [14]:
#We can use our query_llm_single_turn() function, but we need to switch models and include the response_format parameter
response = query_llm_single_turn(our_prompt, model="gpt-3.5-turbo-1106", response_format={ "type": "json_object" })
the_reply = response.choices[0].message.content
print(the_reply)

{
    "newly_built_renovated": "renovated",
    "outdoor_deck": "yes",
    "construction_materials": "clinker brick",
    "parking_spaces": 4,
    "land_size_unit_of_measurement": "m2",
    "land_size": 520
}


## GPT 3.5 makes a mistake. Let's try GPT 4.

In [15]:
#We can use our query_llm_single_turn() function, but we need to switch models and include the response_format parameter
response = query_llm_single_turn(our_prompt, model="gpt-4-1106-preview", response_format={ "type": "json_object" })
the_reply = response.choices[0].message.content
print(the_reply)

{
    "newly_built_renovated": "neither",
    "outdoor_deck": "yes",
    "construction_materials": "clinker brick",
    "parking_spaces": 4,
    "land_size_unit_of_measurement": "m2",
    "land_size": 520
}


## GPT 4 succeeds, but GPT 3.5 fails. But can we write a prompt that will work with GPT 3.5?

In [32]:
prompt_template_v3 = """
The property_description is in the XML tags. The property_description describes the real estate property.
Summarise the property_description as specified by the JSON template.

<property_description>
{0}
</property_description>

The property_description is in the XML tags. The property_description describes the real estate property.
Summarise the property_description as specified by the JSON template.


Use the following JSON template.
{{
    "quotes_about_newly_built_renovated": What does the property_description say about any recent renovations, if the property was recently constructed, or if there is potential to renovate? Quote exact passages from the text.,
    "newly_built_renovated": Is the property newly built, renovated, or neither? Choose one of the following options: ["newly built", "renovated", "neither"],
    "outdoor_deck": Does the property have an outdoor deck? Choose one of the following options: ["yes","no","unkown"]
    "construction_materials: What materials is the property built from? Or "NULL" if you don't know.,
    "parking_spaces": How many cars can i park on the property?,
    "land_size_unit_of_measurement": The unit of measurement as a string or "NULL" if you don't know.,
    "land_size": The land size as an integer or 0 if you don't know
}}

"""

In [33]:
#Make our prompt
our_prompt = prompt_template_v3.format(a_particular_property_description)
#Print the prompt to verify it.
print(our_prompt)


The property_description is in the XML tags. The property_description describes the real estate property.
Summarise the property_description as specified by the JSON template.

<property_description>

Premium Beachside Position & Endless Opportunity
22A LOVE ST, BLACK ROCK
The beach at one end of the street, and shops & cafes at the other, this unique clinker brick home, with timeless warmth and classic comfort, is ready to welcome a new generation, with potential to go big on a contemporary renovation and make the most of one of Bayside's best addresses.

Showcasing house-like scale throughout and a radiant range of indoor/outdoor entertaining areas, the layout is filled with natural light from top to bottom. Under high ceilings, a wide-reaching open-plan living and dining room is instantly relaxed in nature, underpinned by a stone-topped kitchen with stainless steel appliances and servery. Timber doors slide open to a north facing deck, perfectly placed for alfresco indulgence. Peac

In [34]:
#We can use our query_llm_single_turn() function, but we need to switch models and include the response_format parameter
response = query_llm_single_turn(our_prompt, model="gpt-3.5-turbo-1106", response_format={ "type": "json_object" })
the_reply = response.choices[0].message.content
print(the_reply)

{
    "quotes_about_newly_built_renovated": "this unique clinker brick home, with timeless warmth and classic comfort, is ready to welcome a new generation, with potential to go big on a contemporary renovation",
    "newly_built_renovated": "neither",
    "outdoor_deck": "yes",
    "construction_materials": "clinker brick",
    "parking_spaces": 4,
    "land_size_unit_of_measurement": "m2",
    "land_size": 520
}


# Prompt Engineering is about finding the prompt that will work with the LLM that you have. 
Your employer may only have budget for GPT 3.5. Because GPT 4 is 10 times more expensive. The costs add up when you are sending 1000's of queries per day.

# Demonstration of finish_reason and max_tokens

In [41]:
prompt_template_v4 = """
The property_description is in the XML tags. The property_description describes the real estate property.
Summarise the property_description as specified by the JSON template.

<property_description>
{0}
</property_description>

The property_description is in the XML tags. The property_description describes the real estate property.
Summarise the property_description as specified by the JSON template.


Use the following JSON template.
{{
    "quotes_about_newly_built_renovated": What does the property_description say about any recent renovations, if the property was recently constructed, or if there is potential to renovate? Quote exact passages from the text.,
    "newly_built_renovated": Is the property newly built, renovated, or neither? Choose one of the following options: ["newly built", "renovated", "neither"],
    "outdoor_deck": Does the property have an outdoor deck? Choose one of the following options: ["yes","no","unkown"]
    "construction_materials: What materials is the property built from? Or "NULL" if you don't know.,
    "parking_spaces": How many cars can i park on the property?,
    "land_size_unit_of_measurement": The unit of measurement as a string or "NULL" if you don't know.,
    "land_size": The land size as an integer or 0 if you don't know.,
    "property_description_summary": Summarise the property_description. Focus on newly_built_renovated, outdoor_deck, construction_materials, parking_spaces, and land_size.,
    "property_description_summary_translation": Translate the property_description_summary into Chinese Simplified.,    
}}

"""

In [42]:
#Make our prompt
our_prompt = prompt_template_v4.format(a_particular_property_description)
#Print the prompt to verify it.
print(our_prompt)


The property_description is in the XML tags. The property_description describes the real estate property.
Summarise the property_description as specified by the JSON template.

<property_description>

Premium Beachside Position & Endless Opportunity
22A LOVE ST, BLACK ROCK
The beach at one end of the street, and shops & cafes at the other, this unique clinker brick home, with timeless warmth and classic comfort, is ready to welcome a new generation, with potential to go big on a contemporary renovation and make the most of one of Bayside's best addresses.

Showcasing house-like scale throughout and a radiant range of indoor/outdoor entertaining areas, the layout is filled with natural light from top to bottom. Under high ceilings, a wide-reaching open-plan living and dining room is instantly relaxed in nature, underpinned by a stone-topped kitchen with stainless steel appliances and servery. Timber doors slide open to a north facing deck, perfectly placed for alfresco indulgence. Peac

In [43]:
#We can use our query_llm_single_turn() function, but we need to switch models and include the response_format parameter
response = query_llm_single_turn(our_prompt, model="gpt-3.5-turbo-1106", response_format={ "type": "json_object" })
the_reply = response.choices[0].message.content
print(the_reply)

{
    "quotes_about_newly_built_renovated": "this unique clinker brick home, with timeless warmth and classic comfort, is ready to welcome a new generation, with potential to go big on a contemporary renovation",
    "newly_built_renovated": "neither",
    "outdoor_deck": "yes",
    "construction_materials": "clinker brick",
    "parking_spaces": 4,
    "land_size_unit_of_measurement": "m2",
    "land_size": 520,
    "property_description_summary": "The property is not newly built or renovated, but it has a potential for contemporary renovation. It has a clinker brick construction, a spacious outdoor deck, parking for 4 cars, and a land size of 520m2.",
    "property_description_summary_translation": "该物业既不是新建的也没有翻新，但有潜力进行现代化装修。它采用石砖建造，拥有宽敞的户外阳台，可停放4辆汽车，土地面积为520平方米。"
}


In [44]:
response.choices[0].finish_reason

'stop'

In [47]:
#Now we set a limit on the maximum number of tokens that the query can return.
## We use the max_tokens parameter
response = query_llm_single_turn(
    our_prompt, 
    model="gpt-3.5-turbo-1106", 
    response_format={ "type": "json_object" },
    max_tokens = 170 #The max_tokens parameter limits the maximum number of toekns that the LLM can return.
    )
the_reply = response.choices[0].message.content
print(the_reply)

{
    "quotes_about_newly_built_renovated": "this unique clinker brick home, with timeless warmth and classic comfort, is ready to welcome a new generation, with potential to go big on a contemporary renovation",
    "newly_built_renovated": "neither",
    "outdoor_deck": "yes",
    "construction_materials": "clinker brick",
    "parking_spaces": 4,
    "land_size_unit_of_measurement": "m2",
    "land_size": 520,
    "property_description_summary": "The property is not newly built or renovated, but it has a potential for contemporary renovation. It has a outdoor deck, is built from clinker brick, has parking spaces for 4 cars, and a land size of 520m2.",
    "property_description_summary


In [48]:
response.choices[0].finish_reason

'length'

## We get invalid JSON when we run out of tokens. finish_reason = 'length'

### JSON mode caveat: The JSON it not guaranteed to match the template that we specify.
JSON mode gaurantees valid JSON. But it does not gaurantee that it will match our template.

Copyright &copy; Slava Razbash and AI Upskill (aiupskill.io)