## **Objective**

Your goal is to scrape the 1975 Pacific hurricane season Wikipedia page and structure the scraped data in a DataFrame with the following columns:

* hurricane_storm_name

* date_start

* date_end

* number_of_deaths

* list_of_areas_affected

## **Methodology** ##

**Step 1**: Define the url (i.e. https://en.wikipedia.org/wiki/1975_Pacific_hurricane_season) that needs to be scraped and make use of the **requests** along with the **bs4 (BeautifulSoup)** library to fetch the HTML content of the url and then parse it. We can then use the text of the page directly.

**Step 2**: Directly extract the relevant data using an LLM, using a data schema for structured output using the libraries **typing** and **pydantic**. This helps us define a specific data structure for the output of the LLM, which we can then use directly to convert our extracted data to a DataFrame with minimal code. Two classes are created, one that refers to a **Hurricane** entity with the appropriate fields: **hurricane_storm_name**, **date_start**, **date_end**, **number_of_deaths**, and **list_of_areas_affected**. The second class is the **Data** class that contains a list of **Hurricane** entities. This way, all of the data can be extracted at once in the **Data** class, instead of having to extract data for each individual hurricane separately.

**Step 3**: Pick an LLM to use and setup the model and define a prompt template using the **langchain** library. In this example **"gpt-4o-mini"** was used because it's cheap and fast. This approach is more modular when it comes to switching between different LLMs and it is also straightforward in terms of how to structure your prompts. In this example, since we want structured data as an output, it is very easy to do with **langchain** as you can just pass the schema from the previously created **Data** class directly.

**Step 4**: This is an iterable step of the process where different system prompts are defined to test using the LLM and based on the results make relevant adjustments. Typically, I start with a more generic system prompt to see what the LLM can do, and based on the performance I will consider making it more detailed, define specific instructions, provide multi-shot examples, define it as a chain-of-thought process, etc. I might also have to consider at this point whether a different LLM (better or more specific to the problem) or a different approach could be used by breaking the initial problem into smaller ones and test whether there are any improvements. In this case, I tested **"gpt-4o"** a little bit which did provide better results, but chose to remain with **"gpt-4o-mini"** due to cost. In terms of different approaches, I also tried the following:
1. Adding an extra LLM before outputing the final results, which cleans up the original text and keeps only relevant information about hurricanes and tropical storms.
2. Adding an extra LLM before outputing the final results, which cleans up the original text and breaks it into segments where each hurricane in the text is grouped along with its relevant information (this was done by creating an additional **Data** schema).
3. Adding an extra LLM after outputing the final results, which does a validation check on the previous response about missing or incorrect entries.

**Step 5**: Choose the best option out of the ones tried, based on cost and performance. In this case, I chose the simplest approach which is demonstrated in this notebook which can be essentially defined as **Fetch -> Parse -> LLM -> Data**. The performance of this method was as good if not better than the other ones I tried, and was also the most cost efficient as the number of tokens used are kept to a minimum and the final data are extracted with a single call. If the LLM was upgraded to **"gpt-4o"** the results could be even better by slightly adjusting the system prompt, but I prioritized the cost in this case.

### **1. Install langchain** ###

In [None]:
# Install necessary missing libraries
!pip install langchain
!pip install langchain_openai

Collecting langchain
  Downloading langchain-0.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_core-0.3.1-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.121-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.0->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting httpx<1,>=0.23.0 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.7-cp310-cp310-ma

### **2. Setup imports, api key and define the url to scrape** ###

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from typing import List, Optional
from pydantic import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI

# Set up your OpenAI API key
openai_api_key = ''

# Define url
url = "https://en.wikipedia.org/wiki/1975_Pacific_hurricane_season"

### **3. Define scraping function** ###

In [None]:
# This function fetches the url data and parses it so that they can be used by the LLM later on
def get_url_soup(url):
    # Send a GET request to fetch the webpage content
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")

    return soup

### **4. Define data schema** ###

In [None]:
class Hurricane(BaseModel):
    """Information about a hurricane."""

    # Description of the schema Hurricane
    # This doc-string is sent to the LLM as the description of the schema Hurricane,
    # and it can help to improve extraction results.

    hurricane_storm_name: str = Field(
        default=None, description="The name of the hurricane or storm."
    )
    date_start: Optional[str] = Field(
        default=None, description="The start date of the hurricane."
    )
    date_end: Optional[str] = Field(
        default=None, description="The end date of the hurricane."
    )
    number_of_deaths: Optional[int] = Field(
        default=None, description="The number of deaths caused by the hurricane."
    )
    list_of_areas_affected: Optional[List[str]] = Field(
        default=None, description="A list of areas affected by the hurricane. Any location the hurricane passed from."
    )


class Data(BaseModel):
    """Extracted data about hurricanes."""

    # Creates a model so that we can extract multiple hurricane entities.
    hurricanes: List[Hurricane]

# This function converts the LLM's response to a dataframe for easier view
def convert_response_to_df(response):
    df = pd.DataFrame([row for row in response.model_dump()['hurricanes']])
    return df

### **5. Setup LLM, define system prompt, prompt template, and get output** ###

In [None]:
# Setup LLM
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    api_key=openai_api_key
)

# Define system prompt
system_prompt = """
You are an AI assistant specialized in extracting structured information about hurricane storms from text. Your task is to analyze the given text and extract relevant details for all hurricane storms mentioned.

General Instructions:
1. Do not stop your response for any reason until you have extracted the relevant data for all hurricanes or tropical storms mentioned in the text. If you cannot find data for certain hurricane storms, just leave
the relevant fields blank. However, if it makes sense to include certain information then do it, it's better to add than to omit information.
2. If a piece of information is not available in the text and you cannot provide an answer, set the corresponding field as None. Do not attempt to fill it, unless you can infer its value indirectly.
3. When searching for the affected areas, know that a hurricane storm forms at an initial location and then passes through different locations until it dissipates at some point, so make sure to include any relevant
geographical information about the path the hurricane storm followed from beginning to end.  This could be information relating to specific landmarks, relative locations, areas, oceans, cities, countries, etc expressed
either directly (e.g. "X") or indirectly (e.g., "northwest of X"), or with approximate distances (e.g., "100mi away from X"). The affected areas could be more than 1. List any of them that you find for each hurricane storm.
Typically there should be at least 1 area mentioned.
4. Ask yourself the questions:
- Is there any location mentioned in the text where the hurricane storm started or formed at?
- Are there any locations mentioned in the text where the hurricane storm passed from after forming?
- Did the hurricane storm affect any locations indirectly?
- Is the location where the hurricane storm dissipated mentioned?
5. Make sure to include the location information as it is written in the original text. Do not omit any information.

Example 1:
"hurricane_storm_name": "Tropical Storm Lee",
"date_start": "2011-09-02",
"date_end": "2011-09-05",
"number_of_deaths": 18,
"list_of_areas_affected": ["X", "100mi inland from X", "Northeast X", "250mi south from X"]

Example 2:
"hurricane_storm_name": "Hurricane Tim",
"date_start": "2012-05-11",
"date_end": "2012-05-14",
"number_of_deaths": 2,
"list_of_areas_affected": ["close to X", "east north-east of X", "at X"]

Example 3:
"hurricane_storm_name": "Hurricane Luna",
"date_start": "2002-03-13",
"date_end": "2002-03-17",
"number_of_deaths": 0,
"list_of_areas_affected": ["somewhere around X"]
"""

# Define prompt template
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            system_prompt
        ),
        ("human", "{text}"),
    ]
)

# Scrape URL and get the text
text = get_url_soup(url).get_text()

# Create langchain runnable
runnable = prompt | llm.with_structured_output(schema=Data)

# Get response
response = runnable.invoke({"text": text})

# Convert response to DataFrame and print
df = convert_response_to_df(response)
print(df.to_markdown())

|    | hurricane_storm_name     | date_start   | date_end   |   number_of_deaths | list_of_areas_affected                                                                                                                                        |
|---:|:-------------------------|:-------------|:-----------|-------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  0 | Hurricane Agatha         | 1975-06-02   | 1975-06-05 |                  0 | ['about 290 mi southwest of Acapulco', 'about 170 mi southwest of Zihuatanejo', 'about 140 mi south of the Tres Marias Islands']                              |
|  1 | Tropical Storm Bridget   | 1975-06-28   | 1975-07-03 |                  0 | ['about 575 mi south of the tip of the Baja California Peninsula']                                                                                            |
|  2 | Hurricane Carlotta   

### **6. All together** ###

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from typing import List, Optional
from pydantic import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI

# Set up your OpenAI API key
openai_api_key = ''

# Define url
url = "https://en.wikipedia.org/wiki/1975_Pacific_hurricane_season"

# This function fetches the url data and parses it so that they can be used by the LLM later on
def get_url_soup(url):
    # Send a GET request to fetch the webpage content
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")

    return soup

# This function converts the LLM's response to a dataframe for easier view
def convert_response_to_df(response):
    df = pd.DataFrame([row for row in response.model_dump()['hurricanes']])
    return df

class Hurricane(BaseModel):
    """Information about a hurricane."""

    # Description of the schema Hurricane
    # This doc-string is sent to the LLM as the description of the schema Hurricane,
    # and it can help to improve extraction results.

    hurricane_storm_name: str = Field(
        default=None, description="The name of the hurricane or storm."
    )
    date_start: Optional[str] = Field(
        default=None, description="The start date of the hurricane."
    )
    date_end: Optional[str] = Field(
        default=None, description="The end date of the hurricane."
    )
    number_of_deaths: Optional[int] = Field(
        default=None, description="The number of deaths caused by the hurricane."
    )
    list_of_areas_affected: Optional[List[str]] = Field(
        default=None, description="A list of areas affected by the hurricane. Any location the hurricane passed from."
    )


class Data(BaseModel):
    """Extracted data about hurricanes."""

    # Creates a model so that we can extract multiple hurricane entities.
    hurricanes: List[Hurricane]

# Setup LLM
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    api_key=openai_api_key
)

# Define system prompt
system_prompt = """
You are an AI assistant specialized in extracting structured information about hurricane storms from text. Your task is to analyze the given text and extract relevant details for all hurricane storms mentioned.

General Instructions:
1. Do not stop your response for any reason until you have extracted the relevant data for all hurricanes or tropical storms mentioned in the text. If you cannot find data for certain hurricane storms, just leave
the relevant fields blank. However, if it makes sense to include certain information then do it, it's better to add than to omit information.
2. If a piece of information is not available in the text and you cannot provide an answer, set the corresponding field as None. Do not attempt to fill it, unless you can infer its value indirectly.
3. When searching for the affected areas, know that a hurricane storm forms at an initial location and then passes through different locations until it dissipates at some point, so make sure to include any relevant
geographical information about the path the hurricane storm followed from beginning to end.  This could be information relating to specific landmarks, relative locations, areas, oceans, cities, countries, etc expressed
either directly (e.g. "X") or indirectly (e.g., "northwest of X"), or with approximate distances (e.g., "100mi away from X"). The affected areas could be more than 1. List any of them that you find for each hurricane storm.
Typically there should be at least 1 area mentioned.
4. Ask yourself the questions:
- Is there any location mentioned in the text where the hurricane storm started or formed at?
- Are there any locations mentioned in the text where the hurricane storm passed from after forming?
- Did the hurricane storm affect any locations indirectly?
- Is the location where the hurricane storm dissipated mentioned?
5. Make sure to include the location information as it is written in the original text. Do not omit any information.

Example 1:
"hurricane_storm_name": "Tropical Storm Lee",
"date_start": "2011-09-02",
"date_end": "2011-09-05",
"number_of_deaths": 18,
"list_of_areas_affected": ["X", "100mi inland from X", "Northeast X", "250mi south from X"]

Example 2:
"hurricane_storm_name": "Hurricane Tim",
"date_start": "2012-05-11",
"date_end": "2012-05-14",
"number_of_deaths": 2,
"list_of_areas_affected": ["close to X", "east north-east of X", "at X"]

Example 3:
"hurricane_storm_name": "Hurricane Luna",
"date_start": "2002-03-13",
"date_end": "2002-03-17",
"number_of_deaths": 0,
"list_of_areas_affected": ["somewhere around X"]
"""

# Define prompt template
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            system_prompt
        ),
        ("human", "{text}"),
    ]
)

# Scrape URL and get the text
text = get_url_soup(url).get_text()

# Create langchain runnable
runnable = prompt | llm.with_structured_output(schema=Data)

# Get response
response = runnable.invoke({"text": text})

# Convert response to DataFrame and print
df = convert_response_to_df(response)
print(df.to_markdown())

|    | hurricane_storm_name     | date_start   | date_end   |   number_of_deaths | list_of_areas_affected                                                                                                                                        |
|---:|:-------------------------|:-------------|:-----------|-------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  0 | Hurricane Agatha         | 1975-06-02   | 1975-06-05 |                  0 | ['about 290 mi southwest of Acapulco', 'about 170 mi southwest of Zihuatanejo', 'about 140 mi south of the Tres Marias Islands']                              |
|  1 | Tropical Storm Bridget   | 1975-06-28   | 1975-07-03 |                  0 | ['about 575 mi south of the tip of the Baja California Peninsula']                                                                                            |
|  2 | Hurricane Carlotta   