<a href="https://colab.research.google.com/github/Liztomania/NLP_Kurs_2024/blob/main/06_QA_based_Entity_Extraction_EP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QA-based Entity Extraction

Created by Sarah Oberbichler [ORCID](https://orcid.org/0000-0002-1031-2759)


QA-based Entity Extraction transforms entity recognition into a question-answering task. Instead of directly labeling words in text as entities, it asks specific questions like "What companies are mentioned?" or "Who are the people in this text?" The model then responds by extracting the relevant entities from the text as answers to these questions. This approach makes entity extraction more flexible and intuitive, as new entity types can be added simply by asking new questions, though it may be more computationally intensive than traditional sequence labeling methods.

###NuExtract Lannguage Model

For the NE extraction, we use the NuExtract model v1.5. NuExtract is trained on a private high-quality dataset for structured information extraction. It supports long documents and several languages (English, French, Spanish, German, Portuguese, and Italian).

## Importing the Dataset

We import a dataset that contains single articles

In [3]:
!git clone https://github.com/ieg-dhr/NLP-Course4Humanities_2024.git

fatal: destination path 'NLP-Course4Humanities_2024' already exists and is not an empty directory.


In [4]:
import pandas as pd

articles_df = pd.read_excel('/content/NLP-Course4Humanities_2024/datasets/earthquake_articles.xlsx')

articles_df = articles_df[:20]
articles_df.head()

Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext,extracted_article,article_part,total_parts,Unnamed: 17,Unnamed: 18,extracted_article_clean
0,3ML37O5BXQD3EYOR5S777GQZKGDOCDIM-ALTO8633337_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1911-12-23 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],dac9b430-b364-4fa9-9367-0e9c795c1103,['/data/altos/3M/L3/3ML37O5BXQD3EYOR5S777GQZKG...,ALTO8633337_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Samstag , 23 . Dezember Offiziere , und über 3...",**Extracted Article**\n\n* **Headline:** Erdst...,1,1,,,"\n\nNew York, 22. Dez. (Telegr.) In der Stadt ..."
1,42H5V33ALNFVOIG4SBM4YW3WQVKKHDMO-ALTO8384861_D...,10,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-06-07 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],0b5c3ed6-af50-4ea1-a206-78bf2e260dc1,['/data/altos/42/H5/42H5V33ALNFVOIG4SBM4YW3WQV...,ALTO8384861_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Dienstag , 7 . Juni Kölnische Zeitung s Mittag...",**Extracted Article 1:**\n\n* **Headline:** Er...,1,1,,,"\n\nNeapel, 7. Juni. (Telegr.) Ein wellenförmi..."
2,42H5V33ALNFVOIG4SBM4YW3WQVKKHDMO-ALTO8384865_D...,14,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-06-07 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],0b5c3ed6-af50-4ea1-a206-78bf2e260dc1,['/data/altos/42/H5/42H5V33ALNFVOIG4SBM4YW3WQV...,ALTO8384865_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Dienstag , 7 . Juni Kölnische Zeitung 8 Abend ...",**Extracted Article**\n\n* **Headline:** Erdst...,1,1,,,"\n\nFoggia, 7. Juni. (Telegr.) Ein heftiger Er..."
3,477TOWWZGBGVO2T47FBUNYAPVERZTJQC-ALTO8170232_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1909-04-18 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],457082cd-ea69-4273-a359-82e5ec7191d9,['/data/altos/47/7T/477TOWWZGBGVO2T47FBUNYAPVE...,ALTO8170232_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Sonntag , 18 . April — für das Königreich Sach...",**Extracted Article 1:**\n\n* **Headline:** Er...,1,2,,,"\n\nBrancaleone (Kalabrien), 17. April. (Teleg..."
4,477TOWWZGBGVO2T47FBUNYAPVERZTJQC-ALTO8170232_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1909-04-18 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],457082cd-ea69-4273-a359-82e5ec7191d9,['/data/altos/47/7T/477TOWWZGBGVO2T47FBUNYAPVE...,ALTO8170232_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Sonntag , 18 . April — für das Königreich Sach...",**Extracted Article 2:**\n\n* **Headline:** Ki...,2,2,,,\n\nDer Kircheneinsturz in Hohensalza ist dort...


##Defining a Template for Information Extraction

As an example, we extract information from earthquake reportings. In doing so, we want the model to distinguish between earthquake locations, dateline locations, extract the date of the earthquake, the magnitutes, the persons involved as well as causalities, damage and rescue effort of the earthquake.

In [5]:
# Define a template for earthquake information extraction
earthquake_template = """{
    "Earthquake": {
        "Earthquake Locations": "",
        "Dateline Locations": "",
        "Date": "",
        "Magnitude": "",
        "Persons_Involved": [],
        "Casualties": {
            "Fatalities": "",
            "Injured": ""
        },
        "Damage": {
            "Infrastructure Damage": "",
            "Economic Impact": ""
        },
        "Rescue Efforts": ""
    }
}"""

## Running the Model

The code below extracts the named entities using the extraction template. The model output is per default a json format. We add the extracted entities to our dataframe that will be saved as excel file.

In [6]:
import json
import torch
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer

def predict_NuExtract(model, tokenizer, texts, template, batch_size=1, max_length=10_000, max_new_tokens=4_000):
    """
    Extract structured information from texts using NuExtract model

    :param model: Loaded NuExtract model
    :param tokenizer: Corresponding tokenizer
    :param texts: List of input texts to extract from
    :param template: JSON template for structured extraction
    :param batch_size: Number of texts to process in parallel
    :param max_length: Maximum input length
    :param max_new_tokens: Maximum tokens to generate in output
    :return: List of extracted information
    """
    template = json.dumps(json.loads(template), indent=4)
    prompts = [f"""<|input|>\n### Template:\n{template}\n### Text:\n{text}\n\n<|output|>""" for text in texts]

    outputs = []
    with torch.no_grad():
        for i in range(0, len(prompts), batch_size):
            batch_prompts = prompts[i:i+batch_size]
            batch_encodings = tokenizer(batch_prompts, return_tensors="pt", truncation=True, padding=True, max_length=max_length).to(model.device)

            pred_ids = model.generate(**batch_encodings, max_new_tokens=max_new_tokens)
            outputs += tokenizer.batch_decode(pred_ids, skip_special_tokens=True)

    return [output.split("<|output|>")[1] for output in outputs]
    print(outputs)

# Load the NuExtract model
model_name = "numind/NuExtract-v1.5"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)


# Filter out non-string or empty entries
articles_df['extracted_article_clean'] = articles_df['extracted_article_clean'].astype(str)
valid_texts = articles_df['extracted_article_clean'][articles_df['extracted_article_clean'].str.strip() != '']

# Extract information in batches to manage memory
batch_size = 10  # Adjust based on your available memory
all_predictions = []

for i in range(0, len(valid_texts), batch_size):
    batch_texts = valid_texts[i:i+batch_size].tolist()
    batch_predictions = predict_NuExtract(model, tokenizer, batch_texts, earthquake_template)
    all_predictions.extend(batch_predictions)

# Parse predictions and add to DataFrame
articles_df['earthquake_extraction'] = pd.Series([None] * len(articles_df))
articles_df.loc[valid_texts.index, 'earthquake_extraction'] = all_predictions

# Optional: Flatten the JSON extraction for easier analysis
def parse_earthquake_info(extraction):
    try:
        parsed = json.loads(extraction)
        return parsed.get('Earthquake', {})
    except:
        return {}

articles_df['earthquake_locations'] = articles_df['earthquake_extraction'].apply(lambda x: parse_earthquake_info(x).get('Earthquake Locations', ''))
articles_df['dateline_locations'] = articles_df['earthquake_extraction'].apply(lambda x: parse_earthquake_info(x).get('Dateline Locations', ''))
articles_df['date'] = articles_df['earthquake_extraction'].apply(lambda x: parse_earthquake_info(x).get('Date', ''))
articles_df['magnitude'] = articles_df['earthquake_extraction'].apply(lambda x: parse_earthquake_info(x).get('Magnitude', ''))
articles_df['persons_involved'] = articles_df['earthquake_extraction'].apply(lambda x: parse_earthquake_info(x).get('Persons_Involved', []))
articles_df['causalaties'] = articles_df['earthquake_extraction'].apply(lambda x: parse_earthquake_info(x).get('Injured', {}).get('Fatalities', ''))
articles_df['infrastructure_damage'] = articles_df['earthquake_extraction'].apply(lambda x: parse_earthquake_info(x).get('Infrastructure Damage', ''))
articles_df['economic_impact'] = articles_df['earthquake_extraction'].apply(lambda x: parse_earthquake_info(x).get('Economic Impact', ''))
articles_df['rescue_effort'] = articles_df['earthquake_extraction'].apply(lambda x: parse_earthquake_info(x).get('Rescue Efforts', ''))


# Optional: Save results
articles_df.to_excel('earthquake_extractions.xlsx', index=False)


articles_df

config.json:   0%|          | 0.00/3.52k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/140 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48


Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,...,earthquake_extraction,earthquake_locations,dateline_locations,date,magnitude,persons_involved,causalaties,infrastructure_damage,economic_impact,rescue_effort
0,3ML37O5BXQD3EYOR5S777GQZKGDOCDIM-ALTO8633337_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1911-12-23 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],dac9b430-b364-4fa9-9367-0e9c795c1103,...,"{""Earthquake"": {""Earthquake Locations"": ""Mexik...",Mexiko,New York,22. Dez.,,[],,,,
1,42H5V33ALNFVOIG4SBM4YW3WQVKKHDMO-ALTO8384861_D...,10,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-06-07 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],0b5c3ed6-af50-4ea1-a206-78bf2e260dc1,...,"{""Earthquake"": {""Earthquake Locations"": ""Neape...","Neapel, Benevento, Cosenza, Castellamare di St...",Neapel,7. Juni,,[],,,,Der Präfekt hat militärische Hilfe abgesandt
2,42H5V33ALNFVOIG4SBM4YW3WQVKKHDMO-ALTO8384865_D...,14,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-06-07 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],0b5c3ed6-af50-4ea1-a206-78bf2e260dc1,...,"{""Earthquake"": {""Earthquake Locations"": ""Foggi...",Foggia,Provinz,7. Juni,,[],,,,
3,477TOWWZGBGVO2T47FBUNYAPVERZTJQC-ALTO8170232_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1909-04-18 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],457082cd-ea69-4273-a359-82e5ec7191d9,...,"{""Earthquake"": {""Earthquake Locations"": ""Branc...",Brancaleone (Kalabrien),,17. April,,[],,,,
4,477TOWWZGBGVO2T47FBUNYAPVERZTJQC-ALTO8170232_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1909-04-18 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],457082cd-ea69-4273-a359-82e5ec7191d9,...,"{""Earthquake"": {""Earthquake Locations"": """", ""D...",,,,,"[Dr. Leoy, Erster Bürgermeister Treinies, Stad...",,,,
5,4R2P7K6IV6RKWVC2FKYMDFB3HPL7WMOH-ALTO8596509_D...,10,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1911-11-17 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],ad02f11f-3813-4992-8128-5146c67c4642,...,"\n{\n ""Earthquake"": {\n ""Earthquake ...","Mannheim, Heidelberg, Rottweil, München, Ungsb...",,16. Nov.,,[],,,,
6,4R2P7K6IV6RKWVC2FKYMDFB3HPL7WMOH-ALTO8596517_D...,18,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1911-11-17 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],ad02f11f-3813-4992-8128-5146c67c4642,...,"\n{\n ""Earthquake"": {\n ""Earthquake ...",Alpen oder Oberitalien,"Aachen, Trier, Berlin, Karlsruhe, Konstanz, St...",17. Nov.,,[],,,,
7,5MPASOAC4XSDPYSS43PHBU63F4I4KVWT-ALTO8473728_D...,6,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-12-21 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],ab188456-c755-4122-8de7-320ef713ec5a,...,"{""Earthquake"": {""Earthquake Locations"": ""G\u00...",Güstüb,,20. Dez.,,[],,,,
8,5O2IGDPPV7G73A4BBZANSZ3UBTU2J5YP-ALTO8453383_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-11-09 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],3b4ef142-ff33-4276-ac13-24ea899ea108,...,"{""Earthquake"": {""Earthquake Locations"": ""Stolb...",Stolberg,Eschweiler und Aachen,Sonntag auf Montag,,[],,,,
9,5QFXNBPKD6RZIITU75POKAYECEAUWLYC-ALTO8203997_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1909-05-29 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],ab754765-eccd-4c56-acf6-e76867d8102c,...,"{""Earthquake"": {""Earthquake Locations"": """", ""D...",,,,,[],,,,


## Visualization Example - Creating a Map with Earthquake locations

We first use the geopy library to process geographic locations and add their corresponding coordinates (latitude and longitude) to a pandas DataFrame. It includes a GeocodingService class that interfaces with the Nominatim geocoding API, implementing rate-limiting, retries with exponential backoff, and error handling to ensure robust geocoding.

We further use the folium library to create an interactive map with markers for locations provided in a pandas DataFrame. Finally, the map is created and displayed, providing a visual representation of the geographic data.

In [7]:
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut, GeocoderServiceError
import pandas as pd
import time
from typing import List, Tuple, Optional
import random

class GeocodingService:
    def __init__(self, user_agent: str = None, timeout: int = 10, rate_limit: float = 1.1):
        """
        Initialize the geocoding service with proper configuration.

        Args:
            user_agent: Custom user agent string (default: generated)
            timeout: Timeout for requests in seconds
            rate_limit: Time to wait between requests in seconds
        """
        if user_agent is None:
            user_agent = f"python_geocoding_script_{random.randint(1000, 9999)}"

        self.geolocator = Nominatim(
            user_agent=user_agent,
            timeout=timeout
        )
        self.rate_limit = rate_limit
        self.last_request = 0

    def _rate_limit_wait(self):
        """Implement rate limiting between requests"""
        current_time = time.time()
        time_since_last = current_time - self.last_request
        if time_since_last < self.rate_limit:
            time.sleep(self.rate_limit - time_since_last)
        self.last_request = time.time()

    def geocode_location(self, location: str, max_retries: int = 3) -> Optional[Tuple[float, float]]:
        """
        Geocode a single location with retries.

        Args:
            location: Location string to geocode
            max_retries: Maximum number of retry attempts

        Returns:
            Tuple of (latitude, longitude) or None if geocoding fails
        """
        for attempt in range(max_retries):
            try:
                self._rate_limit_wait()
                location_data = self.geolocator.geocode(location)
                if location_data:
                    return (location_data.latitude, location_data.longitude)
                return None
            except (GeocoderTimedOut, GeocoderServiceError) as e:
                if attempt == max_retries - 1:
                    print(f"Failed to geocode '{location}' after {max_retries} attempts: {e}")
                    return None
                time.sleep(2 ** attempt)  # Exponential backoff
            except Exception as e:
                print(f"Error geocoding '{location}': {e}")
                return None
        return None

    def process_locations(self, locations: str) -> List[Optional[Tuple[float, float]]]:
        """
        Process a comma-separated string of locations.

        Args:
            locations: Comma-separated string of location names

        Returns:
            List of coordinate tuples or None for failed geocoding
        """
        if pd.isna(locations) or not locations:
            return []

        location_list = [loc.strip() for loc in locations.split(',')]
        return [self.geocode_location(loc) for loc in location_list]

def geolocate_places(df: pd.DataFrame,
                    places_column: str = 'places',
                    user_agent: str = None) -> pd.DataFrame:
    """
    Add coordinates to a DataFrame based on location names.

    Args:
        df: Input DataFrame
        places_column: Name of the column containing comma-separated location strings
        user_agent: Custom user agent string

    Returns:
        DataFrame with added 'coordinates' column
    """
    geocoder = GeocodingService(user_agent=user_agent)

    # Create a copy to avoid modifying the original DataFrame
    result_df = df.copy()

    # Process locations
    result_df['coordinates'] = result_df[places_column].apply(geocoder.process_locations)

    return result_df

# Main execution
if __name__ == "__main__":
    # Assuming articles_df is your DataFrame with a 'places' column
    # Apply geocoding to the articles DataFrame
    articles_df_with_coords = geolocate_places(
        articles_df,
        places_column='earthquake_locations',
        user_agent='article_geocoding_service_v1.0'
    )

    # Update the original DataFrame with the new coordinates
    articles_df['coordinates'] = articles_df_with_coords['coordinates']

    # Display the results
    print("\nSample of geocoded locations:")
    print(articles_df[['earthquake_locations', 'coordinates']].head())

    # Optional: Display some statistics
    total_locations = len(articles_df)
    successful_geocodes = articles_df['coordinates'].apply(lambda x: len([c for c in x if c is not None])).sum()
    failed_geocodes = articles_df['coordinates'].apply(lambda x: len([c for c in x if c is None])).sum()

    print(f"\nGeocoding Statistics:")
    print(f"Total locations processed: {total_locations}")
    print(f"Successfully geocoded: {successful_geocodes}")
    print(f"Failed to geocode: {failed_geocodes}")


Sample of geocoded locations:
                                earthquake_locations  \
0                                             Mexiko   
1  Neapel, Benevento, Cosenza, Castellamare di St...   
2                                             Foggia   
3                            Brancaleone (Kalabrien)   
4                                                      

                                         coordinates  
0                        [(19.4326296, -99.1331785)]  
1  [(40.8358846, 14.2487679), (41.2476307, 14.705...  
2                [(41.50281055, 15.452893911491383)]  
3                           [(37.962968, 16.100405)]  
4                                                 []  

Geocoding Statistics:
Total locations processed: 20
Successfully geocoded: 21
Failed to geocode: 6


In [8]:
import folium
from folium import plugins
import pandas as pd
from typing import List, Tuple, Optional
from IPython.display import display

def create_location_map(df: pd.DataFrame,
                       coordinates_col: str = 'coordinates',
                       places_col: str = 'earthquake_locations',
                       title_col: Optional[str] = None) -> folium.Map:
    """
    Create an interactive map with individual markers for all earthquake locations.

    Args:
        df: DataFrame containing coordinates and earthquake locations
        coordinates_col: Name of column containing coordinates
        places_col: Name of column containing earthquake location names
        title_col: Optional column name for additional marker information

    Returns:
        folium.Map object with all locations marked individually
    """
    # Initialize the map
    m = folium.Map(location=[0, 0], zoom_start=2)

    # Keep track of all valid coordinates for setting bounds
    all_coords = []

    # Process each row in the DataFrame
    for idx, row in df.iterrows():
        coordinates = row[coordinates_col]
        places = row[places_col].split(',') if pd.notna(row[places_col]) else []
        title = row[title_col] if title_col and pd.notna(row[title_col]) else None

        # Skip if no coordinates
        if not coordinates:
            continue

        # Add individual markers for each location
        for i, (coord, place) in enumerate(zip(coordinates, places)):
            if coord is not None:  # Skip None coordinates
                lat, lon = coord
                place_name = place.strip()

                # Create popup content
                popup_content = f"<b>{place_name}</b>"
                if title:
                    popup_content += f"<br>{title}"

                # Add marker directly to the map (not in a cluster)
                folium.Marker(
                    location=[lat, lon],
                    popup=folium.Popup(popup_content, max_width=300),
                    tooltip=place_name,
                    #icon=folium.Icon(color='red', icon='info-sign')
                ).add_to(m)

                all_coords.append([lat, lon])

    # If we have coordinates, fit the map bounds to include all points
    if all_coords:
        m.fit_bounds(all_coords)

    return m

# Create and display the map
map_obj = create_location_map(articles_df)
display(map_obj)