<a href="https://colab.research.google.com/github/Liztomania/NLP_Kurs_2024/blob/main/06_Named_Entity_Recognition_ImpressoAPI_EP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visualizing Named Entities with Impresso API

The models and corresponding API implementation used in this notebook were developed by Emanuale Boros [ORCID](https://orcid.org/0000-0001-6299-9452) from the Impresso Project (https://impresso-project.ch/). The notebook was created by Sarah Oberbichler for teaching purposes [ORCID](https://orcid.org/0000-0002-1031-2759).

##About this Notebook

This notebook enables the extraction of named entities from any dataset using the Impresso NER and Named Entity Linking (NEL) API, providing precise recognition of fine-grained named entities in historical texts across German, French, and English. Furthermore, it demonstrates how to visualize location entities by generating geographic coordinates for the extracted locations and plotting them on an interactive map, enhancing both analysis and presentation.

**Acknowledgements:**

For further experimentation, you can directly access the experimental API at [Impresso Annotation](https://impresso-annotation.epfl.ch/).


## Call the Impresso NER and NEL API
We only use the NER part for this notebook.

In [1]:
import requests

def get_linked_entities(text, coarse_only=False):
    """
    Calls the external API to get named entity recognition (NER) results.
    """
    url = "https://impresso-annotation.epfl.ch/api/ner/"
    payload = {"data": text}
    try:
        response = requests.post(url, json=payload)
        if response.status_code == 200:
            data = response.json()
            data["text"] = text
            # remove fine-grained and components
            if coarse_only:
                for ne in data["nes"]:
                    data["nes"] = [ne for ne in data["nes"] if not "." in ne["type"]]
            return data
        else:
            print(f"Request failed with status code {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
        return None


##Importing a Dataset

In [2]:
!git clone https://github.com/ieg-dhr/NLP-Course4Humanities_2024.git

Cloning into 'NLP-Course4Humanities_2024'...
remote: Enumerating objects: 1419, done.[K
remote: Counting objects: 100% (155/155), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 1419 (delta 139), reused 119 (delta 119), pack-reused 1264 (from 2)[K
Receiving objects: 100% (1419/1419), 61.47 MiB | 33.91 MiB/s, done.
Resolving deltas: 100% (816/816), done.


In [3]:
import pandas as pd

articles_df = pd.read_excel('/content/NLP-Course4Humanities_2024/datasets/earthquake_articles.xlsx')

articles_df = articles_df[:20]
articles_df.head()

Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,pagefulltext,pagename,preview_reference,plainpagefulltext,extracted_article,article_part,total_parts,Unnamed: 17,Unnamed: 18,extracted_article_clean
0,3ML37O5BXQD3EYOR5S777GQZKGDOCDIM-ALTO8633337_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1911-12-23 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],dac9b430-b364-4fa9-9367-0e9c795c1103,['/data/altos/3M/L3/3ML37O5BXQD3EYOR5S777GQZKG...,ALTO8633337_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Samstag , 23 . Dezember Offiziere , und über 3...",**Extracted Article**\n\n* **Headline:** Erdst...,1,1,,,"\n\nNew York, 22. Dez. (Telegr.) In der Stadt ..."
1,42H5V33ALNFVOIG4SBM4YW3WQVKKHDMO-ALTO8384861_D...,10,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-06-07 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],0b5c3ed6-af50-4ea1-a206-78bf2e260dc1,['/data/altos/42/H5/42H5V33ALNFVOIG4SBM4YW3WQV...,ALTO8384861_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Dienstag , 7 . Juni Kölnische Zeitung s Mittag...",**Extracted Article 1:**\n\n* **Headline:** Er...,1,1,,,"\n\nNeapel, 7. Juni. (Telegr.) Ein wellenförmi..."
2,42H5V33ALNFVOIG4SBM4YW3WQVKKHDMO-ALTO8384865_D...,14,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-06-07 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],0b5c3ed6-af50-4ea1-a206-78bf2e260dc1,['/data/altos/42/H5/42H5V33ALNFVOIG4SBM4YW3WQV...,ALTO8384865_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Dienstag , 7 . Juni Kölnische Zeitung 8 Abend ...",**Extracted Article**\n\n* **Headline:** Erdst...,1,1,,,"\n\nFoggia, 7. Juni. (Telegr.) Ein heftiger Er..."
3,477TOWWZGBGVO2T47FBUNYAPVERZTJQC-ALTO8170232_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1909-04-18 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],457082cd-ea69-4273-a359-82e5ec7191d9,['/data/altos/47/7T/477TOWWZGBGVO2T47FBUNYAPVE...,ALTO8170232_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Sonntag , 18 . April — für das Königreich Sach...",**Extracted Article 1:**\n\n* **Headline:** Er...,1,2,,,"\n\nBrancaleone (Kalabrien), 17. April. (Teleg..."
4,477TOWWZGBGVO2T47FBUNYAPVERZTJQC-ALTO8170232_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1909-04-18 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],457082cd-ea69-4273-a359-82e5ec7191d9,['/data/altos/47/7T/477TOWWZGBGVO2T47FBUNYAPVE...,ALTO8170232_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Sonntag , 18 . April — für das Königreich Sach...",**Extracted Article 2:**\n\n* **Headline:** Ki...,2,2,,,\n\nDer Kircheneinsturz in Hohensalza ist dort...


In [4]:
import requests
import pandas as pd
import json

def get_linked_entities(text, coarse_only=False):
    """
    Calls the external API to get named entity recognition (NER) results.
    """
    url = "https://impresso-annotation.epfl.ch/api/ner/"
    payload = {"data": text}
    try:
        response = requests.post(url, json=payload)
        if response.status_code == 200:
            data = response.json()
            data["text"] = text
            # remove fine-grained and components
            if coarse_only:
                data["nes"] = [ne for ne in data["nes"] if not "." in ne["type"]]
            return data
        else:
            print(f"Request failed with status code {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")
        return None

# Assuming 'extracted_article_clean' is the column name in your DataFrame
def process_dataframe(df):
    results = []
    for index, row in df.iterrows():
        text = row['extracted_article_clean']
        api_result = get_linked_entities(text, coarse_only=True)
        results.append(api_result)
    df['NER_results'] = results
    return df

articles_df = process_dataframe(articles_df)
articles_df.head()

An error occurred: Out of range float values are not JSON compliant
An error occurred: Out of range float values are not JSON compliant


Unnamed: 0,page_id,pagenumber,paper_title,provider_ddb_id,provider,zdb_id,publication_date,place_of_distribution,language,thumbnail,...,pagename,preview_reference,plainpagefulltext,extracted_article,article_part,total_parts,Unnamed: 17,Unnamed: 18,extracted_article_clean,NER_results
0,3ML37O5BXQD3EYOR5S777GQZKGDOCDIM-ALTO8633337_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1911-12-23 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],dac9b430-b364-4fa9-9367-0e9c795c1103,...,ALTO8633337_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Samstag , 23 . Dezember Offiziere , und über 3...",**Extracted Article**\n\n* **Headline:** Erdst...,1,1,,,"\n\nNew York, 22. Dez. (Telegr.) In der Stadt ...","{'ts': '2025-01-10T09:37:26Z', 'sys_id': 'ner-..."
1,42H5V33ALNFVOIG4SBM4YW3WQVKKHDMO-ALTO8384861_D...,10,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-06-07 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],0b5c3ed6-af50-4ea1-a206-78bf2e260dc1,...,ALTO8384861_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Dienstag , 7 . Juni Kölnische Zeitung s Mittag...",**Extracted Article 1:**\n\n* **Headline:** Er...,1,1,,,"\n\nNeapel, 7. Juni. (Telegr.) Ein wellenförmi...","{'ts': '2025-01-10T09:37:27Z', 'sys_id': 'ner-..."
2,42H5V33ALNFVOIG4SBM4YW3WQVKKHDMO-ALTO8384865_D...,14,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1910-06-07 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],0b5c3ed6-af50-4ea1-a206-78bf2e260dc1,...,ALTO8384865_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Dienstag , 7 . Juni Kölnische Zeitung 8 Abend ...",**Extracted Article**\n\n* **Headline:** Erdst...,1,1,,,"\n\nFoggia, 7. Juni. (Telegr.) Ein heftiger Er...","{'ts': '2025-01-10T09:37:28Z', 'sys_id': 'ner-..."
3,477TOWWZGBGVO2T47FBUNYAPVERZTJQC-ALTO8170232_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1909-04-18 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],457082cd-ea69-4273-a359-82e5ec7191d9,...,ALTO8170232_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Sonntag , 18 . April — für das Königreich Sach...",**Extracted Article 1:**\n\n* **Headline:** Er...,1,2,,,"\n\nBrancaleone (Kalabrien), 17. April. (Teleg...","{'ts': '2025-01-10T09:37:29Z', 'sys_id': 'ner-..."
4,477TOWWZGBGVO2T47FBUNYAPVERZTJQC-ALTO8170232_D...,2,Kölnische Zeitung. 1803-1945,VKNQFFAKOR4XZWJJKUX3NGYSZ3QZAXCW,Universitäts- und Landesbibliothek der Rheinis...,2719361-5,1909-04-18 12:00:00,"['Köln', 'Kleve (Kreis Kleve)', 'Jülich']",['ger'],457082cd-ea69-4273-a359-82e5ec7191d9,...,ALTO8170232_DDB_FULLTEXT,https://api.deutsche-digitale-bibliothek.de/bi...,"Sonntag , 18 . April — für das Königreich Sach...",**Extracted Article 2:**\n\n* **Headline:** Ki...,2,2,,,\n\nDer Kircheneinsturz in Hohensalza ist dort...,"{'ts': '2025-01-10T09:37:29Z', 'sys_id': 'ner-..."


In [6]:
# export as excel file
articles_df.to_excel('earthquake_articles_with_ner.xlsx', index=False)

##Extracting Locations from the NER Output - An Example

We can extract specific entities in order to use them for further analysis or visualization. Please not that when using extracted named entities for further analysis, they need to be controlled and verified by a human reader since the model most likely has made some mistakes.

In [7]:
#extract locations

import pandas as pd

def extract_locations(df):
    places = []
    for index, row in df.iterrows():
        ner_results = row['NER_results']
        if ner_results and 'nes' in ner_results:
            # Use 'surface' instead of 'text' to get the entity text
            location_names = [ne['surface'] for ne in ner_results['nes'] if ne.get('type') == 'loc']
            places.append(', '.join(location_names))  # Join multiple locations
        else:
            places.append('')  # Handle cases where NER_results is None or nes key is missing
    df['places'] = places
    return df

articles_df = extract_locations(articles_df)
print(articles_df[['extracted_article_clean', 'places']].head())

                             extracted_article_clean  \
0  \n\nNew York, 22. Dez. (Telegr.) In der Stadt ...   
1  \n\nNeapel, 7. Juni. (Telegr.) Ein wellenförmi...   
2  \n\nFoggia, 7. Juni. (Telegr.) Ein heftiger Er...   
3  \n\nBrancaleone (Kalabrien), 17. April. (Teleg...   
4  \n\nDer Kircheneinsturz in Hohensalza ist dort...   

                                              places  
0                             New York, Stadt Mexiko  
1  Neapel, Benevento, di Stabia, Potenza, Catanza...  
2                                             Bovino  
3                                          Kalabrien  
4                          Hohensalza, Salzbergwerke  


## Visualization Example - Creating a Map with Place Names

We first use the geopy library to process geographic locations and add their corresponding coordinates (latitude and longitude) to a pandas DataFrame. It includes a GeocodingService class that interfaces with the Nominatim geocoding API, implementing rate-limiting, retries with exponential backoff, and error handling to ensure robust geocoding.

We further use the folium library to create an interactive map with markers for locations provided in a pandas DataFrame. Finally, the map is created and displayed, providing a visual representation of the geographic data.

In [8]:
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut, GeocoderServiceError
import pandas as pd
import time
from typing import List, Tuple, Optional
import random

class GeocodingService:
    def __init__(self, user_agent: str = None, timeout: int = 10, rate_limit: float = 1.1):
        """
        Initialize the geocoding service with proper configuration.

        Args:
            user_agent: Custom user agent string (default: generated)
            timeout: Timeout for requests in seconds
            rate_limit: Time to wait between requests in seconds
        """
        if user_agent is None:
            user_agent = f"python_geocoding_script_{random.randint(1000, 9999)}"

        self.geolocator = Nominatim(
            user_agent=user_agent,
            timeout=timeout
        )
        self.rate_limit = rate_limit
        self.last_request = 0

    def _rate_limit_wait(self):
        """Implement rate limiting between requests"""
        current_time = time.time()
        time_since_last = current_time - self.last_request
        if time_since_last < self.rate_limit:
            time.sleep(self.rate_limit - time_since_last)
        self.last_request = time.time()

    def geocode_location(self, location: str, max_retries: int = 3) -> Optional[Tuple[float, float]]:
        """
        Geocode a single location with retries.

        Args:
            location: Location string to geocode
            max_retries: Maximum number of retry attempts

        Returns:
            Tuple of (latitude, longitude) or None if geocoding fails
        """
        for attempt in range(max_retries):
            try:
                self._rate_limit_wait()
                location_data = self.geolocator.geocode(location)
                if location_data:
                    return (location_data.latitude, location_data.longitude)
                return None
            except (GeocoderTimedOut, GeocoderServiceError) as e:
                if attempt == max_retries - 1:
                    print(f"Failed to geocode '{location}' after {max_retries} attempts: {e}")
                    return None
                time.sleep(2 ** attempt)  # Exponential backoff
            except Exception as e:
                print(f"Error geocoding '{location}': {e}")
                return None
        return None

    def process_locations(self, locations: str) -> List[Optional[Tuple[float, float]]]:
        """
        Process a comma-separated string of locations.

        Args:
            locations: Comma-separated string of location names

        Returns:
            List of coordinate tuples or None for failed geocoding
        """
        if pd.isna(locations) or not locations:
            return []

        location_list = [loc.strip() for loc in locations.split(',')]
        return [self.geocode_location(loc) for loc in location_list]

def geolocate_places(df: pd.DataFrame,
                    places_column: str = 'places',
                    user_agent: str = None) -> pd.DataFrame:
    """
    Add coordinates to a DataFrame based on location names.

    Args:
        df: Input DataFrame
        places_column: Name of the column containing comma-separated location strings
        user_agent: Custom user agent string

    Returns:
        DataFrame with added 'coordinates' column
    """
    geocoder = GeocodingService(user_agent=user_agent)

    # Create a copy to avoid modifying the original DataFrame
    result_df = df.copy()

    # Process locations
    result_df['coordinates'] = result_df[places_column].apply(geocoder.process_locations)

    return result_df

# Main execution
if __name__ == "__main__":
    # Assuming articles_df is your DataFrame with a 'places' column
    # Apply geocoding to the articles DataFrame
    articles_df_with_coords = geolocate_places(
        articles_df,
        places_column='places',
        user_agent='article_geocoding_service_v1.0'
    )

    # Update the original DataFrame with the new coordinates
    articles_df['coordinates'] = articles_df_with_coords['coordinates']

    # Display the results
    print("\nSample of geocoded locations:")
    print(articles_df[['places', 'coordinates']].head())

    # Optional: Display some statistics
    total_locations = len(articles_df)
    successful_geocodes = articles_df['coordinates'].apply(lambda x: len([c for c in x if c is not None])).sum()
    failed_geocodes = articles_df['coordinates'].apply(lambda x: len([c for c in x if c is None])).sum()

    print(f"\nGeocoding Statistics:")
    print(f"Total locations processed: {total_locations}")
    print(f"Successfully geocoded: {successful_geocodes}")
    print(f"Failed to geocode: {failed_geocodes}")


Sample of geocoded locations:
                                              places  \
0                             New York, Stadt Mexiko   
1  Neapel, Benevento, di Stabia, Potenza, Catanza...   
2                                             Bovino   
3                                          Kalabrien   
4                          Hohensalza, Salzbergwerke   

                                         coordinates  
0  [(40.7127281, -74.0060152), (19.4326296, -99.1...  
1  [(40.8358846, 14.2487679), (41.2476307, 14.705...  
2                           [(41.249926, 15.341249)]  
3                         [(39.0565974, 16.5249864)]  
4  [(52.7952408, 18.2595624), (49.969634600000006...  

Geocoding Statistics:
Total locations processed: 20
Successfully geocoded: 99
Failed to geocode: 10


In [9]:
import folium
from folium import plugins
import pandas as pd
from typing import List, Tuple, Optional
from IPython.display import display

def create_location_map(df: pd.DataFrame,
                       coordinates_col: str = 'coordinates',
                       places_col: str = 'places',
                       title_col: Optional[str] = None) -> folium.Map:
    """
    Create an interactive map with markers for all locations in the DataFrame.

    Args:
        df: DataFrame containing coordinates and place names
        coordinates_col: Name of column containing coordinates
        places_col: Name of column containing place names
        title_col: Optional column name for additional marker information

    Returns:
        folium.Map object with all locations marked
    """
    # Initialize the map
    m = folium.Map(location=[0, 0], zoom_start=2)

    # Create a MarkerCluster
    marker_cluster = plugins.MarkerCluster().add_to(m)

    # Keep track of all valid coordinates for setting bounds
    all_coords = []

    # Process each row in the DataFrame
    for idx, row in df.iterrows():
        coordinates = row[coordinates_col]
        places = row[places_col].split(',') if pd.notna(row[places_col]) else []
        title = row[title_col] if title_col and pd.notna(row[title_col]) else None

        # Skip if no coordinates
        if not coordinates:
            continue

        # Add markers for each location
        for i, (coord, place) in enumerate(zip(coordinates, places)):
            if coord is not None:  # Skip None coordinates
                lat, lon = coord
                place_name = place.strip()

                # Create popup content
                popup_content = f"<b>{place_name}</b>"
                if title:
                    popup_content += f"<br>{title}"

                # Add marker
                folium.Marker(
                    location=[lat, lon],
                    popup=popup_content,
                    tooltip=place_name
                ).add_to(marker_cluster)

                all_coords.append([lat, lon])

    # If we have coordinates, fit the map bounds to include all points
    if all_coords:
        m.fit_bounds(all_coords)

    return m

# Create and display the map
map_obj = create_location_map(articles_df)
display(map_obj)