# Named entity recognition

In this notebook, we will perform named entity recognition on the OCR text.

First, we load our preprocessed data into a new dataframe.

In [None]:
import pickle
import pandas as pd

# Deserialize
with open('data/preprocessed_docs.pkl', 'rb') as f:
    processed_docs = pickle.load(f)

In [None]:
# We drop the rows without content
processed_docs = processed_docs.dropna(subset=['content'])

# Show the first rows of the DataFrame
pd.set_option('display.max_colwidth', None)
processed_docs[['identifier','krantnaam','GPEs']].head()

## Inspect the entity data in the dataframe

With the `displacy` module, we can visualise all kinds of entities in our text.

Remember that entities can be extracted from the SpaCy `Doc` objects that are stored in column `doc`.

When you like to know the meaning of an entity tag, use the explain function, like this:

```python
spacy.explain('GPE')
```


In [None]:
import spacy

# Specify the relative path to the model directory
model_path = "model/nl_core_news_sm"

# Load the model from the relative path
nlp = spacy.load(model_path)

In [None]:
from spacy import displacy

# Get the SpaCy Doc object from the first record in the dataframe
doc = processed_docs['doc'].iloc[[1]]

# displacy visualises the named entities
displacy.render(doc, style='ent', jupyter=True)

We can see that our model has made quite some mistakes.

Let's continue with GPE data in this notebook.

## Plot GPE-data to a map

The steps:

* Load a CSV file with Dutch city coordinates.
* Create a mapping of city names to their coordinates.
* Look up coordinates for each GPE entity found in our data.
* Add a new column with these coordinates to our data.
* Plot the coordinates on a map.

### Load a CSV file with Dutch city coordinates

We wil store the data in a new dataframe.

In [None]:
import pandas as pd

# Load coordinates data from CSV
coordinates_df = pd.read_csv('data/nl.csv')
coordinates_df.head(2)

### Create a mapping of city names to their coordinates

We will create a Python dictionary data structure for easy lookup.

In [None]:
# Create a Python dictionary mapping city names to coordinates
city_to_coords = {row['city']: (row['lat'], row['lng']) for idx, row in coordinates_df.iterrows()}

### Look up coordinates for each GPE entity found in our data and add a new column with these coordinates to our data

In [None]:
# Function to get coordinates for GPEs
def get_gpe_coords(gpe_list):
    return [city_to_coords[gpe] for gpe in gpe_list if gpe in city_to_coords]

# Apply the function to create a new column with coordinates
processed_docs['GPE_coords'] = processed_docs['GPEs'].apply(get_gpe_coords)

# Display the DataFrame
pd.set_option('display.max_colwidth', None)
display(processed_docs[['identifier','GPEs','GPE_coords']].head())


We can see see that no coordinates were found for the places of `http://resolver.kb.nl/resolve?urn=ddd:011065450:mpeg21:a0002:ocr`.
This is because the list only contains Dutch place names.

The place `Fkankhjjk` does not occur in our lookup list. Neither does `Lemmer`.
But `Groningen` and `Amsterdam` do occur in the list.

```python
city_to_coords['Fkankhjjk']
city_to_coords['Groningen']
city_to_coords['Lemmer']
```

In [None]:
city_to_coords['Groningen']

### Plot the coordinates on a map

We will use the `folium` library for this.

In [None]:
import folium

# Plot the coordinates on a map
map = folium.Map(location=[52.0, 5.0], zoom_start=7)

for coords_list in processed_docs['GPE_coords']:
    for coords in coords_list:
        folium.Marker(location=coords).add_to(map)

# Save the map to an HTML file
map.save('map.html')

# Display the map
map


#### Add different colors for different papers

In [None]:
processed_docs.krantnaam.unique()


In [None]:
import pandas as pd
import folium

# Define a color mapping for each unique krantnaam
color_mapping = {
    'De standaard': 'red',
    'Het vaderland : staat- en letterkundig nieuwsblad': 'blue',
    'De TÄ³d : godsdienstig-staatkundig dagblad': 'green',
}

# Plot the coordinates on a map
map = folium.Map(location=[52.0, 5.0], zoom_start=7)

for index, row in processed_docs.iterrows():
    krantnaam = row['krantnaam']
    coords_list = row['GPE_coords']
    color = color_mapping.get(krantnaam, 'black')  # Default to black if krantnaam not in mapping
    
    for coords in coords_list:
        folium.Marker(location=coords, icon=folium.Icon(color=color)).add_to(map)

# Save the map to an HTML file
map.save('map.html')

# Display the map
map


## Add missing cities to the coordinates dictionary

To retrieve the coordinates for a city (e.g., Lemmer) from Wikidata using a SPARQL query, you can use the following query:

```sparql
SELECT ?city ?cityLabel ?coordinates
WHERE {
  ?city rdfs:label "Lemmer"@nl;  # The city name "Lemmer"
        wdt:P625 ?coordinates.  # The coordinates
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],nl". }
}
LIMIT 1
```

You can run this [SPARQL query](https://w.wiki/AUVP) on the Wikidata Query Service to retrieve the coordinates for Lemmer.

### SPARQL-queries with Python

We can use a special Python library for SPARQL queries.

This piece of code runs a SPARQL query and returns data in JSON-format

In [None]:
from SPARQLWrapper import SPARQLWrapper, JSON
import json

# Wikidata endpoint
endpoint_url = "https://query.wikidata.org/sparql"

# Define the SPARQL query
query = """
SELECT ?city ?cityLabel ?coordinates
WHERE {
  ?city rdfs:label "Lemmer"@nl; 
        wdt:P625 ?coordinates.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],nl". }
}
LIMIT 1
"""

def get_results(endpoint_url, query):
    sparql = SPARQLWrapper(endpoint_url)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

result = get_results(endpoint_url, query)

print(json.dumps(result, indent=4))

We want to retrieve the coordinates from this data object.

In [None]:
bindings = result['results']['bindings'][0]
new_city = bindings['cityLabel']['value']
new_coordinates = bindings['coordinates']['value']

# The coordinates are in the format "Point(lon lat)"
new_lon, new_lat = new_coordinates.replace('Point(', '').replace(')', '').split()
new_coords = (float(new_lat), float(new_lon))

# Print the extracted values
print(f"City: {new_city}")
print(f"Longitude: {new_lon}")
print(f"Latitude: {new_lat}")
print(f"Longitude: {new_lon}")
print(f"new_coords: {new_coords}")

In [None]:
### Add the new coordinates to the dictionary for lookup

In [None]:
city_to_coords[new_city] = new_coords

In [None]:
city_to_coords['Lemmer']

Which cells do you have to run again to see Lemmer in te plot?

## Fix incorrect place names

This [article](https://www.delpher.nl/nl/kranten/view?coll=ddd&identifier=ddd:011065450:mpeg21:a0005&objectsearch=Sxeek) shows 'Sxeek' instead of 'Sneek' in the OCR text.

Let's replace the places in the dataframe for article `http://resolver.kb.nl/resolve?urn=ddd:011065450:mpeg21:a0005:ocr`

Run the code below to replace the placename and update the coordinates. Make sure Sneek exists in the coordinates list.

In [None]:
import pandas as pd
import spacy

article_identifier = 'http://resolver.kb.nl/resolve?urn=ddd:011065450:mpeg21:a0005:ocr'
old_place = 'Sxeek'
new_place = 'Sneek'

# Function to extract GPEs from the document
def get_gpe(doc):
    return [ent.text for ent in doc.ents if ent.label_ == 'GPE']

def get_gpe_coords(gpe_list):
    return [city_to_coords[gpe] for gpe in gpe_list if gpe in city_to_coords]

# Extract the content of the specific row
article_text = processed_docs.loc[processed_docs['identifier'] == article_identifier, 'content'].values[0]

# Replace the text within that content
if old_place in article_text:
    # Replace the text within that content
    article_text = article_text.replace(old_place, new_place)
    
    # Update the content column
    processed_docs.loc[processed_docs['identifier'] == article_identifier, 'content'] = article_text
    
    # Update the SpaCy doc column
    processed_docs.loc[processed_docs['identifier'] == article_identifier, 'doc'] = processed_docs.loc[processed_docs['identifier'] == article_identifier, 'content'].apply(nlp)

    # Update the GPEs column
    processed_docs['GPEs'] = processed_docs['doc'].apply(get_gpe)   

    # Update the GPE_coords column
    processed_docs['GPE_coords'] = processed_docs['GPEs'].apply(get_gpe_coords)

In [None]:
# Display the DataFrame
pd.set_option('display.max_colwidth', None)
display(processed_docs[['identifier', 'GPEs', 'GPE_coords']].head())
