# Named entity recognition

In this notebook, we will perform named entity recognition on the OCR text.

First, we load our preprocessed data into a new dataframe.

In [1]:
import pickle
import pandas as pd

# Deserialize
with open('data/preprocessed_docs.pkl', 'rb') as f:
    processed_docs = pickle.load(f)

In [3]:
# We drop the rows without content
processed_docs = processed_docs.dropna(subset=['content'])

# Show the first rows of the DataFrame
processed_docs.head(2)

Unnamed: 0,identifier,oaiIdentifier,type,title,date,content,krantnaam,verspreidingsgebied,month,day,doc,tokens,lemmas,GPEs
31,http://resolver.kb.nl/resolve?urn=ddd:01106545...,DDD:ddd:011065450:mpeg21,artikel,Buitenland.,1873-01-01,Fkankhjjk. Volgens 't officieele blad heeft de...,De standaard,Landelijk,January,Wednesday,"(Fkankhjjk, ., Volgens, 't, officieele, blad, ...","[Fkankhjjk, ., Volgens, 't, officieele, blad, ...","[Fkankhjjk, ., volgens, het, officieeel, blad,...","[Fkankhjjk, Mailand, Franscho, Duitschland, Ko..."
37,http://resolver.kb.nl/resolve?urn=ddd:01106545...,DDD:ddd:011065450:mpeg21,artikel,"* Groningen, 30 Dec. In eene",1873-01-01,"* Groningen, 30 Dec. In eene ring van kerkvoog...",De standaard,Landelijk,January,Wednesday,"(*, Groningen, ,, 30, Dec., In, eene, ring, va...","[*, Groningen, ,, 30, Dec., In, eene, ring, va...","[*, Groningen, ,, 30, Dec., in, een, ring, van...","[Groningen, derAmsterdam, Amsterdam, Lemmer, E..."


## Inspect the entity data in the dataframe

With the `displacy` module, we can visualise all kinds of entities in our text.

Remember that entities can be extracted from the SpaCy `Doc` objects that are stored in column `doc`.

When you want to know the meaning of an entity tag, use the explain function, like this:

```python
spacy.explain('GPE')
```


In [7]:
import spacy

# Specify the relative path to the model directory
model_path = "model/nl_core_news_sm"

# Load the model from the relative path
nlp = spacy.load(model_path)

In [6]:
from spacy import displacy

# Get the SpaCy Doc object from the first record in the dataframe
doc = processed_docs['doc'].iloc[1]

# displacy visualises the named entities
displacy.render(doc, style='ent', jupyter=True)

We can see that our model has made quite some mistakes.

Let's continue with GPE data in this notebook.

## Plot GPE-data to a map

The steps:

* Load a CSV file with Dutch city coordinates.
* Create a mapping of city names to their coordinates.
* Look up coordinates for each GPE entity found in our data.
* Add a new column with these coordinates to our data.
* Plot the coordinates on a map.

### Load a CSV file with Dutch city coordinates

We wil store the data in a new dataframe.

In [40]:
import pandas as pd

# Load coordinates data from CSV
coordinates_df = pd.read_csv('data/nl.csv')
coordinates_df.head(2)
coordinates_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   city               205 non-null    object 
 1   lat                205 non-null    float64
 2   lng                205 non-null    float64
 3   country            205 non-null    object 
 4   iso2               205 non-null    object 
 5   admin_name         205 non-null    object 
 6   capital            136 non-null    object 
 7   population         205 non-null    int64  
 8   population_proper  205 non-null    int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 14.5+ KB


### Create a mapping of city names to their coordinates

We will create a Python dictionary data structure for easy lookup.

In [14]:
# Create a Python dictionary mapping city names to coordinates
city_to_coords = {row['city']: (row['lat'], row['lng']) for idx, row in coordinates_df.iterrows()}

### Look up coordinates for each GPE entity found in our data and add a new column with these coordinates to our data

In [12]:
# Function to get coordinates for GPEs
def get_gpe_coords(gpe_list):
    return [city_to_coords[gpe] for gpe in gpe_list if gpe in city_to_coords]

# Apply the function to create a new column with coordinates
processed_docs['GPE_coords'] = processed_docs['GPEs'].apply(get_gpe_coords)

# Display the DataFrame
pd.set_option('display.max_colwidth', None)
display(processed_docs[['identifier','GPEs','GPE_coords']].head(2))


Unnamed: 0,identifier,GPEs,GPE_coords
31,http://resolver.kb.nl/resolve?urn=ddd:011065450:mpeg21:a0002:ocr,"[Fkankhjjk, Mailand, Franscho, Duitschland, Koningsbergen, Duitschland, Tirol, Italië, Spanje, Madrid, Cuba, Spanje]",[]
37,http://resolver.kb.nl/resolve?urn=ddd:011065450:mpeg21:a0008:ocr,"[Groningen, derAmsterdam, Amsterdam, Lemmer, Eetenen, I'uttershock]","[(53.2189, 6.5675), (52.3728, 4.8936)]"


We can see see that no coordinates were found for the places of `http://resolver.kb.nl/resolve?urn=ddd:011065450:mpeg21:a0002:ocr`.
This is because do only have Dutch places in our lookup list.

The place `Fkankhjjk` does not occur in our lookup list.
Neither does `Lemmer`.
But `Groningen` and `Amsterdam` do occur in the list.

### Plot the coordinates on a map

We will use the `folium` library for this.

In [None]:
import folium

# Plot the coordinates on a map
m = folium.Map(location=[52.0, 5.0], zoom_start=7)

for coords_list in processed_docs['GPE_coords']:
    for coords in coords_list:
        folium.Marker(location=coords).add_to(m)

# Save the map to an HTML file
m.save('map.html')

# Display the map
m


## Add missing cities to the coordinates dictionary

To retrieve the coordinates for a city (e.g., Lemmer) from Wikidata using a SPARQL query, you can use the following query:

```sparql
SELECT ?city ?cityLabel ?coordinates
WHERE {
  ?city rdfs:label "Lemmer"@nl;  # The city name "Lemmer"
        wdt:P625 ?coordinates.  # The coordinates
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],nl". }
}
LIMIT 1
```

You can run this [SPARQL query](hhttps://w.wiki/AUVP) on the Wikidata Query Service to retrieve the coordinates for Lemmer.

In [15]:
### We can use a special Python library as well.

In [29]:
from SPARQLWrapper import SPARQLWrapper, JSON

# Wikidata endpoint
endpoint_url = "https://query.wikidata.org/sparql"

# Define the SPARQL query
query = """
SELECT ?city ?cityLabel ?coordinates
WHERE {
  ?city rdfs:label "Lemmer"@nl;  # The city name "Lemmer"
        wdt:P625 ?coordinates.  # The coordinates
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],nl". }
}
LIMIT 1
"""

def get_results(endpoint_url, query):
    sparql = SPARQLWrapper(endpoint_url)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

result = get_results(endpoint_url, query)

print(result)

{'head': {'vars': ['city', 'cityLabel', 'coordinates']}, 'results': {'bindings': [{'city': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q144091'}, 'coordinates': {'datatype': 'http://www.opengis.net/ont/geosparql#wktLiteral', 'type': 'literal', 'value': 'Point(5.7093 52.8437)'}, 'cityLabel': {'xml:lang': 'nl', 'type': 'literal', 'value': 'Lemmer'}}]}}


In [34]:
bindings = result['results']['bindings'][0]
new_city = bindings['cityLabel']['value']
new_coordinates = bindings['coordinates']['value']

# The coordinates are in the format "Point(lon lat)"
new_lon, new_lat = new_coordinates.replace('Point(', '').replace(')', '').split()
new_coords = (float(new_lat), float(new_lon))

# Print the extracted values
print(f"City: {new_city}")
print(f"Longitude: {new_lon}")
print(f"Latitude: {new_lat}")

City: Lemmer
Longitude: 5.7093
Latitude: 52.8437


In [16]:
### Add the new coordinates to the dictionary for lookup

In [41]:
import pandas as pd

# New place and coordinates to add
new_place = {
    'city': new_city,
    'lat': new_lat,  # Replace with the actual latitude from Wikidata
    'lng': new_lon,  # Replace with the actual longitude from Wikidata
    'country': 'Netherlands',
    'iso2': 'NL',
    'admin_name': '-',
    'capital': '-',
    'population': 0,  # Replace with the actual population if available
    'population_proper': 0  # Replace with the actual population if available
}

new_place

{'city': 'Lemmer',
 'lat': '52.8437',
 'lng': '5.7093',
 'country': 'Netherlands',
 'iso2': 'NL',
 'admin_name': 'minor',
 'capital': '-',
 'population': 0,
 'population_proper': 0}

In [48]:
# Append the new place to the DataFrame

coordinates_df = pd.concat([coordinates_df, pd.DataFrame([new_place])], ignore_index=True)
print(coordinates_df)

                city      lat     lng      country iso2     admin_name  \
0            Tilburg    51.55  5.0833  Netherlands   NL  Noord-Brabant   
1          Amsterdam  52.3728  4.8936  Netherlands   NL  Noord-Holland   
2          Rotterdam    51.92    4.48  Netherlands   NL   Zuid-Holland   
3          The Hague    52.08    4.31  Netherlands   NL   Zuid-Holland   
4            Utrecht  52.0908  5.1217  Netherlands   NL        Utrecht   
..               ...      ...     ...          ...  ...            ...   
202  Zuid-Scharwoude  52.6833  4.8167  Netherlands   NL  Noord-Holland   
203       Poortugaal  51.8667     4.4  Netherlands   NL   Zuid-Holland   
204            Odijk  52.0503  5.2333  Netherlands   NL        Utrecht   
205           Lemmer  52.8437  5.7093  Netherlands   NL          minor   
206           Lemmer  52.8437  5.7093  Netherlands   NL          minor   

     capital  population  population_proper  
0      minor     1944588            1944588  
1    primary     14

In [50]:
# Save the updated DataFrame back to the CSV file
coordinates_df.to_csv('data/nl.csv', index=False)

print(coordinates_df)


# Create a dictionary mapping city names to coordinates
city_to_coords = {row['city']: (row['lat'], row['lng']) for idx, row in coordinates_df.iterrows()}

city_to_coords

                city      lat     lng      country iso2     admin_name  \
0            Tilburg    51.55  5.0833  Netherlands   NL  Noord-Brabant   
1          Amsterdam  52.3728  4.8936  Netherlands   NL  Noord-Holland   
2          Rotterdam    51.92    4.48  Netherlands   NL   Zuid-Holland   
3          The Hague    52.08    4.31  Netherlands   NL   Zuid-Holland   
4            Utrecht  52.0908  5.1217  Netherlands   NL        Utrecht   
..               ...      ...     ...          ...  ...            ...   
202  Zuid-Scharwoude  52.6833  4.8167  Netherlands   NL  Noord-Holland   
203       Poortugaal  51.8667     4.4  Netherlands   NL   Zuid-Holland   
204            Odijk  52.0503  5.2333  Netherlands   NL        Utrecht   
205           Lemmer  52.8437  5.7093  Netherlands   NL          minor   
206           Lemmer  52.8437  5.7093  Netherlands   NL          minor   

     capital  population  population_proper  
0      minor     1944588            1944588  
1    primary     14

{'Tilburg': (51.55, 5.0833),
 'Amsterdam': (52.3728, 4.8936),
 'Rotterdam': (51.92, 4.48),
 'The Hague': (52.08, 4.31),
 'Utrecht': (52.0908, 5.1217),
 'Maastricht': (50.85, 5.6833),
 'Eindhoven': (51.4333, 5.4833),
 'Groningen': (53.2189, 6.5675),
 'Almere': (52.3667, 5.2167),
 'Breda': (51.5889, 4.7758),
 'Nijmegen': (51.8475, 5.8625),
 'Arnhem': (51.9833, 5.9167),
 'Haarlem': (52.3833, 4.6333),
 'Enschede': (52.2225, 6.8925),
 '’s-Hertogenbosch': (51.6833, 5.3),
 'Amersfoort': (52.15, 5.3833),
 'Zaanstad': (52.4333, 4.8167),
 'Apeldoorn': (52.2167, 5.9667),
 'Zwolle': (52.5167, 6.1),
 'Zoetermeer': (52.0667, 4.5),
 'Leeuwarden': (53.2, 5.7833),
 'Leiden': (52.16, 4.49),
 'Dordrecht': (51.7958, 4.6783),
 'Alphen aan den Rijn': (52.1333, 4.6667),
 'Alkmaar': (52.6333, 4.75),
 'Delft': (52.0117, 4.3592),
 'Emmen': (52.7833, 6.9),
 'Deventer': (52.25, 6.15),
 'Helmond': (51.4833, 5.65),
 'Hilversum': (52.2333, 5.1667),
 'Heerlen': (50.8833, 5.9833),
 'Lelystad': (52.5, 5.4833),
 'Purmer