# Named entity recognition

In this notebook, we will perform named entity recognition on one of the articles, and we will explain why this might be useful.

First, we load our preprocessed data into a new dataframe.

In [40]:
import pickle
import pandas as pd

# Deserialize
with open('data/preprocessed_docs.pkl', 'rb') as f:
    processed_docs = pickle.load(f)

In [41]:
processed_docs = processed_docs.dropna(subset=['content'])
# Show the first rows of the DataFrame
processed_docs.head(2)

Unnamed: 0,identifier,oaiIdentifier,type,title,date,content,krantnaam,verspreidingsgebied,month,day,doc,tokens,lemmas,GPEs
31,http://resolver.kb.nl/resolve?urn=ddd:011065450:mpeg21:a0002:ocr,DDD:ddd:011065450:mpeg21,artikel,Buitenland.,1873-01-01,Fkankhjjk. Volgens 't officieele blad heeft de minister van financiën van do Duitscho regecring ...,De standaard,Landelijk,January,Wednesday,"(Fkankhjjk, ., Volgens, 't, officieele, blad, heeft, de, minister, van, financiën, van, do, Duit...","[Fkankhjjk, ., Volgens, 't, officieele, blad, heeft, de, minister, van, financiën, van, do, Duit...","[Fkankhjjk, ., volgens, het, officieeel, blad, hebben, de, minister, van, financiën, van, do, Du...","[Fkankhjjk, Mailand, Franscho, Duitschland, Koningsbergen, Duitschland, Tirol, Italië, Spanje, M..."
37,http://resolver.kb.nl/resolve?urn=ddd:011065450:mpeg21:a0008:ocr,DDD:ddd:011065450:mpeg21,artikel,"* Groningen, 30 Dec. In eene",1873-01-01,"* Groningen, 30 Dec. In eene ring van kerkvoogdan en notabelen derAmsterdam; 31 Dec. Volgen» een...",De standaard,Landelijk,January,Wednesday,"(*, Groningen, ,, 30, Dec., In, eene, ring, van, kerkvoogdan, en, notabelen, derAmsterdam, ;, 31...","[*, Groningen, ,, 30, Dec., In, eene, ring, van, kerkvoogdan, en, notabelen, derAmsterdam, ;, 31...","[*, Groningen, ,, 30, Dec., in, een, ring, van, kerkvoogdan, en, notabelen, derAmsterdam, ;, 31,...","[Groningen, derAmsterdam, Amsterdam, Lemmer, Eetenen, I'uttershock]"


## We load our model and 

In [15]:
import spacy

# Specify the relative path to the model directory
model_path = "model/nl_core_news_sm"

# Load the model from the relative path
nlp = spacy.load(model_path)

In [13]:
from spacy import displacy

# Get the SpaCy Doc object from the first record in the dataframe
doc = processed_docs['doc'].iloc[1]

# displacy visualises the named entities
displacy.render(doc, style='ent', jupyter=True)

## Plot GPE-data to a map

* Load the coordinates data from the CSV file.
* Create a mapping of city names to their coordinates.
* Extract the coordinates for each GPE entity found in your processed documents.
* Add a new column with these coordinates to your DataFrame.
* Plot the coordinates on a map using a library like folium.

### Create a cityname dictionary for lookup

We create a mapping of city names to coordinates

In [25]:
import pandas as pd

# Load coordinates data from CSV
coordinates_df = pd.read_csv('data/nl.csv')

# Create a dictionary mapping city names to coordinates
city_to_coords = {row['city']: (row['lat'], row['lng']) for idx, row in coordinates_df.iterrows()}

In [43]:
# Function to get coordinates for GPEs
def get_gpe_coords(gpe_list):
    return [city_to_coords[gpe] for gpe in gpe_list if gpe in city_to_coords]

# Apply the function to create a new column with coordinates
processed_docs['GPE_coords'] = processed_docs['GPEs'].apply(get_gpe_coords)

# Display the DataFrame
pd.set_option('display.max_colwidth', None)
display(processed_docs[['identifier','GPEs','GPE_coords']].head(2))


Unnamed: 0,identifier,GPEs,GPE_coords
31,http://resolver.kb.nl/resolve?urn=ddd:011065450:mpeg21:a0002:ocr,"[Fkankhjjk, Mailand, Franscho, Duitschland, Koningsbergen, Duitschland, Tirol, Italië, Spanje, Madrid, Cuba, Spanje]",[]
37,http://resolver.kb.nl/resolve?urn=ddd:011065450:mpeg21:a0008:ocr,"[Groningen, derAmsterdam, Amsterdam, Lemmer, Eetenen, I'uttershock]","[(53.2189, 6.5675), (52.3728, 4.8936)]"


In [44]:
import folium

# Plot the coordinates on a map
m = folium.Map(location=[52.0, 5.0], zoom_start=7)

for coords_list in processed_docs['GPE_coords']:
    for coords in coords_list:
        folium.Marker(location=coords).add_to(m)

# Save the map to an HTML file
m.save('map.html')

# Display the map
m


In [10]:
import spacy
import pandas as pd
import folium

# Load the SpaCy model
nlp = spacy.load('en_core_web_sm')

# Sample processed document data (assuming this is preprocessed)
data = {
    'text': [
        "I went to Paris and then flew to New York.",
        "Berlin is a beautiful city.",
        "My friend lives in Tokyo.",
    ]
}

# Create a DataFrame
processed_docs = pd.DataFrame(data)


# Function to get coordinates for GPEs
def get_gpe_coords(gpe_list):
    return [city_to_coords[gpe] for gpe in gpe_list if gpe in city_to_coords]

# Apply the function to create a new column with coordinates
processed_docs['GPE_coords'] = processed_docs['GPEs'].apply(get_gpe_coords)

# Display the DataFrame
print(processed_docs)

# Plot the coordinates on a map
m = folium.Map(location=[52.0, 5.0], zoom_start=7)

for coords_list in processed_docs['GPE_coords']:
    for coords in coords_list:
        folium.Marker(location=coords).add_to(m)

# Save the map to an HTML file
m.save('map.html')

# Display the map
m


NameError: name 'data' is not defined

In [25]:
def get_ents(doc):
    return [ent.text for ent in doc.ents]

In [26]:
ner['ents'] = ner['doc'].apply(get_ents)

AttributeError: 'str' object has no attribute 'ents'