<a href="https://colab.research.google.com/github/NbtKmy/gc_workshops/blob/main/NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



*   Listeneintrag
*   Listeneintrag
https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da


- Einführung in die Textverarbeitung: Von der grundlegenden Textmanipulation bis zur maschinellen Analyse.
- Erkundung von BeautifulSoup: Extrahieren von Informationen aus HTML und XML-ähnlichen Dokumenten.
- Einblick in NLTK und SpaCy: Anwendung von NLP-Techniken wie Tokenisierung, Lemmatisierung und Named Entity Recognition.

Sitzung 2: Handling von TEI-strukturierten Texten und fortgeschrittene Analysen

- Verständnis von TEI: Behandlung von strukturierten Texten im TEI-Format.
- Praktische Übungen: Manipulation und Analyse von TEI-Texten unter Verwendung von Python.


- Fortgeschrittene Analysetechniken: Sentiment-Analyse, Textklassifikation und mehr.


# Named Entity Recognition (NER)

NER bezeichnet ein autmatisches Verfahren, durch das die Eigennamen in einem Text identifiziert und klassifiziert werden.

Eigenennamen bestehen dabei auf ein Word oder mehrere Wörter.

## NER mit NLTK


In [1]:
!pip install -q nltk

In [3]:
import nltk

nltk.download("punkt") # Worttrennung
nltk.download("averaged_perceptron_tagger") # Part of Speech erhalten
nltk.download("words")
nltk.download("maxent_ne_chunker") # Chunker

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

In [4]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag


def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

ex = """
     Our journey here lost the interest arising from beautiful scenery; but
     we arrived in a few days at Rotterdam, whence we proceeded by sea to
     England.
     """

sent = preprocess(ex)
sent

[('Our', 'PRP$'),
 ('journey', 'NN'),
 ('here', 'RB'),
 ('lost', 'VBD'),
 ('the', 'DT'),
 ('interest', 'NN'),
 ('arising', 'VBG'),
 ('from', 'IN'),
 ('beautiful', 'JJ'),
 ('scenery', 'NN'),
 (';', ':'),
 ('but', 'CC'),
 ('we', 'PRP'),
 ('arrived', 'VBD'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('few', 'JJ'),
 ('days', 'NNS'),
 ('at', 'IN'),
 ('Rotterdam', 'NNP'),
 (',', ','),
 ('whence', 'NN'),
 ('we', 'PRP'),
 ('proceeded', 'VBD'),
 ('by', 'IN'),
 ('sea', 'NN'),
 ('to', 'TO'),
 ('England', 'NNP'),
 ('.', '.')]

Es gibt unterschiedliche POS-Tag-Standards. Z.B.:
- [Universal POS Tags](https://universaldependencies.org/u/pos/)
- [STTS(Stuttgart-Tübingen-Tagset)-Tags](https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/mitarbeiter-innen/hagen/STTS_Tagset_Tiger)


Bei NLTK sehen POS-Tags folgendermassen aus:


| Abkürzung | Bedeutung |
|------------|-------------|
|CC	| coordinating conjunction|
|CD	|cardinal digit|
|DT	|determiner|
|EX	|existential there|
|FW	|foreign word||
|IN	|preposition/subordinating conjunction|
|JJ	|This NLTK POS Tag is an adjective (large)|
|JJR	|adjective, comparative (larger)|
|JJS	|adjective, superlative (largest)|
|LS	|list market|
|MD	|modal (could, will)|
|NN	|noun, singular (cat, tree)|
|NNS	|noun plural (desks)|
|NNP	|proper noun, singular (sarah)|
|NNPS	|proper noun, plural (indians or americans)|
|PDT	|predeterminer (all, both, half)|
|POS	|possessive ending (parent\ ‘s)|
|PRP	|personal pronoun (hers, herself, him, himself)|
|PRP$	|possessive pronoun (her, his, mine, my, our )|
|RB	|adverb (occasionally, swiftly)|
|RBR	|adverb, comparative (greater)|
|RBS	|adverb, superlative (biggest)|
|RP	|particle (about)|
|TO	|infinite marker (to)|
|UH	|interjection (goodbye)|
|VB	|verb (ask)|
|VBG	|verb gerund (judging)|
|VBD	|verb past tense (pleaded)|
|VBN	|verb past participle (reunified)|
|VBP	|verb, present tense not 3rd person singular(wrap)|
|VBZ	|verb, present tense with 3rd person singular (bases)|
|WDT	|wh-determiner (that, what)|
|WP	|wh- pronoun (who)|
|WRB	|wh- adverb (how)|



Die Tokens und pos (Part of Speech), die wir erhalten haben, werden durch RgexpParser umgestaltet. Der umgestaltete Satz hat nun eine Baumstruktur.  

In [7]:
pattern = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs)

(S
  Our/PRP$
  (NP journey/NN)
  here/RB
  lost/VBD
  (NP the/DT interest/NN)
  arising/VBG
  from/IN
  (NP beautiful/JJ scenery/NN)
  ;/:
  but/CC
  we/PRP
  arrived/VBD
  in/IN
  a/DT
  few/JJ
  days/NNS
  at/IN
  Rotterdam/NNP
  ,/,
  (NP whence/NN)
  we/PRP
  proceeded/VBD
  by/IN
  (NP sea/NN)
  to/TO
  England/NNP
  ./.)


Mit der Methode tree2conlltags kann man Tags nach IOB-Format vergeben.

## IOB Format

IOB Format ist ein Standard für die Bezeichnung der Einheit der Eigennamen.
Weil viele Eigennamen nicht nur aus einem Wort besteht, ist solche Markierung notwendig.

- Token ist __I__nnerhalb einer Entität.
- Token ist ausserhalb (__O__ut of) einer Entität.
- Token ist am Anfang (the __B__eginning of) einer Entität.

So kann man die Entitäten abgrenzen.


In [6]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('Our', 'PRP$', 'O'),
 ('journey', 'NN', 'B-NP'),
 ('here', 'RB', 'O'),
 ('lost', 'VBD', 'O'),
 ('the', 'DT', 'B-NP'),
 ('interest', 'NN', 'I-NP'),
 ('arising', 'VBG', 'O'),
 ('from', 'IN', 'O'),
 ('beautiful', 'JJ', 'B-NP'),
 ('scenery', 'NN', 'I-NP'),
 (';', ':', 'O'),
 ('but', 'CC', 'O'),
 ('we', 'PRP', 'O'),
 ('arrived', 'VBD', 'O'),
 ('in', 'IN', 'O'),
 ('a', 'DT', 'O'),
 ('few', 'JJ', 'O'),
 ('days', 'NNS', 'O'),
 ('at', 'IN', 'O'),
 ('Rotterdam', 'NNP', 'O'),
 (',', ',', 'O'),
 ('whence', 'NN', 'B-NP'),
 ('we', 'PRP', 'O'),
 ('proceeded', 'VBD', 'O'),
 ('by', 'IN', 'O'),
 ('sea', 'NN', 'B-NP'),
 ('to', 'TO', 'O'),
 ('England', 'NNP', 'O'),
 ('.', '.', 'O')]


## NER mit SpaCy

Entity-Typen in ScaCy sind zum Beispiel...:

- PERSON - People, including fictional.

- NORP - Nationalities or religious or political groups.

- FAC - Buildings, airports, highways, bridges, etc.

- ORG - Companies, agencies, institutions, etc.

- GPE - Countries, cities, states.

- LOC - Non-GPE locations, mountain ranges, bodies of water.

Die weiteren Typen sind [hier](https://www.kaggle.com/code/curiousprogrammer/entity-extraction-and-classification-using-spacy?scriptVersionId=11364473&cellId=9) zu finden.



In [None]:
!pip install -q spacy

SpaCy hat einige trainierte Pipelines/Models für [englische Sprache](https://spacy.io/models/en#)  wie folgt:

- en_core_web_sm
- en_core_web_md
- en_core_web_lg
- en_core_web_trf


Diese Modelle haben jeweils unterschiedlich trainiert (meist mit geschriebene Texte im Internet, Nachrichten usw.) und demnach unterschiedliche Leistungen.
Nähere Beschreibungen findet man auf der oben genannten Website.

Hier verwenden wir alle Modelle, damit ihre unterschiedlichen Leistungen im Hinblick auf NER sichtbar werden.


### en_core_web_sm

In [None]:
import spacy
from spacy import displacy
from collections import Counter


nlp = spacy.load("en_core_web_sm")
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
pprint([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'),
 ('Google', 'ORG'),
 ('$5.1 billion', 'MONEY'),
 ('Wednesday', 'DATE')]


In [None]:
ex_sent = """
          Beyond Cologne we descended to the plains of Holland; and we resolved to
          post the remainder of our way; for the wind was contrary, and the stream
          of the river was too gentle to aid us.
          Our journey here lost the interest arising from beautiful scenery; but
          we arrived in a few days at Rotterdam, whence we proceeded by sea to
          England. It was on a clear morning, in the latter days of December, that
          I first saw the white cliffs of Britain. The banks of the Thames
          presented a new scene; they were flat, but fertile, and almost every
          town was marked by the remembrance of some story. We saw Tilbury Fort,
          and remembered the Spanish armada; Gravesend, Woolwich, and Greenwich,
          places which I had heard of even in my country.
          """

displacy.render(nlp(ex_sent), jupyter=True, style="ent")

### en_core_web_md

In [None]:
import spacy
from spacy import displacy
from spacy.cli import download

download("en_core_web_md")



nlp = spacy.load("en_core_web_md")
ex_sent = """
          Beyond Cologne we descended to the plains of Holland; and we resolved to
          post the remainder of our way; for the wind was contrary, and the stream
          of the river was too gentle to aid us.
          Our journey here lost the interest arising from beautiful scenery; but
          we arrived in a few days at Rotterdam, whence we proceeded by sea to
          England. It was on a clear morning, in the latter days of December, that
          I first saw the white cliffs of Britain. The banks of the Thames
          presented a new scene; they were flat, but fertile, and almost every
          town was marked by the remembrance of some story. We saw Tilbury Fort,
          and remembered the Spanish armada; Gravesend, Woolwich, and Greenwich,
          places which I had heard of even in my country.
          """

displacy.render(nlp(ex_sent), jupyter=True, style="ent")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


### en_core_web_lg

In [None]:
import spacy
from spacy import displacy
from spacy.cli import download

download("en_core_web_lg")



nlp = spacy.load("en_core_web_lg")
ex_sent = """
          Beyond Cologne we descended to the plains of Holland; and we resolved to
          post the remainder of our way; for the wind was contrary, and the stream
          of the river was too gentle to aid us.
          Our journey here lost the interest arising from beautiful scenery; but
          we arrived in a few days at Rotterdam, whence we proceeded by sea to
          England. It was on a clear morning, in the latter days of December, that
          I first saw the white cliffs of Britain. The banks of the Thames
          presented a new scene; they were flat, but fertile, and almost every
          town was marked by the remembrance of some story. We saw Tilbury Fort,
          and remembered the Spanish armada; Gravesend, Woolwich, and Greenwich,
          places which I had heard of even in my country.
          """

displacy.render(nlp(ex_sent), jupyter=True, style="ent")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


### SpaCy & Transformers

In [None]:
!pip install -q spacy[transformers]

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.8/190.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from spacy.cli import download
download("en_core_web_trf")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


In [None]:
import spacy
import spacy_transformers
from spacy import displacy



nlp = spacy.load("en_core_web_trf")
ex_sent = """
          Beyond Cologne we descended to the plains of Holland; and we resolved to
          post the remainder of our way; for the wind was contrary, and the stream
          of the river was too gentle to aid us.
          Our journey here lost the interest arising from beautiful scenery; but
          we arrived in a few days at Rotterdam, whence we proceeded by sea to
          England. It was on a clear morning, in the latter days of December, that
          I first saw the white cliffs of Britain. The banks of the Thames
          presented a new scene; they were flat, but fertile, and almost every
          town was marked by the remembrance of some story. We saw Tilbury Fort,
          and remembered the Spanish armada; Gravesend, Woolwich, and Greenwich,
          places which I had heard of even in my country.
          """
ent_analyse = nlp(ex_sent)
displacy.render(ent_analyse, jupyter=True, style="ent")



In [None]:
# Die Entitäten kann man auch mit anderer Form herholen.
# In unteren Beispiel wollen wir Orte in dem Text herausfischen:
for ent in ent_analyse.ents:
    if ent.label_ == "LOC" or ent.label_ == "GPE" or ent.label_ == "FAC":
        tup = ( ent.text, ent.start_char, ent.end_char, ent.label_)
        print(tup)

('Cologne', 18, 25, 'GPE')
('Holland', 56, 63, 'GPE')
('Rotterdam', 335, 344, 'GPE')
('England', 386, 393, 'GPE')
('Britain', 501, 508, 'GPE')
('Thames', 527, 533, 'LOC')
('Tilbury Fort', 680, 692, 'FAC')
('Gravesend', 739, 748, 'GPE')
('Woolwich', 750, 758, 'GPE')
('Greenwich', 764, 773, 'GPE')


Solange wir die Beispiele angeschaut haben, scheint das Modell "en_core_web_trf" das beste Ergebnis zu liefern. Weil wir jetzt auf der "Dokument"-Level die Entitäten angeschaut haben, ist Ortsname aus 2 Wörter ("Tilbury Fort") auch als eine Entität behandelt.

Man kann natürlih auf der "Token"-Ebene dies betrachten:

In [None]:
for tok in ent_analyse:
    if tok.ent_type_ == "FAC":
        tup = (tok.text, tok.ent_iob_, tok.ent_type_)
        print(tup)

('Tilbury', 'B', 'FAC')
('Fort', 'I', 'FAC')



SpaCy markiert die Tokens auch nach IOB Format.

In dem Beispel haben wir die Sätze nicht getrennt. Dies ist ungünstig, wenn man sehr lange Passage oder ger einen ganzen Roman analysieren möchte.


## Quiz

Holen wir einen anderen Textschnitt irgendwo und unterziehen wir ihm der NER-Analyse.

...Weil wir bisher Englische Texte genommen habe, könnte man natürlich Englische Texte nehmen. Wenn man Deutsche Texte nehmen wollt, könnt ihr natürlich ausprobieren. [Hier](https://spacy.io/models/de) sind die vortrainierten Modelle für Deutsche Sprache zu finden.



## Wofür NER?

Hier denken wir über eine mögliche Einsatzszenario...

In [None]:
!pip install -q geopy

In [None]:
from spacy.cli import download
download("en_core_web_trf")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


In [None]:
import spacy
import spacy_transformers

nlp = spacy.load("en_core_web_trf")
ex_sent = """
          Beyond Cologne we descended to the plains of Holland; and we resolved to
          post the remainder of our way; for the wind was contrary, and the stream
          of the river was too gentle to aid us.
          Our journey here lost the interest arising from beautiful scenery; but
          we arrived in a few days at Rotterdam, whence we proceeded by sea to
          England. It was on a clear morning, in the latter days of December, that
          I first saw the white cliffs of Britain. The banks of the Thames
          presented a new scene; they were flat, but fertile, and almost every
          town was marked by the remembrance of some story. We saw Tilbury Fort,
          and remembered the Spanish armada; Gravesend, Woolwich, and Greenwich,
          places which I had heard of even in my country.
          """

ent_analyse = nlp(ex_sent)
locations = []
for ent in ent_analyse.ents:
    if ent.label_ == "LOC" or ent.label_ == "GPE" or ent.label_ == "FAC":
        locations.append(ent.text)


In [None]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="ner_use_case")

geo_codes = []
for loc in locations:
    location = geolocator.geocode(loc)
    geo_codes.append([location.latitude, location.longitude])
print(geo_codes)

[[50.938361, 6.959974], [52.2434979, 5.6343227], [51.9244424, 4.47775], [52.5310214, -1.2649062], [54.7023545, -3.2765753], [51.5575525, -0.781404], [51.453058299999995, 0.3747360852853912], [51.4424747, 0.3694468], [51.4826696, 0.0623335], [51.46862565, 0.04883057313755089]]


In [None]:
import folium

tile_name = "cartodbdark_matter"

map = folium.Map(
    location=[52.2434979, 5.6343227],
    zoom_start=4,
    tiles=tile_name
)


markers = folium.FeatureGroup(name="test")
for ortsname, loc in zip(locations, geo_codes):
    # Die Markers jetzt zu der FeatureGroup hinzufügen
    folium.Marker(location=loc, popup=ortsname).add_to(markers)

markers.add_to(map)
folium.LayerControl().add_to(map)

map

In [None]:
import spacy
import spacy_transformers
import collections

nlp = spacy.load("en_core_web_trf")

with open("frankenstein_ch0.txt", encoding="UTF-8") as f:
    text_ch0 = f.read()


res = nlp(text_ch0)
locations = []
for ent in res.ents:
    if ent.label_ == "LOC" or ent.label_ == "GPE" or ent.label_ == "FAC":
        locations.append(ent.text)

counts = collections.Counter(locations)
counts_keys = counts.keys()
max_value = max(counts.values())
locs_dict = []
locations = list(set(locations))
for l in locations:
    if l in counts_keys:
        l_obj = {"locname": l, "value": counts[l]/max_value}

    else:
        l_obj = {"locname": l, "value": 1}
    locs_dict.append(l_obj)
print(locs_dict)

[{'locname': 'St. Petersburgh', 'value': 0.16666666666666666}, {'locname': 'the North Sea', 'value': 0.16666666666666666}, {'locname': 'Africa', 'value': 0.16666666666666666}, {'locname': 'Petersburgh', 'value': 0.16666666666666666}, {'locname': 'England', 'value': 1.0}, {'locname': 'Archangel', 'value': 0.5}, {'locname': 'earth', 'value': 0.16666666666666666}, {'locname': 'America', 'value': 0.16666666666666666}, {'locname': 'Greenland', 'value': 0.16666666666666666}, {'locname': 'Russia', 'value': 0.16666666666666666}, {'locname': 'London', 'value': 0.16666666666666666}, {'locname': 'the North Pacific Ocean', 'value': 0.16666666666666666}]


In [None]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="ner")

heatmap_data = []
for loc in locs_dict:
    location = geolocator.geocode(loc["locname"])
    if location is not None:
        heatmap_data.append([location.latitude, location.longitude, loc["value"]])


In [None]:
import folium
from folium import plugins

tile_name = "cartodbdark_matter"

map1 = folium.Map(
    location=[52.2434979, 5.6343227],
    zoom_start=4,
    tiles=tile_name
)

folium.plugins.HeatMap(data=heatmap_data,
                       name="Locations in Frankenstein Ch0",
                       control=True).add_to(map1)

folium.LayerControl().add_to(map1)
map1

In [None]:
import spacy
import spacy_transformers
import collections
import glob
from geopy.geocoders import Nominatim

nlp = spacy.load("en_core_web_trf")


def create_heatmap_data(text):
    res = nlp(text)
    locations = []
    for ent in res.ents:
        if ent.label_ == "LOC" or ent.label_ == "GPE" or ent.label_ == "FAC":
            locations.append(ent.text)

    counts = collections.Counter(locations)
    counts_keys = counts.keys()
    max_value = max(counts.values())
    locs_dict = []
    locations = list(set(locations))
    for l in locations:
        if l in counts_keys:
            l_obj = {"locname": l, "value": counts[l]/max_value}

        else:
            l_obj = {"locname": l, "value": 1}
        locs_dict.append(l_obj)

    # Get Geocode
    geolocator = Nominatim(user_agent="ner")

    heatmap_data = []
    for loc in locs_dict:
        location = geolocator.geocode(loc["locname"])
        if location is not None:
            heatmap_data.append([location.latitude, location.longitude, loc["value"]])

    return heatmap_data

x = 0
index = []
heatmap_2d_array = []
for file in sorted(glob.glob("frankenstein_*.txt")):
    with open(file, encoding="UTF-8") as f:
         text = f.read()
    heatmap = create_heatmap_data(text)
    heatmap_2d_array.append(heatmap)
    index.append("ch" + str(x))
    x += 1






In [None]:
import folium
from folium import plugins

tile_name = "cartodbdark_matter"

map1 = folium.Map(
    location=[52.2434979, 5.6343227],
    zoom_start=4,
    tiles=tile_name
)

folium.plugins.HeatMapWithTime(data=heatmap_2d_array,
                       index=index,
                       name="Locations in Frankenstein",
                       control=True).add_to(map1)

folium.LayerControl().add_to(map1)
map1