https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

In [1]:
#!pip install spacy
#!python -m spacy download en_core_web_sm

In [2]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm     #hier...
nlp = en_core_web_sm.load()   #... und hier lädt spacy einen Datensatz runter, um damit zu trainieren

In [3]:
article  = '''

Air pollution is a very big deal. Its adverse effects on numerous health outcomes and general mortality are widely documented. However, our understanding of its cognitive costs is more recent and those costs are almost certainly still significantly under-emphasized. For example, cognitive effects are not mentioned in most EPA materials.

World Bank data indicate that 3.7 billion people, about half the world's population, are exposed to more than 50 µg/m³ of PM2.5 on an annual basis, 5x the unit of measure for most of the findings below.

    Chess players make more mistakes on polluted days: "We find that an increase of 10 µg/m³ raises the probability of making an error by 1.5 percentage points, and increases the magnitude of the errors by 9.4%. The impact of pollution is exacerbated by time pressure. When players approach the time control of games, an increase of 10 µg/m³, corresponding to about one standard deviation, increases the probability of making a meaningful error by 3.2 percentage points, and errors being 17.3% larger." – Künn et al 2019.
    A 3.26x (albeit with very wide CI) increase in Alzheimer's incidence for each 10 µg/m³ increase in long-term PM2.5 exposure? "Short- and long-term PM2.5 exposure was associated with increased risks of stroke (short-term odds ratio 1.01 [per µg/m³ increase in PM2.5 concentrations], 95% CI 1.01-1.02; long-term 1.14, 95% CI 1.08-1.21) and mortality (short-term 1.02, 95% CI 1.01-1.04; long-term 1.15, 95% CI 1.07-1.24) of stroke. Long-term PM2.5 exposure was associated with increased risks of dementia (1.16, 95% CI 1.07-1.26), Alzheimer's disease (3.26, 95% 0.84-12.74), ASD (1.68, 95% CI 1.20-2.34), and Parkinson's disease (1.34, 95% CI 1.04-1.73)." – Fu et al 2019. Similar effects are seen in Bishop et al 2018: "We find that a 1 µg/m³ increase in decadal PM2.5 increases the probability of a dementia diagnosis by 1.68 percentage points."
    A study of 20,000 elderly women concluded that "the effect of a 10 µg/m³ increment in long-term [PM2.5 and PM10] exposure is cognitively equivalent to aging by approximately 2 years". – Weuve et al 2013.
    "Utilizing variations in transitory and cumulative air pollution exposures for the same individuals over time in China, we provide evidence that polluted air may impede cognitive ability as people become older, especially for less educated men. Cutting annual mean concentration of particulate matter smaller than 10 µm (PM10) in China to the Environmental Protection Agency’s standard (50 µg/m³) would move people from the median to the 63rd percentile (verbal test scores) and the 58th percentile (math test scores), respectively." – Zhang et al 2018.
    Stock market returns are lower on polluted days. "This estimate indicates that a one unit increase in PM2.5 decreases the daily percentage returns by 1.7%. Put differently, a one standard deviation increase in PM2.5 decreases the daily percentage returns by 11.9%, a substantial effect on daily NYSE returns." Hayes et al 2016.
    Baseball umpires make worse decisions on polluted days. "Unique characteristics of this setting combined with high-frequency data disentangle effects of multiple pollutants and identify previously under-explored acute effects. We find a 1 ppm increase in 3 hour CO causes an 11.5% increase in the propensity of umpires to make incorrect calls and a 10 µg/m³ increase in 12-hour PM2.5 causes a 2.6% increase." Archsmith et al 2018.
    Politicians use less complex speech on polluted days. "We apply textual analysis to convert over 100,000 verbal statements made by Canadian MPs from 2006 through 2011 into—among other metrics—speech-specific Flesch-Kincaid grade-level indices. This index measures the complexity of an MP’s speech by the number of years of education needed to accurately understand it. Conditioning on individual fixed effects and other controls, we show that elevated levels of airborne fine particulate matter reduce the complexity of MPs’s speeches. A high-pollution day, defined as daily average PM2.5 concentrations greater than 15 µg/m³, causes a 2.3% reduction in same-day speech quality. To put this into perspective, this is equivalent to the removal of 2.6 months of education." Heyes et al 2019.
    "Exposure to CO2 and VOCs at levels found in conventional office buildings was associated with lower cognitive scores than those associated with levels of these compounds found in a Green building." – Allen et al 2016. The effect seems to kick in at around 1,000 ppm of CO2.

'''


## Mit Spacy NLP verarbeiten

In [4]:
article = nlp(article)
len(article.ents) #Anzahl der Entitäten

91

In [5]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'ORG': 5,
         'CARDINAL': 25,
         'DATE': 24,
         'QUANTITY': 4,
         'PERCENT': 15,
         'MONEY': 2,
         'PERSON': 5,
         'GPE': 6,
         'ORDINAL': 1,
         'TIME': 1,
         'NORP': 1,
         'WORK_OF_ART': 2})

In [6]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('95%', 8), ('polluted days', 3), ('daily', 3)]

In [7]:
sentences = [x for x in article.sents]
#print(sentences[2])
sentences[3]

For example, cognitive effects are not mentioned in most EPA materials.


In [8]:
displacy.render(nlp(str(sentences[1])), jupyter=True, style='ent')

  "__main__", mod_spec)


In [9]:
displacy.render(nlp(str(sentences[0])), style='dep', jupyter = True, options = {'distance': 120})

In [10]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[0])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('\n\n', 'SPACE', '\n\n'),
 ('Air', 'NOUN', 'air'),
 ('pollution', 'NOUN', 'pollution'),
 ('big', 'ADJ', 'big'),
 ('deal', 'NOUN', 'deal')]

In [11]:
dict([(str(x), x.label_) for x in nlp(str(sentences[0])).ents])

{}

In [12]:
displacy.render(article, jupyter=True, style='ent')

### Mein Versuch mit deutsche Text (GR-Protokoll)

In [13]:
!python -m spacy download de_core_news_sm #hier lade ich die Bibliothek

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')


In [14]:
import textract
text = textract.process("DocumentLoader.pdf") 

In [15]:
text = str(text)
text = text.encode().decode("utf-8")  #encoden muss ich es zunächst, weil es als str schon dekodiert ist (aber falsch!)
text = text.replace("\\xe2\\x80\\x93", "–")
text = text.replace("\\xc2\\xbb", "»")
text = text.replace("\\xc2\\xab", "«")
text = text.replace("\\xc3\\xb6", "ö")
text

'b\'1827–1836\\n\\nSubstanzielles Protokoll 70. Sitzung\\ndes Gemeinderats von Z\\xc3\\xbcrich\\nMittwoch, 30. Oktober 2019, 21.00 Uhr bis 23.27 Uhr, im Rathaus\\n\\nVorsitz: Pr\\xc3\\xa4sident Heinz Schatt (SVP)\\nBeschlussprotokoll: Sekret\\xc3\\xa4rin Elena Marti (Gr\\xc3\\xbcne)\\nSubstanzielles Protokoll: Paulina Kerber\\nAnwesend: 119 Mitglieder\\nAbwesend: Ezgi Akyol (AL), Duri Beer (SP), Dr. David Garcia Nu\\xc3\\xb1ez (AL), Joe A. Manser (SP),\\nThomas Schwendener (SVP), Dominique Zygmont (FDP)\\n\\nDer Rat behandelt aus der vom Pr\\xc3\\xa4sidenten erlassenen, separat gedruckten Tagliste folgende\\nGesch\\xc3\\xa4fte:\\nMitteilungen\\n\\n1.\\n7.\\n\\n2019/209\\n\\nWeisung vom 22.05.2019:\\nKultur, Verein Theaterhaus Gessnerallee, Verein zur Förderung\\ndes Theaters an der Winkelwiese, Theater am Neumarkt AG,\\nNeufestsetzung Beitr\\xc3\\xa4ge ab 2019 (Erhöhung Einnahmeverzichte)\\n\\nSTP\\n\\n8.\\n\\n2019/265\\n\\nWeisung vom 19.06.2019:\\nKultur, Verein Spontankonzerte/Hombi

Der Decode-Befehl oben sollte das gleiche machen, wie wenn ich das hier manuell eingeben würde:

- text = text.replace("\\n\\n", " ")
- text = text.replace("\\n", " ")
- text = text.replace("\\xc3\\xbc", "ü")
- text = text.replace("\\xc3\\xa4", "ä")
- text = text.replace("\\xc3\\xab", "ë")
- etc.

In [16]:
import spacy
from spacy import displacy
from collections import Counter
import de_core_news_sm    #hier...
nlp = de_core_news_sm.load()   #... und hier lädt spacy einen Datensatz runter, um damit zu trainieren

In [17]:
text = nlp(text)
len(text.ents) #Anzahl der Entitäten

1431

In [18]:
labels = [x.label_ for x in text.ents]
Counter(labels)

Counter({'ORG': 366, 'MISC': 615, 'PER': 223, 'LOC': 227})

In [19]:
displacy.render(text, jupyter=True, style='ent')

In [20]:
for entity in text.ents:
    if entity.label_ == "ORG":
        print(entity, entity.label_)

b'1827–1836\n\nSubstanzielles Protokoll ORG
Pr\xc3\xa4sident Heinz Schatt ORG
AL ORG
Duri Beer ORG
SP ORG
AL ORG
SVP ORG
Spontankonzerte/Hombis Salon ORG
SP ORG
GLP ORG
GLP ORG
SP ORG
SP ORG
SP ORG
Landesstreik-Jubil\xc3\xa4ums\n\nSTP\n\n13.\n\n2018/477 ORG
AL-Fraktion ORG
Theaterhaus Gessnerallee ORG
Winkelwiese ORG
neu\nFr ORG
Theaterhaus Gessnerallee ORG
Winkelwiese ORG
um\nFr ORG
Theater Neumarkt AG ORG
Dispositiv-Ziffer ORG
j\xc3\xa4hrlich ORG
neu\nFr ORG
Theater Neumarkt AG ORG
erm\xc3\xa4chtigt ORG
wird.\n4.\n\nDie ORG
Winkelwiese ORG
selbstst\xc3\xa4ndig ORG
SP ORG
Schlussabstimmungen:\nStefan Urech ORG
SVP ORG
IMMO ORG
Dispositivziffer ORG
Theaterhaus Gessnerallee ORG
neu\nFr ORG
Theaterhaus Gessnerallee ORG
SP ORG
SVP ORG
Roger\nBartholdi ORG
SVP ORG
Yasmine Bourgeois ORG
FDP ORG
GLP ORG
FDP ORG
SP ORG
SP ORG
SP ORG
SK PRD/SSD ORG
Theater Neumarkt AG ORG
Dispositiv-Ziffer ORG
neu\nFr ORG
Theater Neumarkt AG ORG
erm\xc3\xa4chtigt ORG
SP ORG
SVP ORG
Roger\nBartholdi ORG
SVP ORG

In [21]:
#so lasse ich mir die erwähnten Organisationen aufzählen (alle nur 1x, das macht "set")

organisationen = []

for entity in text.ents:
    if entity.label_ == "ORG":
        organisationen.append(str(entity))

set(organisationen)

{'AHV',
 'AL',
 'AL-Fraktion',
 'Aktualit\\xc3\\xa4t',
 'Arbeiter\\n\\n20\\n\\n\\x0c70',
 'Armee',
 'B\\xc3\\xbcrgerkriegstreiber,\\ndie',
 'Bund',
 'Bund beauftragen,\\nein Denkmal',
 'Civic Tech',
 'Civic\\nTech»\\nGem\\xc3\\xa4ss schriftlicher Mitteilung',
 'Daf\\xc3\\xbcr',
 'Daf\\xc3\\xbcr w\\xc3\\xa4re',
 'Die\\nSchweizer Armee',
 'Dies\\nist',
 'Dispositiv-Ziffer',
 'Dispositivziffer',
 'Diversit\\xc3\\xa4t',
 'Duri Beer',
 'EU',
 'EVP',
 'Essens',
 'FDP',
 'FDPDominanz',
 'Filmtechnikern',
 'GLP',
 'GLP),\\nChristian Huser',
 'GLP),\\nSimone Hofer Frei',
 'Gesellschaft wach',
 'Gewerkschaft',
 'Gewerkschaften',
 'Gewerkschafter',
 'Gratistickets',
 'IMMO',
 'IT-Bereich',
 'IT-Strategie',
 'Identit\\xc3\\xa4',
 'Impuls',
 'KWO',
 'Komitees zu Gef\\xc3\\xa4ngnisstrafen',
 'Kommission',
 'Kraftwerke Oberhasli AG',
 'Kraftwerken Oberhasli AG',
 'Kulturh\\xc3\\xa4usern',
 'Landesstreik-Jubil\\xc3\\xa4ums\\nGem\\xc3\\xa4ss',
 'Landesstreik-Jubil\\xc3\\xa4ums\\n\\nSTP\\n\\n13.\\n\\n20

In [22]:
from collections import Counter

organisationen = []

for entity in text.ents:
    if entity.label_ == "ORG":
        organisationen.append(str(entity))

Counter(organisationen)

import pandas as pd
anzahl = (Counter(organisationen))
          
df = pd.DataFrame(dict(anzahl), index=["Nennungen"])
df_transposed = df.T
df_transposed.sort_values("Nennungen", ascending = False).head(6)

Unnamed: 0,Nennungen
SP,60
FDP,44
SVP,43
AL,20
GLP,20
Theaterhaus Gessnerallee,6


In [23]:
from collections import Counter

organisationen = []

for entity in text.ents:
    if entity.label_ == "PER":
        organisationen.append(str(entity))

Counter(organisationen)

import pandas as pd
anzahl = (Counter(organisationen))
          
df = pd.DataFrame(dict(anzahl), index=["Nennungen"])
df_transposed = df.T
df_transposed.sort_values("Nennungen", ascending = False).head(10)

Unnamed: 0,Nennungen
Jean-Daniel Strub,19
Ursula N\xc3\xa4f,11
Christoph Homberger,11
Roger Bartholdi,9
Isabel Garcia,8
Patrik Maillard,8
Mark Richli,8
Hombis Salon,6
\xc3\xbcber,6
Christian Huser\n,5


### Das gleiche mit einer grösseren Bibliothek zum Trainieren

In [25]:
#!python -m spacy download de_core_news_md

Collecting de_core_news_md==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_md-2.2.5/de_core_news_md-2.2.5.tar.gz (224.6MB)
[K     |████████████████████████████████| 224.6MB 3.3MB/s eta 0:00:014   |█████████████▎                  | 92.8MB 4.3MB/s eta 0:00:31| 157.1MB 4.6MB/s eta 0:00:15     |██████████████████████▍         | 157.4MB 2.0MB/s eta 0:00:34�██▋        | 165.9MB 4.1MB/s eta 0:00:15     |███████████████████████▊        | 166.2MB 4.1MB/s eta 0:00:15     |████████████████████████▌       | 172.0MB 1.2MB/s eta 0:00:45     |████████████████████████▉       | 174.1MB 14.4MB/s eta 0:00:04     |███████████████████████████▋    | 193.6MB 2.1MB/s eta 0:00:15     |███████████████████████████▊    | 194.7MB 2.1MB/s eta 0:00:14     |██████████████████████████████▎ | 212.2MB 2.2MB/s eta 0:00:06     |██████████████████████████████▉ | 216.3MB 3.7MB/s eta 0:00:03     |██████████████████████████████▉ | 216.7MB 3.7MB/s eta 0:00:03
Building wheels

In [36]:
import spacy
from spacy import displacy

import pandas as pd

from collections import Counter

import de_core_news_md    #Hier md statt sm – dauert entsprechend, weil gross!
nlp = de_core_news_md.load()   

In [48]:
text = textract.process("DocumentLoader.pdf").decode("utf-8")
text = text.replace("\n", " ")
text

'1827–1836  Substanzielles Protokoll 70. Sitzung des Gemeinderats von Zürich Mittwoch, 30. Oktober 2019, 21.00 Uhr bis 23.27 Uhr, im Rathaus  Vorsitz: Präsident Heinz Schatt (SVP) Beschlussprotokoll: Sekretärin Elena Marti (Grüne) Substanzielles Protokoll: Paulina Kerber Anwesend: 119 Mitglieder Abwesend: Ezgi Akyol (AL), Duri Beer (SP), Dr. David Garcia Nuñez (AL), Joe A. Manser (SP), Thomas Schwendener (SVP), Dominique Zygmont (FDP)  Der Rat behandelt aus der vom Präsidenten erlassenen, separat gedruckten Tagliste folgende Geschäfte: Mitteilungen  1. 7.  2019/209  Weisung vom 22.05.2019: Kultur, Verein Theaterhaus Gessnerallee, Verein zur Förderung des Theaters an der Winkelwiese, Theater am Neumarkt AG, Neufestsetzung Beiträge ab 2019 (Erhöhung Einnahmeverzichte)  STP  8.  2019/265  Weisung vom 19.06.2019: Kultur, Verein Spontankonzerte/Hombis Salon, Beiträge 2020–2023  STP  10.  2018/425 E/A  Postulat von Urs Helfenstein (SP) und Matthias Wiesmann (GLP) vom 07.11.2018: Anreicherung

In [49]:
text = nlp(text)
len(text.ents) 

1219

In [55]:
organisationen = []

for entity in text.ents:
    if entity.label_ == "LOC":
        organisationen.append(str(entity))

Counter(organisationen)


anzahl = (Counter(organisationen))
          
df = pd.DataFrame(dict(anzahl), index=["Nennungen"])
df_transposed = df.T
df_transposed.sort_values("Nennungen", ascending = False).head(10)

Unnamed: 0,Nennungen
Zürich,38
Stadt,25
Schweiz,19
Stadt Zürich,16
Oerlikon,11
der Schweiz,10
Wipkingen,8
Dispositiv-Ziffer,8
Theaterhaus Gessnerallee,6
Landesmuseum,6
