<a href="https://colab.research.google.com/github/RobinBargen/General/blob/main/NLP_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **spaCy NLP - English Language**

**Imports**

In this example we only use the spaCy library. This library is taylored towards Natural Language Processing and provides solutions to tokenization, lemmatization, Named Entity Recognition etc. The displacy-module provides inbuilt visualisation functionality, allowing us to graphically display results of the analysis.



In [None]:
import spacy
from spacy import displacy

**Sample Text**

Define some sample text consisting of some data. Here a simple sentence is used, although whole paragraphs may be submitted to the model. In this case a centence with a name, an e - mail address and a website address has been provided to showcase the functionality made available through the spaCy - library.

In [None]:
sample_text = "Hello! My name is Gudvang Gerdaas. E-mail to: gudvang.gerdaas-gunde@aol.com or visit http://www.google.com/"

**Processing a document**

We utilise a trained, downloaded, standard pipeline provided for the language of concern. Here we load the pipeline `en_core_web_sm`, which returns a language object used for analysing texts in English.



In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(sample_text)

**Analysing the content of the document**

The processed document object can then be used. In the next few sections examples of the most common functions are shown.

In [None]:
# Tokenization and text extraction
cleaned_doc = [token for token in doc if not token.is_stop and not token.is_punct]

tokenization_summary = "\n".join([
    "-"*50, "Tokenization summary:",
    "Pre-processing num. tokens: " + str(len(doc)),
    "Post-processing num. tokens: " + str(len(cleaned_doc)),
    "-"*50
])
print(tokenization_summary)

# lematization
cleaned_lemmas = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
lemmatization_summary = "\n".join([
    "-"*50, "Lemmatization summary:",
    "Number of lemmas: " + str(len(cleaned_lemmas)),
    "\nExtracted lemmas:",
    "\t-" + "\n\t- ".join(cleaned_lemmas),
    "-"*50
])
print(lemmatization_summary)

# Email detection
email_addresses = [token.text for token in doc if token.like_email]
email_summary = "\n".join([
    "-"*50, "Email summary:",
    "\nExtracted emails:",
    "\t-" + "\n\t- ".join(email_addresses),
    "-"*50
])
print(email_summary)

# Website link detection
url_links = [token.text for token in doc if token.like_url]
url_summary = "\n".join([
    "-"*50, "URL summary:",
    "\nExtracted URLs:",
    "\t-" + "\n\t- ".join(url_links),
    "-"*50
])
print(url_summary)

--------------------------------------------------
Tokenization summary:
Pre-processing num. tokens: 17
Post-processing num. tokens: 8
--------------------------------------------------
--------------------------------------------------
Lemmatization summary:
Number of lemmas: 8

Extracted lemmas:
	-hello
	- Gudvang
	- Gerdaas
	- e
	- mail
	- gudvang.gerdaas-gunde@aol.com
	- visit
	- http://www.google.com/
--------------------------------------------------
--------------------------------------------------
Email summary:

Extracted emails:
	-gudvang.gerdaas-gunde@aol.com
--------------------------------------------------
--------------------------------------------------
URL summary:

Extracted URLs:
	-http://www.google.com/
--------------------------------------------------


In [None]:
"""
--------------------------------------------------
PART OF SPEECH ANALYSIS
--------------------------------------------------
To understand our text we analyse each of the
words and their part of speech (POS) tags.
"""
# General POS tags
pos_tags = [token.text + "\t\t----> " + token.pos_ + "\t " + spacy.explain(token.pos_)  for token in doc]
shortenings = [token.text for token in doc if token.pos_ == 'X']

pos_summary = "\n".join([
    "-"*50, "POS summary:",
    "\nExtracted POS tags:",
    " -" + "\n - ".join(pos_tags),
    "\nShortenings found:",
    "\n\t-" + "\n\t-".join(shortenings),
    "-"*50
])
print(pos_summary)
displacy.render(doc, style="dep", jupyter=True)

--------------------------------------------------
POS summary:

Extracted POS tags:
 -Hello		----> INTJ	 interjection
 - !		----> PUNCT	 punctuation
 - My		----> PRON	 pronoun
 - name		----> NOUN	 noun
 - is		----> AUX	 auxiliary
 - Gudvang		----> PROPN	 proper noun
 - Gerdaas		----> PROPN	 proper noun
 - .		----> PUNCT	 punctuation
 - E		----> NOUN	 noun
 - -		----> NOUN	 noun
 - mail		----> NOUN	 noun
 - to		----> PART	 particle
 - :		----> PUNCT	 punctuation
 - gudvang.gerdaas-gunde@aol.com		----> VERB	 verb
 - or		----> CCONJ	 coordinating conjunction
 - visit		----> VERB	 verb
 - http://www.google.com/		----> NOUN	 noun

Shortenings found:

	-
--------------------------------------------------


In [None]:
"""
--------------------------------------------------
NAMED ENTITY RECOGNITION (NER) ANALYSIS
--------------------------------------------------
Using NER analysis we find the named entities of
the text.

For more unknown entities 'EntityRuler' can be 
used. Patterns can then be specified and added
to the NLP pipeline.
"""
named_entities = [
    ent.text + "\n  Start pos.: " + str(ent.start_char) +
     ", Stop pos.: " + str(ent.end_char) +
     "\n  Label: " + ent.label_ +
     "\n  Label descr.: " + spacy.explain(ent.label_)
     for ent in doc.ents
]
pos_summary = "\n".join([
    "-"*50, "NER summary:",
    "\nExtracted NER tags:",
    " -" + "\n - ".join(named_entities),
    "-"*50
])
print(pos_summary)
displacy.render(doc, style="ent", jupyter=True)

--------------------------------------------------
NER summary:

Extracted NER tags:
 -Gudvang Gerdaas
  Start pos.: 18, Stop pos.: 33
  Label: PERSON
  Label descr.: People, including fictional
--------------------------------------------------


#**spaCy NLP - Norwegian Language**

This section covers the same topics and methods shown above, but for the Norwegian language.

In [None]:
import spacy
from spacy import displacy
import spacy.cli


In [None]:
sample_text = "".join([
      "Det meldes om stor fare for å skli på bananskall i Buskerud.",
      "Dette er svært beklagelig meddeler Kasper Espensen.",
      "For mer informasjon om dette gå inn på http://vg.no/banan",
      "Meld inn din bekymring til nei-til-bananskall-fall@banan.no"
    ])

In [None]:
spacy.cli.download("nb_core_news_sm")
nlp = spacy.load("nb_core_news_sm")
doc = nlp(sample_text)

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('nb_core_news_sm')


In [None]:
# Tokenization and text extraction
cleaned_doc = [token for token in doc if not token.is_stop and not token.is_punct]

tokenization_summary = "\n".join([
    "-"*50, "Tokenization summary:",
    "Pre-processing num. tokens: " + str(len(doc)),
    "Post-processing num. tokens: " + str(len(cleaned_doc)),
    "-"*50
])
print(tokenization_summary)

# lematization
cleaned_lemmas = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
lemmatization_summary = "\n".join([
    "-"*50, "Lemmatization summary:",
    "Number of lemmas: " + str(len(cleaned_lemmas)),
    "\nExtracted lemmas:",
    "\t-" + "\n\t- ".join(cleaned_lemmas),
    "-"*50
])
print(lemmatization_summary)

# Email detection
email_addresses = [token.text for token in doc if token.like_email]
email_summary = "\n".join([
    "-"*50, "Email summary:",
    "\nExtracted emails:",
    "\t-" + "\n\t- ".join(email_addresses),
    "-"*50
])
print(email_summary)

# Website link detection
url_links = [token.text for token in doc if token.like_url]
url_summary = "\n".join([
    "-"*50, "URL summary:",
    "\nExtracted URLs:",
    "\t-" + "\n\t- ".join(url_links),
    "-"*50
])
print(url_summary)

--------------------------------------------------
Tokenization summary:
Pre-processing num. tokens: 35
Post-processing num. tokens: 14
--------------------------------------------------
--------------------------------------------------
Lemmatization summary:
Number of lemmas: 14

Extracted lemmas:
	-melde
	- fare
	- skli
	- bananskall
	- Buskerud
	- beklagelig
	- meddel
	- Kasper
	- Espensen
	- informasjon
	- http://vg.no/bananMeld
	- din
	- bekymring
	- nei-til-bananskall-fall@banan.no
--------------------------------------------------
--------------------------------------------------
Email summary:

Extracted emails:
	-nei-til-bananskall-fall@banan.no
--------------------------------------------------
--------------------------------------------------
URL summary:

Extracted URLs:
	-http://vg.no/bananMeld
--------------------------------------------------


In [None]:
"""
--------------------------------------------------
PART OF SPEECH ANALYSIS
--------------------------------------------------
To understand our text we analyse each of the
words and their part of speech (POS) tags.
"""
# General POS tags
pos_tags = [token.text + "\t\t----> " + token.pos_ + "\t " + spacy.explain(token.pos_)  for token in doc]
shortenings = [token.text for token in doc if token.pos_ == 'X']

pos_summary = "\n".join([
    "-"*50, "POS summary:",
    "\nExtracted POS tags:",
    " -" + "\n - ".join(pos_tags),
    "\nShortenings found:",
    "\n\t-" + "\n\t-".join(shortenings),
    "-"*50
])
print(pos_summary)
displacy.render(doc, style="dep", jupyter=True)

--------------------------------------------------
POS summary:

Extracted POS tags:
 -Det		----> PRON	 pronoun
 - meldes		----> VERB	 verb
 - om		----> ADP	 adposition
 - stor		----> ADJ	 adjective
 - fare		----> NOUN	 noun
 - for		----> SCONJ	 subordinating conjunction
 - å		----> PART	 particle
 - skli		----> VERB	 verb
 - på		----> ADP	 adposition
 - bananskall		----> NOUN	 noun
 - i		----> ADP	 adposition
 - Buskerud		----> PROPN	 proper noun
 - .		----> PUNCT	 punctuation
 - Dette		----> PRON	 pronoun
 - er		----> AUX	 auxiliary
 - svært		----> ADJ	 adjective
 - beklagelig		----> ADJ	 adjective
 - meddeler		----> VERB	 verb
 - Kasper		----> PROPN	 proper noun
 - Espensen		----> PROPN	 proper noun
 - .		----> PUNCT	 punctuation
 - For		----> ADP	 adposition
 - mer		----> ADJ	 adjective
 - informasjon		----> NOUN	 noun
 - om		----> ADP	 adposition
 - dette		----> PRON	 pronoun
 - gå		----> VERB	 verb
 - inn		----> ADP	 adposition
 - på		----> ADP	 adposition
 - http://vg.no/bananMe

In [None]:
"""
--------------------------------------------------
NAMED ENTITY RECOGNITION (NER) ANALYSIS
--------------------------------------------------
Using NER analysis we find the named entities of
the text.

For more unknown entities 'EntityRuler' can be 
used. Patterns can then be specified and added
to the NLP pipeline.
"""
named_entities = [
    ent.text + "\n  Start pos.: " + str(ent.start_char) +
     ", Stop pos.: " + str(ent.end_char) +
     "\n  Label: " + ent.label_ +
     "\n  Label descr.: " + spacy.explain(ent.label_)
     for ent in doc.ents
]
pos_summary = "\n".join([
    "-"*50, "NER summary:",
    "\nExtracted NER tags:",
    " -" + "\n - ".join(named_entities),
    "-"*50
])
print(pos_summary)
displacy.render(doc, style="ent", jupyter=True)

--------------------------------------------------
NER summary:

Extracted NER tags:
 -Buskerud
  Start pos.: 51, Stop pos.: 59
  Label: GPE_LOC
  Label descr.: Geo-political entity, with a locative sense, e.g. 'John lives in Spain'
 - Kasper Espensen
  Start pos.: 95, Stop pos.: 110
  Label: PER
  Label descr.: Named person or family.
 - nei-til-bananskall-fall@banan.no
  Start pos.: 195, Stop pos.: 227
  Label: ORG
  Label descr.: Companies, agencies, institutions, etc.
--------------------------------------------------
