<a href="https://colab.research.google.com/github/PeerChristensen/NLP-Demos/blob/main/SpaCy_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A quick introduction to SpaCy


### Getting started



Upgrading SpaCy is necessary for use with Colab. 
Here, we download a small Danish model. 

Other options include `da_core_news_lg` and a transfomer model - `da_core_news_trf` (i.e. Maltehb/danish-bert-botxo - also available through Huggingface)

More information about Danish models [here](https://spacy.io/models/da)

In [2]:
!pip install --upgrade spacy
!python -m spacy download da_core_news_sm

import spacy
nlp = spacy.load("da_core_news_sm")

Collecting spacy
  Downloading spacy-3.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 18.0 MB/s 
Collecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (628 kB)
[K     |████████████████████████████████| 628 kB 62.3 MB/s 
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 59.8 MB/s 
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.5 MB/s 
[?25hCollecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.0-py3-none-any.whl (27 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-

In [3]:
text = "Mens pandemien er i vækst over store dele af Europa, er en ny og bekymrende coronavariant ved navn Omikron opdaget i Sydafrika."

doc = nlp(text)

### What's in the `doc` container?

The `doc` variable contains a a sequence of tokens with a bunch of information attached.

Here, we create and use a function to return a dataframe with some key info.

In [6]:
import pandas as pd

def df_from_text(text):

  cols = ("token", "lemma", "POS", "POSexplained", "stopword")
  rows = []

  for word in doc:
    row = [word.text, word.lemma_, word.pos_, spacy.explain(word.pos_), word.is_stop]
    rows.append(row)

  return pd.DataFrame(rows, columns=cols)

In [7]:
df_from_text(text)

Unnamed: 0,token,lemma,POS,POSexplained,stopword
0,Mens,Mens,SCONJ,subordinating conjunction,True
1,pandemien,pandemien,NOUN,noun,False
2,er,være,VERB,verb,True
3,i,i,ADP,adposition,True
4,vækst,vækst,NOUN,noun,False
5,over,over,ADP,adposition,True
6,store,stor,ADJ,adjective,False
7,dele,del,NOUN,noun,False
8,af,af,ADP,adposition,True
9,Europa,Europa,PROPN,proper noun,False


### Extracting named entities

Most models can extract named **persons**, **locations** and **organizations**

In [None]:
text2 = "Du forbinder nok jul med pebernødder, risalamande og brunkager, men hvordan smager julen i Bulgarien, Italien og Sverige?"

def get_entities(text):

  cols = ("entity", "label", "start", "end")
  rows = []

  doc = nlp(text)
  for ent in doc.ents:
    row = [ent.text, ent.label_, ent.start_char, ent.end_char]
    rows.append(row)

  return pd.DataFrame(rows, columns=cols)
    
get_entities(text2)

With the Displacy library, visualizing named entities within text is easy.

In [None]:
from spacy import displacy

text3 = "Det kommer næppe som en stor overraskelse, at Donald Trump endnu engang ikke har noget pænt at sige om hertuginde Meghan, der er gift med Storbritanniens prins Harry.  Den forhenværende amerikanske præsident har tidligere udtalt, at han »ikke er fan af Meghan«, og nu forklarer Donald Trump sig i et interview til den tidligere UKIP-formand, Nigel Farage.  I interviewet, der sendes i aften på britisk tv, beskylder Donald Trump hertuginde Meghan for at »mangle respekt« og for at »skade« dronning Elizabeth, skriver The Daily Mail. Donald Trump kommenterer også beskyldninger om, at hertuginden skulle være manipulerende over for sin mand. Hun skulle ifølge anklagerne være den direkte årsag til,  at hertugparret i begyndelsen af 2020 valgte at træde ud af den kongelige familie for i stedet at flytte USA."

doc = nlp(text3)

displacy.render(doc, style="ent", jupyter=True)

### Tokenization


For downstream tasks such as text classification or topic modelling, we might try different ways of preprocessing texts.

In this case, we create a function that only outputs lemmatized nouns transformed to lowercase.

In [None]:
def noun_lemmatizer(sentence):
    """Using SpaCy to lemmatize and extract nouns"""
    tokens = nlp(sentence)
    tokens = [word.lemma_.lower() for word in tokens if word.pos_ == "NOUN"]
    return tokens

In [None]:
text4 = "Du forbinder nok jul med pebernødder, risalamande og brunkager, men hvordan smager julen i Bulgarien, Italien og Sverige?"

noun_lemmatizer(text4)

### Getting noun phrases

In [None]:
text5 = "Jeg er den stolte ejer af to store røde biler."

doc = nlp(text5)

for chunk in doc.noun_chunks:
    print(f"> {chunk.text}")

### Leaner pipelines

The default processing pipeline includes a tagger, a lemmatizer, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

We may choose to only keep some of the processing components in the pipeline.
This can help you process text faster.

In this case, we disable several unnecessary steps for NER.



In [None]:
texts = [text, text2, text3, text4, text5]

for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):

    print([(ent.text, ent.label_) for ent in doc.ents])