<a href="https://colab.research.google.com/github/PeerChristensen/NLP-Demos/blob/main/SpaCy_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A quick introduction to SpaCy


### Getting started



Upgrading SpaCy is necessary for use with Colab. 
Here, we download a small Danish model. 

Other options include `da_core_news_lg` and a transfomer model - `da_core_news_trf` (i.e. Maltehb/danish-bert-botxo - also available through Huggingface)

More information about Danish models [here](https://spacy.io/models/da)

In [None]:
!pip install --upgrade spacy
!python -m spacy download da_core_news_sm

import spacy
nlp = spacy.load("da_core_news_sm")

Collecting da-core-news-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/da_core_news_sm-3.2.0/da_core_news_sm-3.2.0-py3-none-any.whl (19.1 MB)
     |████████████████████████████████| 19.1 MB 1.4 MB/s            
Installing collected packages: da-core-news-sm
Successfully installed da-core-news-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('da_core_news_sm')


In [None]:
text = "Mens pandemien er i vækst over store dele af Europa, er en ny og bekymrende coronavariant ved navn Omikron opdaget i Sydafrika."

doc = nlp(text)

### What's in the `doc` container?

The `doc` variable contains a a sequence of tokens with a bunch of information attached.

Here, we create and use a function to return a dataframe with some key info.

In [None]:
import pandas as pd

def df_from_text(text):

  cols = ("text", "lemma", "POS", "POSexplained", "stopword")
  rows = []

  for word in doc:
    row = [word.text, word.lemma_, word.pos_, spacy.explain(word.pos_), word.is_stop]
    rows.append(row)

  return pd.DataFrame(rows, columns=cols)

In [None]:
df = df_from_text(text)

df

### Extracting named entities

Most models can extract named **persons**, **locations** and **organizations**

In [None]:
text2 = "Du forbinder nok jul med pebernødder, risalamande og brunkager, men hvordan smager julen i Bulgarien, Italien og Sverige?"

def get_entities(text):

  cols = ("entity", "label", "start", "end")
  rows = []

  doc = nlp(text)
  for ent in doc.ents:
    row = [ent.text, ent.label_, ent.start_char, ent.end_char]
    rows.append(row)

  return pd.DataFrame(rows, columns=cols)
    
get_entities(text2)

Unnamed: 0,entity,label,start,end
0,Bulgarien,LOC,91,100
1,Italien,LOC,102,109
2,Sverige,LOC,113,120


With the Displacy library, visualizing named entities within text is easy.

In [None]:
from spacy import displacy

text3 = "Det kommer næppe som en stor overraskelse, at Donald Trump endnu engang ikke har noget pænt at sige om hertuginde Meghan, der er gift med Storbritanniens prins Harry.  Den forhenværende amerikanske præsident har tidligere udtalt, at han »ikke er fan af Meghan«, og nu forklarer Donald Trump sig i et interview til den tidligere UKIP-formand, Nigel Farage.  I interviewet, der sendes i aften på britisk tv, beskylder Donald Trump hertuginde Meghan for at »mangle respekt« og for at »skade« dronning Elizabeth, skriver The Daily Mail. Donald Trump kommenterer også beskyldninger om, at hertuginden skulle være manipulerende over for sin mand. Hun skulle ifølge anklagerne være den direkte årsag til,  at hertugparret i begyndelsen af 2020 valgte at træde ud af den kongelige familie for i stedet at flytte USA."

doc = nlp(text3)

displacy.render(doc, style="ent", jupyter=True)

### Tokenization


For downstream tasks such as text classification or topic modelling, we might try different ways of preprocessing texts.

In this case, we create a function that only outputs lemmatized nouns transformed to lowercase.

In [None]:
def noun_lemmatizer(sentence):
    """Using SpaCy to lemmatize and extract nouns"""
    tokens = nlp(sentence)
    tokens = [word.lemma_.lower() for word in tokens if word.pos_ == "NOUN"]
    return tokens

In [None]:
text4 = "Du forbinder nok jul med pebernødder, risalamande og brunkager, men hvordan smager julen i Bulgarien, Italien og Sverige?"

noun_lemmatizer(text4)

['jul', 'pebernød', 'risalamande', 'jul']

### Getting noun phrases

In [None]:
text5 = "Jeg er den stolte ejer af to store røde biler."

doc = nlp(text5)

for chunk in doc.noun_chunks:
    print(f"> {chunk.text}")

> Jeg
> den stolte ejer
> to store røde biler


### Leaner pipelines


We may choose to only keep some of the processing components in the pipeline.
This can help you process text faster.

In this case, we disable several unnecessary steps for NER.



In [None]:
texts = [text, text2, text3, text4, text5]

for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):

    print([(ent.text, ent.label_) for ent in doc.ents])

[('Donald Trump', 'ORG'), ('Meghan', 'PER'), ('Storbritanniens', 'LOC'), ('Harry', 'PER'), ('amerikanske', 'MISC'), ('Meghan', 'LOC'), ('Donald Trump', 'ORG'), ('UKIP-formand', 'ORG'), ('Nigel Farage', 'PER'), ('britisk tv, beskylder', 'MISC'), ('Donald Trump', 'ORG'), ('Meghan', 'PER'), ('dronning Elizabeth', 'PER'), ('The Daily Mail', 'ORG'), ('USA', 'LOC')]
[('Bulgarien', 'LOC'), ('Italien', 'LOC'), ('Sverige', 'LOC')]
[('Donald Trump', 'ORG'), ('Meghan', 'PER'), ('Storbritanniens', 'LOC'), ('Harry', 'PER'), ('amerikanske', 'MISC'), ('Meghan', 'LOC'), ('Donald Trump', 'ORG'), ('UKIP-formand', 'ORG'), ('Nigel Farage', 'PER'), ('britisk tv, beskylder', 'MISC'), ('Donald Trump', 'ORG'), ('Meghan', 'PER'), ('dronning Elizabeth', 'PER'), ('The Daily Mail', 'ORG'), ('USA', 'LOC')]
[('Bulgarien', 'LOC'), ('Italien', 'LOC'), ('Sverige', 'LOC')]
[]
