# Using the SpaCy pipeline

This task is aiming to demonstrate the tokenization capabilites of [SpaCy](https://spacy.io/), as well as to serve as an introduction to the pipeline's capabilities combined with [rule based matching](https://spacy.io/usage/rule-based-matching).

Our goal will be to process the demonstration text, as well as to correct for some peculiarities, like special pronunciation marks, wide-spread abbreviations and foreign language insertions into our text.

It is mandatory, to stick to SpaCy based pipeline operations so as to make our analysis reproducible by running the pipeline on other texts presumably coming from the same corpus.

## Our demonstration text

Original from [Deutsche Sprache](https://de.wikipedia.org/wiki/Deutsche_Sprache) Wikipedia entry - with some modifications.

In [0]:
text= '''Die deutsche Sprache bzw. Deutsch ([dɔʏ̯t͡ʃ]; abgekürzt dt. oder dtsch.) ist eine westgermanische Sprache.

And this is an English sentence inbetween.

Ihr Sprachraum umfasst Deutschland, Österreich, die Deutschschweiz, Liechtenstein, Luxemburg, Ostbelgien, Südtirol, das Elsass und Lothringen sowie Nordschleswig. Außerdem ist sie eine Minderheitensprache in einigen europäischen und außereuropäischen Ländern, z. B. in Rumänien und Südafrika, sowie Nationalsprache im afrikanischen Namibia.'''

## Basic usage

After installing SpaCy, let us demonstrate it's basic usage by analysing our text.

In [0]:
%%capture
!pip install tabulate
!pip install spacy

In [3]:
# Ok, we installed SpaCy, but do we have a model for German?
# Something has to be done here to get it!

!python -m spacy download de

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/de_core_news_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/de
You can now load the model via spacy.load('de')


In [0]:
# Please do the appropriate imports for SpaCy and it's rule based Matcher class!

import spacy
from spacy.matcher import Matcher
from spacy.symbols import ORTH, POS, NOUN
# Please don't forget to instantiate the language model that we will use later on for analysis

nlp = spacy.load("de")

In [0]:
# And please use the model to analyse the text from above!

doc=nlp(text)

### Helper functions for nice printout

We just define some helper functions for nice printout. Nothing to do here, except to observe the ways one can iterate over a corpus or sentence, as well as the nice output of [Tabulate](https://bitbucket.org/astanin/python-tabulate/src/master/). 

In [0]:
from tabulate import tabulate

def print_sentences(doc):
    for sentence in doc.sents:
        print(sentence,"\n")

def print_tokens_for_sentence(doc,sentence_num, stopwords=False):
    attribs=[]
    for token in list(doc.sents)[sentence_num]:
        if token.has_extension("is_lemma_stop"):
            if stopwords and token._.is_lemma_stop:
                pass
            else:
                attribs.append([token.text, token.lemma_, token.pos_])
        else:
            attribs.append([token.text, token.lemma_, token.pos_])
    print(tabulate(attribs))


In [7]:
print_sentences(doc)

Die deutsche Sprache bzw. Deutsch ([dɔʏ̯t͡ʃ]; abgekürzt dt. oder dtsch.) ist eine westgermanische Sprache.

 

And this is an 

English sentence inbetween.

 

Ihr Sprachraum umfasst Deutschland, Österreich, die Deutschschweiz, Liechtenstein, Luxemburg, Ostbelgien, Südtirol, das Elsass und Lothringen sowie Nordschleswig. 

Außerdem ist sie eine Minderheitensprache in einigen europäischen und außereuropäischen Ländern, z. B. in Rumänien und Südafrika, sowie Nationalsprache im afrikanischen Namibia. 



In [8]:
print_tokens_for_sentence(doc,-1)

-------------------  -------------------  -----
Außerdem             Außerdem             ADV
ist                  sein                 AUX
sie                  ich                  PRON
eine                 einen                DET
Minderheitensprache  Minderheitensprache  NOUN
in                   in                   ADP
einigen              einig                DET
europäischen         europäisch           ADJ
und                  und                  CONJ
außereuropäischen    außereuropäisch      ADJ
Ländern              Land                 NOUN
,                    ,                    PUNCT
z.                   z.                   ADP
B.                   B.                   NOUN
in                   in                   ADP
Rumänien             Rumänien             PROPN
und                  und                  CONJ
Südafrika            Südafrika            PROPN
,                    ,                    PUNCT
sowie                sowie                CONJ
Nationalsprache  

## Matching "zum Beispiel"

We are a bit frustrated, that the standard analysis pipeline does not know, that in German, "z. B." is the abbreviation of "zum Beispiel" (like eg. is for "for example"), thus we would like to correct this.

Our approach is to extend the pipeline and do a matching, whereby we replace the `lemma` form of "z. B." to the appropriate long form.

**IMPORTANT** design principle by SpaCy is, that one **always keeps the possibility to restore the original text**, so we are **NOT to modify `token.text`**. In the analysed form, we can do whatever we want.

It is typical to add layers to the pipeline which modify the analysis.

For our purposes, we will use rule based matching to achieve our goals.

A detailed description on rule based matching in SpaCy can be found [here](https://spacy.io/usage/rule-based-matching), or [here](https://medium.com/@ashiqgiga07/rule-based-matching-with-spacy-295b76ca2b68)

### Build the matcher

With the help of rule based matching we create a matcher that reacts to the presence of "z. B." exactly, then we use this matcher to define a pipeline step, that after matching, replaces the lemmas of the tokens "z." and "B." to  their full written equivalent.  

In [0]:
zb_matcher = Matcher(nlp.vocab) # Please instantiate a matcher with the appropriate parameters - think about all the words of the corpus...
pattern = [{"TEXT": "z."}, {"TEXT": "B."}]
zb_matcher.add("z.B.", None, pattern)# Please add an appropriate pattern to the matcher to match "z. B."

def zb_replacer(doc):
    matched_spans = []
    # Please use the matcher to get matches!
    matches = zb_matcher(doc)
    # Plsease iterate over the matches!
    for match_id, start, end in matches:
        span = doc[start:end] # get the span of text based on the matches coordinates!
        matched_spans.append(span)
        print("ZB MATCH!!!")

    # Please iterate over matched spans
    for match_id in matched_spans:  
        match_id[0].lemma_ = "zum"
        match_id[1].lemma_ = "Beispiel"
        return doc
        # And replace their lemmas to the appropriate ones!
        # Please observe, that you don't have the ID of the desired lemmas, just the their string form.

### Register it to the pipeline

After creating this processing step, we register it to be part of the pipeline and then run our analysis again.

In [0]:
# Plase register the new zb_replacer to the pipeline!
# Think about, where to place it!

nlp.add_pipe(zb_replacer, first = True)

### Re-do the analysis and observe results

In [11]:
doc=nlp(text)

ZB MATCH!!!


In [12]:
print_tokens_for_sentence(doc,-1)

-------------------  -------------------  -----
Außerdem             Außerdem             ADV
ist                  sein                 AUX
sie                  ich                  PRON
eine                 einen                DET
Minderheitensprache  Minderheitensprache  NOUN
in                   in                   ADP
einigen              einig                DET
europäischen         europäisch           ADJ
und                  und                  CONJ
außereuropäischen    außereuropäisch      ADJ
Ländern              Land                 NOUN
,                    ,                    PUNCT
z.                   zum                  ADP
B.                   Beispiel             NOUN
in                   in                   ADP
Rumänien             Rumänien             PROPN
und                  und                  CONJ
Südafrika            Südafrika            PROPN
,                    ,                    PUNCT
sowie                sowie                CONJ
Nationalsprache  

## What are those ugly pronunciation signs doing there?

OK, so far so good. Let's observe, what is the problem with the first sentence!

In [13]:
doc=nlp(text)

ZB MATCH!!!


In [14]:
print_tokens_for_sentence(doc,0)


---------------  ---------------  -----
Die              der              DET
deutsche         deutsch          ADJ
Sprache          Sprache          NOUN
bzw.             beziehungsweise  CONJ
Deutsch          Deutsch          NOUN
(                (                PUNCT
[                [                PROPN
dɔʏ̯t͡ʃ            dɔʏ̯t͡ʃ            PROPN
]                ]                NOUN
;                ;                PUNCT
abgekürzt        abkürzen         VERB
dt               dt               PRON
.                .                PUNCT
oder             oder             CONJ
dtsch            dtsch            ADJ
.                .                PUNCT
)                )                PUNCT
ist              sein             AUX
eine             einen            DET
westgermanische  westgermanische  ADJ
Sprache          Sprache          NOUN
.                .                PUNCT
                                  SPACE
---------------  ---------------  -----


As we can see, poor pipeline can not really cope with the pronunciation markings of the phonetic alphabet, and thus thinks, that the signs are representing a foreign proper noun. 

We would like to remedy this, and since we do expect further texts from the corpus to contain these inserted phonetics, we would like to match, merge and replace.

## Building up matcher for PRONUNCIATION

To be more specific, we again first build up a matcher, that aims at the "square brackets" markings around the pronunciation. The task is to match everything between square brackets, or to be more specific: **everything that starts with an opening square bracket, and finishes with ";"**.

This matcher can then be used to:

1. Merge the resulting matching `span` into one token
2. Replace the token's lemma to "PRONUNCIATION"

For this to be achievable, we have to first register "PRONUNCIATION" as part of the vocabulary, moreover mark it as ["stopword"](https://en.wikipedia.org/wiki/Stop_words). (More on SpaCy's stopword handling [here](https://medium.com/@makcedward/nlp-pipeline-stop-words-part-5-d6770df8a936)) See below.

In [0]:
# Please instantiate and build the matcher as before with the appropriate pattern!
# Make it so, that the pattern will match ALL future pronunciations, not just the present one!
matcher = Matcher(nlp.vocab) 
pattern = [{"TEXT": "["},{"TEXT": "dɔʏ̯t͡ʃ"},{"TEXT": "]"}, {"TEXT": ";"}]
matcher.add("[dɔʏ̯t͡ʃ];", None, pattern)


# We set the properties for the new word "PRONUNCIATION"
lex = nlp.vocab['PRONUNCIATION']
lex.is_oov = False
lex.is_stop = True

def pronunciation_replacer(doc):
    
    # Using the template above, please build a pronunciation replacer, that
    # 1. gets the matches
    # 2. merges them into one
    # 3. Replaces their lemma string and lemma ID
    # 4. sets it's POS to "NOUN"
    matched_pro = []
    # Please use the matcher to get matches!
    matches = matcher(doc)
    # Plsease iterate over the matches!
    for match_id, start, end in matches:
        span = doc[start:end]
        span.merge() # merge the tags to one!
        matched_pro.append(span)
        print("MATCH!!!")

    # Please iterate over matched spans
    for match_id in matched_pro:  
        lemma_id = doc.vocab.strings[match_id.text]
        match_id[0].lemma_ = "PRONUNCIATION"
        match_id[0].pos_ = "NOUN"
    return doc


nlp.add_pipe(pronunciation_replacer, after="zb_replacer") 

### Observing result

In [16]:
doc=nlp(text)
print_tokens_for_sentence(doc,0)

ZB MATCH!!!
MATCH!!!
---------------  ---------------  -----
Die              der              DET
deutsche         deutsch          ADJ
Sprache          Sprache          NOUN
bzw.             beziehungsweise  CONJ
Deutsch          Deutsch          NOUN
(                (                PUNCT
[dɔʏ̯t͡ʃ];         PRONUNCIATION    NOUN
abgekürzt        abkürzen         VERB
dt               dt               PRON
.                .                PUNCT
oder             oder             CONJ
dtsch            dtsch            ADJ
.                .                PUNCT
)                )                PUNCT
ist              sein             AUX
eine             einen            DET
westgermanische  westgermanische  ADJ
Sprache          Sprache          NOUN
.                .                PUNCT
                                  SPACE
---------------  ---------------  -----


In the future, we decide, we would not want to include the pronunciation tokens in our view. So we have to mark them as wtopwords.

### Registering PRONUNCIATION as a stopword

Stopwords are typically those words, which do not contribute to the meaning of the sentence, are just there for syntactic reasons. There is a vague running list of these for languages. We will use and extend the German one in SpaCy.


In [0]:
# import stop words from GERMAN language data
from spacy.lang.de.stop_words import STOP_WORDS
# Add PRONUNCIATION to stopwords
STOP_WORDS.add("PRONUNCIATION")

But since we will only be able to manipulate the lemmas of the pronunciation markings, we would have to let SpaCy know, that - in contrast to the default behavior, where stopwords are filtered on `text` level, we would like to have a new property for words, that is based on `lemma` level stopword filtering.

For these we will use extensions!

For more info please see [here](https://spacy.io/api/token#set_extension)!

In [0]:
from spacy.tokens import Token

# Please define a function (or lambda expression!) that checks if a Token, or its lower case for, 
# OR it's lemma string is contained it he stopword list above.
stop_words_getter = lambda token: token.is_stop or token.lower_ in STOP_WORDS or token.lemma_ in STOP_WORDS

# Set the above defined function as a extension for Token under the name "is_lemma_stop" as a getter!
Token.set_extension('is_lemma_stop', getter=stop_words_getter)

In [19]:
doc=nlp(text)

ZB MATCH!!!
MATCH!!!


In [20]:
print_tokens_for_sentence(doc,0, stopwords=True)

assert len(list(doc.sents)[0]) == 20

---------------  ---------------  -----
deutsche         deutsch          ADJ
Sprache          Sprache          NOUN
bzw.             beziehungsweise  CONJ
Deutsch          Deutsch          NOUN
(                (                PUNCT
abgekürzt        abkürzen         VERB
dt               dt               PRON
.                .                PUNCT
dtsch            dtsch            ADJ
.                .                PUNCT
)                )                PUNCT
westgermanische  westgermanische  ADJ
Sprache          Sprache          NOUN
.                .                PUNCT
                                  SPACE
---------------  ---------------  -----


## Language detection

We could also observe, that there is some English text inbetween our nice German sentences. We would like to detect foreign sentences and by later processing, ignore / skip them.

For this to be achievable, we need some language detection capabilities.

Luckily enough, we can make it part of our pipeline via [this extension](#https://spacy.io/universe/project/spacy-langdetect).

### Standard installation

In [0]:
%%capture
!pip install spacy-langdetect

In [0]:
#Please import the language detector!
from spacy_langdetect import LanguageDetector

### Adding language detection to our pipeline

In [0]:
# Please register it to the pipeline as the final step of processing!
nlp.add_pipe(LanguageDetector(), last = True) 

### Observing results

In [24]:
doc = nlp(text)

ZB MATCH!!!
MATCH!!!


In [25]:
attribs = []
for sentence in doc.sents:
    attribs.append([list(sentence)[:5],"...", sentence._.language])
print(tabulate(attribs))

# Please observe how one accesses anextension!!

-----------------------------------------------  ---  -----------------------------------------------
[Die, deutsche, Sprache, bzw., Deutsch]          ...  {'language': 'de', 'score': 0.9999969868093777}
[And, this, is, an]                              ...  {'language': 'en', 'score': 0.9999974943159319}
[English, sentence, inbetween, .,                ...  {'language': 'en', 'score': 0.7142846100859669}

]
[Ihr, Sprachraum, umfasst, Deutschland, ,]       ...  {'language': 'de', 'score': 0.999997090026703}
[Außerdem, ist, sie, eine, Minderheitensprache]  ...  {'language': 'de', 'score': 0.9999978812017776}
-----------------------------------------------  ---  -----------------------------------------------


# Creating final generator for cleaned text

Typically for a later stage of NLP, we would like to have a generator like function, which allows us to iteratively access the corpus, albeit in it's cleaned and encoded form. Integer encoding (as well as one hot encoding) are quite typical representations of text.

In this spirit, we would like to implement a generator, that gives back an **array of lemmas OR lemma IDs for each sentence in the corpus, filtering out non-German sentences and punctuation / space marks**. 

In [0]:
# Please implement a generator function that yields the text of the corpus as lists of sentences
# Based on the parameters either as a list of strings or a list of IDs
# It should filter out non-German sentences
# as well as topwords based on lemmas
# and punctuation and "space like" characters!

def sentence_generator(doc, ids=False):
  for sentence in doc.sents:
    out_sentence=[]
    if sentence._.language['language'] == 'de':
      for token in sentence:
        if not token._.is_lemma_stop and token.pos_ not in {"PUNCT","SPACE"}:
          if ids:
            out_sentence.append(token.lemma)
          else:
            out_sentence.append(token.lemma_)
      yield out_sentence


In [66]:
for i in sentence_generator(doc):
    print(i,"\n")
    
for i in sentence_generator(doc, ids=True):
    print(i,"\n")

['deutsch', 'Sprache', 'beziehungsweise', 'Deutsch', 'abkürzen', 'dt', 'dtsch', 'westgermanische', 'Sprache'] 

['Sprachraum', 'umfasst', 'Deutschland', 'Österreich', 'Deutschschweiz', 'Liechtenstein', 'Luxemburg', 'Ostbelgien', 'Südtirol', 'Elsass', 'Lothringen', 'Nordschleswig'] 

['Minderheitensprache', 'einig', 'europäisch', 'außereuropäisch', 'Land', 'Beispiel', 'Rumänien', 'Südafrika', 'Nationalsprache', 'afrikanisch', 'Namibia'] 

[5968319817064592459, 8431935777423264011, 16143637279988465102, 13347145995516113707, 12068858602874567954, 5135506797272647618, 2552743035069842888, 7654685629011980891, 8431935777423264011] 

[11854469037278879099, 7289263729939212449, 3491614202785599281, 16047064563126251420, 3469156011154928224, 10833980334450146958, 15216956676957942053, 14493420987399493547, 14425170055224073740, 14854674721094831692, 5682654018506929560, 10694615845175474381] 

[13853446524293058697, 2130075938147343825, 512110525822973470, 15751849195492229329, 73123320805871