![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/german/pretrained-german-models.ipynb)

## 0. Colab Setup

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-12-23 12:20:19--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-12-23 12:20:20--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-12-23 12:20:20--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:44

### German models specs

| Feature   | Description|
|:----------|:----------|
| **Lemma** | Trained by **Lemmatizer** annotator on **lemmatization-lists** by `Michal Měchura`|
| **POS**   | Trained by **PerceptronApproach** annotator on the [Universal Dependencies](https://universaldependencies.org/treebanks/de_hdt/index.html)|
| **NER**   | Trained by **NerDLApproach** annotator with **Char CNNs - BiLSTM - CRF** and **GloVe Embeddings** on the **WikiNER** corpus and supports the identification of `PER`, `LOC`, `ORG` and `MISC` entities |

In [2]:
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline

from pyspark.sql.types import StringType

In [3]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  4.2.6
Apache Spark version:  3.2.3


In [4]:
dfTest = spark.createDataFrame([
    "Die Anfänge der EU gehen auf die 1950er-Jahre zurück, als zunächst sechs Staaten die Europäische Wirtschaftsgemeinschaft (EWG) gründeten.",
    "Angela[1] Dorothea Merkel (* 17. Juli 1954 in Hamburg als Angela Dorothea Kasner) ist eine deutsche Politikerin (CDU)."
], StringType()).toDF("text")

### Pretrained Pipelines in German
#### explain_document_md (glove_6B_300)

In [5]:
pipeline_exdo_md = PretrainedPipeline('explain_document_md', 'de')

explain_document_md download started this may take some time.
Approx size to download 452.4 MB
[OK!]


In [6]:
pipeline_exdo_md.transform(dfTest).show(2, truncate=10)

+----------+----------+----------+----------+----------+----------+----------+----------+----------+
|      text|  document|  sentence|     token|     lemma|       pos|embeddings|       ner|  entities|
+----------+----------+----------+----------+----------+----------+----------+----------+----------+
|Die Anf...|[{docum...|[{docum...|[{token...|[{token...|[{pos, ...|[{word_...|[{named...|[{chunk...|
|Angela[...|[{docum...|[{docum...|[{token...|[{token...|[{pos, ...|[{word_...|[{named...|[{chunk...|
+----------+----------+----------+----------+----------+----------+----------+----------+----------+



In [7]:
pipeline_exdo_md.transform(dfTest).select("lemma.result").show(2, truncate=70)
pipeline_exdo_md.transform(dfTest).select("pos.result").show(2, truncate=70)
pipeline_exdo_md.transform(dfTest).select("entities.result").show(2, truncate=70)

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[Die, Anfang, der, EU, gehen, auf, der, 1950er-Jahre, zurück,, als,...|
|[Angela[1], Dorothea, Merkel, (*, 17, ., Juli, 1954, in, Hamburg, a...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[DET, NOUN, DET, PROPN, VERB, ADP, DET, NOUN, VERB, ADP, ADV, NUM, ...|
|[PROPN, PROPN, PROPN, X, NUM, PUNCT, NOUN, NUM, ADP, PROPN, ADP, PR...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                

#### entity_recognizer_md (glove_6B_300)

In [9]:
pipeline_entre_md = PretrainedPipeline('entity_recognizer_md', 'de')

entity_recognizer_md download started this may take some time.
Approx size to download 443.7 MB
[OK!]


In [10]:
pipeline_entre_md.transform(dfTest).show(2, truncate=10)

+----------+----------+----------+----------+----------+----------+----------+
|      text|  document|  sentence|     token|embeddings|       ner|  entities|
+----------+----------+----------+----------+----------+----------+----------+
|Die Anf...|[{docum...|[{docum...|[{token...|[{word_...|[{named...|[{chunk...|
|Angela[...|[{docum...|[{docum...|[{token...|[{word_...|[{named...|[{chunk...|
+----------+----------+----------+----------+----------+----------+----------+



In [11]:
pipeline_entre_md.transform(dfTest).select("token.result").show(2, truncate=70)
pipeline_entre_md.transform(dfTest).select("ner.result").show(2, truncate=70)
pipeline_entre_md.transform(dfTest).select("entities.result").show(2, truncate=70)

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[Die, Anfänge, der, EU, gehen, auf, die, 1950er-Jahre, zurück,, als...|
|[Angela[1], Dorothea, Merkel, (*, 17, ., Juli, 1954, in, Hamburg, a...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[O, O, O, I-ORG, O, O, O, O, O, O, O, O, I-LOC, O, I-MISC, O, I-LOC...|
|[I-LOC, I-PER, I-PER, O, O, O, O, O, O, I-LOC, O, I-PER, I-PER, I-P...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                

#### entity_recognizer_lg (glove_840B_300)

In [12]:
pipeline_entre_lg = PretrainedPipeline('entity_recognizer_lg', 'de')

entity_recognizer_lg download started this may take some time.
Approx size to download 2.3 GB
[OK!]


In [13]:
pipeline_entre_lg.transform(dfTest).show(2, truncate=10)

+----------+----------+----------+----------+----------+----------+----------+
|      text|  document|  sentence|     token|embeddings|       ner|  entities|
+----------+----------+----------+----------+----------+----------+----------+
|Die Anf...|[{docum...|[{docum...|[{token...|[{word_...|[{named...|[{chunk...|
|Angela[...|[{docum...|[{docum...|[{token...|[{word_...|[{named...|[{chunk...|
+----------+----------+----------+----------+----------+----------+----------+



In [14]:
pipeline_entre_lg.transform(dfTest).select("token.result").show(2, truncate=70)
pipeline_entre_lg.transform(dfTest).select("ner.result").show(2, truncate=70)
pipeline_entre_lg.transform(dfTest).select("entities.result").show(2, truncate=70)

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[Die, Anfänge, der, EU, gehen, auf, die, 1950er-Jahre, zurück,, als...|
|[Angela[1], Dorothea, Merkel, (*, 17, ., Juli, 1954, in, Hamburg, a...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[O, O, O, I-ORG, O, O, O, O, O, O, O, O, I-ORG, O, I-LOC, I-LOC, I-...|
|[O, I-PER, I-PER, O, O, O, O, O, O, I-LOC, O, I-PER, I-PER, I-PER, ...|
+----------------------------------------------------------------------+

+---------------------------------------------------------------------+
|                                                 

### Pretrained Models in German

In [15]:
document = DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

lemma = LemmatizerModel.pretrained('lemma', 'de')\
    .setInputCols(['token'])\
    .setOutputCol('lemma')

pos = PerceptronModel.pretrained('pos_ud_hdt', 'de') \
    .setInputCols(['sentence', 'token'])\
    .setOutputCol('pos')

embeddings = WordEmbeddingsModel.pretrained('glove_6B_300', 'xx')\
    .setInputCols(['sentence', 'token'])\
    .setOutputCol('embeddings')

ner_model = NerDLModel.pretrained('wikiner_6B_300', 'de')\
    .setInputCols(['sentence', 'token', 'embeddings'])\
    .setOutputCol('ner')


prediction_pipeline = Pipeline(stages=[
        document,
        sentence,
        token,
        lemma,
        pos,
        embeddings,
        ner_model
])

lemma download started this may take some time.
Approximate size to download 4 MB
[OK!]
pos_ud_hdt download started this may take some time.
Approximate size to download 4.7 MB
[OK!]
glove_6B_300 download started this may take some time.
Approximate size to download 426.2 MB
[OK!]
wikiner_6B_300 download started this may take some time.
Approximate size to download 14.1 MB
[OK!]


In [16]:
prediction = prediction_pipeline.fit(dfTest).transform(dfTest)

In [17]:

prediction.select("lemma.result").show(2, truncate=70)
prediction.select("pos.result").show(2, truncate=70)
prediction.select("ner.result").show(2, truncate=70)

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[Die, Anfang, der, EU, gehen, auf, der, 1950er-Jahre, zurück, ,, al...|
|[Angela[1], Dorothea, Merkel, (*, 17, ., Juli, 1954, in, Hamburg, a...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                                result|
+----------------------------------------------------------------------+
|[DET, NOUN, DET, PROPN, VERB, ADP, DET, NOUN, ADP, PUNCT, ADP, ADV,...|
|[PROPN, PROPN, PROPN, X, NUM, PUNCT, NOUN, NUM, ADP, PROPN, ADP, PR...|
+----------------------------------------------------------------------+

+----------------------------------------------------------------------+
|                                                