![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Train POS Tagger in French by Spark NLP
### Based on Universal Dependency `UD_French-GSD`


In [None]:
! pip install -q pyspark==3.1.2 spark-nlp

[K     |████████████████████████████████| 212.4 MB 65 kB/s 
[K     |████████████████████████████████| 133 kB 20.0 MB/s 
[K     |████████████████████████████████| 198 kB 72.1 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [3]:
import sys
import time

#Spark ML and SQL
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import array_contains
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
#Spark NLP
import sparknlp
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher


### Let's create a Spark Session for our app

In [4]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.3.4
Apache Spark version:  3.1.2


Let's prepare our training datasets containing `token_posTag` like `de_DET`. You can download this data set from Amazon S3:

```
wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/UD_French-GSD_2.3.txt -P /tmp
```

In [5]:
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/UD_French-GSD_2.3.txt -P /tmp

--2021-12-02 06:01:21--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/UD_French-GSD_2.3.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.250.94
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.250.94|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3565213 (3.4M) [text/plain]
Saving to: ‘/tmp/UD_French-GSD_2.3.txt’


2021-12-02 06:01:22 (21.1 MB/s) - ‘/tmp/UD_French-GSD_2.3.txt’ saved [3565213/3565213]



In [6]:
from sparknlp.training import POS
training_data = POS().readDataset(spark, '/tmp/UD_French-GSD_2.3.txt', '_', 'tags')

In [7]:
training_data.show()

+--------------------+--------------------+--------------------+
|                text|            document|                tags|
+--------------------+--------------------+--------------------+
|Les commotions cé...|[{document, 0, 11...|[{pos, 0, 2, DET,...|
|L' œuvre est situ...|[{document, 0, 82...|[{pos, 0, 1, DET,...|
|Le comportement d...|[{document, 0, 18...|[{pos, 0, 1, DET,...|
|Toutefois , les f...|[{document, 0, 44...|[{pos, 0, 8, ADV,...|
|Ismene entre et a...|[{document, 0, 80...|[{pos, 0, 5, PROP...|
|je reviendrais av...|[{document, 0, 28...|[{pos, 0, 1, PRON...|
|Les forfaits comp...|[{document, 0, 30...|[{pos, 0, 2, DET,...|
|Il prévient que d...|[{document, 0, 99...|[{pos, 0, 1, PRON...|
|Ils tiraient à ba...|[{document, 0, 43...|[{pos, 0, 2, PRON...|
|Le château est en...|[{document, 0, 44...|[{pos, 0, 1, DET,...|
|En effet , la bir...|[{document, 0, 10...|[{pos, 0, 1, ADP,...|
|Le point final de...|[{document, 0, 15...|[{pos, 0, 1, DET,...|
|L' information gé...|[{d

In [8]:
document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setExceptions(["jusqu'", "aujourd'hui", "États-Unis", "lui-même", "celui-ci", "c'est-à-dire", "celle-ci", "au-dessus", "etc.", "sud-est", "Royaume-Uni", "ceux-ci", "au-delà", "elle-même", "peut-être", "sud-ouest", "nord-ouest", "nord-est", "Etats-Unis", "Grande-Bretagne", "Pays-Bas", "eux-mêmes", "porte-parole", "Notre-Dame", "puisqu'", "week-end", "quelqu'un", "celles-ci", "chef-lieu"])\
    .setPrefixPattern("\\A([^\\s\\p{L}\\d\\$\\.#]*)")\
    .setSuffixPattern("([^\\s\\p{L}\\d]?)([^\\s\\p{L}\\d]*)\\z")\
    .setInfixPatterns([
      "([\\p{L}\\w]+'{1})",
      "([\\$#]?\\d+(?:[^\\s\\d]{1}\\d+)*)",
      "((?:\\p{L}\\.)+)",
      "((?:\\p{L}+[^\\s\\p{L}]{1})+\\p{L}+)",
      "([\\p{L}\\w]+)"
    ])

posTagger = PerceptronApproach() \
    .setNIterations(1) \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos") \
    .setPosCol("tags")
    
pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector, 
    tokenizer,
    posTagger
])

In [9]:
# Let's train our Pipeline by using our training dataset
model = pipeline.fit(training_data)

This is our testing DataFrame where we get some sentences in French. We are going to use our trained Pipeline to transform these sentence and predict each token's `Part Of Speech`.

In [10]:
dfTest = spark.createDataFrame([
    "Je sens qu'entre ça et les films de médecins et scientifiques fous que nous avons déjà vus, nous pourrions emprunter un autre chemin pour l'origine.",
    "On pourra toujours parler à propos d'Averroès de décentrement du Sujet."
], StringType()).toDF("text")

In [11]:
predict = model.transform(dfTest)

In [12]:
predict.select("token.result", "pos.result").show(truncate=50)

+--------------------------------------------------+--------------------------------------------------+
|                                            result|                                            result|
+--------------------------------------------------+--------------------------------------------------+
|[Je, sens, qu'entre, ça, et, les, films, de, mé...|[PRON, NOUN, ADP, PRON, CCONJ, DET, NOUN, ADP, ...|
|[On, pourra, toujours, parler, à, propos, d'Ave...|[PRON, VERB, ADV, VERB, ADP, NOUN, NOUN, ADP, N...|
+--------------------------------------------------+--------------------------------------------------+

