![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Train POS Tagger in French by Spark NLP
### Based on Universal Dependency `UD_French-GSD` version 2.3


In [1]:
import sys
import time

#Spark ML and SQL
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import array_contains
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
#Spark NLP
import sparknlp
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher


### Let's create a Spark Session for our app

In [2]:
spark = sparknlp.start()

print("Spark NLP version")
sparknlp.version()
print("Apache Spark version")
spark.version

Spark NLP version
2.2.1
Apache Spark version


'2.4.3'

Let's prepare our training datasets containing `token_posTag` like `de_DET`. You can download this data set from Amazon S3:

```
wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/UD_French-GSD_2.3.txt -P /tmp
```

In [3]:
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/UD_French-GSD_2.3.txt -P /tmp

--2019-07-15 11:42:45--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/UD_French-GSD_2.3.txt
R'esolution de s3.amazonaws.com (s3.amazonaws.com)... 52.217.38.22
Connexion `a s3.amazonaws.com (s3.amazonaws.com)|52.217.38.22|:443... connect'e.
requ^ete HTTP transmise, en attente de la r'eponse... 200 OK
Taille : 3565213 (3.4M) [text/plain]
Sauvegarde en : << /tmp/UD_French-GSD_2.3.txt >>


2019-07-15 11:42:50 (875 KB/s) - << /tmp/UD_French-GSD_2.3.txt >> sauvegard'e [3565213/3565213]



In [9]:
from sparknlp.training import POS
training_data = POS().readDataset(
    spark=spark,
    path="/tmp/UD_French-GSD_2.3.txt",
    delimiter="_",
    outputPosCol="tags",
    outputDocumentCol="document",
    outputTextCol="text"
)

In [10]:
training_data.show()

+--------------------+--------------------+--------------------+
|                text|            document|                tags|
+--------------------+--------------------+--------------------+
|Les commotions cé...|[[document, 0, 11...|[[pos, 0, 2, DET,...|
|L' œuvre est situ...|[[document, 0, 82...|[[pos, 0, 1, DET,...|
|Le comportement d...|[[document, 0, 18...|[[pos, 0, 1, DET,...|
|Toutefois , les f...|[[document, 0, 44...|[[pos, 0, 8, ADV,...|
|Ismene entre et a...|[[document, 0, 80...|[[pos, 0, 5, PROP...|
|je reviendrais av...|[[document, 0, 28...|[[pos, 0, 1, PRON...|
|Les forfaits comp...|[[document, 0, 30...|[[pos, 0, 2, DET,...|
|Il prévient que d...|[[document, 0, 99...|[[pos, 0, 1, PRON...|
|Ils tiraient à ba...|[[document, 0, 43...|[[pos, 0, 2, PRON...|
|Le château est en...|[[document, 0, 44...|[[pos, 0, 1, DET,...|
|En effet , la bir...|[[document, 0, 10...|[[pos, 0, 1, ADP,...|
|Le point final de...|[[document, 0, 15...|[[pos, 0, 1, DET,...|
|L' information gé...|[[d

In [12]:
document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setPrefixPattern("\\A([^\\s\\p{L}\\d\\$\\.#]*)")\
    .setSuffixPattern("([^\\s\\p{L}\\d]?)([^\\s\\p{L}\\d]*)\\z")\
    .setInfixPatterns([
        "([\\p{L}\\w]+'{1})",
        "([\\$#]?\\d+(?:[^\\s\\d]{1}\\d+)*)",
        "((?:\\p{L}\\.)+)",
        "((?:\\p{L}+[^\\s\\p{L}]{1})+\\p{L}+)",
        "([\\p{L}\\w]+)"
    ])

posTagger = PerceptronApproach() \
    .setNIterations(6) \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos") \
    .setPosCol("tags")
    
pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector, 
    tokenizer,
    posTagger
])

In [13]:
%%time

# Let's train our Pipeline by using our training dataset
model = pipeline.fit(training_data)

CPU times: user 24.1 ms, sys: 9.9 ms, total: 34 ms
Wall time: 1min 9s


This is our testing DataFrame where we get some sentences in French. We are going to use our trained Pipeline to transform these sentence and predict each token's `Part Of Speech`.

In [14]:
dfTest = spark.createDataFrame([
    "Je sens qu'entre ça et les films de médecins et scientifiques fous que nous avons déjà vus, nous pourrions emprunter un autre chemin pour l'origine.",
    "On pourra toujours parler à propos d'Averroès de décentrement du Sujet."
], StringType()).toDF("text")

In [15]:
predict = model.transform(dfTest)

In [16]:
predict.select("token.result", "pos.result").show()

+--------------------+--------------------+
|              result|              result|
+--------------------+--------------------+
|[Je, sens, qu'ent...|[PRON, NOUN, ADJ,...|
|[On, pourra, touj...|[PRON, VERB, ADV,...|
+--------------------+--------------------+

