![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Train POS Tagger in French by Spark NLP
### Based on Universal Dependency `UD_French-GSD` version 2.3


### Spark `2.4` and Spark NLP `2.0.0`

In [17]:
import sys
sys.path.append('../../')
import time

#Spark ML and SQL
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import array_contains
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
#Spark NLP
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher


### Let's create a Spark Session for our app

In [3]:
spark = SparkSession.builder \
    .appName("Training_Perceptron")\
    .master("local[*]")\
    .config("spark.driver.memory","6G")\
    .config("spark.driver.maxResultSize", "2G")\
    .config("spark.jars", "/tmp/sparknlp.jar")\
    .config("spark.driver.extraClassPath", "/tmp/sparknlp.jar")\
    .config("spark.executor.extraClassPath", "/tmp/sparknlp.jar")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

In [4]:
spark.version

'2.4.0'

Let's prepare our training datasets containing `token_posTag` like `de_DET`. You can download this data set from Amazon S3:

```
wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/UD_French-GSD_2.3.txt -P /tmp
```

In [51]:
# Download CoNLL-U French-GSD already converted to token_tag
# Download CoNLL 2003 Dataset
import os
from pathlib import Path
import urllib.request

url = 'https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/fr/pos/UD_French/'
file_train='UD_French-GSD_2.3.txt'
full_path='/tmp/'+file_train

if not Path(full_path).is_file():   
    print('Downloading '+file_train)
    urllib.request.urlretrieve(url+file_train, full_path)

Downloading UD_French-GSD_2.3.txt


In [52]:
from sparknlp.training import POS
training_data = POS().readDataset(spark, '/tmp/UD_French-GSD_2.3.txt', '_', 'tags')

In [53]:
training_data.show()

+--------------------+--------------------+
|                tags|                text|
+--------------------+--------------------+
|[[pos, 0, 2, DET,...|Les commotions cé...|
|[[pos, 0, 1, DET,...|L' œuvre est situ...|
|[[pos, 0, 1, DET,...|Le comportement d...|
|[[pos, 0, 8, ADV,...|Toutefois , les f...|
|[[pos, 0, 5, PROP...|Ismene entre et a...|
|[[pos, 0, 1, PRON...|je reviendrais av...|
|[[pos, 0, 2, DET,...|Les forfaits comp...|
|[[pos, 0, 1, PRON...|Il prévient que d...|
|[[pos, 0, 2, PRON...|Ils tiraient à ba...|
|[[pos, 0, 1, DET,...|Le château est en...|
|[[pos, 0, 1, ADP,...|En effet , la bir...|
|[[pos, 0, 1, DET,...|Le point final de...|
|[[pos, 0, 1, DET,...|L' information gé...|
|[[pos, 0, 5, VERB...|Motivé par la cha...|
|[[pos, 0, 1, PRON...|Il exploitait un ...|
|[[pos, 0, 3, ADV,...|Plus tard dans la...|
|[[pos, 0, 2, PRON...|Ils deviennent al...|
|[[pos, 0, 1, DET,...|Le chevalier lui ...|
|[[pos, 0, 4, VERB...|Créée au cours du...|
|[[pos, 0, 1, PRON...|On ne peut

In [54]:
document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .addInfixPattern("(\\w+)([^\\s\\p{L}]{1})+(\\w+)")\
    .addInfixPattern("(\\w+'{1})(\\w+)")\
    .addInfixPattern("(\\p{L}+)(n't\\b)")\
    .addInfixPattern("((?:\\p{L}\\.)+)")\
    .addInfixPattern("([\\$#]?\\d+(?:[^\\s\\d]{1}\\d+)*)")\
    .addInfixPattern("([\\p{L}\\w]+)")

posTagger = PerceptronApproach() \
    .setNIterations(6) \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos") \
    .setPosCol("tags")
    
pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector, 
    tokenizer,
    posTagger
])

In [55]:
%%time

# Let's train our Pipeline by using our training dataset
model = pipeline.fit(training_data)

CPU times: user 34.2 ms, sys: 12.3 ms, total: 46.5 ms
Wall time: 1min 57s


This is our testing DataFrame where we get some sentences in French. We are going to use our trained Pipeline to transform these sentence and predict each token's `Part Of Speech`.

In [56]:
dfTest = spark.createDataFrame([
    "Je sens qu'entre ça et les films de médecins et scientifiques fous que nous avons déjà vus, nous pourrions emprunter un autre chemin pour l'origine.",
    "On pourra toujours parler à propos d'Averroès de décentrement du Sujet."
], StringType()).toDF("text")

In [57]:
predict = model.transform(dfTest)

In [58]:
predict.select("token.result", "pos.result").show()

+--------------------+--------------------+
|              result|              result|
+--------------------+--------------------+
|[Je, sens, qu', e...|[PRON, NOUN, PRON...|
|[On, pourra, touj...|[PRON, VERB, ADV,...|
+--------------------+--------------------+

