# **RNNTagger (tagger)**

### **1.   Connexion à Drive pour accéder au dossier RNN Tagger.**

1. Upload **RNNTagger** (tagger) unziped folder into your GDrive.
  (https://www.cis.uni-muenchen.de/~schmid/tools/RNNTagger/data/RNNTagger-1.2.1.zip)
2. Upload the files to be tagged into a folder inside the **RNNTagger/** folder.
3. Set path to GDrive **RNNTagger/** folder.
4. Mount to your gdrive in GColab.

In [None]:
from google.colab import drive
import os

rnntagger_path = 'gdrive/My Drive/RNNTagger(tagger)'
drive.mount('/content/gdrive')
os.chdir(rnntagger_path)
!pwd

Mounted at /content/gdrive
/content/gdrive/My Drive/RNNTagger(tagger)


* Check PyTorch&CUDA (required)

In [None]:
import torch
print(f"Torch version : {torch.__version__}")
print(f"Device name : {torch.cuda.get_device_name(0)}")
print(f"Current path : {os.getcwd()}")

Torch version : 1.7.0+cu101
Device name : Tesla T4
Current path : /content/gdrive/My Drive/RNNTagger(tagger)


### **2. Annotation avec RNN Tagger**

* Set permissions to *lemma-lookup.pl*

In [None]:
!chmod 755 -R ./scripts/lemma-lookup.pl

In [None]:
from IPython.display import display, HTML
import pandas as pd
import subprocess
import glob
import csv
import os, sys

def getTokens():
  """Save tokens column for each input corpus file
  """
  for filename in all_files:
      df = pd.read_csv(filename, sep="\t", encoding="utf8", names=["token","nan_lemme","nan_tag","src"], quoting=csv.QUOTE_NONE)
      display(HTML(df.head().to_html())) # apperçu du fichier
      tokenscol = df["token"]
      name = filename.replace(files_extension,"_tokens.csv")
      tokenscol.to_csv(name, encoding="utf8", sep="\t", index=False, header=None)
      print(filename,"\n")

def tagFiles():
  """Tag tokens files
  """
  all_files = glob.glob(path_tsv + "/*_tokens.csv")
  print(f"Nombre de fichiers à étiquetter : {len(all_files)}*n")
  if not os.path.exists('tagged_files'):
    os.makedirs('tagged_files')
  nbFile = 0
  for file in all_files:
    file = file.replace("/","//")
    output_name = file.split("//")[-1].replace("tokens.csv","_tagged(rnntagger).csv")
    command = "bash cmd/rnn-tagger-old-french.sh '"+file+"' > '"+"tagged_files//"+output_name+"'"
    nbFile += 1
    print(nbFile)
    print(command)
    !{command} 

* Set path to the corpus folder
* Set files extension

In [None]:
path_tsv = 'tsv_files' # dossier qui contient les fichiers .tsv
files_extension = ".tsv" # extension des fichiers

all_files = glob.glob(path_tsv + "/*"+ files_extension)
# all_files = all_files[0:2] # test avec les 2 premiers fichiers
print("Nombre de fichiers : ", len(all_files))

Nombre de fichiers :  3


In [None]:
getTokens()

In [None]:
tagFiles()

Nombre de fichiers à étiquetter : 12*n
1
bash cmd/rnn-tagger-old-french.sh 'tsv_files//thebes2_tokens.csv' > 'tagged_files//thebes2__tagged(rnntagger).csv'
2
bash cmd/rnn-tagger-old-french.sh 'tsv_files//CommPsia1a_tokens.csv' > 'tagged_files//CommPsia1a__tagged(rnntagger).csv'
3
bash cmd/rnn-tagger-old-french.sh 'tsv_files//strasbBfm_tokens.csv' > 'tagged_files//strasbBfm__tagged(rnntagger).csv'
4
bash cmd/rnn-tagger-old-french.sh 'tsv_files//saintre_tokens.csv' > 'tagged_files//saintre__tagged(rnntagger).csv'


### If tagging a single file, run :

In [None]:
%%shell
bash cmd/rnn-tagger-old-french.sh 'tsv_files//tokens//Berin1_tokens.csv' > 'Berin1__tagged(rnntagger).csv'







---



# **RNNTagger (trainer)**

0. GColab Menu : Exécution > Réinitialiser l'environnement d'exécution

1.   Upload the RNNTagger (trainer) folder into your GDrive.
2.   Modify the next files :
 *   ***Tagger-Data/prepare-data.sh*** : replace BFMGOLD and BFMGOLDLEM corpus paths (Sharedocs : right click in the .zip > partager > copier le lien).
 *   ***train-tagger.sh*** : Change *rnn-train.py* in line 46 to *PyRNN/rnn-train.py*. Otherwise, the bash script won't find the training python script.
 * ***train-tagger.sh*** : check the first line matches the GCollab python path (*#env/python*). Otherwise, *#!/usr/bin/python3*.
3. Set the RNNTagger folder path. Then, mount to your drive :



In [None]:
from google.colab import drive
import os
rnntrainer_path = 'gdrive/My Drive/RNNTagger(trainer)' # set rnntagger folder in gdrive (trainig folder)
drive.mount('/content/gdrive', force_remount=True)
os.chdir(rnntrainer_path)
!pwd

Mounted at /content/gdrive
/content/gdrive/My Drive/RNNTagger(trainer)


4. Activate utf8::all for Perl :

In [None]:
!cpan utf8::all

5. Go to **/Tagger-Data **folder and run *prepare-data.sh* (POS tags) :

In [None]:
os.chdir('Tagger-Data')
!chmod 755 -R ./split-corpus.pl
!chmod 755 -R ./extract-data.pl
!bash prepare-data.sh

6. Go to **/Lemmatizer-Data** and call *prepare-data.sh* (lemmas) :

In [None]:
%cd ..
!pwd

/content/gdrive/My Drive/RNNTagger(trainer)
/content/gdrive/My Drive/RNNTagger(trainer)


In [None]:
os.chdir('Lemmatizer-Data')
!chmod 755 -R ./prepare-data.pl
!chmod 755 -R ./split.pl
!chmod 755 -R ./filter.pl
!chmod 755 -R ./make-lex.pl
!bash prepare-data.sh

900000

In [None]:
%cd ..
!pwd

/content/gdrive/My Drive/RNNTagger(trainer)
/content/gdrive/My Drive/RNNTagger(trainer)


##Train tagger

1.   Dans ***train-tagger.sh*** modifier la ligne 46 (***rnn-train.py*** > ***PyRNN/rnn-train.py***). Autrement, sans specification du chemin, le script .py ne sera pas lu.
2.   Dans ***rnn-train.py*** : verifier que ***#!/usr/bin/python3*** est specifie.
3.   Verifier que le chemin d'execution de python est correct. 



In [None]:
# ! echo $PYTHONPATH
# !chmod 755 -R ./train-tagger.sh
# # import os
# os.environ['PYTHONPATH'] = 'usr/bin/python3'
!echo $PYTHONPATH

/env/python


In [None]:
### 4h entrainement (etiquettes)
import os
print(os.getcwd())
!chmod u+x PyRNN/rnn-train.py
!bash train-tagger.sh Tagger-Training

import subprocess
# subprocess.call(["train-tagger.sh", "Tagger-Training"], env={"PATH": ""})
# subprocess.call(["bash", "train-lemmatizer.sh", "Lemmatizer-Training"])

## Train lemmatizer

1.   Meme procede : chemin***PyNMT/nmt-train.py ***indique dans ***train-lemmatizer.sh***
2.   Vérification du chemin python3 correct.



In [None]:
import os
print(os.getcwd())
!chmod u+x PyNMT/nmt-train.py
# import os
# os.environ['PYTHONPATH'] = 'usr/bin/python3'

/content/gdrive/My Drive/RNNTagger(trainer)


In [None]:
# !CUDA_VISIBLE_DEVICES=1 python3 PyNMT/nmt-train.py
# import torch
# torch.cuda.device_count()
# torch.cuda.set_device(0)

!bash ./train-lemmatizer.sh Lemmatizer-Training


translation examples
src: p a l u d ## N O M c o m
ref: p a l u d
tgt: p a l u d

src: a l u e t ## N O M c o m
ref: a l l e u
tgt: a l l e u

src: G r i t e ## N O M p r o
ref: G r i t e
tgt: G r i t e

Training Loss: 0.13670659136082977
Evaluation on dev data
storing parameters
translation examples
src: s a i s i ## V E R p p e
ref: s a i s i r
tgt: s a i s i r

src: c o n m e ## C O N s u b
ref: c o m m e
tgt: c o m m e

src: j a u d e ## N O M c o m
ref: g u i l d e
tgt: g u i l d e

Training Loss: 0.02435993800293654
Evaluation on dev data
storing parameters
translation examples
src: f i k i é ## V E R p p e
ref: f i c h e r
tgt: f i c h e r

src: D I E N T ## V E R c j g
ref: d i r e
tgt: d i r e

src: r e o n t ## A D J q u a
ref: r o n d
tgt: r o n d

Training Loss: 0.013831932018103543
Evaluation on dev data
translation examples
src: d e v r a ## V E R c j g
ref: d e v o i r
tgt: d e v o i r

src: q u e r t ## V E R c j g
ref: q u é r i r
tgt: q u é r i r

src: j u g e r ## V 