<a href="https://colab.research.google.com/github/CristinaGHolgado/old-french-with-pie-rnntagger/blob/master/old_french_pie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Install the library**

#### **Install Pie (v0.3.6)**


In [None]:
pip install nlp-pie==0.3.6

####**Set the path to** /pie

In [2]:
# path should be /usr/local/lib/python3.7/dist-packages/pie' in GColab if using Python 3.7 
# if different, find and set the path to /pie 
import os

def set_path_pie(path):
  """ Set path to Pie folder for training or tagging
  """
  if not str(os.getcwd()).endswith('/pie'):
    os.chdir(path)
    print(f"Path set to : {os.getcwd()}")
  else:
    pass

In [5]:
pie_path = '/usr/local/lib/python3.7/dist-packages/pie'
set_path_pie(pie_path)

Path set to : /usr/local/lib/python3.7/dist-packages/pie


## **Training**

#### **Set custom training parameters**

In [None]:
## Modify parameters & options if necessary (*)

defaultSettingsFile = "default_settings.json" # // name or path to the .json file training parameters *

out_json = defaultSettingsFile.split(".")[0] + "_" + data_to_train + ".json"

data_to_train = 'pos' # // train pos (traning data column name) *
modelname = "oldfrench_model" # // output model name *
inputpath = "token_tags_corpus.csv" # // input trainig data path or filename *
colnames = f'["token", "{data_to_train}"]' # // column names in trainig data

customized_train_params = True # // If True the folowing parameters will be used * :
nb_epoch = 50 # *
checks_per_epoch = 0  # *
batch_size = 2  # *
lower_opt = 'false' # lowercase target tokens *

In [None]:
import json
import pandas as pd
import re

def setParams():
  """ Set training parameters for 'default_settings.json'
  """
  with open(defaultSettingsFile, "r") as source_params, open(out_json, "w") as out: # open default training settings json file
      for line in source_params:
        if '"modelname"' in line:
          line = line.replace('"model"', f'"{modelname}"')
          print(line)
        if '"input_path"' in line:
          line = line.replace('""', f'"{inputpath}"') # path (unix-like expression) to files with training data [default_settings.json]
          print(line)
        if '"tasks_order":' in line:
          line = line.replace('["lemma", "pos"]',f'{colnames}') # Expected order of tasks for tabreader if no header [default_settings.json]
          print(line)
        if '"name": "lemma"' in line:
          line = line.replace('"lemma"',f'"{data_to_train}"')
          print(line)
        
        if customized_train_params == True:
          if '"epochs"' in line:
            line = re.sub('[0-9]*\,', str(nb_epoch) + "," , line)
            print(line)
          if '"batch_size"' in line:
            line = re.sub('[0-9]*\,', str(batch_size) + "," , line)
            print(line)
          if '"checks_per_epoch"' in line:
            line = re.sub('[0-9]*\,', str(checks_per_epoch) + "," , line)
            print(line)
          if '"lower"' in line:
            line = line.replace('true', lower_opt)
            print(line)

        out.write(line)

In [None]:
print(f"Parameters : \n{out_json}") # Display parameters
print()
setParams()

####  **Run training & download model after training**



In [None]:
# Run after setting trainig parameters in the default settings json file & upload training data
!pie train default_settings_pos.json

In [None]:
from google.colab import files
import glob
lastmodel = modelname + "*.tar"
get_model = "".join(glob.glob(lastmodel)) # 1 model
files.download(get_model) # automatically download model after training

### **Tagging**

#### **Log into your Drive to choose the files to be tagged**

In [None]:
from google.colab import drive
import os
drive.mount('/content/gdrive')

Mounted at /content/gdrive


####**Modify *models/tag_pipe.py* to skip tokenization**

In [None]:
def noTokenizing():
  with open("scripts//tag_pipe.py","r") as py, open("scripts//tag_pipe2.py","w") as out_py:
    f = py.readlines()
    for l in f:
      if "line = line.split()" in l:
        l = l.replace("line = line.split()", "line = line.split()\n            line = [' '.join(item for item in line)]")
      out_py.write(l)
def checkPy():
  with open("scripts//tag_pipe.py","r") as py:
    f = py.readlines()
    for l in f:
      if "line = [' '.join(item for item in line)])" in l:
        pass
      else:
        noTokenizing()

*   **Run the cell below to modify the code in `tag_pie.py` automatically**



In [None]:
!mv "scripts//tag_pipe.py" "scripts//tag_pipe_src.py"
!mv "scripts//tag_pipe2.py" "scripts//tag_pipe.py"

In [None]:
noTokenizing()
checkPy()

* **Or replace line 24**
```
23        else:
24            line = line.split()
```
to
```
23        else:
24            line = [' '.join(item for item in line)]
```



####**Tagging corpus files**

In [None]:
# path to unnanotated corpus
tokens_path = '/content/gdrive/My Drive/RNNTagger(tagger)/tsv_files/tokens/*tokens.csv' 
# model to be used for tagging
model = '10ep2bat.tar'

* **Tag multiple files**



In [None]:
import subprocess
import glob
import os, sys

In [None]:
def run_tagging(path):
  """ Load model and tag corpus
  path : str
    absolute path to folder containing unnanotated corpus
  """
  all_files = glob.glob(path)
  print(f"{len(all_files)} files found")
  nbFile = 0

  for file in all_files:
    print(file)
    outputname = "/content/gdrive/My Drive/tagged_corpus_nlppie/" + file.split("/")[-1].replace('tokens.csv','_tagged_pie.csv')
    command = "cat '" + file + "' | " + "pie tag-pipe " + f"{model} > '" + outputname + "'"
    nbFile += 1
    print(f"Fichier no. : {nbFile}")
    !{command}

run_tagging(tokens_path)

* **Tag a single file**

In [None]:
inputFile = 'inputfile.csv' # // input file to tag
output_tagged = 'tagged_tokens.txt' # // output tagged file name

In [None]:
%%shell
cat {inputFile} | pie tag-pipe {model} > {output_tagged}