<a href="https://colab.research.google.com/github/CristinaGHolgado/old-french-with-pie-rnntagger/blob/master/old_french_pie.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Install the library**

#### **Install Pie (v0.3.6)**


In [5]:
%%shell
pip install nlp-pie==0.3.6





####**Set the path to** /pie

In [6]:
# path should be /usr/local/lib/python3.7/dist-packages/pie' in GColab if using Python 3.7 
# if different, find and set the path to /pie 
import os

def set_path_pie(path):
  """ Set path to Pie folder for training or tagging
  """
  if not str(os.getcwd()).endswith('/pie'):
    os.chdir(path)
    print(f"Path set to : {os.getcwd()}")
  else:
    pass

In [7]:
pie_path = '/usr/local/lib/python3.7/dist-packages/pie'
set_path_pie(pie_path)

Path set to : /usr/local/lib/python3.7/dist-packages/pie


## **Training**

#### **Set custom training parameters**

Upload training and dev data

In [50]:
## Modify parameters & options if necessary (*)

defaultSettingsFile = "default_settings.json" # // name or path to the .json file training parameters *

data_to_train = 'lemma' # // train POS (training data column name) *

out_json = defaultSettingsFile.split(".")[0] + "_" + data_to_train + ".json"

modelname = "oldfrench_model" # // output model name *
inputpath = "/content/gdrive/MyDrive/traindata.tsv" # // input training data path or filename *
dev_set_path = "/content/gdrive/MyDrive/devdata.tsv" # //path to dev set (same format as input)
colnames = f'["token", "{data_to_train}"]' # // column names in training data

customized_train_params = True # // If True the following parameters will be used * :
nb_epoch = 50 # *
checks_per_epoch = 1  # *
batch_size = 2  # *
lower_opt = 'false' # lowercase target tokens *


In [51]:
import json
import pandas as pd
import re

def setParams():
  """ Set training parameters for 'default_settings.json'
  """
  with open(defaultSettingsFile, "r") as source_params, open(out_json, "w") as out: # open default training settings json file
      for line in source_params:
        if '"modelname"' in line:
          line = line.replace('"model"', f'"{modelname}"')
          print(line)
        if '"input_path"' in line:
          line = line.replace('""', f'"{inputpath}"') # path (unix-like expression) to files with training data [default_settings.json]
          print(line)
        # if '"tasks_order":' in line:
        #   line = line.replace('["lemma", "pos"]',f'{colnames}') # Expected order of tasks for tabreader if no header [default_settings.json]
        #   print(line)
        if '"name": "lemma"' in line:
          line = line.replace('"lemma"',f'"{data_to_train}"')
          print(line)
        if "dev_path" in line:
          line = line.replace('""', f'"{dev_set_path}"')
          print(line)
        
        if customized_train_params == True:
          if '"epochs"' in line:
            line = re.sub('[0-9]*\,', str(nb_epoch) + "," , line)
            print(line)
          if '"batch_size"' in line:
            line = re.sub('[0-9]*\,', str(batch_size) + "," , line)
            print(line)
          if '"checks_per_epoch"' in line:
            line = re.sub('[0-9]*\,', str(checks_per_epoch) + "," , line)
            print(line)
          if '"lower"' in line:
            line = line.replace('true', lower_opt)
            print(line)

        out.write(line)

In [52]:
set_path_pie(pie_path)
setParams()
print()
print(f"Parameters : \n{out_json}") # Display parameters

Path set to : /usr/local/lib/python3.7/dist-packages/pie
  "modelname": "oldfrench_model", // model name to be used for saving

  "input_path": "/content/gdrive/MyDrive/traindata.tsv", // path (unix-like expression) to files with training data

  "dev_path": "/content/gdrive/MyDrive/devdata.tsv", // path to dev set (same format as input_path)

      "name": "lemma", // name (by default should match the target task)

	"lower": false // lowercase target tokens

  "epochs": 50, // number of epochs

  "batch_size": 2, // batch size

  "checks_per_epoch": 1, // check model on dev-set so many times during epoch


Parameters : 
default_settings_lemma.json


####  **Run training & download model after training**



In [None]:
# Run after setting training parameters in the default settings json file & upload training data
%cd
!pie train /usr/local/lib/python3.7/dist-packages/pie/default_settings_lemma.json

fatal: not a git repository (or any of the parent directories): .git

::: Loaded Config :::

batch_size: 2
breakline_data: ''
breakline_ref: ''
buffer_size: 10000
cache_dataset: false
cell: LSTM
cemb_dim: 150
cemb_layers: 1
cemb_type: rnn
char_bos: true
char_eos: true
char_lower: false
char_max_size: 500
char_min_freq: 1
checks_per_epoch: 1
clip_norm: 5.0
config_path: /usr/local/lib/python3.7/dist-packages/pie/default_settings_lemma.json
custom_cemb_cell: false
dev_path: /content/gdrive/MyDrive/devdata.tsv
device: cpu
drop_diacritics: false
dropout: 0.0
epochs: 50
factor: 1
freeze_embeddings: false
header: true
hidden_size: 300
include_lm: false
init_rnn: default
input_path: /content/gdrive/MyDrive/traindata.tsv
linear_layers: 1
lm_schedule:
  factor: 0.5
  mode: min
  patience: 2
  weight: 0.2
lm_shared_softmax: true
load_pretrained_embeddings: ''
load_pretrained_encoder: ''
lr: 0.001
lr_factor: 0.75
lr_patience: 2
max_sent_len: 35
max_sents: 1000000
merge_type: concat
min_lr: 1.0e-06

In [None]:
from google.colab import files
import glob
lastmodel = modelname + "*.tar"
get_model = "".join(glob.glob(lastmodel)) # 1 model
files.download(get_model) # automatically download model after training

## **Tagging**

#### **Log into your Drive to choose the files to be tagged**

In [13]:
from google.colab import drive
import os
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


####**Modify *models/tag_pipe.py* to skip tokenization**

In [None]:
def noTokenizing():
  with open("scripts//tag_pipe.py","r") as py, open("scripts//tag_pipe2.py","w") as out_py:
    f = py.readlines()
    for l in f:
      if "line = line.split()" in l:
        l = l.replace("line = line.split()", "line = line.split()\n            line = [' '.join(item for item in line)]")
      out_py.write(l)
def checkPy():
  with open("scripts//tag_pipe.py","r") as py:
    f = py.readlines()
    for l in f:
      if "line = [' '.join(item for item in line)])" in l:
        pass
      else:
        noTokenizing()

*   **Run the cell below to modify the code in `tag_pie.py` automatically**



In [None]:
!mv "scripts//tag_pipe.py" "scripts//tag_pipe_src.py"
!mv "scripts//tag_pipe2.py" "scripts//tag_pipe.py"

In [None]:
noTokenizing()
checkPy()

* **Or replace line 24**
```
23        else:
24            line = line.split()
```
to
```
23        else:
24            line = [' '.join(item for item in line)]
```



####**Tagging corpus files**

In [None]:
# path to unnanotated corpus
tokens_path = '/content/gdrive/My Drive/RNNTagger(tagger)/tsv_files/tokens/*tokens.csv' 
# model to be used for tagging
model = '10ep2bat.tar'

* **Tag multiple files**



In [None]:
import subprocess
import glob
import os, sys

In [None]:
def run_tagging(path):
  """ Load model and tag corpus
  path : str
    absolute path to folder containing unnanotated corpus
  """
  all_files = glob.glob(path)
  print(f"{len(all_files)} files found")
  nbFile = 0

  for file in all_files:
    print(file)
    outputname = "/content/gdrive/My Drive/tagged_corpus_nlppie/" + file.split("/")[-1].replace('tokens.csv','_tagged_pie.csv')
    command = "cat '" + file + "' | " + "pie tag-pipe " + f"{model} > '" + outputname + "'"
    nbFile += 1
    print(f"Fichier no. : {nbFile}")
    !{command}

run_tagging(tokens_path)

* **Tag a single file**

In [None]:
inputFile = 'inputfile.csv' # // input file to tag
output_tagged = 'tagged_tokens.txt' # // output tagged file name

In [None]:
%%shell
cat {inputFile} | pie tag-pipe {model} > {output_tagged}