<a href="https://colab.research.google.com/github/RobMcH/gector/blob/nils/Gector_Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Find out GPU type (The Colab-runtime must include a GPU - needs to be manually changed in the settings if not existent):

In [None]:
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-080c496e-e1c6-17a6-faa6-3d2239fe3c6d)


# Setup Notebook

## Change working directory

Change working directory to gector-master copy in Google Drive 

*First connect to Google Drive via: Files (Click on File-Symbol in left-sidebar) -> Connect to Google Drive (Button at the top of the newly opened sidebar)*

In [None]:
%cd /content/drive/My\ Drive/gector-master

/content/drive/My Drive/gector-master


## Install requirements

Install all requirements as specified in the Gector requirements.txt

In [None]:
pip install -r requirements.txt

# Imports

Import the libraries that are needed:

In [None]:
import nltk
import os

We have to download the NLTK punkt corpus since it is used later:

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Pre-process data

## Convert FCE from XML to parallel sentence format

The following datasets can be used:

* All the public GEC datasets used in the paper can be downloaded from [here](https://www.cl.cam.ac.uk/research/nl/bea2019st/#data).
* Synthetically created datasets can be generated/downloaded [here](https://github.com/awasthiabhijeet/PIE/tree/master/errorify).

To test if everything works, we use the FCE v2.1 dataset.

The GECToR repository already contains a script which expects the parent directory of the FCE dataset (which can be downloaded [here](https://ilexir.co.uk/datasets/index.html)).
This conversion from the xml to the "parallel sentences format" which GECToR uses has to be done once.
We created a new folder "fce_output_folder" which contains the processed results.

In [None]:
#!python utils/prepare_clc_fce_data.py 'fce-released-dataset' --output 'fce_output_folder'

After this operation, our 'fce_output_folder' contains two txt files:
- 'fce-original.txt': Each line contains the original sentence with a grammatical error
- 'fce-applied.txt': Each line contains the corrected sentence (same order as in the origial file)

## Convert parallel sentence format to GECToR specific format

To train the model. the data needs to be preprocessed and converted to special format with the following command where:
- s: Path to the source file (Original sentences w/ mistakes)
- t: Path to the target file (Correct sentences w/o mistakes)
- o: Path to the output file (the training data will be stored in this file)

In [None]:
#!python utils/preprocess_data.py -s 'fce_output_folder/fce-original.txt' -t 'fce_output_folder/fce-applied.txt' -o 'fce_output_folder/training_data.txt'

The size of raw dataset is 34490
34490it [00:05, 5795.45it/s]
Overall extracted 34490. Original TP 21525. Original TN 12965


# Train model

To train the model we have to download a pretrained model from [here](https://github.com/grammarly/gector) and place the file in the 'pre-trained-models' folder. Then, we can run the following command:

- --train_set: training data (txt) as generated before
- --dev_set: validation data (txt) as generated before
- --model_dir: directory of the pre-trained model
- --batch_size: default batch-size is 32
- --n_epoch: number of epochs for the training
- --patience: Early stopping rounds (default = 3)
- --lr: Learning rate
- --predictor_dropout: Dropout rate for predictor (default = 0.0)
- --transformer_model: Name of the transformer (choices=['bert', 'distilbert', 'gpt2', 'roberta', 'transformerxl', 'xlnet', 'albert'])

In [None]:
!python train.py --train_set 'fce_output_folder/training_data.txt' --dev_set 'fce_output_folder/training_data.txt' --model_dir 'pre-trained-models' --n_epoch 1 --transformer_model 'bert'

2021-04-23 02:23:13.726731: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Downloading: 100% 213k/213k [00:00<00:00, 1.07MB/s]
21524it [00:02, 7445.27it/s]
Data is loaded
Downloading: 100% 433/433 [00:00<00:00, 399kB/s]
Downloading: 100% 436M/436M [00:10<00:00, 41.6MB/s]
Model is set
Start training
accuracy: 0.8604, loss: 1.9602 ||: : 673it [00:37, 18.17it/s]
accuracy: 0.8671, loss: 1.3849 ||: : 673it [00:33, 20.06it/s]
Model is dumped
