# **Data download**

> Download the data from its storage place into the Google Colab working directory



## *Data stored in a GCS Bucket*

### Log into the google account and initialize gcloud SDK

> Make sure to have a google account for authentication



In [None]:
import os, subprocess

### authenticate into google account (must have a google account)
from google.colab import auth
auth.authenticate_user()
### install GCloud SDK
!curl https://sdk.cloud.google.com | bash
### initialize SDK
!gcloud init

### Set the name of the Google Cloud Storage Bucket

> This directory must have already been created through the Google Cloud Platform



In [None]:
### define the Google Cloud Storage Bucket
BUCKET = 'gectorptbrstorage'

### Download the file to be preprocessed

> This file will be copied from the GCS Bucket into the Google Colab working directory



In [None]:
### download sentences file
os.system('gsutil cp gs://' + BUCKET + '/files/wiki-sentences.txt .');

## *Data stored in a Google Drive*

### Download the file to be preprocessed

> This file will be copied from a Google Drive into the Google Colab working directory

In [None]:
### download sentences file
os.system('cp /content/drive/MyDrive/gectorptbrFolder/preprocessable_files/wiki-sentences.txt .');

NameError: ignored

# **Errorify the data**

> Synthetically produce errors into the dataset and save two files in the folder dual_files/, one is for correct sentences and the other is for the errorified sentences



In [None]:
### errorify the wiki-sentences.txt file
!cd /content/drive/MyDrive/gectorptbrFolder/PIE/errorify/ptbr && python3 error.py /content/wiki-sentences.txt /content/drive/MyDrive/gectorptbrFolder/files/dual_files

# **Install pip3 dependencies**

> The following two sections depend on these packages



In [None]:
### install the package requirements
!pip3 install -q -r /content/drive/MyDrive/gectorptbrFolder/requirements.txt

# **Compare the dual files**
> In this step, we compare the correct sentences against the errorified sentences in order to create a dataset which has inputs alongside the respective labels, for example: **Eu vou a{replace_à} praia...** . In this example, **a** is the input and its label is **replace_à**. This process creates the training file for Deep Learning.





In [None]:
### generate source file for training
!cd /content/drive/MyDrive/gectorptbrFolder && python3 utils/preprocess_data.py -s /files/dual_files/corr_sentences.txt -t /files/dual_files/incorr_sentences.txt -o /files/neural_files/inputs_labels.txt

The size of raw dataset is 0
0it [00:00, ?it/s]0it [00:00, ?it/s]
Overall extracted 0. Original TP 0. Original TN 0


# **Fine-tune the BERTimbau model**
> Here we fine-tune the BERTimbau model on our wikipedia dataset 





## Separate the inputs_labels.txt file into training and testing datasets

> We create a function ourselves since we have inputs and labels together in one file



In [None]:
from random
from math

# open the relevant files
inputs_labels_path = '/content/drive/MyDrive/gectorptbrFolder/files/neural_files/inputs_labels.txt'
train_path = '/content/drive/MyDrive/gectorptbrFolder/files/neural_files/train.txt'
test_path = '/content/drive/MyDrive/gectorptbrFolder/files/neural_files/test.txt'

# if the file is huge, we have to think of something else, 
# such as "linecache" instead of reading everything into memory
def shuffle_split(inpath, outpath_1, outpath_2, proportion_1=0.8):
    assert proportion_1 < 1.0, 'proportion_1 must be smaller than 1.0'

    with open(inpath, 'r') as f:
        lines = f.readlines()
    # append a newline in case the last line didn't end with one
    # so that when we shuffle, we do not end up with two lines
    # without the break character
    lines[-1] = lines[-1].rstrip('\n') + '\n'

    random.shuffle(lines)

    cutoff = math.floor((1 - proportion_1)*len(lines))
    with open(outpath_1, 'w') as f:
        f.writelines(lines[:cutoff])
    with open(outpath_2, 'w') as f:
        f.writelines(lines[cutoff:])

shuffle_split(inputs_labels_path, train_path, test_path)

## Run the script for fine-tuning

Here we have to input:
*   The training dataset; 
*   The test dataset;
*   The model directory, here as MODEL_DIR;
*   The transformer BERT model which will be fine-tuned;
*   Inform whether we want to lowercase the tokens or not (0 => no, 1 => yes). Note: the BERTimbau model was trained on cased sentences, so it would be a waste of information if we were to train on uncased (i.e. lowercased tokens only) sentences.
*   The number of epochs.

In [None]:
### train the model
!cd /content/drive/MyDrive/gectorptbrFolder && python3 train.py --train_set files/neural_files/train.txt --dev_set files/neural_files/test.txt --model_dir MODEL_DIR --transformer_model bertimbau --lowercase_tokens 0 --n_epoch 5


## Make inference on a file

Here we have to input:
*   The trained model (one of the models saved in MODEL_DIR); 
*   The vocabulary folder in which there is the information of what the model encountered in the training dataset, e.g. the labels \$KEEP or \$TRANSFORM_VERB_VB_VBD;
*   The file to make the inference on, e.g. a file containing sentence such as 'Eles era feios.' so that we want the neural network to output 'Eles eram feios.';
*   The file where the inference is outputted to;
*   The transformer model which we fine-tuned in the training.









In [None]:
!cd /content/drive/MyDrive/gectorptbrFolder && python3 predict.py --model_path MODEL_DIR/model_state_epoch_4.th --vocab_path MODEL_DIR/vocabulary/ --input_file eval_after_train.txt --output_file OUTPUT_FILE.txt --transformer_model bertimbau


Traceback (most recent call last):
  File "predict.py", line 4, in <module>
    from gector.gec_model import GecBERTModel
  File "/content/drive/MyDrive/gectorptbrFolder/gector/gec_model.py", line 8, in <module>
    from allennlp.data.dataset import Batch
ModuleNotFoundError: No module named 'allennlp'
