# Intro

**This is my very first public notebook on Kaggle! To whom this notebook is helpful or interesting, I'd be extremely happy if you can upvote it, thank you!!**

# Data Augmentation

In this note we try several augmentation method, to artificialy increase the number of data. As one shall see, back translation method works best. This method translates the original text into another language (e.g. German) and translate it again to the original language (English) to obtain similar, but slightly different texts. 

Data augmentation can be used, for instance, to increase the number of texts containing counter-claims and rebuttals, since they are not well represented in the original dataset.

This notebook further creates a new csv training file, and raw text files for the augmented data, so that they can be directly used in the following training process.

# Nlpaug

We shall install the *nlpaug* library, which comes with handy nlp augmentation methods. Several augmentation schemes are available, from the simplest synonym replacement to the most complex nlp transformer augmentation. We shall try some of them to see what is the best augmenter in our specific case.

In [1]:
# Instlling nlpaug (you need internet!)
!pip install nlpaug

Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m672.0 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting gdown>=4.0.0
  Downloading gdown-4.6.0-py3-none-any.whl (14 kB)
Installing collected packages: gdown, nlpaug
Successfully installed gdown-4.6.0 nlpaug-1.1.11
[0m

In [3]:
import os
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
from transformers import *

import random
import nlpaug.augmenter.word as naw

In [5]:
# We save the created data in the following folder
os.mkdir("data_augmented")

FileExistsError: [Errno 17] File exists: 'data_augmented'

# Load Train

In [6]:
train = pd.read_csv('../input/feedback-prize-2021/train.csv')
train.head()

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring
0,423A1CA112E2,1622628000000.0,8.0,229.0,Modern humans today are always on their phone....,Lead,Lead 1,1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1...
1,423A1CA112E2,1622628000000.0,230.0,312.0,They are some really bad consequences when stu...,Position,Position 1,45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
2,423A1CA112E2,1622628000000.0,313.0,401.0,Some certain areas in the United States ban ph...,Evidence,Evidence 1,60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
3,423A1CA112E2,1622628000000.0,402.0,758.0,"When people have phones, they know about certa...",Evidence,Evidence 2,76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 9...
4,423A1CA112E2,1622628000000.0,759.0,886.0,Driving is one of the way how to get around. P...,Claim,Claim 1,139 140 141 142 143 144 145 146 147 148 149 15...


# Augmentation methods Study

For test purpose, we shall take a random discourse, and apply some of the available augmentation methods, to see how the text is transformed.

## Preliminary

In [9]:
IDS = train["id"].unique()

# We shall pick one random text to see how the augmentation performs
random.seed(1)
id_text = random.choice(IDS)
os_test_pdf = train[train["id"] == id_text]

# The associated text file
file_path = f'../input/feedback-prize-2021/train/{id_text}.txt'
study_text = open(file_path, 'r').read()

In [10]:
print(study_text)

Summer break is the time were students get to unwind relax and just enjoy being youthful with friends and family even though it can lead to them struggling to learn for the next school year. When school is out and students are on summer break students tend to forget everything that they had previously learned from the past year. Which leads them to their first week back to school being confusing also basic concepts that they were already taught become difficult to remember. This is why teachers give out teacher designed summer projects to keep students mentally stimulated, help them get an understanding at the new subject they will be learning, and remind them of old concepts that were taught to them.

If a student is given all of summer break to themselves they won't hesitate to not look at a book or do some practice math problems or watch a video on things they have learned to refresh themselves. Students use the school breaks as a way to hang out with friends and family, get a summe

## Augmentation using Nlpaug

### Synonym Augmentation

The method simply consists to replace some of the words in the original text by their synonym. You can change the minimum of maximum number of replacement by specifying the parameters, as explained [here](https://nlpaug.readthedocs.io/en/latest/augmenter/word/synonym.html). As you can see here and below, using nlpaug, data augmentation is done in two lines!

In [11]:
# We take a text chunk from the train dataframe to apply augmentation
text_chunk = os_test_pdf.iloc[10]["discourse_text"]

In [12]:
syn_aug = naw.SynonymAug(aug_src='wordnet')
text_chunk_aug_syn = syn_aug.augment(text_chunk)

### Word2Vec Augmentation

Word2Vec augmentation is similar to synonym augmentation, but it replace words not by its synonym, but rather by word having similar vectorial represenation. In order to use this augmentation, we need to specify the backbone. There are many available on Kaggle's dataset, here we shall use the most common GoogleNews trained one, which you can find [here](https://www.kaggle.com/umbertogriffo/googles-trained-word2vec-model-in-python). 

In [13]:
word2vec_path = "../input/googles-trained-word2vec-model-in-python/GoogleNews-vectors-negative300.bin"
w2v_aug = naw.WordEmbsAug(
    model_type='word2vec', model_path=word2vec_path,
    action="substitute")
text_chunk_aug_w2v = w2v_aug.augment(text_chunk)

### Contextual Embedding

Contextual embedding use nlp models (here transformers), to understand the context of the input text and replace/add words keeping the context. As a result, the new text may have additional words or slightly different meaning. 

Here for the embedding model, we shall use pretrained roberta model. Please notice that internet connection is necessary to download the model. It may be possible to add the model used for the feature prize predictio, so as to create text in the essence of the original dataset.

In [14]:
transf_aug = naw.ContextualWordEmbsAug(
    model_path="roberta-base", action="substitute")
text_chunk_aug_transf = transf_aug.augment(text_chunk)

Could not locate the tokenizer configuration file, will try to use the model config instead.
https://huggingface.co/roberta-base/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp1mrjnh_f


Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

storing https://huggingface.co/roberta-base/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
creating metadata file for /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hid

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

storing https://huggingface.co/roberta-base/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
creating metadata file for /root/.cache/huggingface/transformers/d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
https://huggingface.co/roberta-base/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp1digno9g


Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

storing https://huggingface.co/roberta-base/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
creating metadata file for /root/.cache/huggingface/transformers/cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
https://huggingface.co/roberta-base/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpgy5rpx73


Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

storing https://huggingface.co/roberta-base/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/d53fc0fa09b8342651efd4073d75e19617b3e51287c2a535becda5808a8db287.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
creating metadata file for /root/.cache/huggingface/transformers/d53fc0fa09b8342651efd4073d75e19617b3e51287c2a535becda5808a8db287.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
loading file https://huggingface.co/roberta-base/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1fdc5.d67d6b367eb24ab43b08ad55e014cf254076934f71d832bbab9ad35644a375ab
loading file https://huggingface.co/roberta-base/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/cafdecc90fcab17011e12ac813dd574b4b3fea39da6dd817813efa010262ff3f.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/roberta

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

storing https://huggingface.co/roberta-base/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7
creating metadata file for /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d717dbc69de8683aaad.c7efaa30a0d80b2958b876969faa180e485944a849deee4ad482332de65365a7


### Back Translation

The back translation method consists to first translate the original text into another language (for instance French, German...), and then translate it back to the original one. This has as effects to create new texts having the same meaning, yet with different words/length. Here too you need internet connection, since under the hood nlpaug uses huggingface translation models (by default English -> German -> English).

It is also worth noticing that one can set the `device` parameter so as to use GPUs.

In [15]:
# 4) Back translation augmentation
# back_trans_aug = naw.BackTranslationAug(device="cuda")  # If using GPUs
back_trans_aug = naw.BackTranslationAug()
text_chunk_aug_btrans = back_trans_aug.augment(text_chunk)

https://huggingface.co/facebook/wmt19-en-de/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpzdjwbjuo


Downloading:   0%|          | 0.00/825 [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-en-de/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/f7228afe8e7ee9211a64cd65f2198a2ebaada159ed50abfdee397fe7f16e4364.52a7d772180b8f212557824ed60760f056dae8e38e7b405fdb03ba3df448a4e0
creating metadata file for /root/.cache/huggingface/transformers/f7228afe8e7ee9211a64cd65f2198a2ebaada159ed50abfdee397fe7f16e4364.52a7d772180b8f212557824ed60760f056dae8e38e7b405fdb03ba3df448a4e0
loading configuration file https://huggingface.co/facebook/wmt19-en-de/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/f7228afe8e7ee9211a64cd65f2198a2ebaada159ed50abfdee397fe7f16e4364.52a7d772180b8f212557824ed60760f056dae8e38e7b405fdb03ba3df448a4e0
Model config FSMTConfig {
  "_name_or_path": "facebook/wmt19-en-de",
  "activation_dropout": 0.0,
  "activation_function": "relu",
  "architectures": [
    "FSMTForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "d_model": 1024,
  "decoder

Downloading:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-en-de/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/2bb0e28ddb1f044101a23295142a7890c00c794a2f098d006d52ed6d1a0f8566.2c5810823637488fc0e13ba283ba60a8d9ab34d75c791e3de7acc4350be75681
creating metadata file for /root/.cache/huggingface/transformers/2bb0e28ddb1f044101a23295142a7890c00c794a2f098d006d52ed6d1a0f8566.2c5810823637488fc0e13ba283ba60a8d9ab34d75c791e3de7acc4350be75681
loading weights file https://huggingface.co/facebook/wmt19-en-de/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/2bb0e28ddb1f044101a23295142a7890c00c794a2f098d006d52ed6d1a0f8566.2c5810823637488fc0e13ba283ba60a8d9ab34d75c791e3de7acc4350be75681
All model checkpoint weights were used when initializing FSMTForConditionalGeneration.

All the weights of FSMTForConditionalGeneration were initialized from the model checkpoint at facebook/wmt19-en-de.
If your task is similar to the task the model of the checkpo

Downloading:   0%|          | 0.00/825 [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-de-en/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/232d5f8320861ff995f0ce707c380004bfb69ec2d6d426a4f067451b766c0035.0390a2ce5c4c6411e268f30e31322ecdef9eb2ba49fda94dfc5a76d67a9b0000
creating metadata file for /root/.cache/huggingface/transformers/232d5f8320861ff995f0ce707c380004bfb69ec2d6d426a4f067451b766c0035.0390a2ce5c4c6411e268f30e31322ecdef9eb2ba49fda94dfc5a76d67a9b0000
loading configuration file https://huggingface.co/facebook/wmt19-de-en/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/232d5f8320861ff995f0ce707c380004bfb69ec2d6d426a4f067451b766c0035.0390a2ce5c4c6411e268f30e31322ecdef9eb2ba49fda94dfc5a76d67a9b0000
Model config FSMTConfig {
  "_name_or_path": "facebook/wmt19-de-en",
  "activation_dropout": 0.0,
  "activation_function": "relu",
  "architectures": [
    "FSMTForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "d_model": 1024,
  "decoder

Downloading:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-de-en/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/a46524d41cad7f1854ca313a0ca48f73081df4a0dc4be12d6a7d8d74f3f2b4ff.99fcb44889d54158876b90cc6822bfb973ef7f5725865f17a84c60394ebcb225
creating metadata file for /root/.cache/huggingface/transformers/a46524d41cad7f1854ca313a0ca48f73081df4a0dc4be12d6a7d8d74f3f2b4ff.99fcb44889d54158876b90cc6822bfb973ef7f5725865f17a84c60394ebcb225
loading weights file https://huggingface.co/facebook/wmt19-de-en/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/a46524d41cad7f1854ca313a0ca48f73081df4a0dc4be12d6a7d8d74f3f2b4ff.99fcb44889d54158876b90cc6822bfb973ef7f5725865f17a84c60394ebcb225
All model checkpoint weights were used when initializing FSMTForConditionalGeneration.

All the weights of FSMTForConditionalGeneration were initialized from the model checkpoint at facebook/wmt19-de-en.
If your task is similar to the task the model of the checkpo

Downloading:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-en-de/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/872df7acb11517e6d62c7ef194b99f39785365dcaf3cd44ea7e157b7db4baf05.d1dbb763b9c5b0e34814e4b3ae2732435ed6111f2b688dc8045d6b55bf8af194
creating metadata file for /root/.cache/huggingface/transformers/872df7acb11517e6d62c7ef194b99f39785365dcaf3cd44ea7e157b7db4baf05.d1dbb763b9c5b0e34814e4b3ae2732435ed6111f2b688dc8045d6b55bf8af194
loading configuration file https://huggingface.co/facebook/wmt19-en-de/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/f7228afe8e7ee9211a64cd65f2198a2ebaada159ed50abfdee397fe7f16e4364.52a7d772180b8f212557824ed60760f056dae8e38e7b405fdb03ba3df448a4e0
Model config FSMTConfig {
  "_name_or_path": "facebook/wmt19-en-de",
  "activation_dropout": 0.0,
  "activation_function": "relu",
  "architectures": [
    "FSMTForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "d_model": 1024,


Downloading:   0%|          | 0.00/829k [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-en-de/resolve/main/vocab-src.json in cache at /root/.cache/huggingface/transformers/31bda1b7c109aebe6c2e9e365e00b5d04a564e0223b47c8bfc6deacd11b03ae8.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
creating metadata file for /root/.cache/huggingface/transformers/31bda1b7c109aebe6c2e9e365e00b5d04a564e0223b47c8bfc6deacd11b03ae8.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
https://huggingface.co/facebook/wmt19-en-de/resolve/main/vocab-tgt.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpuoqtnjfe


Downloading:   0%|          | 0.00/829k [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-en-de/resolve/main/vocab-tgt.json in cache at /root/.cache/huggingface/transformers/19eb65f36625e9f094c4656372bd75c896ebf0b26fa1120fbf5babd4c178c1de.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
creating metadata file for /root/.cache/huggingface/transformers/19eb65f36625e9f094c4656372bd75c896ebf0b26fa1120fbf5babd4c178c1de.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
https://huggingface.co/facebook/wmt19-en-de/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpiz2c9y2c


Downloading:   0%|          | 0.00/308k [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-en-de/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/c3070a97de3e07091fc0c4eb19dc1954683c96f7b8ec872285f1fccbc915f83f.7b3379be52fb43e75807d326f0244c93e52304616f437ba5a8d2ee3995704bb4
creating metadata file for /root/.cache/huggingface/transformers/c3070a97de3e07091fc0c4eb19dc1954683c96f7b8ec872285f1fccbc915f83f.7b3379be52fb43e75807d326f0244c93e52304616f437ba5a8d2ee3995704bb4
loading file https://huggingface.co/facebook/wmt19-en-de/resolve/main/vocab-src.json from cache at /root/.cache/huggingface/transformers/31bda1b7c109aebe6c2e9e365e00b5d04a564e0223b47c8bfc6deacd11b03ae8.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
loading file https://huggingface.co/facebook/wmt19-en-de/resolve/main/vocab-tgt.json from cache at /root/.cache/huggingface/transformers/19eb65f36625e9f094c4656372bd75c896ebf0b26fa1120fbf5babd4c178c1de.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
loading file ht

Downloading:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-de-en/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/ee37018c2659c8edff30fa5786a4086e37a183385772707b5477ba4411a61d92.bb240727cbb831795b3c665b2c7805e92ca85dab184cd839ea0408af6dcfecae
creating metadata file for /root/.cache/huggingface/transformers/ee37018c2659c8edff30fa5786a4086e37a183385772707b5477ba4411a61d92.bb240727cbb831795b3c665b2c7805e92ca85dab184cd839ea0408af6dcfecae
loading configuration file https://huggingface.co/facebook/wmt19-de-en/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/232d5f8320861ff995f0ce707c380004bfb69ec2d6d426a4f067451b766c0035.0390a2ce5c4c6411e268f30e31322ecdef9eb2ba49fda94dfc5a76d67a9b0000
Model config FSMTConfig {
  "_name_or_path": "facebook/wmt19-de-en",
  "activation_dropout": 0.0,
  "activation_function": "relu",
  "architectures": [
    "FSMTForConditionalGeneration"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "d_model": 1024,


Downloading:   0%|          | 0.00/829k [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-de-en/resolve/main/vocab-src.json in cache at /root/.cache/huggingface/transformers/389f8ccd1a7283e9f4d04e2059faba3e29a5092e5209dd1f061904e2b72f2e5f.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
creating metadata file for /root/.cache/huggingface/transformers/389f8ccd1a7283e9f4d04e2059faba3e29a5092e5209dd1f061904e2b72f2e5f.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
https://huggingface.co/facebook/wmt19-de-en/resolve/main/vocab-tgt.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpvwohxabg


Downloading:   0%|          | 0.00/829k [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-de-en/resolve/main/vocab-tgt.json in cache at /root/.cache/huggingface/transformers/31bb587b1089489d525c2ba4fdb7ed2017e242a7ee9f927049d2d472e2cb1ec9.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
creating metadata file for /root/.cache/huggingface/transformers/31bb587b1089489d525c2ba4fdb7ed2017e242a7ee9f927049d2d472e2cb1ec9.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
https://huggingface.co/facebook/wmt19-de-en/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmplozrp5_z


Downloading:   0%|          | 0.00/308k [00:00<?, ?B/s]

storing https://huggingface.co/facebook/wmt19-de-en/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/b6b9d991e26bd9421ae2696ba763560cece0fd7ef9f3a3b3be3e71436c0a30e9.7b3379be52fb43e75807d326f0244c93e52304616f437ba5a8d2ee3995704bb4
creating metadata file for /root/.cache/huggingface/transformers/b6b9d991e26bd9421ae2696ba763560cece0fd7ef9f3a3b3be3e71436c0a30e9.7b3379be52fb43e75807d326f0244c93e52304616f437ba5a8d2ee3995704bb4
loading file https://huggingface.co/facebook/wmt19-de-en/resolve/main/vocab-src.json from cache at /root/.cache/huggingface/transformers/389f8ccd1a7283e9f4d04e2059faba3e29a5092e5209dd1f061904e2b72f2e5f.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
loading file https://huggingface.co/facebook/wmt19-de-en/resolve/main/vocab-tgt.json from cache at /root/.cache/huggingface/transformers/31bb587b1089489d525c2ba4fdb7ed2017e242a7ee9f927049d2d472e2cb1ec9.26ba0023c6adfdb30f5b481eb41adbaa8ec26dc4b98e42d321b9deb99433e90f
loading file ht

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Compare results

In [16]:
# Comparing the different augmentation method
print("Original:")
print(text_chunk)
print("")
print("Synonym Augmented Text:")
print(text_chunk_aug_syn)
print("")
print("Word2Vec Augmented Text:")
print(text_chunk_aug_w2v)
print("")
print("Roberta Augmented Text:")
print(text_chunk_aug_transf)
print("")
print("Back Translation Augmented Text:")
print(text_chunk_aug_btrans)

Original:
From personal experience I know I wasn't the only other student who struggled from a similar situation like this, reasons like this is why most teachers give out there own assignments for students to do to help them remember old previous concepts and methods from previous years to understand their class.


Synonym Augmented Text:
["From personal experience I cognize I wasn ' t the only other student who struggled from a similar situation similar this, reasons like this is why most teachers give verboten there own assignments for students to suffice to help them remember one time previous concepts and methods from previous years to understand their class."]

Word2Vec Augmented Text:
["Starting Patti_Seger experience I know I didn_t ' t the only other student who struggled prior a similar situation weird this, reasons liked this is why most teachers give out hesays own assignments for students to do to help them remember daughter_Janessa_Greig prior concepts and methods from pr

# Creating Augmented dataset

The previous study showed that while there are many way avilable in augmenting text data, either *transformers* based approach or *back-translation* approach give the most prominent results. One shall therefore stick with the **back-translation** method in what follows. Let us briefly note that one can also use the transfomers method if we want to get extra dataset.

We shall now create a new dataset, in three steps:
1. First, back translate all the discourses corresponding to a given text in the original training csv file
2. Next, create a new text file by picking up the original text, and replacing the discourses inside by the new ones created in 1.
3. Finally, create a new csv file, taking into account the new discourse positions, and prediction strings.

## Applying back-translation to dataset

In [17]:
# For testing purpose, we shall select only 5 texts
n_augment = 5
selected_id = np.random.choice(IDS, size=n_augment, replace=False)

train_selected = train[train["id"].isin(selected_id)]

In [18]:
# Selecting the back translation augmentation method
augmenter = back_trans_aug

# Set the following to avoid warning message
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Applying augmentation to all the selected texts
# Since this may take a while, we shall show progressbar using tqdm
from tqdm.auto import tqdm
tqdm.pandas()

# train_selected["augmented_text"] = train_selected.apply(lambda row : augmenter.augment(row["discourse_text"]), axis=1)  # If you don't want to use tqdm
train_selected["augmented_text"] = train_selected.progress_apply(lambda row : augmenter.augment(row["discourse_text"]), axis=1)

  0%|          | 0/43 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


In [19]:
train_selected.head()

Unnamed: 0,id,discourse_id,discourse_start,discourse_end,discourse_text,discourse_type,discourse_type_num,predictionstring,augmented_text
6892,E2BCA7B63A7E,1622064000000.0,0.0,57.0,Why should cars be limited? Theres really no r...,Position,Position 1,0 1 2 3 4 5 6 7 8 9,[Why should cars be restricted? There really i...
6893,E2BCA7B63A7E,1622064000000.0,104.0,785.0,I see that all cars are is a deadly machine th...,Evidence,Evidence 1,19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 3...,[I see that all cars are a deadly machine that...
6894,E2BCA7B63A7E,1622064000000.0,786.0,858.0,I belive cars should be limited to the point w...,Position,Position 2,148 149 150 151 152 153 154 155 156 157 158 15...,[I think cars should be limited to the point w...
6895,E2BCA7B63A7E,1622064000000.0,858.0,888.0,To cut down the obesity rate.,Evidence,Evidence 2,162 163 164 165 166 167,[To reduce the obesity rate.]
27177,2B37596BD9A5,1619642000000.0,0.0,59.0,I think that boys should join the Seagoing Cow...,Position,Position 1,0 1 2 3 4 5 6 7 8 9,[I think boys should join the Seagoing Cowboys...
