# Replication of CrossAug: Cross-lingual Data Augmentation for Low-resource Neural Machine Translation

## Disclaimer:
```javascript
This is a work in progress. I will be adding more techniques as I learn them. If you have any suggestions, please feel free to reach out to me.
```
## Credits:
```javascript
This work is inspired by the following previous work sources:
https://github.com/minwhoo/CrossAug

author: Minwoo Lee

citation: Minwoo Lee, Seungwon Do, and Sung Ju Hwang. 2020. CrossAug: Cross-lingual Data Augmentation for Low-resource Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020). Association for Computational Linguistics, Online, July 5-10, 2020, pages 1-11. https://www.aclweb.org/anthology/2020.acl-main.1
```

# Downloading the files and setting up configurations

In [None]:

!curl  "https://drive.google.com/uc?export=download&id=16zumyDSuYV415dBDZiMrPIUA3yUrwJqb" -L  -o fever+crossaug.train.tar.gz
!tar xfv *.gz
!mkdir fever_data
!mv "fever+crossaug.train.jsonl" fever_data/


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
100 22.0M  100 22.0M    0     0  7150k      0  0:00:03  0:00:03 --:--:-- 7150k
fever+crossaug.train.jsonl


time took: 632.8004121780396
Modify evidence using lexical search-based substitution
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16664/16664 [00:06<00:00, 2678.98it/s]
time took: 6.220763206481934
Augment data
Saving to path: fever_data/fever.train.jsonl
Data saved! Data size: 24,647

### Downloading all the files from the repository to the local server.
- [ ] Base-Line Data
- [ ] Installing The Depenencies 
- [ ] Training & Evaluation Scripts 



In [None]:

!mkdir fever_data 
!cd fever_data & curl -L https://raw.githubusercontent.com/minwhoo/CrossAug/master/download_data.sh |sh
!curl -O https://raw.githubusercontent.com/minwhoo/CrossAug/master/utils_fever.py
!curl -O https://raw.githubusercontent.com/minwhoo/CrossAug/master/run_fever.py
!curl -O https://raw.githubusercontent.com/minwhoo/CrossAug/master/modeling_bert.py
!curl -O https://raw.githubusercontent.com/minwhoo/CrossAug/master/run_fever.py
!curl -O https://raw.githubusercontent.com/minwhoo/CrossAug/master/run_crossaug.py
!pip install jsonlines==2.0.0  nltk==3.6.2 numpy==1.20.2 pandas==1.1.5  scikit-learn==0.24.2  scipy==1.6.3  sentencepiece==0.1.95 tensorboardX==2.2  torch==1.8.1  transformers==4.11.2  pytorch-transformers==1.2.0  tqdm==4.60.0
!pip install pytorch-transformers

# Data Augmentation

In [None]:

# Symmetric data
run_crossaug.py   --in_file fever_data/symmetric.dev.jsonl   --out_file fever_data/symmetric.train.jsonl
# FEVER data
run_crossaug.py   --in_file fever_data/fever.dev.jsonl   --out_file fever_data/fever.train.jsonl
# FM2 data
run_crossaug.py   --in_file fever_data/fm2.dev.jsonl   --out_file fever_data/fm2.train.jsonl
# Adversarial data
run_crossaug.py   --in_file fever_data/adversarial.dev.jsonl   --out_file fever_data/adversarial.train.jsonl


# Use the fine-tuned negative claim generation model (Example)

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = 'minwhoo/bart-base-negative-claim-generation'
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.to('cuda' if torch.cuda.is_available() else 'cpu')

examples = [
    "Little Miss Sunshine was filmed over 30 days.",
    "Magic Johnson did not play for the Lakers.",
    "Claire Danes is wedded to an actor from England."
]

batch = tokenizer(examples, max_length=1024, padding=True, truncation=True, return_tensors="pt")
out = model.generate(batch['input_ids'].to(model.device), num_beams=5)
negative_examples = tokenizer.batch_decode(out, skip_special_tokens=True)
print(negative_examples)

# Use the fine-tuned negative claim generation model (Example)

In [None]:

import time
import argparse

import torch
import jsonlines
from tqdm import trange, tqdm
from nltk import word_tokenize
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


def find_substitution_map(sent1, sent2):
    """Find overlapping words in the given two sentences"""
    words1 = word_tokenize(sent1)
    words2 = word_tokenize(sent2)
    start_idx = 0
    while words1[start_idx] == words2[start_idx]:
        start_idx += 1
        if start_idx == len(words1) or start_idx == len(words2):
            return None

    end_idx = -1
    while words1[end_idx] == words2[end_idx]:
        end_idx -= 1

    if end_idx == -1:
        words_overlap1 = words1[start_idx:]
        words_overlap2 = words2[start_idx:]
    else:
        words_overlap1 = words1[start_idx:end_idx+1]
        words_overlap2 = words2[start_idx:end_idx+1]

    if 0 < len(words_overlap1) <= 3 and 0 < len(words_overlap2) <= 3:
        return words_overlap1, words_overlap2
    else:
        return None


def substitute_sent(sent, orig_words, replacing_words):
    """Find and substitute word phrases from given sentence"""
    sent_words = word_tokenize(sent)
    j = 0
    match_start_idx = None
    match_end_idx = None
    matches = []
    for i in range(len(sent_words)):
        if sent_words[i] == orig_words[j]:
            if j == 0:
                match_start_idx = i
            j += 1
        else:
            j = 0
            match_start_idx = None
            match_end_idx = None
        if j == len(orig_words):
            match_end_idx = i
            matches.append((match_start_idx, match_end_idx))
            j = 0
            match_start_idx = None
            match_end_idx = None
    if len(matches) == 1:
        i, j = matches[0]
        return ' '.join(sent_words[:i] + replacing_words + sent_words[j+1:])
    else:
        return None


def generate_negative_claims(data, batch_size):
    """Generate negative (refuted) claims using fine-tuned negative claim generation model"""
    model_name = 'minwhoo/bart-base-negative-claim-generation'
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    model.to('cuda' if torch.cuda.is_available() else 'cpu')

    for i in trange(0, len(data), batch_size):
        sents = [d['claim'] for d in data[i:i+batch_size]]
        batch = tokenizer(sents, padding=True, truncation=True, return_tensors="pt")
        out = model.generate(batch['input_ids'].to(model.device), num_beams=5)
        refuted_sents = tokenizer.batch_decode(out, skip_special_tokens=True)
        for j, refuted in enumerate(refuted_sents):
            data[i + j]['claim_refuted'] = refuted
    return data


def augment_start(in_file="file.txt",out_file="file.txt",batch_size=64 ): 

    print(f"Reading from path: {in_file}")
    with jsonlines.open(in_file, mode='r') as reader:
        data = [obj for obj in reader]
    print(f"Data loaded! Data size: {len(data):,}")

    print('Generate negative claims')
    start_time = time.time()
    data = generate_negative_claims(data, batch_size)
    print(f"time took: {time.time() - start_time}")

    print('Modify evidence using lexical search-based substitution')
    failed_cnt = 0
    start_time = time.time()
    for d in tqdm(data):
        try:
            span_pair = find_substitution_map(d['claim'], d['claim_refuted'])
        except:
            failed_cnt += 1
        else:
            if span_pair is not None:
                orig_span, replace_span  = span_pair
                evid_refuted = substitute_sent(d['evidence'], orig_span, replace_span)
                if evid_refuted is not None:
                    d['evidence_refuted'] = evid_refuted
    print(f"time took: {time.time() - start_time}")

    print('Augment data')
    augmented_data = []
    for d in data:
        augmented_data.append({
            'gold_label': d['gold_label'],
            'evidence': d['evidence'],
            'claim': d['claim'],
            'id': len(augmented_data),
            'weight': 0.0,
        })
        if d['gold_label'] == 'SUPPORTS':
            augmented_data.append({
                    'gold_label': 'REFUTES',
                    'evidence': d['evidence'],
                    'claim': d['claim_refuted'],
                    'id': len(augmented_data),
                    'weight': 0.0,
                })
            if 'evidence_refuted' in d:
                augmented_data.append({
                        'gold_label': 'REFUTES',
                        'evidence': d['evidence_refuted'],
                        'claim': d['claim'],
                        'id': len(augmented_data),
                        'weight': 0.0,
                    })
                augmented_data.append({
                        'gold_label': 'SUPPORTS',
                        'evidence': d['evidence_refuted'],
                        'claim': d['claim_refuted'],
                        'id': len(augmented_data),
                        'weight': 0.0,
                    })

    print(f"Saving to path: {out_file}")
    with jsonlines.open(out_file, mode='w') as writer:
        writer.write_all(augmented_data)
    print(f"Data saved! Data size: {len(augmented_data):,}")

augment_start("fever_data/fever.train.jsonl","fever_data/fever+crossaug.train.jsonl",64)

# Using HuggingFace's Model for CrossAug

In [None]:
!pip install autocuda gradio 
!pip install pyabsa[dev] -U
!pip install --upgrade huggingface-hub -U


from ast import Str
import gradio as gr
from tweetnlp import Sentiment, NER
from typing import Tuple, Dict
from statistics import mean

def clean_tweet(tweet: str, remove_chars: str = "@#") -> str:
    """Remove any unwanted characters
    Args:
        tweet (str): The raw tweet
        remove_chars (str, optional): The characters to remove. Defaults to "@#".
    Returns:
        str: The tweet with these characters removed
    """
    for char in remove_chars:
        tweet = tweet.replace(char, "")
    return tweet


def format_sentiment(model_output: Dict) -> Dict:
    """Format the output of the sentiment model
    Args:
        model_output (Dict): The model output
    Returns:
        Dict: The format for gradio
    """
    formatted_output = dict()
    print(model_output)

    try:
      if model_output["label"] == "positive":
          formatted_output["positive"] = model_output["probability"]
          formatted_output["negative"] = 1 - model_output["probability"]
      else:
          formatted_output["negative"] = model_output["probability"]
          formatted_output["positive"] = 1 - model_output["probability"]
      return formatted_output
    except:
      pass


def format_entities(model_output: Dict) -> Dict:
    """Format the output of the NER model
    Args:
        model_output (Dict): The model output
    Returns:
        Dict: The format for gradio
    """
    formatted_output = dict()
    for entity in model_output["entity_prediction"]:
        new_output = dict()
        name = " ".join(entity["entity"])
        entity_type = entity["type"]
        new_key = f"{name}:{entity_type}"
        new_value = mean(entity["probability"])
        formatted_output[new_key] = new_value
    return formatted_output


def classify(tweet: str) -> Tuple[Dict, Dict]:
    """Runs models
    Args:
        tweet (str): The raw tweet
    Returns:
        Tuple[Dict, Dict]: The formatted_sentiment and formatted_entities of the tweet
    """
    tweet = clean_tweet(tweet)
    # Get sentiment
    model_sentiment = se_model.sentiment(tweet)
    model_pred = se_model.predict(tweet)
    print(model_sentiment)
    print(model_pred)
    formatted_sentiment = format_sentiment(model_sentiment)
    # Get entities
    entities = ner_model.ner(tweet)
    formatted_entities = format_entities(entities)
    return formatted_sentiment, formatted_entities

    # https://github.com/cardiffnlp/tweetnlp
    

def run(tweets=None):
  se_model = Sentiment()
  ner_model = NER()

examples = list()
examples.append("Dameon Pierce is clearly the #Texans starter and he once again looks good")
examples.append("Deebo Samuel had 150+ receiving yards in 4 games last year - the most by any receiver in the league.")

for tweet in examples:
  classify(tweet)
    # Get a few examples from: https://twitter.com/NFLFantasy



```javascript Dameon Pierce is clearly the Texans starter and he once again looks good
{'label': 'positive'}
Dameon Pierce is clearly the Texans starter and he once again looks good
```

# Generating Evaluation and Training Models

In [None]:
/anaconda/envs/azureml_py310_sdkv2/bin/python run_fever.py --task_name fever --do_train --train_task_name fever+crossaug  --eval_task_names fever  --data_dir ./fever_data/ --do_lower_case --model_type bert --model_name_or_path bert-base-uncased --max_seq_length 128 --per_gpu_train_batch_size 32  --num_train_epochs 1.0 --save_steps 1000 --output_dir ./out --output_preds --seed 177697310

***italicized text***

In [None]:
/anaconda/envs/azureml_py310_sdkv2/bin/python run_fever.py --task_name fever --do_train --train_task_name fever+crossaug --do_eval --eval_task_names fever symmetric adversarial fm2 --data_dir ./fever_data/ --do_lower_case --model_type bert --model_name_or_path bert-base-uncased --max_seq_length 128 --per_gpu_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3.0 --save_steps 100000 --output_dir ./out --output_preds --seed 177697310

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = 'minwhoo/bart-base-negative-claim-generation'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.to('cuda' if torch.cuda.is_available() else 'cpu')



['Little Miss Sunshine was filmed less than 3 days.', 'Magic Johnson played for the Lakers.', 'Claire Danes is married to an actor from France.']


In [None]:
python  run_fever.py \
    --task_name fever \
    --do_train \
    --train_task_name fever+crossaug \
    --do_eval \
    --eval_task_names fever symmetric adversarial fm2 \
    --data_dir ./fever_data/ \
    --do_lower_case \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --max_seq_length 128 \
    --per_gpu_train_batch_size 32 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --save_steps 100000 \
    --output_dir ./out \
    --output_preds \
    --seed 177697310
 

run_fever.py --task_name fever --do_train --train_task_name fever+crossaug --do_eval --eval_task_names fever symmetric adversarial fm2 --data_dir ./fever_data/ --do_lower_case --model_type bert --model_name_or_path bert-base-uncased --max_seq_length 128 --per_gpu_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3.0 --save_steps 100000 --output_dir ./out --output_preds --seed 177697310


04/22/2023 03:06:22 - WARNING - __main__ -   Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
04/22/2023 03:06:23 - INFO - pytorch_transformers.modeling_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at /root/.cache/torch/pytorch_transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
04/22/2023 03:06:23 - INFO - pytorch_transformers.modeling_utils -   Model config {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": "fever",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 3,
  "output_attentions": false,
  "output_hidden_states": false,
  "pad_token_id": 0,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

04/22/2023 03:06:24 - INFO - pytorch_transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/pytorch_transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
04/22/2023 03:06:25 - INFO - pytorch_transformers.modeling_utils -   loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin from cache at /root/.cache/torch/pytorch_transformers/aa1ef1aede4482d0dbcd4d52baad8ae300e60902e88fcb0bebdec09afd232066.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157
04/22/2023 03:06:30 - INFO - pytorch_transformers.modeling_utils -   Weights of BertForSequenceClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
04/22/2023 03:06:30 - INFO - pytorch_transformers.modeling_utils -   Weights from pretrained model not used in BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
04/22/2023 03:06:33 - INFO - __main__ -   Training/evaluation parameters Namespace(data_dir='./fever_data/', train_task_name='fever', eval_task_names=['fever', 'symmetric', 'adversarial', 'fm2'], model_type='bert', model_name_or_path='bert-base-uncased', task_name='fever', output_dir='./baseline_trained_models_seed=177697310/', output_preds=True, config_name='', tokenizer_name='', cache_dir='', max_seq_length=128, do_train=True, do_eval=True, evaluate_during_training=False, do_lower_case=True, weighted_loss=False, per_gpu_train_batch_size=32, per_gpu_eval_batch_size=8, gradient_accumulation_steps=1, learning_rate=2e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_steps=50, save_steps=100000, eval_all_checkpoints=False, no_cuda=False, overwrite_output_dir=False, overwrite_cache=False, seed=177697310, fp16=False, fp16_opt_level='O1', local_rank=-1, server_ip='', server_port='', n_gpu=1, device=device(type='cuda'), output_mode='classification')
04/22/2023 03:06:33 - INFO - __main__ -   Creating features from dataset file at ./fever_data/
04/22/2023 03:06:35 - INFO - utils_fever -   Writing example 0 of 242911
04/22/2023 03:06:35 - INFO - utils_fever -   *** Example ***
04/22/2023 03:06:35 - INFO - utils_fever -   guid: 150448
04/22/2023 03:06:35 - INFO - utils_fever -   tokens: [CLS] roman at ##wood is a content creator . [SEP] he is best known for his v ##log ##s , where he posts updates about his life on a daily basis . [SEP]
04/22/2023 03:06:35 - INFO - utils_fever -   input_ids: 101 3142 2012 3702 2003 1037 4180 8543 1012 102 2002 2003 2190 2124 2005 2010 1058 21197 2015 1010 2073 2002 8466 14409 2055 2010 2166 2006 1037 3679 3978 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   segment_ids: 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   label: SUPPORTS (id = 0)
04/22/2023 03:06:35 - INFO - utils_fever -   *** Example ***
04/22/2023 03:06:35 - INFO - utils_fever -   guid: 150448
04/22/2023 03:06:35 - INFO - utils_fever -   tokens: [CLS] roman at ##wood is a content creator . [SEP] he also has another youtube channel called ` ` roman ##at ##wood ' ' , where he posts prank ##s . [SEP]
04/22/2023 03:06:35 - INFO - utils_fever -   input_ids: 101 3142 2012 3702 2003 1037 4180 8543 1012 102 2002 2036 2038 2178 7858 3149 2170 1036 1036 3142 4017 3702 1005 1005 1010 2073 2002 8466 26418 2015 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   segment_ids: 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   label: SUPPORTS (id = 0)
04/22/2023 03:06:35 - INFO - utils_fever -   *** Example ***
04/22/2023 03:06:35 - INFO - utils_fever -   guid: 214861
04/22/2023 03:06:35 - INFO - utils_fever -   tokens: [CLS] history of art includes architecture , dance , sculpture , music , painting , poetry literature , theatre , narrative , film , photography and graphic arts . [SEP] the subsequent expansion of the list of principal arts in the 20th century reached to nine : architecture , dance , sculpture , music , painting , poetry - l ##rb - described broadly as a form of literature with aesthetic purpose or function , which also includes the distinct genres of theatre and narrative - rr ##b - , film , photography and graphic arts . [SEP]
04/22/2023 03:06:35 - INFO - utils_fever -   input_ids: 101 2381 1997 2396 2950 4294 1010 3153 1010 6743 1010 2189 1010 4169 1010 4623 3906 1010 3004 1010 7984 1010 2143 1010 5855 1998 8425 2840 1012 102 1996 4745 4935 1997 1996 2862 1997 4054 2840 1999 1996 3983 2301 2584 2000 3157 1024 4294 1010 3153 1010 6743 1010 2189 1010 4169 1010 4623 1011 1048 15185 1011 2649 13644 2004 1037 2433 1997 3906 2007 12465 3800 2030 3853 1010 2029 2036 2950 1996 5664 11541 1997 3004 1998 7984 1011 25269 2497 1011 1010 2143 1010 5855 1998 8425 2840 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   label: SUPPORTS (id = 0)
04/22/2023 03:06:35 - INFO - utils_fever -   *** Example ***
04/22/2023 03:06:35 - INFO - utils_fever -   guid: 156709
04/22/2023 03:06:35 - INFO - utils_fever -   tokens: [CLS] ad ##rien ##ne bail ##on is an accountant . [SEP] ad ##rien ##ne eliza houghton - l ##rb - nee bail ##on ; born october 24 , 1983 - rr ##b - is an american singer - songwriter , recording artist , actress , dancer and television personality . [SEP]
04/22/2023 03:06:35 - INFO - utils_fever -   input_ids: 101 4748 23144 2638 15358 2239 2003 2019 17907 1012 102 4748 23144 2638 13234 21234 1011 1048 15185 1011 7663 15358 2239 1025 2141 2255 2484 1010 3172 1011 25269 2497 1011 2003 2019 2137 3220 1011 6009 1010 3405 3063 1010 3883 1010 8033 1998 2547 6180 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   segment_ids: 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   label: REFUTES (id = 1)
04/22/2023 03:06:35 - INFO - utils_fever -   *** Example ***
04/22/2023 03:06:35 - INFO - utils_fever -   guid: 33078
04/22/2023 03:06:35 - INFO - utils_fever -   tokens: [CLS] the boston celtics play their home games at td garden . [SEP] the celtics play their home games at the td garden , which they share with the national hockey league - l ##rb - nhl - rr ##b - ' s boston bruins . [SEP]
04/22/2023 03:06:35 - INFO - utils_fever -   input_ids: 101 1996 3731 23279 2377 2037 2188 2399 2012 14595 3871 1012 102 1996 23279 2377 2037 2188 2399 2012 1996 14595 3871 1010 2029 2027 3745 2007 1996 2120 3873 2223 1011 1048 15185 1011 7097 1011 25269 2497 1011 1005 1055 3731 18159 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
04/22/2023 03:06:35 - INFO - utils_fever -   label: SUPPORTS (id = 0)
04/22/2023 03:06:43 - INFO - utils_fever -   Writing example 10000 of 242911
04/22/2023 03:06:50 - INFO - utils_fever -   Writing example 20000 of 242911
04/22/2023 03:06:56 - INFO - utils_fever -   Writing example 30000 of 242911
04/22/2023 03:07:05 - INFO - utils_fever -   Writing example 40000 of 242911
04/22/2023 03:07:11 - INFO - utils_fever -   Writing example 50000 of 242911
04/22/2023 03:07:19 - INFO - utils_fever -   Writing example 60000 of 242911
04/22/2023 03:07:27 - INFO - utils_fever -   Writing example 70000 of 242911
04/22/2023 03:07:32 - INFO - utils_fever -   Writing example 80000 of 242911
04/22/2023 03:07:41 - INFO - utils_fever -   Writing example 90000 of 242911
04/22/2023 03:07:47 - INFO - utils_fever -   Writing example 100000 of 242911
04/22/2023 03:07:56 - INFO - utils_fever -   Writing example 110000 of 242911
04/22/2023 03:08:02 - INFO - utils_fever -   Writing example 120000 of 242911
04/22/2023 03:08:08 - INFO - utils_fever -   Writing example 130000 of 242911
04/22/2023 03:08:17 - INFO - utils_fever -   Writing example 140000 of 242911
04/22/2023 03:08:22 - INFO - utils_fever -   Writing example 150000 of 242911
04/22/2023 03:08:32 - INFO - utils_fever -   Writing example 160000 of 242911


---
# Results

Training and evaluation with the above commands should result in the following accuracies.

|          | FEVER dev | Symmetric | Adversarial | FM2 dev   |
|----------|-----------|-----------|-------------|-----------|
| No aug   | **86.43** | 59.14     | 50.00       | 41.15     |
| PoE      | 86.14     | 63.88     | 51.31       | **47.39** |
| CrossAug | 85.05     | **68.20** | **52.48**   | 45.17     |


### Azure Machine Learning Setup (Optional)

```javascript
because of the size of the data, I have decided to use Azure Machine Learning to run the experiments to make the deadline.
```

### Setup & Requirements

```javascript
The following are the requirements to run the experiments:
```
- [ ] Azure Subscription
- [ ] Azure Machine Learning Workspace
- [ ] Azure Machine Learning Compute Cluster or Instance
- [ ] Azure Machine Learning Experiment


### Azure Machine Learning Workspace

```javascript
Note: You can create a workspace using the Azure Portal, Azure CLI, or Azure Machine Learning SDK. For more information, see Create an Azure Machine Learning workspace.
```




### Azure Machine Learning Compute Cluster or Instance

```javascript

```
![image.png](https://raw.githubusercontent.com/MurtadhaM/ML-5156/main/AZURE.png)


### Visual Studio Code (VSCode) Integration

The following extensions are recommended for VSCode:

In the Azure ML portal, click on "Open in VSCode" to open the project in VSCode.  This will automatically install the Azure Machine Learning extension for VSCode.


![VSCODE](https://raw.githubusercontent.com/MurtadhaM/ML-5156/main/VSCODE.png)