## Training the generation model

In order to teach the generation model to incorporate strategies, we need to show examples of how each strategy can be integrated into various utterance contexts. 

To this end, we sample a set of utterances as _groundtruth_ data, and then seperate the _strategy_ used in utterances from the remaining utterance _context_ to create (_strategy_, _context_, _groundtrugh_) tuples as training data points.

In [2]:
import pandas as pd
import os

from convokit import Corpus, Utterance, Speaker
from convokit import PolitenessStrategies

### 1. Training data

Training data for the generation model is sampled from [WikiConv](http://www.cs.cornell.edu/~cristian/index_files/wikiconv-conversation-corpus.pdf). For
each politeness strategy, we sample 1,500 disjoint instances. 

The training data is saved as a [ConvoKit](https://convokit.cornell.edu/) corpus, which includes the following utterance-level metadata:
        
        * strategy: strategy split
        * parsed: dependency parse information 

In [3]:
train_corpus = Corpus(filename=("data/train/training-corpus/"))

### 2. Seperating strategy markers and utterance context

We first identify strategies (and their corresponding markers) used in the utterances. For data directly read from the TSV file, the same step can be performed with transform_utterance on raw text.

In [4]:
ps = PolitenessStrategies(strategy_attribute_name = "strategies", \
                          marker_attribute_name = "markers", \
                          strategy_collection="politeness_local")

In [5]:
# it is important to set markers to True
train_corpus = ps.transform(train_corpus, markers=True)

We can verify that each utterance does indeed contain the markers for strategies they are representing:

In [6]:
for utt in train_corpus.iter_utterances():
    
    strategy_split = utt.meta['strategy']
    assert utt.meta['strategies'][strategy_split] == 1

Next, for each (_strategy_, _utterance_) pair---i.e., (_utt.meta['strategy']_, _utt.text_) for each utt in our corpus)---we obtain the version of utterance with the specified strategy removed. 

In [7]:
# helper functions further detailed in Marker_Edits.ipynb 
from strategy_manipulation import remove_strategies_from_utt

In [8]:
for utt in train_corpus.iter_utterances():
    
    remove_strategies_from_utt(utt, [utt.meta['strategy']])

Each utterance in the corpus now has the strategy-removed content saved under _post_del_content_:

In [9]:
utt = train_corpus.get_utterance('100087711.41.31')
print("BEFORE:", utt.text)
print("AFTER:", utt.meta['post_del_content'])

BEFORE: Is this page really still a stub?  Seems like enough information to remove the stub marker.
AFTER: is this page still a stub ?   seems like enough information to remove the stub marker .


### 3. Prepare generation data

In [10]:
import random
from strategy_manipulation import convert_to_training_format

The format for training data is as follows (not that special tokens are introduced as seperators): 

    <STR> strategy_name <CONTEXT> content <START> groundtruth <END>

In [11]:
for utt in train_corpus.iter_utterances():
    
    strategy = utt.meta['strategy']
    post_del_content = utt.meta['post_del_content']
    text = utt.text.lower()
    
    utt.meta['training_format'] = convert_to_training_format(strategy, \
                                                             post_del_content, text)

In [12]:
random.seed(123)

train_data = [utt.meta['training_format'] for utt in train_corpus.iter_utterances() \
                      if utt.meta['split'] == "train"]

eval_data = [utt.meta['training_format'] for utt in train_corpus.iter_utterances() \
                      if utt.meta['split'] == "eval"]

random.shuffle(train_data)

In [13]:
# you can specify where you prefer the formatted outputs are written to
# the resultant files can also be directly found in data/training-files
out_dir = "generation_data/data/"

formatted_data = {"train": train_data, "eval": eval_data}

for filetype in ["train", "eval"]:
    
    with open(os.path.join(out_dir, "{}.txt".format(filetype)), "w+") as f:
        for data in formatted_data[filetype]:
            f.write("{}\n".format(data))

### 4. Training 

The training script ([openai_gpt_drg_strategized.py](openai_gpt_drg_strategized.py)) is adapted from the training script [openai_gpt_delete_retrive_and_generate.py](https://github.com/agaralabs/transformer-drg-style-transfer/blob/master/openai_gpt_delete_retrive_and_generate.py) from [transformer-drg-style-transfer](https://github.com/agaralabs/transformer-drg-style-transfer), with the key change being an update to the special tokens used. 

We use the following setting to train the generation model:

In [None]:
!python openai_gpt_drg_strategized.py --do_train --do_eval \
      --train_dataset "data/train/training-files/train.txt" \
      --eval_dataset "data/train/training-files/eval.txt" \
      --train_batch_size 8 \
      --eval_batch_size 8 \
      --max_seq_length 128 \
      --output_dir "politeness_paraphrase/models" \
      --num_train_epochs 3