<a href="https://colab.research.google.com/github/JDBumgardner/Trumpifyve/Trumpifyve_Training_Data_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Copyright 2020 Jacob Bumgardner and Jeremy Salwen

In [0]:
# Copyright 2020 Jacob Bumgardner and Jeremy Salwen
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

#Description

Github: https://github.com/JDBumgardner/Trumpifyve

In this notebook we normalize the samples of phrases from Donald Trump's 2016 presidential campaign by passing these phrases through a pair of pretrained NMT translation models. In our case we translate from english to german and back using two models from the fairseq repository. These could replaced with any sequence of translation models that translate from english to english. The intent is to preserve the meaning of the text samples while removing the characteristic style in order to generate training data for the T5 model. The translation models will at times strip leading and trailing sentences from the base phrases, this is unintended behavior, but has the consequence that the model with addend characteristic phrases of the style target. 

The output files will be written to your google drive at /trump_pairs and will be read by the colab Trumpifyve Train.

Next Colab: [Trumpifyve Train]("https://colab.research.google.com/github/JDBumgardner/Trumpifyve/Trumpifyve_Train.ipynb")  

#Project start

We are loading the requirements and dependencies and mounting the drive where we will be saving our training pairs.

In [0]:
!pip install sacremoses subword_nmt fastbpe
import torch, numpy
import matplotlib.pyplot as plt
import json
import os.path
from google.colab import drive

In [0]:
drive.mount ('/content/drive')

In [0]:
!git clone https://github.com/unendin/Trump_Campaign_Corpus.git "/content/drive/My Drive/campaign_corpus"

#Load the NMT models

Here we load the translation models, one from English to German (transformer.wmt16.en-de) and one from German back to English (transformer.wmt19.de-en.single_model). 

In [0]:
en2de = torch.hub.load('pytorch/fairseq', 'transformer.wmt16.en-de', tokenizer='moses', bpe='subword_nmt')

In [0]:
de2en = torch.hub.load('pytorch/fairseq', 'transformer.wmt19.de-en.single_model', tokenizer='moses', bpe='fastbpe')

In [0]:
def normalize(input):
  return de2en.translate(en2de.translate(input, beam=1), beam=1)

In [0]:
en2de.eval()
de2en.eval()
en2de.cuda()
de2en.cuda()

#Load the corpus

Loads and filters campaign corpus for spoken language and splits into individual samples. 

In [0]:
with open('/content/drive/My Drive/campaign_corpus/trump_campaign_corpus.json') as f:
  campaign_corpus=json.load(f)

In [0]:
trump_lines = []

for item in campaign_corpus:
    if item['is_as_spoken'] is True:
        turns = item["doc"]
        if type(turns) is not list:
            turns = [turns]
        for turn in turns:
            if turn["person"] == 'Donald Trump':
                samples = turn["p"]
                if type(samples) is not list:
                    samples = [samples]
                for sample in samples:
                    if type(sample) is not str:
                        continue
                    trump_lines.append(sample)

#Write to file

Write out trump pairs to .json files in chunks of 128 pairs.

In [0]:
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

In [0]:
!mkdir '/content/drive/My Drive/trump_pairs'

In [0]:
batch_size=128
trump_pairs = []
for i, chunk in enumerate(chunks(trump_lines,batch_size)):
  path = '/content/drive/My Drive/trump_pairs/{}.json'.format(i)
  if os.path.exists(path):
    continue
  else:
    current_batch_pair = list(zip(chunk, normalize(chunk)))
    trump_pairs += current_batch_pair
    with open(path, "w") as f:
      json.dump(current_batch_pair,f)
    print("\r", i, end="")