<h1><center>Cognitive Computing Final Project</center></h1>
<h2><center>Automated Word-based Product Review/Testimonial Generation using Google BERT and OpenAI GPT-2</center></h2>
<h4><center>Jinming Li, Tianxin Huang, Qin Shan, Guanhua Zhang</center></h4>
<h4><center>Master of Science in Business Analytics, The University of Texas at Austin</center></h4>

---

<h3>Project Overview:</h3>

Automatically generated fake reviews are a great threat to e-commerce businesses. As the machine learning algorithms become more and more powerful, It is hard for users to detect machines-generated fake reviews.In order to detect the fake reviews, it is important to understand the mechanism of generating fake news. In this project, we would like to explore the mechanism of the existence generate methods. 


**Problem Statement:**

In this project, we are going to explore the state-of-art mechanism of Google BERT & OpenAI GPT-2 and build a testimonial generator to generate stay-on-topic product reviews automatically.


**Data Description:**

We get our customer review data from http://cseweb.ucsd.edu/~jmcauley/datasets.html#google_local, which includes 185147 customer reviews on clothing fit from ModCloth and RentTheRunaway after data cleaning and data integration. The data contained variables such as clothing mega data, clothing size and customer reviews. Here, we only extract the customer review data for building NLP model.

**Business Value:**

Customer reviews are very important. True opinions could help companies to discover the preferences of the customer, the strengths and weaknesses of the product, and to study the market conditions. Additionally, reviews are also the consumption guidances for the customer. Great reviews will bring a significant financial gains to the company. In that case, some companies even hire the spammers to write fake positive reviews for them and write fake negative reviews for their competitors. Since it is expensive for them to hire real people to write fake reviews, they choose the automatic fake review generator to generate fake reviews. Researches show that it’s getting harder for people to detect the fake reviews from real reviews. 

Due to the flooded machines-generated fake reviews, the Internet is no longer the source of reliable information it once was. Thus, bring integrity to the e-commerce business, and detect fake reviews to bring to surface the true product opinion will minimize the loss caused by the fake information. Therefore, the purpose of our model is to not only generate fake product reviews that look like real, but most importantly to understand how these fake reviews are generated, so that we can analyze the mechanism behind the review generation process of some customer-oriented companies. What's more, we not only trained our model to generate fake news but also trained our model to detect fake review effectively and therefore might provide hugh amount of business insights into the business industry.

**Tools:**

   - Pytorch
    
   - GPU (8 x NVIDIA Tesla K80)
    
   - Googld Cloud Platform

**Models (pretrained):**

   - GANs: <b>G</b>enerative <b>A</b>dversarial <b>N</b>etworks
    
   - Google BERT: <b>B</b>idirectional <b>E</b>ncoder <b>R</b>epresentations from <b>T</b>ransformers
    
   - OpenAI GPT-2: <b>G</b>enerative <b>P</b>re-<b>T</b>raining version <b>2</b>

**Tokenizer (pretrained):**

   - Pretrained Google BERT tokenizer (Word Piece Encoding)
    
   - Pretrained OpenAI GPT-2 tokenizer (Byte-Pair-Encoding)

In [None]:
import os
import copy
import random
import torch
from torch import nn
from torch.nn import functional as F
from torch.nn import CrossEntropyLoss
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertAdam, BertForSequenceClassification, GPT2Tokenizer, OpenAIAdam, GPT2LMHeadModel
from datetime import datetime
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import warnings
from matplotlib import pyplot as plt
%matplotlib inline
warnings.filterwarnings('ignore')

# bert_base_uncased_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels = 2)
# bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2')

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.


---

<h3>Part 1: Generative Adversarial Networks (GANs) with Google BERT & OpenAI GPT-2</h3>

<h4>1) Model Introduction:</h4>

- **GANs** (<b>G</b>enerative <b>A</b>dversarial <b>N</b>etworks)

The main model we are going to use is GAN. GAN stands for generative adversarial networks. It comprises of two nets: a generator net and a discriminator net. GAN is genreative because the generator can create fake inputs for the discriminator. It's adversarial becasue the generator and the discriminator are fighting one against the other. You can think of the generator as a spammer. It takes random noise as input and output a fake review. The discriminator acts like a police. It takes both the real reviews from the training set and the fake review outputs from the generator, and learns to classify whether the review is fake or real.

<img src = 'GAN.png' height = 500 width = 500>

- **BERT** (<b>B</b>idirectional <b>E</b>ncoder <b>R</b>epresentations from <b>T</b>ransformers: <b>BERT</b>)

BERT takes an input with length up to 512. The input starts with a special token [CLS] and ends with [SEP]. BERT then maps the tokens into word embeddings, and pass the embeddings to a transformer encorder. For classification task, a linear classification layer is added on top of the pooled output in BERT base model. This layer takes the first output which is [CLS] token and decides whether this is fake or real review. The BERT pretrained model we are going to use is called BERT-base-uncase model. It has 12 layers, 768 states, 12 heads, and 110M parameters. Here, we only needs to train 201 weights for transfer learning in this case.

You might be thinking why we are choose BERT-base rather than BERT-large. BERT large is a larger and more powerful pretrained model than BERT base as the name suggested. The reason we choose BERT base over BERT large is for fine-tunning purpose. In order to train BERT large, we need a TPU. However, we only have a GPU with a RAM of 16 GB. Therefore, BERT base is a more feasible choice for this project.

<img src = 'BERT.png' height = 500 width = 500>

- **OpenAI GPT-2** (<b>G</b>enerative <b>P</b>re-<b>T</b>raining version <b>2</b>)

GPT-2 is a state-of-the-art language model designed to improve on the realism and coherence of generated text. It is the largest language model ever, with 1.542 billion parameters. However, Open AI last week made the controversial decision to not release their language model’s code and training dataset due to concerns it might be used for malicious purposes such as generating fake news. The decision challenged the AI community’s open-source philosophy and has divided researchers. Facebook Chief AI Scientist Yann LeCun — who never shies from controversy — mocked the decision in a tweet: “Every new human can potentially be used to generate fake news, disseminate conspiracy theories, and influence people. Should we stop making babies then?”

Github developer Hugging Face has updated its repository with a PyTorch reimplementation of the GPT-2 language model small version that OpenAI open-sourced last week, along with pretrained models and fine-tuning examples. Here, we mainly used the source code from Hugging Face's pytorch_pretrained_bert library and built our generator model with a language head model on top of the base GPT-2 model. We only needs to train 148 weights for transfer learning in this case.

<img src = 'GPT2.png' height = 800 width = 500>

<h4>2) Model Architecture</h4>

**GAN**:

- **Discriminator: Google BERT** base uncased pretrained model (<b>B</b>idirectional <b>E</b>ncoder <b>R</b>epresentations from <b>T</b>ransformers: <b>BERT</b>)

 *Pytorch Reimplementation:* The discriminator we use is Google BERT-base-uncased pretrained model and it has the pytorch reimplementation by huggingface
 
 *Sentence Classification:* The Google pretrained BERT-base-uncase model is added with a linear classification layer on top of the pooled output
 
 *Architecture:* 12-layer, 768-hidden states, 12-heads, 110M parameters, only transfer learning with 201 trainable weights

- **Generator: OpenAI GPT-2** pretrained model (<b>G</b>enerative <b>P</b>re-<b>T</b>raining: <b>GPT-2</b>)

  *Pytorch Reimplementation:* The generator we use is OpenAI pretrained GPT-2 model and it has the pytorch reimplementation by huggingface
  
  *Architecture:* 1.5B parameters of GPT-2 base model and a language head model is added on top of the base model, only tranfer learning with 148 trainable weights


In [None]:
class TextGAN(object):
    def __init__(self, dataframe, bert_pretrained_model = 'bert-base-uncased', gpt_pretrained_model = 'gpt2', num_labels = 2):
        # Read in datafram
        self.dataframe = dataframe
        
        # Setting GPU device where discriminator is trained in GPU0, generator is trained in GPU1
        self.device_default = torch.device('cuda:0')
        
        # Build discriminator and tokenizer from BertForSequenceClassification
        self.bert_tokenizer = BertTokenizer.from_pretrained(bert_pretrained_model)
        self.discriminator = nn.DataParallel(BertForSequenceClassification.from_pretrained(bert_pretrained_model, num_labels = num_labels)).to(self.device_default)
        self.bert_optimizer = BertAdam(self.discriminator.parameters(), lr = 0.00005, warmup = 0.1, t_total = 1000)
        
        # Build the generator, tokenizer, optimizer from OpenAIGPT2
        self.gpt2_tokenizer = GPT2Tokenizer.from_pretrained(gpt_pretrained_model)
        self.generator = GPT2LMHeadModel.from_pretrained(gpt_pretrained_model).to(self.device_default)
        self.gpt2_optimizer = OpenAIAdam(self.generator.parameters(), lr = 0.0001, warmup = 0.1, t_total = 1000)
        
        # Free all GPU memory
        torch.cuda.empty_cache()

    def textGeneration(self, generator_input):
        text_id = generator_input
        input, past = torch.tensor([text_id]).to(self.device_default), None
        for _ in range(random.randint(30, 100)):
            logits, past = self.generator(input, past = past)
            input = torch.multinomial(F.softmax(logits[:, -1]), 1)
            text_id.append(input.item())
        return self.gpt2_tokenizer.decode(text_id)
    
    def dataGenerator(self, batch_size = 16):
        # Randomly fetch traning data bunch
        sample_text_ss = self.dataframe['Review'].iloc[random.sample(range(len(self.dataframe)), batch_size)]
        
        # Tokenize training data bunch with GPT2 tokenizer and take top 10 words
        sample_text_encode_top10 = sample_text_ss.map(lambda x : self.gpt2_tokenizer.encode(x)[:10])
        
        # Generate text using GPT2 generator
        sample_text_generate_ss = sample_text_encode_top10.map(self.textGeneration)
        return sample_text_generate_ss, sample_text_ss
    
    def discriminatorInput(self, text):
        input_token = ['[CLS]'] + self.bert_tokenizer.tokenize(text) + ['[SEP]']
        input_id = self.bert_tokenizer.convert_tokens_to_ids(input_token)
        return [input_id]
    
    def saveGeneratedReview(self):
        content = self.dataframe['Review'].values[random.randint(0, len(self.dataframe))]
        content_id = self.gpt2_tokenizer.encode(content)[:10]
        return self.textGeneration(content_id), content
        
    def train(self, num_epochs = 1000, save_interval = 100):
        start = datetime.now()
        generated_review_list = []
        real_review_list = []
        d_loss_list = []
        g_loss_list = []

        for epoch in range(num_epochs):
            try:
                print('Epoch {}/{}'.format(epoch + 1, num_epochs))
                print('-' * 10)

                # Load in data
                sample_text_generate_ss, sample_text_ss = self.dataGenerator(batch_size = 16)

                # Convert generated text and real text bunch to WorkPiece encode ID as discriminator input
                discriminator_input_ss = pd.concat([sample_text_generate_ss, sample_text_ss], axis = 0, ignore_index = True).map(self.discriminatorInput)
                discriminator_input = torch.LongTensor(np.array(DataFrame(discriminator_input_ss.sum()).fillna(0).astype('int32'))).to(self.device_default)
                discriminator_input_generate = discriminator_input[:len(sample_text_generate_ss)].to(self.device_default)

                # Create labels for training discriminator and generator
                labels = torch.LongTensor([0] * len(sample_text_generate_ss) + [1] * len(sample_text_ss)).to(self.device_default)
                valid = torch.LongTensor([1] * len(sample_text_ss)).to(self.device_default)

                # Each epoch has a train_discriminator and train_generator phase
                for phase in ['train_discriminator', 'train_generator']:
                    if phase == 'train_discriminator':
                        # Set discriminator to training mode
                        self.discriminator.train()

                        # Freeze all trainable parameters
                        for param in self.discriminator.parameters():
                            param.requires_grad = True

                        # Zero the discriminator parameter gradients
                        self.bert_optimizer.zero_grad()

                        # Forward propagation
                        d_loss = self.discriminator(input_ids = discriminator_input, labels = labels).mean()

                        # Backward propagation
                        d_loss.backward()
                        self.bert_optimizer.step()

                    else:
                        # Set discriminator to evaluate mode
                        self.discriminator.eval()

                        # Zero the generator parameter gradients
                        self.gpt2_optimizer.zero_grad()

                        # Forward propagation
                        g_loss = self.discriminator(input_ids = discriminator_input_generate, labels = valid).mean()

                        # Backward propagation
                        g_loss.backward()
                        self.gpt2_optimizer.step()                    

                # Plot the progress
                print('Discriminator Loss:', d_loss)
                print('Generator Loss:', g_loss)
                print()
                d_loss_list.append(d_loss)
                g_loss_list.append(g_loss)

                # If at save interval, then save generated review samples
                if epoch % save_interval == 0:
                    generated_review, real_review = self.saveGeneratedReview()
                    generated_review_list.append(generated_review)
                    real_review_list.append(real_review)
            except RuntimeError:
                pass

        # Counting time elapsed
        time_delta = datetime.now() - start
        print('Training completed time:', time_delta)

        return self.generator, self.discriminator, d_loss_list, g_loss_list, generated_review_list, real_review_list, 

---

<h3>Part 2: Traning Process</h3>

- **Training pipeline: Google BERT**
  
  1. Raw input text

  2. Add [CLS] token to the front and [SEP] token to the tail

  3. Pretrained BERT tokenizer to tokenize the text

  4. Convert text to Word Piece ids

  5. Train pretrained BERT sequence classifier (CrossEntropy)

  6. Output Real(1) or Fake(0) Review
  
- **Training pipeline: OpenAI GPT-2**
  
  1. Raw input text

  2. Pretrained GPT-2 tokenizer to tokenize the text

  3. GPT-2 tokenizer encodes text to Byte-Pair-Encoding (BPE)

  4. Generate text using first 10 encoding words

  5. Feed as input to discriminator and compute loss (CrossEntropy)
  
  
- **Training Loss: Google BERT & OpenAI GPT-2**

    During the training process, the training loss of Google BERT decreases significantly to zero during 1000 epochs while the loss of OpenAI GPT-2 increases.
    
    **Reason:** 
    1. Discriminator (BERT) learns very fast while the generator (GPT-2) learns too slow
    
    2. Warm up process for optimizers of both model starts late, should have started earlier
    
    3. We only choose mini batch size = 16, the maximum number a GPU (NVIDIA Tesla K80) can support. As a result, using bathc size = 16 might be too small loop over the entire data
    
    

In [None]:
if __name__ == '__main__':
    amazon_clothingfit_df = pd.read_csv('Amazon_Clothing_Fit_Data_Review.csv', index_col = 0)
    model = TextGAN(bert_pretrained_model = 'bert-base-uncased', gpt_pretrained_model = 'gpt2', num_labels = 2, dataframe = amazon_clothingfit_df)
    OpenAIGPT2_generator, BERT_discriminator, d_loss_list, g_loss_list, generated_review_list, real_review_list = model.train()

Epoch 1/1000
----------
Discriminator Loss: tensor(0.6871, device='cuda:0', grad_fn=<MeanBackward1>)
Generator Loss: tensor(0.6005, device='cuda:0', grad_fn=<MeanBackward1>)

Epoch 2/1000
----------
Discriminator Loss: tensor(0.7371, device='cuda:0', grad_fn=<MeanBackward1>)
Generator Loss: tensor(0.6042, device='cuda:0', grad_fn=<MeanBackward1>)

Epoch 3/1000
----------
Discriminator Loss: tensor(0.6707, device='cuda:0', grad_fn=<MeanBackward1>)
Generator Loss: tensor(0.7075, device='cuda:0', grad_fn=<MeanBackward1>)

Epoch 4/1000
----------
Discriminator Loss: tensor(0.7073, device='cuda:0', grad_fn=<MeanBackward1>)
Generator Loss: tensor(0.9370, device='cuda:0', grad_fn=<MeanBackward1>)

Epoch 5/1000
----------
Discriminator Loss: tensor(0.6685, device='cuda:0', grad_fn=<MeanBackward1>)
Generator Loss: tensor(1.0688, device='cuda:0', grad_fn=<MeanBackward1>)

Epoch 6/1000
----------
Discriminator Loss: tensor(0.6976, device='cuda:0', grad_fn=<MeanBackward1>)
Generator Loss: tensor(0

---

<h3>Part 3: Generated & Real Review Comparison</h3>

- **Epoch 200:**
<center><img src = 'epoch200.png' height = '500' width = '500'></center>

- **Epoch 400:**
<center><img src = 'epoch400.png' height = '500' width = '500'></center>

- **Epoch 800:**
<center><img src = 'epoch800.png' height = '500' width = '500'></center>

In [None]:
generated_review_list

['First of all, I\'m so happy RTR has picked up RICK REY. Andrew Jackson told everyone about his friend and longtime companion Hank Aaron, and now Chris Rock, and the legacy he won we hope he can continue to live his life even if his actions start leaving a scar that keeps scratching in his neck. I hope it\'s as harsh as it gets.\n\nI also think Axl Rose is "wow"/would be hot if',
 'She is 5\'10" 125 lbs. \xa0Now without reasoning... INALLY SHE DOES ME RIGHT SO IT IS HEAVILY MORE TRUE than WHERE SHE STARTED. ANTITUDE CITY COSTS \u2ef4 . Yet she DOES NOT STILL LIKE TO see the actual snapshot of her torso with her ass in the bathtub. \xa0It is until you MD DOING it to see? \xa0The commission sees a FRONTIENTAL "',
 'This dress is really beautiful but runs pretty big. I did like how it fits my skin tone. I would recommend it to anyone with sensitive skin, just try to avoid being super fitted in this dress.\n\n',
 'Very nice dress, high quality fabric. A little thicker than what all the bo

In [None]:
real_review_list

["First of all, I'm so happy RTR has plus size now. I've been wanting to rent forever, but nothing would ever fit me. I ordered this dress in 20W and 22W, thinking I'd definitely be wearing the 20W. Ehhh wrong, I ended up in the 22W. I normally wear a 1X in shirts and a 22 in pants, so when I wear dresses, they are always a size 20 - this was my first 22, but I was just glad one of my dresses fit. The arms are kind of tight, but I wouldn't be concerned because I have pretty big arms and they were fine. Also, the dress is a little on the short side, so if you're a bigger girl and your legs aren't your strong point, you may want to consider a different dress or wear black tights (I wore the tights). Overall, this is a beautiful dress. The color is much more gorgeous in person and I got a lot of compliments. It's heavy and well-made and it was so nice to wear a dress I really loved rather than settling for some Lane Bryant dress that I only bought because it fit. ",
 'She is 5\'10" 125 lb

---

<h3>Part 4: Conclusions & Takeaways</h3>

- **Conclusions:**
    - The discriminator is good at detecting fake news, BUT the generator is bad at creating good fake news
    
    - The discriminator might learns faster than the generator
    
    - Generator needs more training and requires early warm up optimization


- **Takeaways:**

    - Transfer learning with OpenAI GPT-2 and Google BERT requires large computational expense. We trained our model for 1000 epoches in 12 hours.

    - Using TPU in training might help since normal GPU only has size from 12GB to 16GB which is not enough to use large mini-batch size to train our model

    - Great NLP power is shown by GPT-2 and BERT which are well-implemented models and can both be used in different high-level language tasks.


---

<h3>Part 5: References</h3>

- https://skymind.ai/wiki/generative-adversarial-network-gan
    
- http://jalammar.github.io/illustrated-bert/
    
- https://medium.com/syncedreview/hugging-face-releases-pytorch-bert-pretrained-models-and-more-b8a7839e7730
    
- http://cseweb.ucsd.edu/~jmcauley/datasets.html#google_local
