In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |▍                               | 10kB 24.5MB/s eta 0:00:01[K     |▉                               | 20kB 2.9MB/s eta 0:00:01[K     |█▎                              | 30kB 3.8MB/s eta 0:00:01[K     |█▊                              | 40kB 4.2MB/s eta 0:00:01[K     |██▏                             | 51kB 3.5MB/s eta 0:00:01[K     |██▋                             | 61kB 3.8MB/s eta 0:00:01[K     |███                             | 71kB 4.2MB/s eta 0:00:01[K     |███▍                            | 81kB 4.4MB/s eta 0:00:01[K     |███▉                            | 92kB 4.8MB/s eta 0:00:01[K     |████▎                           | 102kB 4.7MB/s eta 0:00:01[K     |████▊                           | 112kB 4.7MB/s eta 0:00:01[K     |█████▏                          | 122kB 4.7M

In [30]:
import torch
import json
import csv
import os
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import AdamW, get_linear_schedule_with_warmup



device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')

Using device: cuda


# Load GPT-2 Medium 

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', model_max_length=1024)
model = GPT2LMHeadModel.from_pretrained('gpt2')
model = model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Helper functions

In [83]:
class ModelWrapper:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'

        self.set_generation_settings()

    def set_generation_settings(self, params={}):
        if params:
            for key in params:
                self.gen_params[key] = params[key]
        else:
            self.gen_params = {
                'max_length': 512,
                'top_p': 0.92,
                'top_k': 0,
                'temperature': 0.7 
            }

    def generate(self, prompt='', keywords=[], category=''):
        input = '<|startoftext|>'
        if category:
            assert category in ['positive', 'negative']
            input += f'~`{category}'
        if keywords:
            keywords = [ k.replace(' ', '-') for k in keywords ]
            input += f"~^{' '.join(keywords)}"
        
        input += f"~@{prompt if prompt else ''}"
        input_encoded = self.tokenizer.encode(input, return_tensors='pt').to(self.device)

        outputs = self.model.generate(
            input_encoded,
            do_sample=True, 
            **self.gen_params
        )

        # TODO: force outputs with keywords (?)
        outputs_decoded = [ self.tokenizer.decode(out, skip_special_tokens=True).split('~@')[-1] for out in outputs ]
        return outputs_decoded


# Reviews dataset

In [24]:
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [25]:
DATA_PATH = '/content/drive/My Drive/Data Science/Datasets'
!ls '$DATA_PATH'

reviews_nlp_encoded.txt


In [26]:
from torch.utils.data import Dataset, DataLoader
MAX_SEQ_LEN = 1000

class ReviewsDataset(Dataset):
    def __init__(self, filename):
        super().__init__()

        with open(filename) as data_file:
          self.reviews = data_file.readlines()
        
        self.join_sequences()

    def join_sequences(self):
        joined = []
        temp_reviews_tens = None
        for review in self.reviews:
            # fit as many review sequences into MAX_SEQ_LEN sequence as possible
            review_tens = torch.tensor(tokenizer.encode(review, max_length=MAX_SEQ_LEN, truncation=True)).unsqueeze(0)

            # the first review sequence in the sequence
            if not torch.is_tensor(temp_reviews_tens):
                temp_reviews_tens = review_tens
                continue
            else:
                # the next review does not fit in so we process the sequence and leave the last review 
                # as the start for next sequence 
                if temp_reviews_tens.size()[1] + review_tens.size()[1] < MAX_SEQ_LEN:
                    # add the review to sequence, continue and try to add more
                    temp_reviews_tens = torch.cat([temp_reviews_tens, review_tens[:, 1:]], dim=1)
                    continue
                else:
                    work_reviews_tens, temp_reviews_tens = temp_reviews_tens, review_tens
            joined.append(work_reviews_tens)
          
        self.encoded_joined_seq = joined

    def __len__(self):
        return len(self.encoded_joined_seq)

    def __getitem__(self, idx):
        return self.encoded_joined_seq[idx]

review_dataset = ReviewsDataset(f'{DATA_PATH}/reviews_nlp_encoded.txt')
review_loader = DataLoader(review_dataset, batch_size=1, shuffle=True)

# Hyperparameters

In [28]:
BATCH_SIZE = 5
EPOCHS = 8
LEARNING_RATE = 1e-4
WARMUP_FRAC = 0.25
TRAINING_STEPS = len(review_loader)*EPOCHS//BATCH_SIZE

In [29]:
print(f'There are {TRAINING_STEPS} training steps ({TRAINING_STEPS/EPOCHS} per epoch).')

There are 6848 training steps (856.0 per epoch).


# Training

In [31]:
model = model.to(device)
model.train()

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(WARMUP_FRAC*TRAINING_STEPS), num_training_steps=TRAINING_STEPS)

loss_history = []
batch_count = 0
reviews_count = 0
print_loss_every = int(TRAINING_STEPS/EPOCHS/3)

temp_reviews_tens = None
models_dir = "models"
if not os.path.exists(models_dir):
    os.mkdir(models_dir)

for epoch in range(EPOCHS):
    print(f"EPOCH {epoch+1}")
    start = time.perf_counter()
    epoch_loss_history = []

  
    for review_tens in review_loader:
        review_tens = review_tens.to(device)
                
        # sequence ready, process it trough the model
        optimizer.zero_grad()
        outputs = model(review_tens, labels=review_tens)
        loss, logits = outputs[:2]                        
        loss.backward()
        epoch_loss_history.append(loss.detach().item())
                       
        reviews_count += 1
        if reviews_count == BATCH_SIZE:
            reviews_count = 0
            batch_count += 1
            optimizer.step()
            scheduler.step() 

        if batch_count == print_loss_every:
            batch_count = 0
            avg_loss = np.array(epoch_loss_history)[-print_loss_every*BATCH_SIZE+1:].mean()
            print(f'Avg. loss: {avg_loss:.4f}')
    
    loss_history.append(torch.tensor(epoch_loss_history))
    end = time.perf_counter()
    print(f'The epoch has been trained in {end-start:.2f} seconds.\n')
    # store the model after each epoch to compare the performance of them
    torch.save(model.state_dict(), os.path.join(models_dir, f"review_model_e{epoch+1}.pt"))

EPOCH 1
Avg. loss: 3.0681
Avg. loss: 2.6747
Avg. loss: 2.6207
Training the epoch has been finished in 1396.96 seconds.
EPOCH 2
Avg. loss: 2.6118
Avg. loss: 2.5860
Avg. loss: 2.5390
Training the epoch has been finished in 1408.27 seconds.
EPOCH 3
Avg. loss: 2.5256
Avg. loss: 2.4698
Avg. loss: 2.4672
Training the epoch has been finished in 1409.33 seconds.
EPOCH 4
Avg. loss: 2.4089
Avg. loss: 2.3963
Avg. loss: 2.3858
Training the epoch has been finished in 1410.07 seconds.
EPOCH 5
Avg. loss: 2.3334
Avg. loss: 2.3009
Avg. loss: 2.3173
Training the epoch has been finished in 1410.17 seconds.
EPOCH 6
Avg. loss: 2.2501
Avg. loss: 2.2550
Avg. loss: 2.2553
Training the epoch has been finished in 1409.09 seconds.
EPOCH 7
Avg. loss: 2.1990
Avg. loss: 2.1866
Avg. loss: 2.2172
Training the epoch has been finished in 1409.27 seconds.
EPOCH 8
Avg. loss: 2.1662
Avg. loss: 2.1816
Avg. loss: 2.1591
Training the epoch has been finished in 1410.19 seconds.


# Evaluate model

In [33]:
def load_model(filename):
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    model.to(device)
    model.load_state_dict(torch.load(filename))
    model.eval()

    return model

In [87]:
def create_evaluation_report(model, tokenizer, gen_params_list, meta_info):
    wrapped = ModelWrapper(model, tokenizer)
    report = 'Book Review Generation Model\n'

    def add_reviews(report, reviews):
        for idx, review in enumerate(reviews):
            report += f'\tReview {idx+1}:\n\t{review}\n'
        return report+'\n'
    
    for key in meta_info:
        if meta_info[key]:
            report += f'{key}: {meta_info[key]}\n'
    report += '\n'
    
    for idx, params in enumerate(gen_params_list):
        wrapped.set_generation_settings(params)
        report += '-'*20+'\n'
        report += f'GENERATION PARAMS V{idx+1}\n'
        report += ' | '.join([ f'{key}: {params[key]}' for key in params ])
        report += '\n\n'
        
        # TEST 1
        report += 'TEST 1: Positive vs Negative Reviews\n\n'
        report += 'Category: POSITIVE\n'
        report = add_reviews(report, wrapped.generate(category='positive'))
        report += 'Category: NEGATIVE\n'
        report = add_reviews(report, wrapped.generate(category='negative'))

        # TEST 2
        report += 'TEST 2: Keywords\n\n'
        report += 'Keywords: BORING\n'
        report = add_reviews(report, wrapped.generate(keywords=['boring']))
        report += 'Keywords: TWILIGHT\n'
        report = add_reviews(report, wrapped.generate(keywords=['twilight']))
        report += 'Keywords: SEX\n'
        report = add_reviews(report, wrapped.generate(keywords=['sex']))
        report += 'Keywords: TREE, DOG\n'
        report = add_reviews(report, wrapped.generate(keywords=['tree', 'dog']))

    return report


In [41]:
meta_info = {
    'epoch': 8,
    'learning_rate': LEARNING_RATE,
    'batch_size': BATCH_SIZE,
    'scheduler': 'linear warmup 0.25'
}

In [34]:
model = load_model(f'models/review_model_e{meta_info["epoch"]}.pt')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', model_max_length=1024)

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [88]:
report = create_evaluation_report(model, tokenizer, [
                                            { 'top_p': 0.92, 'top_k': 0, 'temperature': 0.7, 'num_return_sequences': 2 }
                                            ], meta_info)

print(report)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


Book Review Generation Model
epoch: 8
learning_rate: 0.0001
batch_size: 5
scheduler: linear warmup 0.25

--------------------
GENERATION PARAMS V1
top_p: 0.92 | top_k: 0 | temperature: 0.7 | num_return_sequences: 2

TEST 1: Positive vs Negative Reviews

Category: POSITIVE
	Review 1:
	I loved the historical setting of this book. I found the characters to be well developed. I highly recommend this book.
	Review 2:
	I read this book on my way to a weekend retreat at a local resort. I love the setting and the characters. I felt like the author was trying to make me feel good about myself and not the others. It's definitely not a book for the timid, it's for the brave and the strong. It's also a book to keep the kids interested in.

Category: NEGATIVE
	Review 1:
	The book is well written, but the characters were just plain annoying. The story is told in alternating chapters. It's not that they are all bad, but that the story is repetitive. It's not that the story is terrible. It's that the 

In [92]:
wrapped = ModelWrapper(model, tokenizer)
wrapped.set_generation_settings({ 'top_p': 0.92, 'top_k': 0, 'temperature': 0.7, 'num_return_sequences': 5 })
reviews = wrapped.generate(keywords=['harry'])
for review in reviews:
    print(review+'\n')

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


This book was a total let-down for me. The writing was horrendous. I kept hoping for something good to come out of it. Instead, the main character, Charlotte, was so predictable and annoying and nothing happened. The characters were all so flat and dull and I couldn't get past the last sentence. I had to keep reading hoping that something would come of it and that it would change things for the better. Nothing really did. I was a bit disappointed in the ending. I did love the story and I was hoping for something more to come out of it but it just didn't happen. I thought the ending was way too realistic. I had a hard time getting past the first chapter and it was pretty painful. It was like the book was trying to make you believe that the author was trying to tell a good story. It was just so unrealistic.I really didn't like this book. I feel that the author is trying to make Charlotte the most unlikable character in the book and I found that unrealistic. I feel like she would have a v