## Easily generate titles using fine-tuned BART, DistilBART and PEGASUS models

Using this file, you can easily use these fine-tuned models to generate titles for a listing description of your choice.


Note that this requires no GPU and can thus be run on consumer hardware! However, this slows the title generation function down

You merely need to define a few variables (see below):

* read_in_from_drive: a boolean, set to True if you need to read in the models from gdrive

* model_path_bart: the path to the tuned BART model
* #model_path_distilbart: the path to the tuned DistilBART model
* model_path_pegasus: the path to the tuned PEGASUS model





## Reading in Libraries and defining functions

In [1]:
!pip install transformers==4.28.1
!pip install huggingface==0.0.1
!pip install accelerate==0.18.0



In [10]:
## important all needed packages
import transformers
import huggingface
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from google.colab import drive
from transformers import pipeline
from transformers import DataCollatorForSeq2Seq
from torch.utils.data import DataLoader
import torch
import numpy as np
import nltk
from accelerate import Accelerator
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [3]:
#### functions ####

## Code for all functions to be called in later

def preprocess_function(input):

    """
    This function applies the tokenizer to both
    the descriptons and the names.
    max_input_length: Maximum length of tokens in input (description ) - to be defined globally
    max_target_length: Maximum number of tokens in output (title) - to be defined globally
    """

    model_inputs = tokenizer(
        input["description"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        input["name"], max_length=max_target_length, truncation=True   ## truncate to maximum length
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs




## function to process data so that it can be used in the finetuning


def prepare_data(airbnb_london_filtered_advanced, frac_train_size, batch_size):

  """
  This function prepares the data for the subsequent fine-tuning.
  First Input: Dataset (as loaded from Drive)
  frac_train_size: fraction to be used in training
  batch_size: Batch size to be used in preparing the DataLoaders
  """

  train_eval_airbnb_london_filtered_advanced = airbnb_london_filtered_advanced[airbnb_london_filtered_advanced.in_top_third == 1]

  train_airbnb_london_filtered_advanced = train_eval_airbnb_london_filtered_advanced.sample(n = int(np.ceil(frac_train_size*train_eval_airbnb_london_filtered_advanced.shape[0])), random_state = 100)
  eval_airbnb_london_filtered_advanced = train_eval_airbnb_london_filtered_advanced.drop(train_airbnb_london_filtered_advanced.index, axis = 0)


  ## re-setting the index, else looping through dataloaders results in erros: https://discuss.pytorch.org/t/keyerror-when-enumerating-over-dataloader/54210/8
  train_airbnb_london_filtered_advanced.index = list(range(train_airbnb_london_filtered_advanced.shape[0]))
  eval_airbnb_london_filtered_advanced.index = list(range(eval_airbnb_london_filtered_advanced.shape[0]))


  # tokenized_datasets
  tokenized_datasets  = train_airbnb_london_filtered_advanced.apply(preprocess_function, axis = 1)

  # eval_tokenized_datasets
  eval_tokenized_datasets  = eval_airbnb_london_filtered_advanced.apply(preprocess_function, axis = 1)

  ## calling data loader
  #batch_size = 8

  train_dataloader = DataLoader(
      tokenized_datasets,
      shuffle=True,
      collate_fn=data_collator,
      batch_size=batch_size,
  )

  eval_dataloader = DataLoader(
      eval_tokenized_datasets,
  #   shuffle=True,
      collate_fn=data_collator,
      batch_size=batch_size
  )

  return train_dataloader, eval_dataloader


def postprocess_text(preds, labels):

    """
    Post-processing to prepare inputs to the ROGUE functions
    """
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # ROUGE expects a newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

device = 'cuda'


def generate_title(model_type, description, min_length=5, max_length=25):

  """ This function can be called to easily let a model of your choice generate a title.
  - Model_type: Either 'bart', 'distilbart' or 'pegasus'
  - Description is obvious
  - min_length and max_length denote the token length restrictions on the title to be generated
  """

  if model_type == "bart":
    output_pipeline = summarizer_bart(description, min_length=min_length, max_length=max_length)
  elif model_type == "distilbart":
    output_pipeline = summarizer_distilbart(description, min_length=min_length, max_length=max_length)
  elif model_type == "pegasus":
    output_pipeline = summarizer_pegasus(description, min_length=min_length, max_length=max_length)
  else:
    raise Exception("model_type must be either 'bart', 'distilbart' or 'pegasus'.")

  return output_pipeline[0]['summary_text']


## also reading in associated tokenization functions

tokenizer_bart = AutoTokenizer.from_pretrained("philschmid/bart-large-cnn-samsum")
tokenizer_distilbart = AutoTokenizer.from_pretrained("sshleifer/distilbart-xsum-12-3")
tokenizer_pegasus = AutoTokenizer.from_pretrained("google/pegasus-xsum")


It is assumed that you will loadin the models from drive, hence connecting to drive is essential.

If you store the models elswhere, ignore his cell or set read_in_from_drive to False

In [7]:

read_in_from_drive = True

if read_in_from_drive:

  # connecting to drive
  from google.colab import drive
  drive.mount('/content/gdrive')

else:
  pass


Mounted at /content/gdrive


### Next, read in the fine-tuned models.



In [8]:
# adjust these path-variables correspondingly

model_path_bart = "..."
# model_path_bart = "/content/gdrive/My Drive/Thesis/Models/bart_finetuned_1_alt.pth"

model_path_distilbart = "..."
# model_path_distilbart = "/content/gdrive/My Drive/Thesis/Models/distilbart_finetuned_1_alt.pth"

model_path_pegasus =  "..."
# model_path_pegasus = "/content/gdrive/My Drive/Thesis/Models/pegasus_fine_tuned_1.pth"

# models
model_bart = torch.load(model_path_bart).to('cpu')
model_distilbart = torch.load(model_path_distilbart).to('cpu')
model_pegasus = torch.load(model_path_pegasus).to('cpu')

# pipelines
summarizer_bart = pipeline("summarization", model=model_bart, tokenizer = tokenizer_bart)
summarizer_distilbart = pipeline("summarization", model=model_distilbart, tokenizer = tokenizer_distilbart)
summarizer_pegasus = pipeline("summarization", model=model_pegasus, tokenizer = tokenizer_pegasus)



## Generate a title :)

Finally, simply define a description and let the model generate a title!

For the BART, pass 'bart' as the first input parameter, for the DistilBART, pass 'distilbart' and for PEGASUS pass 'pegasus'.

See the examples provided below:

In [13]:
description = 'The space Bright double bedroom, own living room and own bathroom all on your own floor in our Victorian house in leafy West Hampstead. You will have a mini fridge, toaster, and tea & coffee making facilities in your living room. We also provide you with tea, coffee, cereal, bread & milk & therefore won’t need to share any spaces with us during this time however, we are always available to advise on places to visit, restaurant, bars etc. As always the space is incredibly clean and we take extra precautions to keep the space safe, strictly following the Airbnb COVID cleansing guidelines. The bedroom has floor to ceiling wardrobes, a chest of drawers, real wood flooring, decorative fireplace, mirror and wireless internet connection. While your own private bathroom is not en-suite it is just a couple of steps away. It is a recently refurbished modern bathroom with power shower and full sized bath. The living room is large bright with bay windows &'


In [21]:
generate_title('bart', description)

'Bright double bedroom & private bathroom in Hampstead'

In [22]:
generate_title('distilbart', description)

'Bright double bedroom, living room & own bathroom'

In [23]:
generate_title('pegasus', description)

'Bright double bedroom in leafy West Hampstead'