# Third test: Introduction

In the second test with BART we'll use a more complex model trained **bart-large-cnn** specialized for tasks of summarization.

As before we use the following alternative inputs:
*   Preprocessed reviews
*   Full text

We'll leverage on the **huggingface**'s **transformers** library also in this case.

In [1]:
!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m61.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m98.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m78.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, TrainingArguments, Trainer, create_optimizer, AdamWeightDecay, DataCollatorForSeq2Seq
from transformers.keras_callbacks import PushToHubCallback
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
from transformers import pipeline

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

We need to types of dataset as anticipated before, the following function is used to load them.

In [3]:
#Function to read a list from a file
def read_list_from_file(filename):
  result = []
  # opening the file in read mode
  my_file = open(filename, "r")

  # reading the file
  data = my_file.read()

  # replacing end splitting the text
  # when newline ('\n') is seen.
  data_into_list = data.split("\n")
  my_file.close()

  return data_into_list

Mount Drive

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Names of the files in which we've the datasets.

In [5]:
#Filename of the review list's file
filename_review_list = "drive/Shareddrives/BPM PROJECT/Dataset/review_list.txt"

#Filename of the dataset
filename_dataset = "drive/Shareddrives/BPM PROJECT/Dataset/amazon_reviews_reduced.csv"

# Defining the model

For this test we'll use **bart-large-cnn**

In [6]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# Defining the functions used for the task

In [9]:
#These function receive as input a batch of reviews and for each one it outputs
# a number of summaries equal to the number of batches
def get_summaries(batches):
  summaries = []
  for input in batches:
    summary = summarizer( input, max_length=1024, min_length=10, do_sample=False)
    summaries.append(summary)
  return summaries

# Creating the inputs

In the following cell we can see how the inputs are created; in particular we build batches of reviews, each one formed by the concatenation of different reviews.


In [10]:
def get_batches(reviews, sample):

  input = " "
  count = 0
  checkpoint = 0


  #batch creation
  #creo un array di testi concatenati, ognuno dei quali ha massimo 1024 token
  #lo stesso procedimento verrà fatto con i riassunti generati!
  # si parte da checkpoint in poi, fino a finire il ciclo.

  batches = []
  i = 0

  if sample:
    num_samples = 100
  else:
    num_samples = len(reviews)

  for i in range(0,num_samples):
  # for i in range(0,43):
    if count + len(reviews[i].split(" ")) < (512):
      #print(cleaned_reviews[i])
      input = input + reviews[i] + " "
      #print(input)
      count = count + len(reviews[i].split(" "))
    else:
      batches.append(input)
      print('Input {} created with a length of {} tokens.'.format(len(batches), count))
      print("Number of reviews processed: {}".format(i))
      input = reviews[i] + " "
      count = len(reviews[i].split(" "))

  if(count != 0):
    batches.append(input)
    print('Input {} created with a length of {} tokens.'.format(len(batches), count))
    print("Number of reviews processed: {}".format(i))

  return batches

Now we create the batches for the summaries

The previous batches of reviews form the input to the first stage of summarization: the results will be composed by **N** summaries, where **N** is the number of inputs to the model (the size of **batches**)

Loading the preprocessed reviews

In [11]:
#load the preprocessed reviews BUT!! We've pretrained the network on the same ones, it's an error
# maybe we don't need the preprocessing or we can skip the first 1000 reviews

reviews_list_preprocessed = read_list_from_file(filename_review_list)
print(reviews_list_preprocessed[0])

work perfect plug laptop tv lines sound interference definitely buy


In [12]:
batches = get_batches(reviews_list_preprocessed, True)

Input 1 created with a length of 493 tokens.
Number of reviews processed: 22
Input 2 created with a length of 485 tokens.
Number of reviews processed: 43
Input 3 created with a length of 499 tokens.
Number of reviews processed: 59
Input 4 created with a length of 493 tokens.
Number of reviews processed: 80
Input 5 created with a length of 485 tokens.
Number of reviews processed: 96
Input 6 created with a length of 120 tokens.
Number of reviews processed: 99


Creating the full text reviews

In [13]:
amazon_reviews = pd.read_csv("drive/Shareddrives/BPM PROJECT/Dataset/amazon_reviews_reduced_preprocessed.csv", on_bad_lines="skip")

In [14]:
review_list = amazon_reviews["review_body"].tail(1000).tolist()

In [15]:
batches_full_text = get_batches(review_list, True)

Input 1 created with a length of 493 tokens.
Number of reviews processed: 10
Input 2 created with a length of 464 tokens.
Number of reviews processed: 22
Input 3 created with a length of 508 tokens.
Number of reviews processed: 32
Input 4 created with a length of 430 tokens.
Number of reviews processed: 43
Input 5 created with a length of 496 tokens.
Number of reviews processed: 52
Input 6 created with a length of 459 tokens.
Number of reviews processed: 59
Input 7 created with a length of 493 tokens.
Number of reviews processed: 72
Input 8 created with a length of 449 tokens.
Number of reviews processed: 80
Input 9 created with a length of 499 tokens.
Number of reviews processed: 88
Input 10 created with a length of 505 tokens.
Number of reviews processed: 97
Input 11 created with a length of 179 tokens.
Number of reviews processed: 99


# Using the preprocessed reviews

The previous batches of reviews form the input to the first stage of summarization: the results will be composed by **N** summaries, where **N** is the number of inputs to the model (the size of **batches**)

In [16]:
summaries_first_stage = get_summaries(batches=batches)

Your max_length is set to 1024, but your input_length is only 566. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=283)
Your max_length is set to 1024, but your input_length is only 582. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=291)
Your max_length is set to 1024, but your input_length is only 607. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=303)
Your max_length is set to 1024, but your input_length is only 610. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_leng

In [17]:
len(batches[0].split(" "))

495

Delete the key of the dictionary

In [18]:
summaries_first_stage = [x[0]["summary_text"] for x in summaries_first_stage]
summaries_first_stage


['Hdmi male hdmi female 270 degree adapter adapter thin enough plugs fully hides cable behind tv cleaner look upconverter dvd player combined cable adapter great picture sound quality great product mistakenly purchased switched cable service dvr box length cable cable company provided didnt quite stretch switched happy fit nicely kudos cables work great provide high quality video blue ray player tv.',
 "Amazon basics everything works fine love cables used buy monoprice cables excellent trying amazon cables 've converted cables thick monoprice ones making easier set mean easier bundle cables pass wall zip-tie slack etc cables given zero issues would recommend anyone looking good quality inexpensively priced cables n't sucker pay 3 times much monster cable.",
 'High-speed hdmi cable purchased along roku helps provide good picture great audio worked great television glad spent money buy ordering roku money felt good accessory accompany roku. Buy expensive premium cables smart buy one grea

In [19]:
print(summaries_first_stage)
print(batches)

['Hdmi male hdmi female 270 degree adapter adapter thin enough plugs fully hides cable behind tv cleaner look upconverter dvd player combined cable adapter great picture sound quality great product mistakenly purchased switched cable service dvr box length cable cable company provided didnt quite stretch switched happy fit nicely kudos cables work great provide high quality video blue ray player tv.', "Amazon basics everything works fine love cables used buy monoprice cables excellent trying amazon cables 've converted cables thick monoprice ones making easier set mean easier bundle cables pass wall zip-tie slack etc cables given zero issues would recommend anyone looking good quality inexpensively priced cables n't sucker pay 3 times much monster cable.", 'High-speed hdmi cable purchased along roku helps provide good picture great audio worked great television glad spent money buy ordering roku money felt good accessory accompany roku. Buy expensive premium cables smart buy one great 

In [20]:
batches_second_stage = get_batches(summaries_first_stage, False)

Input 1 created with a length of 262 tokens.
Number of reviews processed: 5


In [21]:
summaries_second_stage = get_summaries(batches_second_stage)

Your max_length is set to 1024, but your input_length is only 328. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=164)


In [22]:
print(summaries_second_stage[0])
print(summaries_first_stage[0])

[{'summary_text': 'High-speed hdmi cable purchased along roku helps provide good picture great audio worked great television glad spent money buy ordering roku money felt good accessory accompany roku. Buy expensive premium cables smart buy one great price great product.'}]
Hdmi male hdmi female 270 degree adapter adapter thin enough plugs fully hides cable behind tv cleaner look upconverter dvd player combined cable adapter great picture sound quality great product mistakenly purchased switched cable service dvr box length cable cable company provided didnt quite stretch switched happy fit nicely kudos cables work great provide high quality video blue ray player tv.


# Using the full text without preprocessing
The second part of the test on BART-base consists in the usage of the full text (without preprocessing) as input

In [23]:
summaries_first_stage_full_text = get_summaries(batches=batches_full_text)

Your max_length is set to 1024, but your input_length is only 617. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=308)
Your max_length is set to 1024, but your input_length is only 556. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=278)
Your max_length is set to 1024, but your input_length is only 627. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=313)
Your max_length is set to 1024, but your input_length is only 522. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_leng

In [24]:
summaries_first_stage_full_text = [x[0]["summary_text"] for x in summaries_first_stage_full_text]
summaries_first_stage_full_text


['The clarity was amazing when we connected the cable to our computer and tv. The plastic plug surrounding the hdmi connector is so thick it would not plug into my TV as the plastic contacted the TVs back panel. If not for this issue I would give this cable five stars.',
 'The quality of the picture and sound is excellent Durable, functional, and inexpensive. Had an interesting experience with this company as I placed the order just after Sandy had shipped. Would do business with again, without any hesitation.',
 'Good value for a HDMI cable. Not worth returning due to the price and it does have one use for HDMI splitter box to TV connectivity. Performs as advertised passing 3D signals. If you are ordering the Apple TV - you HAVE to order this too.',
 'Great/sturdy item for the price. No complaints works great with the 55inch tv and roku recently purchased. I should have purchassed more than one.',
 'This HDMI cable works where you need a dependable cable. Great length, sturdy construc

In [25]:
batches_second_stage_full_text = get_batches(summaries_first_stage_full_text, False)

Input 1 created with a length of 445 tokens.
Number of reviews processed: 10


In [26]:
summaries_second_stage_full_text = get_summaries(batches_second_stage_full_text)

Your max_length is set to 1024, but your input_length is only 534. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=267)


In [27]:
print(summaries_first_stage_full_text[0])
print(batches_full_text[0])

The clarity was amazing when we connected the cable to our computer and tv. The plastic plug surrounding the hdmi connector is so thick it would not plug into my TV as the plastic contacted the TVs back panel. If not for this issue I would give this cable five stars.
 Work perfect to plug my laptop to my TV.<br />No more lines, or sound interference!<br /><br />Definitely I will buy it again.... The clarity was amazing when we connected the cable to our computer and tv. Even if the print ont he comp was not on good it was way better on tv. What can I say? It works,<br /><br />Ordered one for my PS3. I had some problems with an expensive cable I bought years ago, the screen was flickering all the time.<br /><br />Don't spend more money, get this cable. This was the best thing I could've ever purchased! I save so much money each month because I see no need for DirecTV now. I love my apple tv! J plan on buying one more! This is a must for the apple tv and a great price! Two complaints. Fi

It works as an extractive model -> We've to add more summaries together!