# Second test: Introduction

In the second test with BART we'll use a more complex model trained **bart-large-xsum** specialized for tasks of summarization.

As before we use the following alternative inputs:
*   Preprocessed reviews
*   Full text

We'll leverage on the **huggingface**'s **transformers** library also in this case.

In [1]:
!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM, TrainingArguments, Trainer, create_optimizer, AdamWeightDecay, DataCollatorForSeq2Seq
from transformers.keras_callbacks import PushToHubCallback
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
from transformers import pipeline

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

We need to types of dataset as anticipated before, the following function is used to load them.

In [3]:
#Function to read a list from a file
def read_list_from_file(filename):
  result = []
  # opening the file in read mode

  my_file = open(filename, "r")

  # reading the file
  data = my_file.read()

  # replacing end splitting the text
  # when newline ('\n') is seen.
  data_into_list = data.split("\n")
  my_file.close()

  return data_into_list

Mount Drive

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Names of the files in which we've the datasets.

In [5]:
#Filename of the review list's file
filename_review_list = "drive/Shareddrives/BPM PROJECT/Dataset/review_list.txt"

#Filename of the dataset
filename_dataset = "drive/Shareddrives/BPM PROJECT/Dataset/amazon_reviews_reduced.csv"

# Defining the model

For this test we'll use **bart-large-xsum**

In [6]:
summarizer = pipeline("summarization", model="facebook/bart-large-xsum", tokenizer="facebook/bart-large-xsum")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/309 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# Defining the functions used for the task

In [7]:
#These function receive as input a batch of reviews and for each one it outputs
# a number of summaries equal to the number of batches
def get_summaries(batches):
  summaries = []
  for input in batches:
    summary = summarizer( input, max_length=1024, min_length=10, do_sample=False)
    summaries.append(summary)
  return summaries

# Creating the inputs

In the following cell we can see how the inputs are created; in particular we build batches of reviews, each one formed by the concatenation of different reviews.


In [12]:
def get_batches(reviews, sample):

  input = " "
  count = 0
  checkpoint = 0


  #batch creation
  #creo un array di testi concatenati, ognuno dei quali ha massimo 1024 token
  #lo stesso procedimento verrà fatto con i riassunti generati!
  # si parte da checkpoint in poi, fino a finire il ciclo.

  batches = []
  i = 0

  if sample:
    num_samples = 100
  else:
    num_samples = len(reviews)

  for i in range(0,num_samples):
  # for i in range(0,43):
    if count + len(reviews[i].split(" ")) < (512):
      #print(cleaned_reviews[i])
      input = input + reviews[i] + " "
      #print(input)
      count = count + len(reviews[i].split(" "))
    else:
      batches.append(input)
      print('Input {} created with a length of {} tokens.'.format(len(batches), count))
      print("Number of reviews processed: {}".format(i))
      input = reviews[i] + " "
      count = len(reviews[i].split(" "))

  if(count != 0):
    batches.append(input)
    print('Input {} created with a length of {} tokens.'.format(len(batches), count))
    print("Number of reviews processed: {}".format(i))

  return batches

Now we create the batches for the summaries

The previous batches of reviews form the input to the first stage of summarization: the results will be composed by **N** summaries, where **N** is the number of inputs to the model (the size of **batches**)

Loading the preprocessed reviews

In [None]:
#load the preprocessed reviews BUT!! We've pretrained the network on the same ones, it's an error
# maybe we don't need the preprocessing or we can skip the first 1000 reviews

reviews_list_preprocessed = read_list_from_file(filename_review_list)
print(reviews_list_preprocessed[0])

work perfect plug laptop tv lines sound interference definitely buy


In [None]:
batches = get_batches(reviews_list_preprocessed, True)

Input 1 created with a length of 493 tokens.
Number of reviews processed: 22
Input 2 created with a length of 485 tokens.
Number of reviews processed: 43
Input 3 created with a length of 499 tokens.
Number of reviews processed: 59
Input 4 created with a length of 493 tokens.
Number of reviews processed: 80
Input 5 created with a length of 485 tokens.
Number of reviews processed: 96
Input 6 created with a length of 120 tokens.
Number of reviews processed: 99


Creating the full text reviews

In [9]:
amazon_reviews = pd.read_csv("drive/Shareddrives/BPM PROJECT/Dataset/amazon_reviews_reduced_preprocessed.csv", on_bad_lines="skip")

In [10]:
review_list = amazon_reviews["review_body"].tail(1000).tolist()

In [13]:
batches_full_text = get_batches(review_list, True)

Input 1 created with a length of 493 tokens.
Number of reviews processed: 10
Input 2 created with a length of 464 tokens.
Number of reviews processed: 22
Input 3 created with a length of 508 tokens.
Number of reviews processed: 32
Input 4 created with a length of 430 tokens.
Number of reviews processed: 43
Input 5 created with a length of 496 tokens.
Number of reviews processed: 52
Input 6 created with a length of 459 tokens.
Number of reviews processed: 59
Input 7 created with a length of 493 tokens.
Number of reviews processed: 72
Input 8 created with a length of 449 tokens.
Number of reviews processed: 80
Input 9 created with a length of 499 tokens.
Number of reviews processed: 88
Input 10 created with a length of 505 tokens.
Number of reviews processed: 97
Input 11 created with a length of 179 tokens.
Number of reviews processed: 99


# Using the preprocessed reviews

The previous batches of reviews form the input to the first stage of summarization: the results will be composed by **N** summaries, where **N** is the number of inputs to the model (the size of **batches**)

In [None]:
summaries_first_stage = get_summaries(batches=batches)

Your max_length is set to 1024, but your input_length is only 566. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=283)
Your max_length is set to 1024, but your input_length is only 582. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=291)
Your max_length is set to 1024, but your input_length is only 607. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=303)
Your max_length is set to 1024, but your input_length is only 610. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_leng

In [None]:
len(batches[0].split(" "))

495

In [None]:
summaries_first_stage = [x[0]["summary_text"] for x in summaries_first_stage]
summaries_first_stage


['Hdmi cables work great provide high quality video blue ray player tv definitely fit bill work good ordered several pairs great product price normally order cables monoprice great quality great price happy cable purchased go new roku box would recommend work higher priced cables.',
 'Hdmi cable connects to a wide variety of devices.',
 'Hdmi cable is one of the best buys ever made works really well much cheaper cables stores cheapest one found store 14.00 brand highly recommend cable amazon high-speed hdmi cables.',
 "High-speed hdmi cable connect laptop computer directly hdtv easy use good picture sound prefer brands say monoprice flexible/less stiff complain wish 'd shorter length 2m shortest last time ordered looks like 1m lengths available good great quality great price usual amazon delivered good product great price since sold amazon trust quality n't feel cheap though cost 6",
 "A few of the things we've tried and found to be good or bad:.",
 'Buying a cable to connect a televis

In [None]:
print(summaries_first_stage)
print(batches)

['Hdmi cables work great provide high quality video blue ray player tv definitely fit bill work good ordered several pairs great product price normally order cables monoprice great quality great price happy cable purchased go new roku box would recommend work higher priced cables.', 'Hdmi cable connects to a wide variety of devices.', 'Hdmi cable is one of the best buys ever made works really well much cheaper cables stores cheapest one found store 14.00 brand highly recommend cable amazon high-speed hdmi cables.', "High-speed hdmi cable connect laptop computer directly hdtv easy use good picture sound prefer brands say monoprice flexible/less stiff complain wish 'd shorter length 2m shortest last time ordered looks like 1m lengths available good great quality great price usual amazon delivered good product great price since sold amazon trust quality n't feel cheap though cost 6", "A few of the things we've tried and found to be good or bad:.", 'Buying a cable to connect a television t

In [None]:
batches_second_stage = get_batches(summaries_first_stage, False)

Input 1 created with a length of 176 tokens.
Number of reviews processed: 5


In [None]:
summaries_second_stage = get_summaries(batches_second_stage)

Your max_length is set to 1024, but your input_length is only 219. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=109)


In [None]:
print(summaries_second_stage[0])
print(summaries_first_stage[0])

[{'summary_text': 'Hdmi cable is one of the best buys ever made.'}]
Hdmi cables work great provide high quality video blue ray player tv definitely fit bill work good ordered several pairs great product price normally order cables monoprice great quality great price happy cable purchased go new roku box would recommend work higher priced cables.


In [None]:
#Sentiment analysis
#Bert-base-multilingual-uncased-sentiment is a model fine-tuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian.
pipe = pipeline(model="nlptown/bert-base-multilingual-uncased-sentiment")


Downloading (…)lve/main/config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
#sostituire il prodotto specifico

review_stars = pd.read_csv("drive/Shareddrives/BPM PROJECT/Dataset/amazon_reviews_reduced.csv", on_bad_lines="skip")
mask = review_stars[review_stars["product_id"] == "B003L1ZYYM"]
mask = mask["star_rating"]

In [None]:
val = pipe(summaries_first_stage[0])
print("First Stage: " + str(val[0].get("score")))

val = pipe(summaries_second_stage[0][0].get("summary_text"))
print("Second Stage: " + str(val[0].get("score")))

print("Expected Rating: " + str((mask.sum()/len(mask))/5))

First Stage: 0.8131595849990845
Second Stage: 0.9593029022216797
Expected Rating: 0.9420373027259684


TODO



*   Meno recensioni ci sono meglio funziona!
*   i diversi stage vanno bene, anzi funziona meglio!!
*   Fare lo stesso con i full texts



# Using the full text without preprocessing
The second part of the test on BART-base consists in the usage of the full text (without preprocessing) as input

In [14]:
summaries_first_stage_full_text = get_summaries(batches=batches_full_text)

Your max_length is set to 1024, but your input_length is only 617. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=308)
Your max_length is set to 1024, but your input_length is only 556. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=278)
Your max_length is set to 1024, but your input_length is only 627. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=313)
Your max_length is set to 1024, but your input_length is only 522. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_leng

In [15]:
summaries_first_stage_full_text = [x[0]["summary_text"] for x in summaries_first_stage_full_text]
summaries_first_stage_full_text


['This is a great product for the price.',
 'AmazonBasics HDMI cable is a good buy for the price.',
 "Amazon's HDMI cable for the Apple TV is a good buy for the price.",
 'This HDMI cable is just as high quality as any other high quality cable, the only difference is the price.',
 'This HDMI cable is a great deal for the price, and the quality is top notch.',
 'The Amazon High-Speed HDMI cable that I purchased along with my Roku has been a huge success!',
 "AmazonBasics HDMI cables have been a huge success for me, and I've been using them for two years now.",
 'AmazonBasics High-Speed HDMI Cable is a great value for the money.',
 'An excellent cable for connecting a Blu-Ray player to a receiver at a fraction of the cost of bricks and mortar retailers.',
 'AmazonBasics HDMI cables are a great value for the money.',
 'AmazonBasics Digital Optical Audio Toslink Cable, 6 Feet [[ASIN:B001TH7GSW AmazonBasics]].']

In [16]:
batches_second_stage_full_text = get_batches(summaries_first_stage_full_text, False)

Input 1 created with a length of 157 tokens.
Number of reviews processed: 10


In [17]:
summaries_second_stage_full_text = get_summaries(batches_second_stage_full_text)

Your max_length is set to 1024, but your input_length is only 207. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=103)


In [18]:
summaries_second_stage_full_text = [x[0]["summary_text"] for x in summaries_second_stage_full_text]
summaries_second_stage_full_text

['An excellent cable for connecting a Blu-Ray player to a receiver at a fraction of the cost of bricks and mortar retailers.']

In [19]:
print(summaries_second_stage_full_text[0])
print(summaries_first_stage_full_text[0])
print(batches_full_text[0])

An excellent cable for connecting a Blu-Ray player to a receiver at a fraction of the cost of bricks and mortar retailers.
This is a great product for the price.
 Work perfect to plug my laptop to my TV.<br />No more lines, or sound interference!<br /><br />Definitely I will buy it again.... The clarity was amazing when we connected the cable to our computer and tv. Even if the print ont he comp was not on good it was way better on tv. What can I say? It works,<br /><br />Ordered one for my PS3. I had some problems with an expensive cable I bought years ago, the screen was flickering all the time.<br /><br />Don't spend more money, get this cable. This was the best thing I could've ever purchased! I save so much money each month because I see no need for DirecTV now. I love my apple tv! J plan on buying one more! This is a must for the apple tv and a great price! Two complaints. First off the cable is quite thick and difficult to straighten the coils out of. Secondly, the plastic plug 

In [20]:
#Sentiment analysis
#Bert-base-multilingual-uncased-sentiment is a model fine-tuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian.
pipe = pipeline(model="nlptown/bert-base-multilingual-uncased-sentiment")


Downloading (…)lve/main/config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [21]:
#sostituire il prodotto specifico

review_stars = pd.read_csv("drive/Shareddrives/BPM PROJECT/Dataset/amazon_reviews_reduced.csv", on_bad_lines="skip")
mask = review_stars[review_stars["product_id"] == "B003L1ZYYM"]
mask = mask["star_rating"]

In [23]:
val = pipe(summaries_first_stage_full_text[0])
print("First Stage: " + str(val[0].get("score")))

val = pipe(summaries_second_stage_full_text)
print("Second Stage: " + str(val[0].get("score")))

print("Expected Rating: " + str((mask.sum()/len(mask))/5))

First Stage: 0.6775170564651489
Second Stage: 0.8442233800888062
Expected Rating: 0.9420373027259684


With full text works a lot better (the next model will be better)

NEXT ---> use **bart-large-cnn**