## NLP: REVIEW GENERATOR WITH TRANSFORMERS

-----

This challenge goals is to generate txxt review for women's e-commerce clothing review.

* **Data source** : [kaggle link](!https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews)

  * This dataset contains reviews written by anonymized customers from a real ecommerce.
  * There are **nine columns**:
    - IDs:
      - **Clothing ID** _(int)_: categorical var that referes to an specific piece.
    - CUSTOMER INFO:
      - **Age** _(+ int)_: reviewers age.
    - REVIEW CORPUS:
      - **Title** _(str)_: review's title, <u>This is the text that we have to generate, the target</u>
      - **Review text** (str): text review, <u>This is the main variable to take into account for NLP</u>
    - RATINGS:
      - **Rating** _(+ int)_: ordinal var for the product score.
      - **Recommendend IND** _(bool)_: wether the customer recommends the product or not.
     - **Positive Feedback Count** _(+ int)_: the sum() of customers that found the review positive.
    - CATEGORICAL VARIABLES:
      - **Division name**: categorical name of the product high level division
      - **Department Name**: categorical name of the product department name
      - **Class Name**: categorical name o the product class name.

* **Stack**: 
  - The exercise has to be resolved with [hugging face](!https://github.com/huggingface/transformers)
--- 



## Imports and globals

We need to install the lastest version of hugging face from the library from its git repository.

In [3]:
!pip install transformers
!pip install datasets
!pip install textblob
!pip install multi_rake
!pip install keybert



In [4]:
import numpy as np
import random
import re
import os
import pandas as pd

import tensorflow as tf
import transformers 

from functools import reduce
from keybert import KeyBERT
from multi_rake import Rake
from textblob import TextBlob


## SOLUTION 1: TEXT GENERATION
----
----

---
<a>(</a)

### Inspecting the raw data

In [5]:
dataset_path = '/content/drive/MyDrive/satAI/week_06/challenge/data/Womens Clothing E-Commerce Reviews.csv'  
# NOTE: This path may change in different Drives

In [6]:
df = pd.read_csv(dataset_path, index_col=[0])
df.head(2)

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23486 entries, 0 to 23485
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              23486 non-null  int64 
 1   Age                      23486 non-null  int64 
 2   Title                    19676 non-null  object
 3   Review Text              22641 non-null  object
 4   Rating                   23486 non-null  int64 
 5   Recommended IND          23486 non-null  int64 
 6   Positive Feedback Count  23486 non-null  int64 
 7   Division Name            23472 non-null  object
 8   Department Name          23472 non-null  object
 9   Class Name               23472 non-null  object
dtypes: int64(5), object(5)
memory usage: 2.0+ MB


In [6]:
for col in df:
  print("{:<25} has \t{:<10} num unique values".format(col, len(df[col].value_counts().index)))

Clothing ID               has 	1206       num unique values
Age                       has 	77         num unique values
Title                     has 	13993      num unique values
Review Text               has 	22634      num unique values
Rating                    has 	5          num unique values
Recommended IND           has 	2          num unique values
Positive Feedback Count   has 	82         num unique values
Division Name             has 	3          num unique values
Department Name           has 	6          num unique values
Class Name                has 	20         num unique values



* For categorical classes = [`division_name`, `department_name`, `class_name`] is the hierarchy.
* `rating`, `recommended_ind` and `positive_feedback_count` are weights of how important is the review.

#### CATEGORICAL DATA

In [7]:
pd.pivot_table(
    df, 
    index = 'Division Name', 
    columns = 'Department Name', 
    values= 'Clothing ID', 
    aggfunc='count',
    margins=True
)

Department Name,Bottoms,Dresses,Intimate,Jackets,Tops,Trend,All
Division Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
General,2542.0,3730.0,,645.0,6837.0,96.0,13850
General Petite,1257.0,2589.0,233.0,387.0,3631.0,23.0,8120
Initmates,,,1502.0,,,,1502
All,3799.0,6319.0,1735.0,1032.0,10468.0,119.0,23472


In [8]:
for _ in df['Department Name'].value_counts().index:
    print(f"The number of items in {_} is: \t"
          f"\t {list(df[df['Department Name']==_]['Class Name'].value_counts().index)}"
    )

The number of items in Tops is: 		 ['Knits', 'Blouses', 'Sweaters', 'Fine gauge']
The number of items in Dresses is: 		 ['Dresses']
The number of items in Bottoms is: 		 ['Pants', 'Jeans', 'Skirts', 'Shorts', 'Casual bottoms']
The number of items in Intimate is: 		 ['Lounge', 'Swim', 'Sleep', 'Legwear', 'Intimates', 'Layering', 'Chemises']
The number of items in Jackets is: 		 ['Jackets', 'Outerwear']
The number of items in Trend is: 		 ['Trend']


For the categorical data:
* `division_name` doesnt really add new information
* `class_name`  might need to be reclassified, since some labels seems to specific while other are very general.
* The label that best summarized the categorization of the clothing is `department_name`.

#### RATINGS AND ORDINAL DATA

In [9]:
pd.pivot_table(
    df, 
    index=["Department Name"], 
    values=["Rating", "Recommended IND"],
    aggfunc=['count', 'mean']
)

Unnamed: 0_level_0,count,count,mean,mean
Unnamed: 0_level_1,Rating,Recommended IND,Rating,Recommended IND
Department Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Bottoms,3799,3799,4.28876,0.851277
Dresses,6319,6319,4.150815,0.808197
Intimate,1735,1735,4.280115,0.850144
Jackets,1032,1032,4.264535,0.83624
Tops,10468,10468,4.172239,0.815151
Trend,119,119,3.815126,0.739496


---
<a>(</a)

### Data Cleaning

* Let's make pythonic column names:

In [7]:
df.columns = df.columns.str.lower().str.replace(" ", "_")
df.columns.tolist()

['clothing_id',
 'age',
 'title',
 'review_text',
 'rating',
 'recommended_ind',
 'positive_feedback_count',
 'division_name',
 'department_name',
 'class_name']

* Drop all rows with `review text` as nan and separate the main dataframe into those that have `title` and those that are nan

In [8]:
df = df[~df['review_text'].isna()].copy(deep=True)


na_df = df[df['title'].isna()].copy(deep=True)
df = df[~df['title'].isna()].copy(deep=True)

df_text = df[['title', 'review_text']].copy(deep=False)

In [9]:
print(na_df.shape)
print(df.shape)

(2966, 10)
(19675, 10)


In [56]:
def normalize_text(txt:str):
  """ Normalizes all text from the df; this is a function that is vectorized 
  through applaymap() 

  Args:
    txt (str): row to transform
  
  Return:
    txt (str): row transformed
  """
  pattern = r"\w+|\d+"
  errors_catched = [
    (' aded ', ' added '), ('hte', 'the'), (' it\'s ', ' it is '), ('mintue', 'minute'), 
    ('cagrcoal', 'charcoal'), ('reveiws', 'reviews'),
    # change size to their full names
    (' xxs ', ' extra extra small '), (' xs ', ' extra small '), (' s ', ' small '), (' m ', ' medium '),
    (' l ', ' large '), (' xl ', ' extra large '), (' xxl ', ' extra extra large '),
    # information that we interpret
    ('<3', 'i am in love with this'), ('10++++++', 'this is perfect'), ('a+++', 'this is perfect')
  ] 

  txt = str(txt).lower()   

  for tuples_text in errors_catched: 
    txt = txt.replace(*tuples_text)

  txt = " ".join(re.findall(pattern, txt))
  
  # there are libraries that do spelling autocorrection in python
  # TextBlob or autocorrect; but they are not very precise for this corpus
  # ---------------------------------------------------------------------------
  # takes a long long time it is not very accurate
  # txtBlb = TextBlob(txt)
  # txtCorr = txtBlb.correct()

  return txt

In [11]:
df_text = df_text.applymap(normalize_text)

In [12]:
df_text.head(2)

Unnamed: 0,title,review_text
2,some major design flaws,i had such high hopes for this dress and reall...
3,my favorite buy,i love love love this jumpsuit it is fun flirt...


* Let's see the max length of a title, which is the text to be generated:

In [13]:
df_text['title'].apply(lambda s: len(s)).describe().to_dict()

{'25%': 12.0,
 '50%': 17.0,
 '75%': 24.0,
 'count': 19675.0,
 'max': 52.0,
 'mean': 18.529047013977127,
 'min': 2.0,
 'std': 9.39489280718898}

* There are words that have less than 3 chars, which is suspicious:

In [14]:
df.loc[df_text[df_text['title'].apply(len) < 3].index, :]

Unnamed: 0,clothing_id,age,title,review_text,rating,recommended_ind,positive_feedback_count,division_name,department_name,class_name
2781,868,25,No!!,I thought i would love this shirt because it g...,2,0,0,General,Tops,Knits
2834,1004,43,Eh...,This skirt looks/lays exactly like the photo. ...,2,0,6,General,Bottoms,Skirts
2876,573,34,Ok,I was so excited to receive this in the mail a...,3,0,4,General Petite,Trend,Trend
6462,854,29,Ok,I was not a fan. cute design but the sweater w...,3,0,1,General,Tops,Knits
6993,663,37,Ok,This dress is more of a lounge dress for at ho...,4,1,1,Initmates,Intimate,Lounge
8240,867,42,Eh...,Too floppy; light and comfortable material but...,3,0,0,General Petite,Tops,Knits
9669,828,30,Eh...,I didn't really understand the other reviews u...,2,1,0,General Petite,Tops,Blouses
11778,1022,38,Ag,Awesome jeans. cute and comfortable! size down...,5,1,0,General,Bottoms,Jeans
12144,862,54,No,This top did not look at all like it did on th...,3,0,6,General,Tops,Knits
12829,1100,39,Ok,"I like the dress on line: color flowers, cut, ...",4,1,0,General Petite,Dresses,Dresses


Ok, it seems there are a few titles that may be a bit criptic, but it is OK.

----

### Data Mining

Let's try to extract **keywords** from the rating system and the categorization, for this we are goin to try to:
 * Making __subdfs__ grouping by its rating
 * Using keyBERT, we find words that are highlighted by the algorithm thorugh all the texts in the subdfs joined as an unique text.
 * The, we apply a function that added the category of clothing and whether it is a petite sizing or not.


 > *This is something that we haven't developed further, but the idea was that using the keywords and the review we could generete better titles, but we couldn't be happy enough with the result of this; it seems a bit unreliable.*

In [18]:
df.columns

Index(['clothing_id', 'age', 'title', 'review_text', 'rating',
       'recommended_ind', 'positive_feedback_count', 'division_name',
       'department_name', 'class_name'],
      dtype='object')

In [19]:
indexes_by_rating = df[['rating', 'recommended_ind']].reset_index().groupby(['rating', 'recommended_ind']).agg(list).to_dict(orient='index')
indexes_by_rating = {k: v['index'] for k, v in indexes_by_rating.items()}

print({k: len(v) for k, v in indexes_by_rating.items()})

{(1, 0): 684, (1, 1): 7, (2, 0): 1280, (2, 1): 80, (3, 0): 1444, (3, 1): 1020, (4, 0): 146, (4, 1): 4143, (5, 0): 21, (5, 1): 10850}


In [20]:
subdfs_by_rating = dict()

for k, l in indexes_by_rating.items():
  subdfs_by_rating[k] = df_text[df.index.isin(l)]

In [21]:
kw_model = KeyBERT(model='all-mpnet-base-v2')

In [22]:
def generate_review_keywords(cell:str, num_kwds:int=5, rng_of_words:int=3):
  """ For every cell in the review text columns,
  we generate a number of most significat keywords to use 
  for the training as weights

  Args:
    cell (str): the full review text to evaluate
  """

  kwargs = {
      'keyphrase_ngram_range': (1, rng_of_words),
      "stop_words":"english",
      "highlight":False ,
      "top_n": num_kwds
  }

  keywords = kw_model.extract_keywords(
    cell, **kwargs
  )

  kwds_list = reduce(lambda x, y: x + y, list(dict(keywords).keys()))

  return set(kwds_list.split(' '))



def extract_keywords(row):
  """ Extract keywords from the "division_name", "department_name", "class_name"
  columns to add to the keyword list
  """
  keywords = []
  if row['division_name'] == 'General Petite':
    keywords.append('petite')

  keywords.append(row['department_name'])
  keywords.append(row['class_name'])

  if len(keywords) > 0:
    
    keywords = ", ".join(set(map(lambda s: str(s).lower(), keywords)))
  else:
    keywords = ''

  return keywords

In [23]:
for k, subdf in subdfs_by_rating.items():

  index_filter = df_text.index.isin(subdf.index)
  joined_text = " ".join(subdf['review_text'].values.tolist())

  subdf['kwds_1'] = str(generate_review_keywords(cell = joined_text, num_kwds=1))  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/st

* Mining keywords from the rating system (_kwds_1_) and from categorization (_kwds_2_) :

In [24]:
df_kwds_1 = pd.concat(list(subdfs_by_rating.values())).sort_index()

df_kwds_1['kwds_1'] = df_kwds_1['kwds_1'].apply(lambda s: ", ".join(eval(s)))
df_kwds_1['kwds_2'] = df.sort_index()[['department_name', 'class_name', 'division_name']].apply(lambda r: extract_keywords(r), axis = 1)

df_kwds_1['kwds'] = df_kwds_1[['kwds_1', 'kwds_2']].apply(lambda row: row['kwds_1'] + ", "+ row['kwds_2'], axis = 1)

df_kwds_1.drop(columns=['kwds_1', 'kwds_2'], inplace=True)
df_kwds_1.head()

Unnamed: 0,title,review_text,kwds
2,some major design flaws,i had such high hopes for this dress and reall...,"problem, dress, fit, dresses"
3,my favorite buy,i love love love this jumpsuit it is fun flirt...,"dress, fits, review, petite, bottoms, pants"
4,flattering shirt,this shirt is very flattering to all due to th...,"dress, fits, review, tops, blouses"
5,not for the very petite,i love tracy reese dresses but this one is not...,"fit, dress, nicely, dresses"
6,charcoal shimmer fun,i added this in my basket at the last minute t...,"dress, fits, review, tops, petite, knits"


In [25]:
# minimum number of keywords found by both functions:
df_kwds_1['kwds'].str.split(', ').apply(len).min()

3

We finally decided to stop this here.

----
Text Summarizer: 

---

### Text2Text Model with GPT2: Prepare data

We use the GPT2 Algorith to generate titles from the reviews, using the Trainer() object from the library to finetune the algorithm to our corpus.

* Globals and Imports:

In [32]:
from transformers import pipeline, GPT2Tokenizer, GPT2Model, GPT2LMHeadModel, GPT2Config
from transformers import AutoTokenizer, AutoConfig, AutoModelForPreTraining

MODEL = "gpt2"  # {gpt2, gpt2-medium, gpt2-large, gpt2-xl}

SPECIAL_TOKENS  = { 
    "bos_token": "<|BOS|>",  # special token representing the beginning of a sentence
    "eos_token": "<|EOS|>",  # special token representing the end of a sentence
    "unk_token": "<|UNK|>",                    
    "pad_token": "<|PAD|>",
    "sep_token": "<|SEP|>"   # special token separating two different sentences in the same input
}

SEED = 42
TRAIN_SIZE = 0.8
MAXLEN     = 75

# set seed
random.seed(SEED)
os.environ['PYTHONHASHSEED'] = str(SEED)
np.random.seed(SEED)

* Tokenizer, Config and Model:

In [16]:
# get tokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer.add_special_tokens(SPECIAL_TOKENS)

5

In [44]:
# get config and model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

config = AutoConfig.from_pretrained(
    MODEL, 
    pad_token_id=tokenizer.eos_token_id,
    output_hidden_states=False
)

model = AutoModelForPreTraining.from_pretrained(MODEL, config=config).to(device)
model.resize_token_embeddings(len(tokenizer))

Embedding(50262, 768)

* Transform the train and test dictionaries into Dataset() objects

In [18]:
# prepare data to be trained
# we have a dataset with titles and reviews,
# lets make a function that devides into train ant test 

split_index = int(df_text.shape[0]*TRAIN_SIZE)
df_train, df_test = df_text.iloc[:split_index, :], df_text.iloc[split_index: , :]

def dataset_to_dict(frame):
  """ Transforms the dataframe into a dictionary such as.
     | indx | title | review_text |  - > {ind: [title, review_text]}
    
  Args:
    frame (pd.core.Dataframe)
  
  Returns:
    Dictionary
  """
  cols = frame.columns.tolist()
  data = frame.to_dict(orient='index')

  for i in data.keys():
    data[i] = [data[i][col] for col in cols]
  
  return data

In [19]:
train = dataset_to_dict(frame=df_train)
test = dataset_to_dict(frame=df_test)

In [20]:
print(len(train))
print(len(test))

15740
3935


In [21]:
import torch
from torch.utils.data import Dataset

class myDataset(Dataset):

    def __init__(self, data, tokenizer, randomize=True):

        ids, title, review = self.get_args(data)

        self.randomize = randomize
        self.tokenizer = tokenizer 
        self.title     = title
        self.review    = review

    #---------------------------------------------#

    @staticmethod
    def get_args(data):
      
      transpose_listofiters = lambda iterbl: map(list, zip(*iterbl))

      ids, data = list(transpose_listofiters(data.items()))
      title, review = list(transpose_listofiters(data))

      return ids, title, review    

    #---------------------------------------------#

    def __len__(self):
        return len(self.review)

    #---------------------------------------------#
    
    def __getitem__(self, i):        

        # first we put the input and then the stuff to generate
        input = SPECIAL_TOKENS['bos_token'] + self.review[i] + \
                SPECIAL_TOKENS['sep_token'] + self.title[i] + \
                SPECIAL_TOKENS['eos_token']            

        encodings_dict = tokenizer(
            input,
            truncation=True,                    
            max_length=MAXLEN, 
            padding="max_length"
        )   
        
        input_ids = encodings_dict['input_ids']
        attention_mask = encodings_dict['attention_mask']
        
        return {'label': torch.tensor(input_ids),
                'input_ids': torch.tensor(input_ids), 
                'attention_mask': torch.tensor(attention_mask)}

In [34]:
train_dataset = myDataset(train, tokenizer)
test_dataset  = myDataset(test,  tokenizer)

In [36]:
train_dataset.__getitem__(5)

{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1]),
 'input_ids': tensor([50257,    72,  6149,   428,   287,  6588,   329,  3650,  2298,   510,
           290,   550,   257,  5680,   286,  3404,   355,  1464,   284,  1949,
           319,   290,   973,   428,  1353,   284,  5166, 42370,   290, 12581,
          2279,  1816,   351,   340,   262,  3124,   318,  1107,  3621, 33512,
           351, 34493,   290,  1816,   880,   351, 21613, 42370, 30239, 12581,
          3503,   616,   691,   552,  2913,   318,   340,   318,   257,  1643,
          1263, 27409,   389,   890,   290,   340,  1595,   256,   467,   287,
          4273,   578,   635,   257,  1643]),
 'label': tensor([50257,    72,  6149,   428,   287,  6588,   329,  3650,  2298,   510,
           290,   5

----

### Text2Text Model with GPT2: Fine-tune using Trainer

documentation = [huggingface/main_classes](!https://huggingface.co/docs/transformers/main_classes/trainer)


* Imports and globals:

In [50]:
from transformers import Trainer, TrainingArguments

In [51]:
EPOCHS = 4
TRAIN_BATCHSIZE = 2
BATCH_UPDATE = 32
EPS  = 1e-8
LR   = 5e-4
WARMUP_STEPS = 1e2

* Training time!

In [52]:
 %%time

training_args = TrainingArguments(
    seed = SEED,
    output_dir="/content/",
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=TRAIN_BATCHSIZE,
    per_device_eval_batch_size=TRAIN_BATCHSIZE,
    gradient_accumulation_steps=BATCH_UPDATE,
    evaluation_strategy="steps",
    warmup_steps=WARMUP_STEPS,    
    learning_rate=LR,
    adam_epsilon=EPS,
    weight_decay=0.01, 
    save_total_limit=1,
    load_best_model_at_end=True,
)

#---------------------------------------------------#

trainer = Trainer(
    model=model,
    args=training_args,    
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer
)

#---------------------------------------------------#
trainer.train()
trainer.save_model()

***** Running training *****
  Num examples = 15740
  Num Epochs = 4
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 32
  Total optimization steps = 980


Step,Training Loss,Validation Loss
500,4.0433,2.535551


***** Running Evaluation *****
  Num examples = 3935
  Batch size = 2
Saving model checkpoint to /content/checkpoint-500
Configuration saved in /content/checkpoint-500/config.json
Model weights saved in /content/checkpoint-500/pytorch_model.bin
tokenizer config file saved in /content/checkpoint-500/tokenizer_config.json
Special tokens file saved in /content/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from /content/checkpoint-500 (score: 2.5355513095855713).
Saving model checkpoint to /content/
Configuration saved in /content/config.json
Model weights saved in /content/pytorch_model.bin
tokenizer config file saved in /content/tokenizer_config.json
Special tokens file saved in /content/special_tokens_map.json


CPU times: user 32min 30s, sys: 21.4 s, total: 32min 51s
Wall time: 34min 21s


In [53]:
# Save to G-Drive ----------------------------------#
!cp -r 'pytorch_model.bin' '/content/drive/MyDrive/satAI/week_06/challenge/pytorch_model_v03.bin'

----

### Text2Text Model with GPT2: Generation

* Import model if connection lost

In [37]:
!cp -r '/content/drive/MyDrive/satAI/week_06/challenge/pytorch_model_v03.bin' 'pytorch_model.bin' 

In [46]:
model.load_state_dict(torch.load('pytorch_model.bin'))

<All keys matched successfully>

* Let's pick a review without title and see what the model does

In [39]:
review = "I got this in the petite length, size o, and it fit just right. i like that i didn't have to have it altered in the length; can wear with flats with plenty of clearance to the floor from the bottom hem. my only beef with the design is the height of the waist. i personally think that the elastic waistband looks cheap, and really needs to be concealed with a belt, yet because it sits so high, literally right under the bustline, it's a tricky one to pull off"

In [40]:
def preprare_input(review:str, device):
  """ Prepares string to be introduced into the trained model to get
  possible titles.

  Args:
    review (str): input string
  
  Return:
    generated
  """

  prompt = SPECIAL_TOKENS['bos_token'] + review + SPECIAL_TOKENS['sep_token']
  
  generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0)
  generated = generated.to(device)

  return generated

In [47]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
generated = preprare_input(review=review, device=device)  


In [48]:
torch.cuda.is_available()

True

In [49]:
generation_args = {
    "do_sample": True,            # Generates a sample of possible titles for the review
    "num_return_sequences":5,     # Generated five possible titles
    "min_length":len(review)+20,  # Min length of title generated
    "max_length":len(review)+120, # Max length of title generated
    "top_k": 35,                  # Number of words from the review to play with
    "top_p": 0.75,                # Probability threshold foe words from the review
    "temperature": 0.9,           # Default value
    "repetition_penalty":5.       # Factor penalty for repetetion inside titles generated
}

outputs = model.generate(
    generated, **generation_args
)

In [50]:
print(f"for review descripcion: {review}\n")

for i, sample_output in enumerate(outputs):
    text = tokenizer.decode(sample_output, skip_special_tokens=True)

    a = len(review)       
    print("\t{}: {}".format(i+1,  text[a:]))

for review descripcion: I got this in the petite length, size o, and it fit just right. i like that i didn't have to have it altered in the length; can wear with flats with plenty of clearance to the floor from the bottom hem. my only beef with the design is the height of the waist. i personally think that the elastic waistband looks cheap, and really needs to be concealed with a belt, yet because it sits so high, literally right under the bustline, it's a tricky one to pull off

	1: pretty dress but not great on me
	2: great dress for summer
	3: lovely
	4: perfect for fall
	5: just okay


In [58]:
na_df_text = na_df[['review_text']].applymap(normalize_text)

In [59]:
res = dict()

for i, review in enumerate(na_df_text['review_text'].iloc[0:10].values):
  text_review = []

  gen_review = preprare_input(
      review=review, 
      device=device
  )  
  output_review = model.generate(
    gen_review, **generation_args
  )

  for j, sample_output in enumerate(output_review):
    text_rev = tokenizer.decode(
        sample_output, 
        skip_special_tokens=True
      )
    
    text_review.append(text_rev[len(review):])
  
  res[i] = {'review': review, 'titles': text_review}

In [61]:
res[4]

{'review': 'this is a comfortable skirt that can span seasons easily while not the most exciting design it is a good work skirt that can be paired with many tops',
 'titles': ['comfortable and unique',
  'a nice basic for spring',
  'comfy and sexy',
  'comfortable and versatile',
  'comfortable and stylish']}

In [62]:
import json

with open('text_gen.txt', 'w+') as file:
     file.write(json.dumps(res))

## SOLUTION 2: TEXT SUMMARIZING
----
----

In [66]:
dataset_path = '/content/drive/MyDrive/satAI/week_06/challenge/data/Womens Clothing E-Commerce Reviews.csv'  
df = pd.read_csv(dataset_path, index_col=[0])
# NOTE: This path may change in different Drives

---

### **Data preprocessing**

Includes some processes to prepare the data for the task. By now, the preprocessing includes:

- Filtering the DataFrame to remove records with less than n words as title in order to avoid wrong titles. 
- Cleansing text removing special characters (the ones that are not words or numbers) and lowercasing them.

In [63]:
def filter_by_title_words(dataframe: pd.DataFrame, min_words: int) -> pd.DataFrame:
  
    dataframe["title_words"] = dataframe["title"].apply(lambda x: len(x.split()))
    dataframe = dataframe[dataframe["title_words"] > min_words].copy()

    return dataframe.drop("title_words", axis=1)

In [64]:
def remove_special_characters(dataframe: pd.DataFrame) -> pd.DataFrame:

    def filter_by_regex(string: str) -> str:
        string = re.sub('[^\w]', ' ', string)
        string = re.sub('\s+', ' ', string)
        string = string.lower()
        return string
    
    dataframe["title"] = dataframe["title"].apply(filter_by_regex)
    dataframe["review_text"] = dataframe["review_text"].apply(filter_by_regex)
    
    return dataframe

In [65]:
def normalize_text(text: str) -> str:
  """Normalize the text to correct spelling mistakes.
  
  This is a function that is vectorized through applymap().

  Args:
    txt (str): row to transform

  Return:
    txt (str): row transformed
  """

  # Some words translation.
  errors_catched = [
    (' aded ', ' added '), 
    ('hte', 'the'), 
    (" it's ", ' it is '), 
    ('mintue', 'minute'), 
    ('cagrcoal', 'charcoal'), 
    ('reveiws', 'reviews'),
    
    # Change size to their full names.
    (' xxs ', ' extra extra small '), 
    (' xs ', ' extra small '), 
    (' s ', ' small '), 
    (' m ', ' medium '),
    (' l ', ' large '), 
    (' xl ', ' extra large '),
    (' xxl ', ' extra extra large '),
    
    # Information that we interpret
    ('<3', 'i am in love with this'), 
    ('10++++++', 'this is perfect'), 
    ('a+++', 'this is perfect'),
  ]

  for tuples_text in errors_catched: 
      text = text.replace(*tuples_text)

  # There are some libraries that do spelling autocorrection in Python,
  # TextBlob or Autocorrect; but they are not very precise for this corpus.
  # ---------------------------------------------------------------------------
  # Takes a long long time it is not very accurate.
  #   txtBlb = TextBlob(text)
  #   txtCorr = txtBlb.correct()

  return text

In [67]:
# Rename DataFrame columns and drop empty records for title or review.
df.columns = [column.lower().replace(" ", "_") for column in df.columns]
df.dropna(subset=["title", "review_text"], inplace=True)
df.reset_index(inplace=True, drop=True)

In [68]:
# Keep only title and review columns a preprocess them.
data = df.loc[:, ["title", "review_text"]].copy()
data = filter_by_title_words(data, 3)
# data = remove_special_characters(data)
data["title"] = data["title"].apply(normalize_text)
data["review_text"] = data["review_text"].apply(normalize_text)
data.reset_index(inplace=True, drop=True)

In [69]:
data.head(n=10)

Unnamed: 0,title,review_text
0,Some major design flaws,I had such high hopes for this dress and reall...
1,Not for the very petite,"I love tracy reese dresses, but this one is no..."
2,"Shimmer, surprisingly goes with lots","I ordered this in carbon for store pick up, an..."
3,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the small peti..."
4,Dress looks like it is made of cheap material,Dress runs small esp where the zipper area run...
5,Pretty party dress with some issues,This is a nice choice for holiday gatherings. ...
6,"Nice, but not for my body",I took these out of the package and wanted the...
7,"You need to be at least average height, or taller",Material and color is nice. the leg opening i...
8,Looks great with white pants,Took a chance on this blouse and so glad i did...
9,Super cute and cozy,"A flattering, super cozy coat. will work well..."


### **An example of what we are trying to achieve: Summarization**

We are going to use a pretrained summarization model to generate an abstract text that summarizes a review. We are going to consider the summarization as title.

Reference [here](https://huggingface.co/docs/transformers/tasks/summarization).

In [70]:
review = data["review_text"][0]
review

'I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c'

In [72]:
model_name = "google/pegasus-xsum"

summarizer = pipeline("summarization", model=model_name)

https://huggingface.co/google/pegasus-xsum/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpvpu1fyhz


Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

storing https://huggingface.co/google/pegasus-xsum/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/f8db793080242073e939bf4bc066830a677ca5e1c2d3aa1fc2a79fe733ccf3c9.149318290a6d6f03f34bb735260994b16b4a7c8609973a4abc8e9315c7c5797c
creating metadata file for /root/.cache/huggingface/transformers/f8db793080242073e939bf4bc066830a677ca5e1c2d3aa1fc2a79fe733ccf3c9.149318290a6d6f03f34bb735260994b16b4a7c8609973a4abc8e9315c7c5797c
loading configuration file https://huggingface.co/google/pegasus-xsum/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/f8db793080242073e939bf4bc066830a677ca5e1c2d3aa1fc2a79fe733ccf3c9.149318290a6d6f03f34bb735260994b16b4a7c8609973a4abc8e9315c7c5797c
Model config PegasusConfig {
  "_name_or_path": "google/pegasus-xsum",
  "activation_dropout": 0.1,
  "activation_function": "relu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "PegasusForConditionalGeneration"
  ],
  "attention_dr

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

storing https://huggingface.co/google/pegasus-xsum/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/ddcc1cff87fee162e2905d3466807fe77fd58b88f3e90cf30824032b121396f0.eb12ff56dff38f793b50dd1ead8bacb82b33aa7c9cb713aa0471bc2fdd353c9e
creating metadata file for /root/.cache/huggingface/transformers/ddcc1cff87fee162e2905d3466807fe77fd58b88f3e90cf30824032b121396f0.eb12ff56dff38f793b50dd1ead8bacb82b33aa7c9cb713aa0471bc2fdd353c9e
loading weights file https://huggingface.co/google/pegasus-xsum/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/ddcc1cff87fee162e2905d3466807fe77fd58b88f3e90cf30824032b121396f0.eb12ff56dff38f793b50dd1ead8bacb82b33aa7c9cb713aa0471bc2fdd353c9e
All model checkpoint weights were used when initializing PegasusForConditionalGeneration.

All the weights of PegasusForConditionalGeneration were initialized from the model checkpoint at google/pegasus-xsum.
If your task is similar to the task the model of the chec

Downloading:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

storing https://huggingface.co/google/pegasus-xsum/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/27c4324d316dbe67537c2fe759928141fb29dfa1cc6c6b2674765b2bc7026f54.85037d315e8cfbde266939f8b6852c54e9482ce4c0aa513420289e0b5c0f3499
creating metadata file for /root/.cache/huggingface/transformers/27c4324d316dbe67537c2fe759928141fb29dfa1cc6c6b2674765b2bc7026f54.85037d315e8cfbde266939f8b6852c54e9482ce4c0aa513420289e0b5c0f3499
loading configuration file https://huggingface.co/google/pegasus-xsum/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/f8db793080242073e939bf4bc066830a677ca5e1c2d3aa1fc2a79fe733ccf3c9.149318290a6d6f03f34bb735260994b16b4a7c8609973a4abc8e9315c7c5797c
Model config PegasusConfig {
  "_name_or_path": "google/pegasus-xsum",
  "activation_dropout": 0.1,
  "activation_function": "relu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "PegasusForConditionalGeneration"
  ],
  "at

Downloading:   0%|          | 0.00/1.82M [00:01<?, ?B/s]

storing https://huggingface.co/google/pegasus-xsum/resolve/main/spiece.model in cache at /root/.cache/huggingface/transformers/332f03e91a57dbbec20f967da851c4a988b93136d81acd1e170a728ec6b9e3ad.1acf68c74589da6c7fa3548093824dfc450a54637f4356929bbfea7e294a68f8
creating metadata file for /root/.cache/huggingface/transformers/332f03e91a57dbbec20f967da851c4a988b93136d81acd1e170a728ec6b9e3ad.1acf68c74589da6c7fa3548093824dfc450a54637f4356929bbfea7e294a68f8
https://huggingface.co/google/pegasus-xsum/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpnf0w4c6l


Downloading:   0%|          | 0.00/3.36M [00:00<?, ?B/s]

storing https://huggingface.co/google/pegasus-xsum/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/249fc759c5fa315dca06897d79493c1d27d999d4876ebb798d07d9d7a2ebbb29.b11a22b825f5f3714fac8c470a8602bd6f8fbce86cb56e588c571016e963bd4e
creating metadata file for /root/.cache/huggingface/transformers/249fc759c5fa315dca06897d79493c1d27d999d4876ebb798d07d9d7a2ebbb29.b11a22b825f5f3714fac8c470a8602bd6f8fbce86cb56e588c571016e963bd4e
https://huggingface.co/google/pegasus-xsum/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp_lwpfoii


Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

storing https://huggingface.co/google/pegasus-xsum/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/fa42b967ba408eb6cada6772d2611447c50154335595911e7c51a386db9901e6.294ebaa4cd17bb284635004c92d2c4d522ec488c828dcce0c2471b6f28e3fe82
creating metadata file for /root/.cache/huggingface/transformers/fa42b967ba408eb6cada6772d2611447c50154335595911e7c51a386db9901e6.294ebaa4cd17bb284635004c92d2c4d522ec488c828dcce0c2471b6f28e3fe82
loading file https://huggingface.co/google/pegasus-xsum/resolve/main/spiece.model from cache at /root/.cache/huggingface/transformers/332f03e91a57dbbec20f967da851c4a988b93136d81acd1e170a728ec6b9e3ad.1acf68c74589da6c7fa3548093824dfc450a54637f4356929bbfea7e294a68f8
loading file https://huggingface.co/google/pegasus-xsum/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/249fc759c5fa315dca06897d79493c1d27d999d4876ebb798d07d9d7a2ebbb29.b11a22b825f5f3714fac8c470a8602bd6f8fbce86cb56e588c571016e963bd4e
loading

In [73]:
summary = summarizer(review)

In [74]:
print(summary[0]["summary_text"])

This dress is so small that i could not zip it up!


### **Training our own summarizer model**

In [79]:
import datasets

from transformers import (
    AdamWeightDecay,
    AutoTokenizer, 
    create_optimizer, 
    DataCollatorForSeq2Seq, 
    pipeline, 
    TFAutoModelForSeq2SeqLM
)

In [80]:
# Convert the DataFrame to a HuggingFace Dataset.
dataset = datasets.Dataset.from_pandas(data)
dataset

Dataset({
    features: ['title', 'review_text'],
    num_rows: 7185
})

In [81]:
# Import and instantiate the tokenizer.
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Could not locate the tokenizer configuration file, will try to use the model config instead.
https://huggingface.co/t5-small/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpd9yt_pm9


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

storing https://huggingface.co/t5-small/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
creating metadata file for /root/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

storing https://huggingface.co/t5-small/resolve/main/spiece.model in cache at /root/.cache/huggingface/transformers/65fc04e21f45f61430aea0c4fedffac16a4d20d78b8e6601d8d996ebefefecd2.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d
creating metadata file for /root/.cache/huggingface/transformers/65fc04e21f45f61430aea0c4fedffac16a4d20d78b8e6601d8d996ebefefecd2.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d
https://huggingface.co/t5-small/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpcd0ttdns


Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

storing https://huggingface.co/t5-small/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/06779097c78e12f47ef67ecb728810c2ae757ee0a9efe9390c6419783d99382d.8627f1bd5d270a9fd2e5a51c8bec3223896587cc3cfe13edeabb0992ab43c529
creating metadata file for /root/.cache/huggingface/transformers/06779097c78e12f47ef67ecb728810c2ae757ee0a9efe9390c6419783d99382d.8627f1bd5d270a9fd2e5a51c8bec3223896587cc3cfe13edeabb0992ab43c529
loading file https://huggingface.co/t5-small/resolve/main/spiece.model from cache at /root/.cache/huggingface/transformers/65fc04e21f45f61430aea0c4fedffac16a4d20d78b8e6601d8d996ebefefecd2.3b69006860e7b5d0a63ffdddc01ddcd6b7c318a6f4fd793596552c741734c62d
loading file https://huggingface.co/t5-small/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/06779097c78e12f47ef67ecb728810c2ae757ee0a9efe9390c6419783d99382d.8627f1bd5d270a9fd2e5a51c8bec3223896587cc3cfe13edeabb0992ab43c529
loading file https://huggingface.co/t5-small/reso

In [82]:
def preprocess_function(examples):

    inputs = ["summarize: " + doc for doc in data["review_text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["title"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    
    return model_inputs

In [83]:
# Tokenize the original dataset.
tokenized_dataset = dataset.map(preprocess_function, batch_size=None, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [84]:
# Split into train and test.
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.2)

In [85]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name, return_tensors="tf")

In [86]:
tf_train_set = tokenized_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = tokenized_dataset["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

In [87]:
# Prepare the model.
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")

loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
Model config T5Config {
  "_name_or_path": "t5-small",
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 6,
  "num_heads": 8,
  "num_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 200,
      "min_lengt

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

storing https://huggingface.co/t5-small/resolve/main/tf_model.h5 in cache at /root/.cache/huggingface/transformers/51663d3eebce1656ebbf9cb26c16e243c19f861394299c088496b86b32ef4831.a6c374775a2dd6a6843e2ada3202ba6acae7cf400a3b1bfdca2ec2341e669716.h5
creating metadata file for /root/.cache/huggingface/transformers/51663d3eebce1656ebbf9cb26c16e243c19f861394299c088496b86b32ef4831.a6c374775a2dd6a6843e2ada3202ba6acae7cf400a3b1bfdca2ec2341e669716.h5
loading weights file https://huggingface.co/t5-small/resolve/main/tf_model.h5 from cache at /root/.cache/huggingface/transformers/51663d3eebce1656ebbf9cb26c16e243c19f861394299c088496b86b32ef4831.a6c374775a2dd6a6843e2ada3202ba6acae7cf400a3b1bfdca2ec2341e669716.h5
All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T

In [88]:
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour, please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [89]:
# Train the model.
model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f2cbf39de50>

### **Let's try the model!**

In [90]:
summarizer = pipeline("summarization", model=model, tokenizer=tokenizer)

In [91]:
review

'I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c'

In [92]:
summarizer(review)[0]["summary_text"]

Your max_length is set to 200, but you input_length is only 132. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=66)


"So small that i couldn't zip up - but a big flaw - the net over layer sewn directly into the zipper"

In [93]:
review = data["review_text"][5]
review

'This is a nice choice for holiday gatherings. i like that the length grazes the knee so it is conservative enough for office related gatherings. the size small fit me well - i am usually a size 2/4 with a small bust. in my opinion it runs small and those with larger busts will definitely have to size up (but then perhaps the waist will be too big). the problem with this dress is the quality. the fabrics are terrible. the delicate netting type fabric on the top layer of skirt got stuck in the zip'

In [94]:
summarizer(review)[0]["summary_text"]

Your max_length is set to 200, but you input_length is only 123. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


'Great choice for office related gatherings - but not for big busts . the fabric on the top layer of skirt got stuck in zip'

In [95]:
# Don't take this cell so seriously... :0

def write_review_titles(filename: str, dataframe: pd.DataFrame, n: int, sep="*****"):

    with open(filename, "w") as file:
        for i, review in enumerate(dataframe["review_text"]):
            if i >= n:
                break

            title = summarizer(review)[0]["summary_text"]
            file.write(review + sep + title + "\n")

            i += 1

In [97]:
write_review_titles("summarized_titles.txt", data, n=10)

Your max_length is set to 200, but you input_length is only 132. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=66)
Your max_length is set to 200, but you input_length is only 129. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=64)
Your max_length is set to 200, but you input_length is only 138. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=69)
Your max_length is set to 200, but you input_length is only 114. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=57)
Your max_length is set to 200, but you input_length is only 79. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=39)
Your max_length is set to 200, but you input_length is only 123. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)
Your max_length is set to 200, but you input_length is only 128. You might co

In [98]:
# Save the model locally.
model.save("summary_model")



INFO:tensorflow:Assets written to: summary_model/assets


INFO:tensorflow:Assets written to: summary_model/assets
