In [105]:
# loading the required libraries
pckgs <- c("textclean","keras","stringr","tm","qdap")
lapply(pckgs, library,character.only = TRUE ,quietly = T)

In [106]:
# loading required columns from input data
reviews <- read.csv("data/Reviews.csv",nrows = 10000)[,c('Text', 'Summary')]
head(reviews) 

Text,Summary
I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.,Good Quality Dog Food
"Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as ""Jumbo"".",Not as Advertised
"This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' ""The Lion, The Witch, and The Wardrobe"" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.","""Delight"" says it all"
If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.,Cough Medicine
"Great taffy at a great price. There was a wide assortment of yummy taffy. Delivery was very quick. If your a taffy lover, this is a deal.",Great taffy
"I got a wild hair for taffy and ordered this five pound bag. The taffy was all very enjoyable with many flavors: watermelon, root beer, melon, peppermint, grape, etc. My only complaint is there was a bit too much red/black licorice-flavored pieces (just not my particular favorites). Between me, my kids, and my husband, this lasted only two weeks! I would recommend this brand of taffy -- it was a delightful treat.",Nice Taffy


In [107]:
# keeping only those rows which have both text and summary information in the data
reviews <- reviews[complete.cases(reviews),]
rownames(reviews) <- 1:nrow(reviews)

In [108]:
# converting the Text and Summary columns to character datatypes
reviews$Text <- as.character(reviews$Text)
reviews$Summary <- as.character(reviews$Summary)

In [109]:
# cleaning data
clean_data <- function(data,remove_stopwords = TRUE){
 data <- tolower(data)
 data = replace_contraction(data)
 data = gsub('<br />', '', data)
 data = gsub('[[:punct:] ]+',' ',data)
 data = gsub("[^[:alnum:]\\-\\.\\s]", " ", data)
 data = gsub('&amp;', '', data)
 data = if(remove_stopwords == "TRUE"){paste0(unlist(rm_stopwords(data,tm::stopwords("english"))),collapse = " ")}else{data}
 data = gsub('\\.', "", data)
 data = gsub('\\s+', " ", data)
 return(data)

}

cleaned_text <- unlist(lapply(reviews$Text,clean_data,remove_stopwords = TRUE))
cleaned_summary <- unlist(lapply(reviews$Summary,clean_data,remove_stopwords = FALSE))

# Adding cleaned reviews and their summaries in a dataframe
cleaned_reviews <- data.frame("Cleaned_Text"= cleaned_text,"Cleaned_Summary"= cleaned_summary)

# Converting the Text and Summary columns to character datatypes
cleaned_reviews$Cleaned_Text <- as.character(cleaned_reviews$Cleaned_Text)
cleaned_reviews$Cleaned_Summary <- as.character(cleaned_reviews$Cleaned_Summary)
head(cleaned_reviews)

Cleaned_Text,Cleaned_Summary
bought several vitality canned dog food products found good quality product looks like stew processed meat smells better labrador finicky appreciates product better,Good quality dog food
product arrived labeled jumbo salted peanuts peanuts actually small sized unsalted sure error vendor intended represent product jumbo,Not as advertised
confection around centuries light pillowy citrus gelatin nuts case filberts cut tiny squares liberally coated powdered sugar tiny mouthful heaven chewy flavorful highly recommend yummy treat familiar story c s lewis lion witch wardrobe treat seduces edmund selling brother sisters witch,Delight says it all
looking secret ingredient robitussin believe found got addition root beer extract ordered good made cherry soda flavor medicinal,Cough medicine
great taffy great price wide assortment yummy taffy delivery quick taffy lover deal,Great taffy
got wild hair taffy ordered five pound bag taffy enjoyable many flavors watermelon root beer melon peppermint grape etc complaint bit much red black licorice flavored pieces just particular favorites kids husband lasted two weeks recommend brand taffy delightful treat,Nice taffy


In [110]:
# putting start and end tokens to signal the start and end of the sequences respectively in the summary
cleaned_reviews[,"Cleaned_Summary"] <- sapply(X = cleaned_reviews[,2],FUN = function(X){paste0("<start> ",X," <end>")})

In [111]:
# fixing the maximum length of the reviews and the summary sequences
max_length_text = 110

max_length_summary = 10

In [112]:
# function for tokenization
tokenization <- function(lines){
    tokenizer = text_tokenizer()
    tokenizer =  fit_text_tokenizer(tokenizer,lines)
    return(tokenizer)
}

In [113]:
# preparing a tokenizer on text data and calculating the vocabulary size of the text data

x_tokenizer <- tokenization(cleaned_reviews$Cleaned_Text)
x_tokenizer$word_index[1:5]

x_voc_size   =  length(x_tokenizer$word_index) +1
print(paste0('Xtrain vocabulary size:',x_voc_size))

[1] "Xtrain vocabulary size:19347"


In [114]:
# preparing a tokenizer on summary data and calculating the vocabulary size of the summary data
y_tokenizer <- tokenization(cleaned_reviews$Cleaned_Summary)
y_tokenizer$word_index[1:5]

y_voc_size   =  length(y_tokenizer$word_index) +1
print(paste0('Ytrain data vocabulary size:',y_voc_size))

[1] "Ytrain data vocabulary size:4565"


In [115]:
# function for encoding and padding the sequences

encode_pad_sequences <- function(tokenizer, length, lines){
    # Encoding text to integers
    seq = texts_to_sequences(tokenizer,lines)
    # Padding text to maximum length sentence
    seq = pad_sequences(seq, maxlen=length, padding='post')
    return(seq)
}

In [116]:
# splitting the data into training and testing datasets
sample_size <- floor(0.80 * nrow(cleaned_reviews))

## set the seed to make the partition reproducible
set.seed(0)
train_indices <- sample(seq_len(nrow(cleaned_reviews)), size = sample_size)

x_train <- cleaned_reviews[train_indices,"Cleaned_Text"]
y_train <- cleaned_reviews[train_indices,"Cleaned_Summary"]

x_val <- cleaned_reviews[-train_indices,"Cleaned_Text"]
y_val <- cleaned_reviews[-train_indices,"Cleaned_Summary"]

In [117]:
# encoding the training and validation datasets into integer sequences and padding them to their respective maximum lengths
num_train_examples = length(x_train)
num_val_examples = length(x_val)

x <- encode_pad_sequences(x_tokenizer,max_length_text,x_train)
x_val <- encode_pad_sequences(x_tokenizer,max_length_text,x_val)

y_encoded <- encode_pad_sequences(y_tokenizer,max_length_summary,y_train)
y1 <- encode_pad_sequences(y_tokenizer,max_length_summary,y_train)[,-max_length_summary]
y2 <- encode_pad_sequences(y_tokenizer,max_length_summary,y_train)[,-1]
y2 <- array_reshape(x = y2,c(num_train_examples,(max_length_summary-1),1))

y_val_encoded <- encode_pad_sequences(y_tokenizer,max_length_summary,y_val)
y_val1 <- encode_pad_sequences(y_tokenizer,max_length_summary,y_val)[,-max_length_summary]
y_val2 <- encode_pad_sequences(y_tokenizer,max_length_summary,y_val)[,-1]
y_val2 <- array_reshape(x = y_val2,c(num_val_examples,(max_length_summary-1),1))

In [120]:
# initializing parameters that will be fed in model configuration
latent_dim = 500
batch_size = 200
epochs = 100

In [121]:
# Encoder configuration


# Defining and processing the input sequence.
encoder_inputs  <- layer_input(shape=c(max_length_text),name = "encoder_inputs")
embedding_encoder <- encoder_inputs %>% layer_embedding(input_dim = x_voc_size,output_dim = latent_dim,trainable = TRUE,name = "encoder_embedding")

# Encoder LSTM 1

encoder_lstm1 <- layer_lstm(units=latent_dim,return_sequences = TRUE, return_state=TRUE,name = "encoder_lstm1")
encoder_results1 <- encoder_lstm1(embedding_encoder)
encoder_output1 <- encoder_results1[1]
state_h1 <- encoder_results1[2]
state_c1 <- encoder_results1[3]

# Encoder LSTM 2

encoder_lstm2 <- layer_lstm(units=latent_dim,return_sequences = TRUE, return_state=TRUE,name = "encoder_lstm2")
encoder_results2 <- encoder_lstm2(encoder_output1)
encoder_output2 <- encoder_results2[1]
state_h2 <- encoder_results2[2]
state_c2 <- encoder_results2[3]

# Encode LSTM 3

encoder_lstm3 <- layer_lstm(units=latent_dim,return_sequences = TRUE, return_state=TRUE,name = "encoder_lstm3")
encoder_results3 <- encoder_lstm3(encoder_output2)
encoder_outputs <- encoder_results3[1]
state_h <- encoder_results3[2]
state_c <- encoder_results3[3]
encoder_states <- encoder_results3[2:3]

In [122]:
# Decoder configuration

# Setting up the decoder, using encoder_states as the initial state
decoder_inputs  <- layer_input(shape=list(NULL),name = "decoder_inputs")

embedding_layer_decoder <- layer_embedding(input_dim = y_voc_size,output_dim = latent_dim,trainable = TRUE,name = "decoder_embedding")
embedding_decoder <- embedding_layer_decoder(decoder_inputs)

decoder_lstm    <- layer_lstm(units=latent_dim, return_sequences=TRUE,return_state=TRUE,name="decoder_lstm")
decoder_results <- decoder_lstm(embedding_decoder, initial_state=encoder_states)
decoder_outputs <- decoder_results[1]
decoder_fwd_state <- decoder_results[2]
decoder_back_state <- decoder_results[3]

decoder_dense <- time_distributed(layer = layer_dense(units = y_voc_size, activation='softmax'))
decoder_outputs <- decoder_dense(decoder_outputs[[1]])

In [123]:
# combining the encoder and decoder into a single model
model <- keras_model(inputs = c(encoder_inputs, decoder_inputs),outputs = decoder_outputs)

summary(model)

In [125]:
# compiling the model
model %>% compile(optimizer = "rmsprop",loss = 'sparse_categorical_crossentropy')

In [126]:
# defining the callbacks and checkpoints
model_name <- "model_TextSummarization"

# Checkpoints
checkpoint_dir <- "checkpoints_text_summarization"
dir.create(checkpoint_dir)
filepath <- file.path(checkpoint_dir, paste0(model_name,"weights.{epoch:02d}-{val_loss:.2f}.hdf5",sep=""))

# Callback
ts_callback <- list(callback_model_checkpoint(mode = "min",
 filepath = filepath,
 save_best_only = TRUE,
 verbose = 1,
 callback_early_stopping(patience = 100)))

"'checkpoints_text_summarization' already exists"

In [99]:
# training the model
model %>% fit(x = list(x,y1),y = y2,epochs = epochs,batch_size = batch_size,validation_data = list(list(x_val,y_val1),y_val2),callbacks = ts_callback,verbose = 2)

In [100]:
# Generating predictions for test data


# creating a function to generate a reversed list of key-value pair of the word index
reverse_word_index <- function(tokenizer){
    reverse_word_index <- names(tokenizer$word_index)
    names(reverse_word_index) <- tokenizer$word_index
    return(reverse_word_index)
}

x_reverse_word_index <- reverse_word_index(x_tokenizer)
y_reverse_word_index <- reverse_word_index(y_tokenizer)

# Reverse-lookup token index to decode sequences back to meaningful sentences or phrases
reverse_target_word_index=y_reverse_word_index
reverse_source_word_index=x_reverse_word_index
target_word_index= y_tokenizer$word_index

In [101]:
# Inference model to decode unknown input sequences

encoder_model <-  keras_model(inputs = encoder_inputs, outputs = encoder_results3)

decoder_state_input_h <- layer_input(shape=latent_dim)
decoder_state_input_c <- layer_input(shape=latent_dim)
decoder_hidden_state_input <- layer_input(shape = c(max_length_text,latent_dim))
decoder_embedding2 <- embedding_layer_decoder(decoder_inputs)
decoder_results2 <- decoder_lstm(decoder_embedding2,initial_state = c(decoder_state_input_h,decoder_state_input_c))
decoder_outputs2 <- decoder_results2[1]
state_h2 <- decoder_results2[2]
state_c2 <- decoder_results2[3]

decoder_outputs2 <- decoder_dense(decoder_outputs2[[1]])
inp = c(decoder_hidden_state_input,decoder_state_input_h,decoder_state_input_c)
dec_states = c(state_h2,state_c2)
decoder_model <-  keras_model(inputs = c(decoder_inputs,inp),outputs = c(decoder_outputs2,dec_states))

________________________________________________________________________________
Layer (type)              Output Shape      Param #  Connected to               
decoder_inputs (InputLaye (None, None)      0                                   
________________________________________________________________________________
decoder_embedding (Embedd (None, None, 500) 105500   decoder_inputs[0][0]       
________________________________________________________________________________
input_4 (InputLayer)      (None, 500)       0                                   
________________________________________________________________________________
input_5 (InputLayer)      (None, 500)       0                                   
________________________________________________________________________________
decoder_lstm (LSTM)       [(None, None, 500 2002000  decoder_embedding[1][0]    
                                                     input_4[0][0]              
                            

In [102]:
# defining a function decode_sequence(), which is the implementation of the inference process

decode_sequence <- function(input_seq) {
    ## Encoding the input as state vectors
    encoder_predict <- predict(encoder_model, input_seq)
    e_out = encoder_predict[[1]]
    e_h = encoder_predict[[2]]
    e_c = encoder_predict[[3]]

    # Generating empty target sequence of length 1
    target_seq <- array(0,dim = c(1,1))

    ## Populating the first character of target sequence with the start character.
    target_seq[1,1] <- target_word_index[['start']]

    stop_condition = FALSE
    decoded_sentence = ''
    niter = 1
    while (stop_condition==FALSE) {

        decoder_predict <- predict(decoder_model, list(target_seq, e_out,e_h,e_c))
        output_tokens <- decoder_predict[[1]]
        h <-  decoder_predict[[2]]
        c <-  decoder_predict[[3]]


        ## Sampling a token
        sampled_token_index <- which.max(output_tokens[1, 1, ])
        sampled_token <- reverse_target_word_index[sampled_token_index]

         if (sampled_token != 'end'){
             decoded_sentence =  paste0(decoded_sentence, sampled_token," ")
             if(sapply(strsplit(decoded_sentence, " "), length) >= max_length_summary){
                 stop_condition = TRUE
             }
         }

        target_seq <- array(0,dim = c(1,1))
        target_seq[ 1,1] <- sampled_token_index

        e_h = h
        e_c = c

  }
    return(decoded_sentence)
    }

In [103]:
# defining functions to convert an integer sequence to a word sequence for both reviews and the summaries

seq2summary<- function(input_seq){
    newString=''
    for(i in input_seq){
        if((i!=0 & i!=target_word_index[['start']]) & i!=target_word_index[['end']]){
        newString=paste0(newString,reverse_target_word_index[[i]],' ')
        }
        }
     return(newString)
}



seq2text <- function(input_seq){
    newString=''
    for(i in input_seq){
      if(i!=0){
        newString=paste0(newString,reverse_source_word_index[[i]],' ')
          }
        }
    return(newString)
    }

In [104]:
# decoding sample reviews
for(i in 1:dim(x_val)[1]){
    print(paste0("Review:",seq2text(x_val[i,])))
    print(paste0("Original summary:",seq2summary(y_val_encoded[i,])))
    print(paste0("Predicted summary:",decode_sequence(array_reshape(x_val[i,],dim= c(1,max_length_text)))))
    print("\n")
    }

[1] "Review:confection around centuries light pillowy citrus gelatin nuts case filberts cut tiny squares liberally coated powdered sugar tiny mouthful heaven chewy flavorful highly recommend yummy treat familiar story c s lewis lion witch wardrobe treat seduces edmund selling brother sisters witch "
[1] "Original summary:delight says it all "
[1] "Predicted summary:food great great great start start start start start start "
[1] "\n"
[1] "Review:looking secret ingredient robitussin believe found got addition root beer extract ordered good made cherry soda flavor medicinal "
[1] "Original summary:cough medicine "
[1] "Predicted summary:food great great great start start start start start start "
[1] "\n"
[1] "Review:great taffy great price wide assortment yummy taffy delivery quick taffy lover deal "
[1] "Original summary:great taffy "
[1] "Predicted summary:food great great great start start start start start start "
[1] "\n"
[1] "Review:taffy good soft chewy flavors amazing definitely