<a href="https://colab.research.google.com/github/KaustavRaj/Text-Summarization/blob/master/Text_Summarizer_Manual.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarizer Manual
##### *by Kaustav Bhattacharjee, IIIT Guwahati*
---

Here, I'm going to show how to use the encoder-decoder model that was trained in **Text Summarization** jupyter notebook. So lets first import all the libraries and other necessities for google colab.

In [0]:
from google.colab import drive

drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
!pip install contractions

Collecting contractions
  Downloading https://files.pythonhosted.org/packages/85/41/c3dfd5feb91a8d587ed1a59f553f07c05f95ad4e5d00ab78702fbf8fe48a/contractions-0.0.24-py2.py3-none-any.whl
Collecting textsearch
  Downloading https://files.pythonhosted.org/packages/42/a8/03407021f9555043de5492a2bd7a35c56cc03c2510092b5ec018cae1bbf1/textsearch-0.0.17-py2.py3-none-any.whl
Collecting Unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 4.3MB/s 
[?25hCollecting pyahocorasick
[?25l  Downloading https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz (312kB)
[K     |████████████████████████████████| 317kB 19.9MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  

Now, to summarize a sentence, we just need the below code out of the entire 'Summarization' notebook because our trained models are already saved in google drive.

In [0]:
import re
import os
import pickle
import logging
import numpy as np
import contractions
from nltk.corpus import stopwords
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
logging.getLogger("tensorflow").setLevel(logging.CRITICAL)

_MAX_TEXT_LEN    =   60
_MAX_SUMMARY_LEN =   10
_TEXT_PADDING    =   'post'

dir_path = '/content/gdrive/My Drive/Colab Notebooks/Summarization/summarization v2'

encoder_model = load_model(dir_path + '/models/encoder_model.h5')
decoder_model = load_model(dir_path + '/models/decoder_model.h5')
model         = load_model(dir_path + '/models/model_3.h5')


with open(dir_path + '/data/word_indices_mapping.pickle', 'rb') as f:
  index_to_word_text, index_to_word_summary, word_to_index_summary = pickle.load(f)


with open(dir_path + '/data/tok_x.pickle', 'rb') as f:
  tok_x = pickle.load(f)


def cleaner(text, remove_stopwords=True):
  """removes url's, nltk's stopwords and anything which is not an alphabet"""
  
  stop_words = set(stopwords.words('english'))
  text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text.lower(), flags=re.MULTILINE)
  text = re.sub(r'[^a-zA-Z]', ' ', text)
  text = contractions.fix(text, slang=False)
  if remove_stopwords:
    text = ' '.join([word for word in text.split() if word not in stop_words]).strip()
  return text


def summarizer(input_seq):
    encoder_out, encoder_h, encoder_c = encoder_model.predict(input_seq)
    target_seq = np.zeros((1,1))
    target_seq[0, 0] = word_to_index_summary['stok']
    stop_condition = False
    decoded_sentence = ''

    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + [encoder_out, encoder_h, encoder_c])
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token = index_to_word_summary[sampled_token_index]
        
        if sampled_token != 'etok':
            decoded_sentence += sampled_token + ' '

        if sampled_token == 'etok' or len(decoded_sentence.split()) >= (_MAX_SUMMARY_LEN-1):
            stop_condition = True

        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        encoder_h, encoder_c = h, c

    return decoded_sentence


def tryit(sent):
  """wrapper function to test the model"""

  sent = cleaner(sent, remove_stopwords=True)
  if len(sent.split()) > _MAX_TEXT_LEN:
    return "make your sentence length less than {} words".format(_MAX_TEXT_LEN)

  seq = tok_x.texts_to_sequences(sent.split())
  seq = [[item for sublist in seq for item in sublist]]
  seq = pad_sequences(seq, maxlen=_MAX_TEXT_LEN, padding=_TEXT_PADDING)
  return summarizer(seq.reshape(1,_MAX_TEXT_LEN))


print(tryit('my dog loves the food'))

dog loves it 


Let's try to see a few more sentences...

In [0]:
print(tryit('actually try cups see work advertised however order arrived time excellent condition gives star rating cups work advertised rating jumps stars'))

great product but not a great price 


In [0]:
tryit('pumpkin seeds received bob redmill company stale contained lot small seed fragment')

'pumpkin seeds '

In [0]:
tryit('guess item would find cooking however found bit bitter eating snack')

'not a good product '

## Conclusion

The above code has been made into a class, which makes it easier to work with and maybe modified into a module upon improvement.

In [0]:
import re
import os
import pickle
import logging
import numpy as np
import contractions
from nltk.corpus import stopwords
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
logging.getLogger("tensorflow").setLevel(logging.CRITICAL)


class Summarizer:

  def __init__(self):
    # change the dir_path to your folder's location if used in google colab,
    # otherwise make it 'None' if used in own computer

    self.dir_path = '/content/gdrive/My Drive/Colab Notebooks/Summarization/summarization v2'
    self._MAX_TEXT_LEN    =  60
    self._MAX_SUMMARY_LEN =  10
    self._TEXT_PADDING    =  'post'
    self.encoder_model = load_model(self.dir_path + '/models/encoder_model.h5')
    self.decoder_model = load_model(self.dir_path + '/models/decoder_model.h5')
    with open(self.dir_path + '/data/word_indices_mapping.pickle', 'rb') as f:
      self.index_to_word_text, self.index_to_word_summary, self.word_to_index_summary = pickle.load(f)
    with open(self.dir_path + '/data/tok_x.pickle', 'rb') as f:
      self.tok_x = pickle.load(f)


  def summarize(self, sent):
    """wrapper function to test the model"""

    sent = self.cleaner(sent, remove_stopwords=True)
    if len(sent.split()) > self._MAX_TEXT_LEN:
      return "make your sentence length less than {} words".format(self._MAX_TEXT_LEN)

    seq = self.tok_x.texts_to_sequences(sent.split())
    seq = [[item for sublist in seq for item in sublist]]
    seq = pad_sequences(seq, maxlen=self._MAX_TEXT_LEN, padding=self._TEXT_PADDING)
    return self.seq2seq_model(seq.reshape(1, self._MAX_TEXT_LEN))


  def cleaner(self, text, remove_stopwords=True):
    """removes url's, nltk's stopwords and anything which is not an alphabet"""

    stop_words = set(stopwords.words('english'))
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text.lower(), flags=re.MULTILINE)
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = contractions.fix(text, slang=False)
    if remove_stopwords:
      text = ' '.join([word for word in text.split() if word not in stop_words]).strip()
      
    return text
  

  def seq2seq_model(self, input_seq):
    """summarizes the input text given and returns the summarized string"""

    encoder_out, encoder_h, encoder_c = self.encoder_model.predict(input_seq)
    target_seq = np.zeros((1,1))
    target_seq[0, 0] = self.word_to_index_summary['stok']
    stop_condition = False
    decoded_sentence = ''

    while not stop_condition:
      output_tokens, h, c = self.decoder_model.predict([target_seq] + [encoder_out, encoder_h, encoder_c])
      sampled_token_index = np.argmax(output_tokens[0, -1, :])
      sampled_token = self.index_to_word_summary[sampled_token_index]
      
      if sampled_token != 'etok':
        decoded_sentence += sampled_token + ' '
        
      if sampled_token == 'etok' or len(decoded_sentence.split()) >= (self._MAX_SUMMARY_LEN-1):
        stop_condition = True

      target_seq = np.zeros((1,1))
      target_seq[0, 0] = sampled_token_index
      encoder_h, encoder_c = h, c

    return decoded_sentence

It's usage:

In [0]:
our_model = Summarizer()
our_model.summarize('after all this hardwork we should have a python party tonight')

'great for a little lifestyle '