<a href="https://colab.research.google.com/github/AravindR7/T5-Question-Generator/blob/master/que_gen_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!gdown -O  t5_que_gen.zip --id 1vhsDOW9wUUO83IQasTPlkxb82yxmMH-V
!unzip t5_que_gen.zip

In [None]:
!pip install transformers

In [3]:
import argparse
import glob
import os
import json
import time 
import logging
import random
from itertools import chain
from string import punctuation

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

import numpy as np
import torch
from torch.utils.data import dataset, DataLoader

from transformers import (
    AdamW,
    T5ForConditionalGeneration,
    T5Tokenizer,
    get_linear_schedule_with_warmup
                         )

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
class QueGenerator():
  def __init__(self):
    self.que_model = T5ForConditionalGeneration.from_pretrained('./t5_que_gen_model/t5_base_que_gen/')
    self.ans_model = T5ForConditionalGeneration.from_pretrained('./t5_ans_gen_model/t5_base_ans_gen/')

    self.que_tokenizer = T5Tokenizer.from_pretrained('./t5_que_gen_model/t5_base_tok_que_gen/')
    self.ans_tokenizer = T5Tokenizer.from_pretrained('./t5_ans_gen_model/t5_base_tok_ans_gen/')
    
    self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    self.que_model = self.que_model.to(self.device)
    self.ans_model = self.ans_model.to(self.device)
  
  def generate(self, text):
    answers = self._get_answers(text)
    questions = self._get_questions(text, answers)
    output = [{'answer': ans, 'question': que} for ans, que in zip(answers, questions)]
    return output
  
  def _get_answers(self, text):
    # split into sentences
    sents = sent_tokenize(text)

    examples = []
    for i in range(len(sents)):
      input_ = ""
      for j, sent in enumerate(sents):
        if i == j:
            sent = "[HL] %s [HL]" % sent
        input_ = "%s %s" % (input_, sent)
        input_ = input_.strip()
      input_ = input_ + " </s>"
      examples.append(input_)
    
    batch = self.ans_tokenizer.batch_encode_plus(examples, max_length=512, pad_to_max_length=True, return_tensors="pt")
    with torch.no_grad():
      outs = self.ans_model.generate(input_ids=batch['input_ids'].to(self.device), 
                                attention_mask=batch['attention_mask'].to(self.device), 
                                max_length=32,
                                # do_sample=False,
                                # num_beams = 4,
                                )
    dec = [self.ans_tokenizer.decode(ids, skip_special_tokens=False) for ids in outs]
    answers = [item.split('[SEP]') for item in dec]
    answers = chain(*answers)
    answers = [ans.strip() for ans in answers if ans != ' ']
    return answers
  
  def _get_questions(self, text, answers):
    examples = []
    for ans in answers:
      input_text = "%s [SEP] %s </s>" % (ans, text)
      examples.append(input_text)
    
    batch = self.que_tokenizer.batch_encode_plus(examples, max_length=512, pad_to_max_length=True, return_tensors="pt")
    with torch.no_grad():
      outs = self.que_model.generate(input_ids=batch['input_ids'].to(self.device), 
                                attention_mask=batch['attention_mask'].to(self.device), 
                                max_length=32,
                                num_beams = 4)
    dec = [self.que_tokenizer.decode(ids, skip_special_tokens=False) for ids in outs]
    return dec

In [7]:
que_generator = QueGenerator()

In [None]:
text = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum \
and first released in 1991, Python's design philosophy emphasizes code \
readability with its notable use of significant whitespace."

text2 = "Gravity (from Latin gravitas, meaning 'weight'), or gravitation, is a natural phenomenon by which all \
things with mass or energy—including planets, stars, galaxies, and even light—are brought toward (or gravitate toward) \
one another. On Earth, gravity gives weight to physical objects, and the Moon's gravity causes the ocean tides. \
The gravitational attraction of the original gaseous matter present in the Universe caused it to begin coalescing \
and forming stars and caused the stars to group together into galaxies, so gravity is responsible for many of \
the large-scale structures in the Universe. Gravity has an infinite range, although its effects become increasingly \
weaker as objects get further away"

In [None]:
que_generator.generate(text)

In [None]:
que_generator.generate(text2)


In [None]:
tetx = "A dentist, also known as a dental surgeon, is a surgeon who specializes in dentistry, the diagnosis, prevention, and treatment of diseases and conditions of the oral cavity. The dentist's supporting team aids in providing oral health services. The dental team includes dental assistants, dental hygienists, dental technicians, and sometimes dental therapists."

In [None]:
que_generator.generate(tetx)

In [14]:
algom = '''We define a targeted promotion as an incentive that can be delivered to selected consumers through marketing channels. 
Promotions can be offered with some condition (e. g. buy one, get one)
or without any condition, can provide a monetary value such as
a discount, or can just advertise a product or brand. Promotions
may or may not be redeemable, in the sense that a consumer might
need to submit evidence of the promotion (scan a bar code on a
printed coupon or enter a promotion code) to redeem its monetary value with a purchase. We also use the word treatment as a
generic term that refers to promotions and other marketing communications.
• A retailer owns marketing channels such physical stores or eCommerce websites that can be used to communicate promotions to
the consumers. The marketing channels of multiple retailers can
be combined into a promotion distribution network that can be operated by retailers or a third-party agency. For example, an agency
can install its coupon printers in stores that belong to multiple
retail chains.
It is critically important that the retailer or agency, as a marketing
channel owner, can track consumers at the individual level and
link together transactions made by the same consumer or household. This tracking is often based on loyalty IDs that are assigned
to the customers by using loyalty cards or online accounts, credit
card IDs, or other pieces of information that are available to a retailer. This process, however, is often imperfect, and a significant
number of transactions can remain anonymous.
• Promotions can be distributed through the marketing channels
on behalf of both manufacturers and retailers. Distribution can
be done either in batch mode, when emails or printed catalogs
are sent to a large number of customers, or in real-time mode,
when promotions are generated in the scope of an individual
transaction, such as an in-store purchase or website visit.
• The main decisions that a targeting system needs to make with
respect to promotions are who are the right recipients for a promotion, what are the right promotional properties, what is the
optimal time to offer it, and what is the right delivery channel.
• We assume that a retailer can identify consumers who have received a promotion, consumers who have purchased a promoted
product, and, optionally, promotion redemption events. Note
that purchases and redemptions are completely different events
that should not be confused: consumers who have a promotion
3.1 environment 79
are not obligated to redeem it, and a product can typically be
purchased by any consumer although the purchase may be on
different conditions according to the granted promotions. Beyond these events, a targeting system can also access additional
or external consumer data, such as demographic records or
survey answers.'''

In [15]:
que_generator.generate(algom)

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[{'answer': 'marketing channels',
  'question': 'How can a targeted promotion be delivered to selected consumers?'},
 {'answer': 'promotion',
  'question': 'What can provide a monetary value such as a discount, or can just advertise a product or brand?'},
 {'answer': 'redeemable', 'question': 'Promotions may or may not be what?'},
 {'answer': 'treatment',
  'question': 'What word refers to promotions and other marketing communications?'},
 {'answer': 'physical stores or eCommerce websites',
  'question': 'Where does a retailer own marketing channels?'},
 {'answer': 'retailers or a third-party agency',
  'question': 'Who can operate a promotion distribution network?'},
 {'answer': 'multiple retail chains',
  'question': 'An agency can install coupon printers in stores that belong to what?'},
 {'answer': 'track consumers at the individual level and link together transactions made by the same consumer or household',
  'question': 'What is critically important that the retailer or agency, 