In [1]:
import spacy
from spacy import displacy
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
nlp=spacy.load('en_core_web_lg')

In [2]:
text='''Cricket is a bat-and-ball game played between two teams of eleven players on a field at the centre of which is a 22-yard (20-metre) pitch with a wicket at each end, each comprising two bails balanced on three stumps. The batting side scores runs by striking the ball bowled at one of the wickets with the bat and then running between the wickets, while the bowling and fielding side tries to prevent this (by preventing the ball from leaving the field, and getting the ball to either wicket) and dismiss each batter (so they are "out"). Means of dismissal include being bowled, when the ball hits the stumps and dislodges the bails, and by the fielding side either catching the ball after it is hit by the bat, but before it hits the ground, or hitting a wicket with the ball before a batter can cross the crease in front of the wicket. When ten batters have been dismissed, the innings ends and the teams swap roles. The game is adjudicated by two umpires, aided by a third umpire and match referee in international matches. They communicate with two off-field scorers who record the match's statistical information.

Forms of cricket range from Twenty20 (also known as T20), with each team batting for a single innings of 20 overs (each "over" being a set of 6 fair opportunities for the batting team to score) and the game generally lasting three to four hours, to Test matches played over five days. Traditionally cricketers play in all-white kit, but in limited overs cricket they wear club or team colours. In addition to the basic kit, some players wear protective gear to prevent injury caused by the ball, which is a hard, solid spheroid made of compressed leather with a slightly raised sewn seam enclosing a cork core layered with tightly wound string.

The earliest known definite reference to cricket is to it being played in South East England in the mid-16th century. It spread globally with the expansion of the British Empire, with the first international matches in the second half of the 19th century. The game's governing body is the International Cricket Council (ICC), which has over 100 members, twelve of which are full members who play Test matches. The game's rules, the Laws of Cricket, are maintained by Marylebone Cricket Club (MCC) in London. The sport is followed primarily in South Asia, Australia, New Zealand, the United Kingdom, Southern Africa and the West Indies.[1]

Women's cricket, which is organised and played separately, has also achieved international standard.

The most successful side playing international cricket is Australia, which has won eight One Day International trophies, including six World Cups, more than any other country and has been the top-rated Test side more than any other country.[citation needed]'''

In [3]:
doc=nlp(text)
for sent in doc.sents:
    print(sent)

Cricket is a bat-and-ball game played between two teams of eleven players on a field at the centre of which is a 22-yard (20-metre) pitch with a wicket at each end, each comprising two bails balanced on three stumps.
The batting side scores runs by striking the ball bowled at one of the wickets with the bat and then running between the wickets, while the bowling and fielding side tries to prevent this (by preventing the ball from leaving the field, and getting the ball to either wicket) and dismiss each batter (so they are "out").
Means of dismissal include being bowled, when the ball hits the stumps and dislodges the bails, and by the fielding side either catching the ball after it is hit by the bat, but before it hits the ground, or hitting a wicket with the ball before a batter can cross the crease in front of the wicket.
When ten batters have been dismissed, the innings ends and the teams swap roles.
The game is adjudicated by two umpires, aided by a third umpire and match referee 

In [4]:
data={'Entity':[],'Label':[],'Explain':[]}
data['Entity']=[ents.text for ents in doc.ents]
data['Label']=[ents.label_ for ents in doc.ents]
data['Explain']=[spacy.explain(ents.label_) for ents in doc.ents]
NER_=pd.DataFrame(data)
NER_

Unnamed: 0,Entity,Label,Explain
0,two,CARDINAL,Numerals that do not fall under another type
1,eleven,CARDINAL,Numerals that do not fall under another type
2,22-yard,QUANTITY,"Measurements, as of weight or distance"
3,20-metre,QUANTITY,"Measurements, as of weight or distance"
4,two,CARDINAL,Numerals that do not fall under another type
5,three,CARDINAL,Numerals that do not fall under another type
6,ten,CARDINAL,Numerals that do not fall under another type
7,two,CARDINAL,Numerals that do not fall under another type
8,third,ORDINAL,"""first"", ""second"", etc."
9,two,CARDINAL,Numerals that do not fall under another type


In [5]:
displacy.render(doc,style='ent')

In [6]:
def preprocess(sent,type=0):
    doc=nlp(sent)
    new_sent=[]
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        else:
            new_sent.append(token.lemma_)
    if type==1:
        return new_sent
    else:
        return ' '.join(new_sent)
    
def preprocess_with_ner(sent, type=0):
    doc = nlp(sent)
    new_sent = set()
    
    for ent in doc.ents:
        new_sent.add(ent.text)
    
    return new_sent
    

In [7]:
b=preprocess_with_ner(text)
b=list(b)
b=np.array(b)
b.shape

(34,)

In [8]:
preprocessed_sent=[preprocess(sent.text) for sent in doc.sents]
preprocessed_sent=np.array(preprocessed_sent)
print(preprocessed_sent.shape)

(15,)


In [9]:
v=TfidfVectorizer()
tdidf_vector=v.fit_transform(preprocessed_sent)
tdidf_vector.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.26423975,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.22653805],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [10]:
sent_score=tdidf_vector.sum(axis=1)
sent_score=np.array(sent_score)


In [11]:
sent_score_map=np.hstack((sent_score,preprocessed_sent.reshape(-1,1)))
sent_score_map=np.sort(sent_score_map,axis=0)[::-1]
print(sent_score_map)


[['5.182617383483045'
  'traditionally cricketer play white kit limited over cricket wear club team colour']
 ['4.8988555579483775'
  'successful play international cricket Australia win day international trophy include World Cups country rate test country.[citation need']
 ['4.630733091947959'
  'spread globally expansion British Empire international match second half 19th century']
 ['4.394572964179317'
  'sport follow primarily South Asia Australia New Zealand United Kingdom Southern Africa West Indies.[1 \n\n Women cricket organise play separately achieve international standard \n\n']
 ['3.805433108828658'
  'mean dismissal include bowl ball hit stump dislodge bail field catch ball hit bat hit ground hit wicket ball batter cross crease wicket']
 ['3.693383291813456'
  'game rule law Cricket maintain Marylebone Cricket Club MCC London']
 ['3.4605544418510146'
  'game govern body International Cricket Council ICC 100 member member play test match']
 ['3.4027409075765225'
  'game adju

In [12]:
v=TfidfVectorizer()
tdidf_matrix=v.fit_transform(b)
tdidf_matrix.toarray()
score=tdidf_matrix.sum(axis=1)
score.shape


(34, 1)

In [13]:
word_score=np.hstack((score,b.reshape(-1,1)))
word_score=np.sort(word_score,axis=0)[::-1]
word_score=np.array(word_score)

word_score_map=word_score[:,1]
for x in word_score_map:
    print(x)

two
twelve
three to four hours
three
third
the second half of the 19th century
the mid-16th century
the United Kingdom
the Laws of Cricket
the International Cricket Council
the British Empire
ten
six
over 100
five days
first
eleven
eight
World Cups
Twenty20
Southern Africa
South East England
South Asia
One Day International
New Zealand
Marylebone Cricket Club
MCC
London
ICC
Australia
6
22-yard
20-metre
20


In [14]:

from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-question-generation-ap")

def get_question(answer, context, max_length=64):
  input_text = "answer: %s  context: %s </s>" % (answer, context)
  features = tokenizer([input_text], return_tensors='pt')

  output = model.generate(input_ids=features['input_ids'], 
               attention_mask=features['attention_mask'],
               max_length=max_length)

  return tokenizer.decode(output[0])



The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can s

In [15]:
x=20
while x>0:
    flag=True
    for sent in sent_score_map[:,1]:
        for word in word_score_map:
            if word in sent:
                print(get_question(word,sent),"|",word)
                word_score_map = np.delete(word_score_map, np.where(word_score_map == word))
                x-=1
                flag=False
                break
        if x==0 or word_score_map.size==0:
            break
    if flag:
        break

<pad> question: What type of trophy does Australia win?</s> | World Cups
<pad> question: What region of the world does cricket follow?</s> | Southern Africa
<pad> question: What is the name of the cricket club in London?</s> | Marylebone Cricket Club
<pad> question: What is the name of the international cricket council?</s> | ICC
<pad> question: What team knows T20 team bat single innings 20 over set 6 fair opportunity batting team score game?</s> | Twenty20
<pad> question: What region of England is known for a cricket play?</s> | South East England
<pad> question: How many meters long is the pitch?</s> | 20
<pad> question: Which country has won the World Cups?</s> | Australia
<pad> question: What region of the world does cricket follow?</s> | South Asia
<pad> question: What is the name of the Marylebone Cricket Club?</s> | MCC
<pad> question: How many times is a fair opportunity batting team score game?</s> | 6
<pad> question: What country is the most popular for women cricket?</s> | 