# Reader-aware question generation tests
We've taken a first stab at reader-aware question generation by adding a reader-type token (e.g. `US` reader) to the input text.

Let's see how well this model does with prediction, and how well it adapts to different reader groups.

In [10]:
len(tokenizer)

50271

In [14]:
## load model
import torch
from transformers import AutoModelForSeq2SeqLM, BartTokenizer
cache_dir = '../../data/nyt_comments/author_data_model/model_cache/'
generation_model = AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-base', cache_dir=cache_dir)
tokenizer = torch.load('../../data/nyt_comments/author_data_model/BART_tokenizer.pt')
# fix vocab size
generation_model.resize_token_embeddings(len(tokenizer))
# generation_model = AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-base')
trained_model_state_dict_file = '../../data/nyt_comments/author_data_model/question_generation_model/checkpoint-72000/pytorch_model.bin' # latest checkpoint model
trained_model_state_dict = torch.load(trained_model_state_dict_file)
generation_model.load_state_dict(trained_model_state_dict)

<All keys matched successfully>

In [16]:
## load validation data
val_data = torch.load('../../data/nyt_comments/author_data_model/author_type_NYT_question_data_val_data.pt')
val_data = val_data['train']
print(val_data)

Dataset(features: {'article_id': Value(dtype='string', id=None), 'location_region': Value(dtype='string', id=None), 'prior_comment_count_bin': Value(dtype='int64', id=None), 'prior_comment_len_bin': Value(dtype='int64', id=None), 'source_text': Value(dtype='string', id=None), 'target_text': Value(dtype='string', id=None), 'source_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'target_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, num_rows: 7209)


In [65]:
help(tokenizer.add_tokens)

Help on method add_tokens in module transformers.tokenization_utils_base:

add_tokens(new_tokens: Union[str, tokenizers.AddedToken, List[Union[str, tokenizers.AddedToken]]], special_tokens: bool = False) -> int method of transformers.tokenization_bart.BartTokenizer instance
    Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to
    it with indices starting from length of the current vocabulary.
    
    Args:
        new_tokens (:obj:`str`, :obj:`tokenizers.AddedToken` or a list of `str` or :obj:`tokenizers.AddedToken`):
            Tokens are only added if they are not already in the vocabulary. :obj:`tokenizers.AddedToken` wraps a
            string token to let you personalize its behavior: whether this token should only match against a single
            word, whether this token should strip all potential whitespaces on the left side, whether this token
            should strip all potential whitespaces on the right side,

In [62]:
print(val_data['source_ids'][20])
print(val_data['source_text'][20])

[50270  5762   480 ...    14     5     2]
WASHINGTON -- President Trump rejected, for now at least, a fresh round of sanctions set to be imposed against Russia on Monday, a course change that underscored the schism between the president and his national security team. The president's ambassador to the United Nations, Nikki R. Haley, had announced on Sunday that the administration would place sanctions on Russian companies found to be assisting Syria's chemical weapons program. The sanctions were listed on a menu of further government options after an American-led airstrike on Syria, retaliating against a suspected gas attack that killed dozens a week earlier. But the White House contradicted her on Monday, saying that Mr. Trump had not approved additional measures." We are considering additional sanctions on Russia and a decision will be made in the near future," Sarah Huckabee Sanders, the White House press secretary, said in a statement. Speaking later with reporters aboard Air Force

In [60]:
## TODO: why did special author token get added to start of all source/target IDs??
for x in val_data['source_ids']:
    if(x[0] != 50270):
        print(x)
        break

In [55]:
tokenizer.encode(['this', 'is', 'a', 'test'], add_special_tokens=True)

[50270, 9226, 354, 102, 21959, 2]

In [52]:
txt = val_data['target_text'][0]
print()
print(val_data['target_ids'][0])
print(tokenizer.decode(val_data['target_ids'][0], skip_special_tokens=False))
print(tokenizer.decode(tokenizer.tokenize(txt)))


[50270  6785 24224    53    16    24   678    14  1774   782     5    86
     7  3886     7   422    25    10  1984    13     5  2760  1939   729
   116     2     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1     1     1     1     1     1     1     1     1
     1     1     1     1]
<COMMENT_LEN_1_AUTHOR> Just guessing but is it possible that Ryan needs the time to prepare to run as a candidate for the 2020 presidential election?</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


ValueError: invalid literal for int() with base 10: 'Just'

In [17]:
from data_helpers import generate_predictions
device_name = 'cuda:0'
pred_text = generate_predictions(generation_model, val_data, tokenizer, device_name)

  return function(data_struct)
 55%|█████▍    | 3957/7209 [10:45<09:02,  5.99it/s]IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [32]:
tokenizer.decode([len(tokenizer)-6])

'<US_AUTHOR>'

In [20]:
## evaluate sample of text
sample_size = 100
mini_val_data = val_data.select(list(range(sample_size)))
mini_pred_text = pred_text[:sample_size]
from data_helpers import compare_pred_text_with_target
compare_pred_text_with_target(mini_val_data, mini_pred_text, tokenizer)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1.0), HTML(value='')))


*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> WASHINGTON  --  Paul  D .  Ryan  took  the  helm  of  the  House  two  and  a  half  years  ago ,  not  because  he  wanted  it ,  but  because  he  was  seen  as  the  only  lawmaker  who  could  keep  Republicans  from  dev ouring  themselves .  They  had  shut  down  the  g...
target text = <COMMENT_LEN_1_AUTHOR> Just  guessing  but  is  it  possible  that  Ryan  needs  the  time  to  prepare  to  run  as  a  candidate  for  the  2020  presidential  election ?
pred text = Why would someone like Ryan run for public office in the first place?
*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> WASHINGTON  --  Defense  Department  officials  said  on  Saturday  that  American - led  strikes  against  Syria  had  taken  out  the "  heart "  of  President  Bashar  al - Assad 's  chemical  weapons  program ,  but  acknowledged  that  the  Syrian  government  most  likel...
target text = <COMMENT_LEN_1_AUTHOR> Why  waste  trillions  of  dollar

*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> H EL ENA ,  Mont .  --  It  has  been  an  iron  rule  for  candidates  in  rural  areas  and  red  states  for  decades  :  Do  not  antagon ize  the  National  Rifle  Association .  But  that  was  before  the  massacre  at  a  high  school  in  Park land ,  Fla .,  galvan i...
target text = <COMMENT_LEN_1_AUTHOR> Now ,  just  why  would "  Russia "  give  money  to  an  organization  that  wants  Americans  armed ?
pred text = Alan-Could you please educate us all and explain why banning assault weapons and requiring thorough background checks and gun safety training would result in" total confiscation"?
*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> We  are  now  in  the  midst  of  an  epic  clash  between  Donald  Trump  and  fired  F .  B .  I .  Director  James  Comey ,  neither  of  whom  I  hold  in  high  esteem ,  both  men  with  raging  eg os  and  questionable  motives .  The  depth  of  my  contempt  differs  ...
target 

*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> WASHINGTON  --  The  White  House  said  Tuesday  night  that  President  Trump  planned  to  deploy  the  National  Guard  to  the  southern  border  to  confront  what  it  called  a  growing  threat  of  illegal  immigrants ,  drugs  and  crime  from  Central  America  afte...
target text = <COMMENT_LEN_1_AUTHOR> What  do  you  do  when  you  have  an  Im bec ile  in  the  White house  making  crazy  proclamation  without  any  awareness  or  knowledge ,  just  stupid  off  the  cuff  pronounce ments ?
pred text = What will happen to American troops who encounter drug traffickers at the border?
*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> The  Trump  administration  is  expected  to  launch  an  effort  in  coming  days  to  weaken  greenhouse  gas  emissions  and  fuel  economy  standards  for  automobiles ,  handing  a  victory  to  car  manufacturers  and  giving  them  ammunition  to  potentially  roll  bac...
target text = <C

*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> From  his  Fox  News  pul pit ,  Sean  Hannity  has  been  one  of  the  most  ardent  supporters  of  President  Trump ,  cheering  his  agenda  and  exc ori ating  his  enemies .  He  has  gone  from  giving  advice  on  messaging  and  strategy  to  Mr .  Trump  and  his  a...
target text = <COMMENT_LEN_1_AUTHOR> Another  right - wing  David  D enn ison ?
pred text = Why would Sean Hannity need a fixer like Cohen?
*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> MI AMI  --  Gov .  Rick  Scott  made  official  on  Monday  what  Flor id ians  have  suspected  for  months  :  He  is  running  for  the  United  States  Senate  against  Bill  Nelson ,  the  incumbent  Democrat ,  in  a  premier  race  that  will  return  the  nation 's  la...
target text = <COMMENT_LEN_1_AUTHOR> Why  shouldn 't  Florida  do  all  in  its  political  power  to  wipe  out  Miami ,  Palm  Beach ,  Naples ,  and  the  rest ?
pred text = Have his Democratic opp

*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> P AL M  BE ACH ,  Fla .  --  President  Trump  declared  on  Wednesday  that  he  would  scrap  a  planned  summit  meeting  with  North  Korea 's  leader ,  Kim  Jong - un ,  or  even  walk  out  of  the  session  while  it  was  underway ,  if  his  diplomatic  overt ure  wa...
target text = <COMMENT_LEN_1_AUTHOR> After  calling  Kim "  Little  Rocket  Man ,"  what  in  the  world  Trump  talk  about ?
pred text = Why is he taking such a political risk?
*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> The  tabloid  news  company  American  Media  Inc .  agreed  to  let  a  former  Playboy  model  out  of  a  contract  that  had  kept  her  from  talking  freely  about  an  alleged  affair  with  Donald  J .  Trump .  The  settlement  agreement ,  reached  on  Wednesday ,  e...
target text = <COMMENT_LEN_1_AUTHOR> Has  it  been  explained  yet  why  at  least  one  of  these "  paid  for  silence  re  dj t "  contracts  mentions "  pate

*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> I  used  to  want  to  leave  you .  I  loved  you .  But  I  couldn  � �  t  stay .  I  wanted  to  live  in  a  city ,  with  access  to  hiking  trails ,  and  coffee  shops  and  book stores  that  I  could  walk  to .  Not  our  Florida  suburb  full  of  palm  trees  and...
target text = <COMMENT_LEN_1_AUTHOR> How  did  she  give  up  her  dream  and  why ?
pred text = Do you see the tenfold disparity there?
*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR> WASHINGTON  --  Paul  D .  Ryan  took  the  helm  of  the  House  two  and  a  half  years  ago ,  not  because  he  wanted  it ,  but  because  he  was  seen  as  the  only  lawmaker  who  could  keep  Republicans  from  dev ouring  themselves .  They  had  shut  down  the  g...
target text = <COMMENT_LEN_1_AUTHOR> es ign ,  rather  than  stand  up ?
pred text = Why would someone like Ryan run for public office in the first place?
*~*~*~*~*~*
source text = <COMMENT_LEN_1_AUTHOR>