# Collect generated text for evaluation data
We need to collect text generated by the model for the following two evaluation tasks:

1. Is reader-aware text (1) more relevant and (2) more likely to elicit interesting information from the author, than non-aware text?
2. Is it just as easy to differentiate different reader groups in the generated text as it is for the real text?

In [1]:
import torch
val_data = torch.load('../../data/reddit_data/combined_data_val_data.pt', map_location='cpu')
# convert to data frame because easier
import pandas as pd
val_data = pd.DataFrame(list(val_data))
data_cols = ['article_id', 'author_has_subreddit_embed', 'author_has_text_embed', 'reader_token_str', 'source_text', 'target_text', 'subreddit_embed', 'text_embed']
val_data = val_data.loc[:, data_cols]
print(val_data.shape)
# get metadata
import pandas as pd
post_metadata = pd.read_csv('../../data/reddit_data/combined_data_question_data.gz', sep='\t', compression='gzip', usecols=['article_id', 'subreddit'])
# add subreddit info
val_data = pd.merge(val_data, post_metadata, on=['article_id'], how='left')

  return function(data_struct)


(51302, 8)


## Collect data for no-reader and reader-aware generated questions
After generating text with `test_question_generation.py`, we can filter for questions generated by no-reader and reader-aware models (additional condition: question should have some reader information attached).

In [None]:
import gzip
no_reader_pred_text = list(gzip.open('../../data/reddit_data/text_only_model/test_data_output_text.gz'))
reader_aware_pred_text = list(gzip.open('../../data/reddit_data/author_text_data/test_data_output_text.gz'))

## Collect data for real and generated reader-aware questions
Let's get real and generated text for reader-aware questions.

In [11]:
## for each reader group pair: get 2 questions from same article
reader_group_data = [
    ('EXPERT', '<EXPERT_PCT_0_AUTHOR>', '<EXPERT_PCT_1_AUTHOR>'),
    ('TIME', '<RESPONSE_TIME_0_AUTHOR>', '<RESPONSE_TIME_1_AUTHOR>'),
    ('LOC', '<US_AUTHOR>', '<NONUS_AUTHOR>'),
]
# val_data_article_reader_groups = val_data.groupby('article_id').apply(lambda x: set(x.loc[:, 'reader_token_str'].unique()))
# article ID | reader group class | question | reader group type | subreddit
sample_size = 20
for subreddit_i, data_i in val_data.groupby('subreddit'):
    print(subreddit_i)
    article_reader_groups_i = data_i.groupby('article_id').apply(lambda x: set(x.loc[:, 'reader_token_str'].unique()))
    for reader_group_type_j, reader_group_1, reader_group_2 in reader_group_data:
        articles_ids_j = article_reader_groups_i[article_reader_groups_i.apply(lambda x: reader_group_1 in x and reader_group_2 in x)]
        print(len(articles_ids_j))

Advice
15
38
1
AmItheAsshole
86
246
42
legaladvice
6
56
1
pcmasterrace
2
13
1
personalfinance
20
100
1


OK! It looks like `pcmasterrace` and `personalfinance`, which we were planning to use in evaluation, don't have great coverage of reader groups.

Let's pivot to training data to improve coverage.

In [60]:
train_data = torch.load('../../data/reddit_data/combined_data_train_data.pt')
train_data = pd.DataFrame(list(train_data))
data_cols = ['article_id', 'author_has_subreddit_embed', 'author_has_text_embed', 'reader_token_str', 'source_text', 'source_ids_reader_token', 'target_text', 'subreddit_embed', 'text_embed', 'attention_mask']
train_data = train_data.loc[:, data_cols]
# add subreddit info
train_data = pd.merge(train_data, post_metadata, on=['article_id'], how='left')

In [18]:
print(train_data.loc[:, 'reader_token_str'].value_counts())

UNK                         1134850
<EXPERT_PCT_0_AUTHOR>        290931
<RESPONSE_TIME_0_AUTHOR>     223571
<RESPONSE_TIME_1_AUTHOR>      79179
<US_AUTHOR>                   29237
<NONUS_AUTHOR>                21367
<EXPERT_PCT_1_AUTHOR>         11819
Name: reader_token_str, dtype: int64


In [29]:
for subreddit_i, data_i in train_data.groupby('subreddit'):
    print(subreddit_i)
    article_reader_groups_i = data_i.groupby('article_id').apply(lambda x: set(x.loc[:, 'reader_token_str'].unique()))
    for reader_group_type_j, reader_group_1, reader_group_2 in reader_group_data:
        articles_ids_j = article_reader_groups_i[article_reader_groups_i.apply(lambda x: reader_group_1 in x and reader_group_2 in x)]
        print(f'group {reader_group_type_j} has {len(articles_ids_j)} articles')

Advice
group EXPERT has 66 articles
group TIME has 201 articles
group LOC has 8 articles
AmItheAsshole
group EXPERT has 270 articles
group TIME has 946 articles
group LOC has 122 articles
legaladvice
group EXPERT has 40 articles
group TIME has 199 articles
group LOC has 6 articles
pcmasterrace
group EXPERT has 19 articles
group TIME has 59 articles
group LOC has 0 articles
personalfinance
group EXPERT has 75 articles
group TIME has 413 articles
group LOC has 9 articles


This looks better! Now we can sample some data.

In [117]:
import numpy as np
np.random.seed(123)
sample_size = 100
reader_group_sample_question_data = []
for subreddit_i, data_i in train_data.groupby('subreddit'):
    article_reader_groups_i = data_i.groupby('article_id').apply(lambda x: set(x.loc[:, 'reader_token_str'].unique()))
    for reader_group_type_j, reader_group_1, reader_group_2 in reader_group_data:
        article_ids_j = article_reader_groups_i[article_reader_groups_i.apply(lambda x: reader_group_1 in x and reader_group_2 in x)].index.tolist()
        # sample
        if(len(article_ids_j) > sample_size):
            article_ids_j = np.random.choice(article_ids_j, sample_size, replace=False)
        # get paired reader group data for each article
        for article_id_k in article_ids_j:
            data_k = data_i[(data_i.loc[:, 'article_id']==article_id_k)]
            sample_data_k_1 = data_k[data_k.loc[:, 'reader_token_str']==reader_group_1].iloc[0, :]
            sample_data_k_2 = data_k[data_k.loc[:, 'reader_token_str']==reader_group_2].iloc[0, :]
            post_text_k = data_k.loc[:, 'source_text'].iloc[0]
            reader_group_sample_question_data.append([article_id_k, subreddit_i, post_text_k, sample_data_k_1.loc['target_text'], reader_group_1, sample_data_k_2.loc['target_text'], reader_group_2, reader_group_type_j])
reader_group_sample_question_data = pd.DataFrame(reader_group_sample_question_data, 
                                                 columns=['article_id', 'subreddit', 'post', 'question_1', 'group_1', 'question_2', 'group_2', 'group_type'])
display(reader_group_sample_question_data.head())

Unnamed: 0,article_id,subreddit,post,question_1,group_1,question_2,group_2,group_type
0,8kc90i,Advice,"First off, I have no problem with weed because...",Is she trying to use the marijuana as a shortc...,<EXPERT_PCT_0_AUTHOR>,but is it worth being caught?,<EXPERT_PCT_1_AUTHOR>,EXPERT
1,8l91i1,Advice,I work online and last week my boss called and...,"Now, how's that cold?",<EXPERT_PCT_0_AUTHOR>,Did you thank him for been a good boss and why...,<EXPERT_PCT_1_AUTHOR>,EXPERT
2,8p4uwz,Advice,Was in the midst of a panic attack over some t...,How would you feel if you or your family got a...,<EXPERT_PCT_0_AUTHOR>,Did any one see that you hit the car?,<EXPERT_PCT_1_AUTHOR>,EXPERT
3,8v8mga,Advice,"In other words, I don't know who I am. I suck ...",Have you ever tried keeping a diary or journal?,<EXPERT_PCT_0_AUTHOR>,"Do you doubt yourself, your choices, etc.?",<EXPERT_PCT_1_AUTHOR>,EXPERT
4,8vm0a1,Advice,"Hey, I’m 19 and a half and I’ve been out of hi...",What problem do you wish was solved in the world?,<EXPERT_PCT_0_AUTHOR>,"It was very different from School or College, ...",<EXPERT_PCT_1_AUTHOR>,EXPERT


Now! Let's generate questions for the same data using the reader-aware model, and organize to match the layout.

In [34]:
## load model
from test_question_generation import load_model
model_cache_dir = '../../data/model_cache/'
model_weight_file = '../../data/reddit_data/author_text_data/question_generation_model/checkpoint-114500/pytorch_model.bin'
data_dir = '../../data/reddit_data/author_text_data/'
model_type = 'bart_author'
model, model_tokenizer = load_model(model_cache_dir, model_weight_file, model_type, data_dir)

In [118]:
## subset data to articles and reader groups mentioned in sample data
generation_data = train_data[train_data.loc[:, 'article_id'].isin(reader_group_sample_question_data.loc[:, 'article_id'].unique())].drop_duplicates(['article_id', 'reader_token_str'])
article_id_valid_tokens = reader_group_sample_question_data.groupby('article_id').apply(lambda x: x.iloc[0, :].loc[['group_1', 'group_2']].values.tolist())
generation_data = generation_data[generation_data.apply(lambda x: x.loc['reader_token_str'] in article_id_valid_tokens.loc[x.loc['article_id']], axis=1)]
generation_data.sort_values('article_id', inplace=True)
generation_data = generation_data.loc[:, ['article_id', 'reader_token_str', 'source_ids_reader_token', 'attention_mask']]
generation_data.rename(columns={'source_ids_reader_token' : 'source_ids'}, inplace=True)
generation_data.drop_duplicates(['article_id', 'reader_token_str'], inplace=True)
print(generation_data.shape[0])
# fix tensor vars
generation_data = generation_data.assign(**{
    'source_ids' : generation_data.loc[:, 'source_ids'].apply(lambda x: torch.LongTensor(x)),
    'attention_mask' : generation_data.loc[:, 'attention_mask'].apply(lambda x: torch.LongTensor(x)),
})
# convert to list of dicts
generation_data_iter = generation_data.apply(lambda x: x.to_dict(), axis=1).values.tolist()
display(generation_data.head())

1634


Unnamed: 0,article_id,reader_token_str,source_ids,attention_mask
1636010,7py853,<NONUS_AUTHOR>,"[tensor(0), tensor(154), tensor(7), tensor(5),...","[tensor(1), tensor(1), tensor(1), tensor(1), t..."
1220202,7py853,<US_AUTHOR>,"[tensor(0), tensor(154), tensor(7), tensor(5),...","[tensor(1), tensor(1), tensor(1), tensor(1), t..."
1654293,7rkvv3,<US_AUTHOR>,"[tensor(0), tensor(17), tensor(27), tensor(119...","[tensor(1), tensor(1), tensor(1), tensor(1), t..."
1199596,7rkvv3,<NONUS_AUTHOR>,"[tensor(0), tensor(17), tensor(27), tensor(119...","[tensor(1), tensor(1), tensor(1), tensor(1), t..."
1507009,814t9v,<NONUS_AUTHOR>,"[tensor(0), tensor(1141), tensor(6), tensor(81...","[tensor(1), tensor(1), tensor(1), tensor(1), t..."


In [119]:
## generate text for all source examples
from model_helpers import generate_predictions
generation_method = 'beam_search'
num_beams = 8
pred_text = generate_predictions(model, generation_data_iter, model_tokenizer, generation_method=generation_method, num_beams=num_beams)

100%|██████████| 1634/1634 [07:34<00:00,  3.60it/s]


In [120]:
## re-add to generation data
generation_pred_data = generation_data.assign(**{
    'pred_text' : pred_text
})
## reorganize to match original sampled data
reader_token_groups = {
    'LOC' : ['<US_AUTHOR>', '<NONUS_AUTHOR>'],
    'EXPERT' : ['<EXPERT_PCT_0_AUTHOR>', '<EXPERT_PCT_1_AUTHOR>'],
    'TIME' : ['<RESPONSE_TIME_0_AUTHOR>', '<RESPONSE_TIME_1_AUTHOR>'],
}
reader_token_group_lookup = {v1 : k for k,v in reader_token_groups.items() for v1 in v}
def flatten_pred_data(data, reader_token_group_lookup):
    reader_group_type = reader_token_group_lookup[data.loc[:, 'reader_token_str'].iloc[0]]
    data_1 = data.iloc[0, :]
    data_2 = data.iloc[1, :]
    flat_data = [data_1.loc['pred_text'], data_1.loc['reader_token_str'], 
                 data_2.loc['pred_text'], data_2.loc['reader_token_str'],
                 reader_group_type]
    flat_data_cols = ['question_1', 'group_1', 'question_2', 'group_2', 'group_type']
    flat_data = pd.Series(flat_data, index=flat_data_cols)
    return flat_data
per_article_generation_pred_data = generation_pred_data.groupby('article_id').apply(lambda x: flatten_pred_data(x, reader_token_group_lookup)).reset_index()
# remove duplicates
per_article_generation_pred_data = per_article_generation_pred_data[per_article_generation_pred_data.loc[:, 'question_1']!=per_article_generation_pred_data.loc[:, 'question_2']]
## join with metadata
per_article_generation_pred_data = pd.merge(per_article_generation_pred_data, reader_group_sample_question_data.loc[:, ['article_id', 'subreddit', 'post']].drop_duplicates('article_id'), on='article_id', how='left')
print(f'{per_article_generation_pred_data.shape[0]} generated pairs total')

260 generated pairs total


Now that we have the real and generated data, let's shuffle the groups to prepare for annotation.

In [121]:
def shuffle_questions_by_group(data, num_groups, group_vars=['question', 'group']):
    ordered_group_cols = [[f'{var}_{i}' for var in group_vars] for i in range(1, num_groups+1)]
    group_cols = list(ordered_group_cols)
    np.random.shuffle(group_cols)
    flat_group_cols = [y for x in group_cols for y in x]
    flat_ordered_group_cols = [y for x in ordered_group_cols for y in x]
    group_data = data.loc[flat_group_cols]
    group_data.index = flat_ordered_group_cols
    data.drop(flat_ordered_group_cols, inplace=True)
    data = data.append(group_data)
    return data

In [122]:
num_groups = 2
shuffled_generation_pred_data = per_article_generation_pred_data.apply(lambda x: shuffle_questions_by_group(x, num_groups), axis=1)
shuffled_reader_group_sample_question_data = reader_group_sample_question_data.apply(lambda x: shuffle_questions_by_group(x, num_groups), axis=1)
# align columns lol
shuffled_generation_pred_data = shuffled_generation_pred_data.loc[:, shuffled_reader_group_sample_question_data.columns]
## prepare for annotation
reader_group_type_choices = pd.DataFrame([
    ['EXPERT', 'expert vs. novice'],
    ['TIME', 'fast response vs. slow response'],
    ['LOC', 'US vs. non-US'],
], columns=['group_type', 'group_type_choices'])
def prepare_for_annotation(data, reader_group_type_choices):
    # drop labels
    annotation_data = data.drop(['group_1', 'group_2'], axis=1)
    # add choices
    annotation_data = pd.merge(annotation_data, reader_group_type_choices, on=['group_type'], how='left')
    annotation_data = annotation_data.assign(**{'label' : -1})
    return annotation_data
annotation_generation_pred_data = prepare_for_annotation(shuffled_generation_pred_data, reader_group_type_choices)
annotation_reader_group_sample_question_data = prepare_for_annotation(shuffled_reader_group_sample_question_data, reader_group_type_choices)
display(annotation_generation_pred_data.head())
display(annotation_reader_group_sample_question_data.head())

Unnamed: 0,article_id,subreddit,post,group_type,question_1,question_2,group_type_choices,label
0,7py853,legaladvice,Turning to the experts at reddit! I'm in a ba...,LOC,Are you sure it's not a scam and not a identit...,Are you sure it's not a scam?,US vs. non-US,-1
1,814t9v,personalfinance,"My wife, over a year ago, thought she submitte...",LOC,Did she file for deferment?,Did you send her a certified letter or just a ...,US vs. non-US,-1
2,89p14v,Advice,Every Tuesday my neighbours garbage goes flyin...,LOC,Are you sure it wasn’t your mom’s personal junk?,How did you get your garbage there?,US vs. non-US,-1
3,8c71q3,Advice,My ex girlfriend is very suicidal. We broke up...,LOC,How long have you been with her?,How long have you been with this person?,US vs. non-US,-1
4,8kc90i,Advice,"First off, I have no problem with weed because...",EXPERT,Is she trying to use the marijuana for feeling...,What is her motivation behind it?,expert vs. novice,-1


Unnamed: 0,article_id,subreddit,post,group_type,question_1,question_2,group_type_choices,label
0,8kc90i,Advice,"First off, I have no problem with weed because...",EXPERT,Is she trying to use the marijuana as a shortc...,but is it worth being caught?,expert vs. novice,-1
1,8l91i1,Advice,I work online and last week my boss called and...,EXPERT,Did you thank him for been a good boss and why...,"Now, how's that cold?",expert vs. novice,-1
2,8p4uwz,Advice,Was in the midst of a panic attack over some t...,EXPERT,How would you feel if you or your family got a...,Did any one see that you hit the car?,expert vs. novice,-1
3,8v8mga,Advice,"In other words, I don't know who I am. I suck ...",EXPERT,"Do you doubt yourself, your choices, etc.?",Have you ever tried keeping a diary or journal?,expert vs. novice,-1
4,8vm0a1,Advice,"Hey, I’m 19 and a half and I’ve been out of hi...",EXPERT,What problem do you wish was solved in the world?,"It was very different from School or College, ...",expert vs. novice,-1


In [123]:
# check subreddit distribution
display(annotation_reader_group_sample_question_data.loc[:, 'subreddit'].value_counts())
display(annotation_generation_pred_data.loc[:, 'subreddit'].value_counts())

AmItheAsshole      300
personalfinance    184
Advice             174
legaladvice        146
pcmasterrace        78
Name: subreddit, dtype: int64

AmItheAsshole      87
personalfinance    59
Advice             55
legaladvice        41
pcmasterrace       18
Name: subreddit, dtype: int64

In [124]:
## restrict to same posts
annotation_reader_group_sample_question_data = annotation_reader_group_sample_question_data[annotation_reader_group_sample_question_data.loc[:, 'article_id'].isin(annotation_generation_pred_data.loc[:, 'article_id'].unique())]
## sort
annotation_reader_group_sample_question_data.sort_values(['subreddit', 'article_id'], inplace=True)
annotation_generation_pred_data.sort_values(['subreddit', 'article_id'], inplace=True)
print(annotation_reader_group_sample_question_data.shape[0])
print(annotation_generation_pred_data.shape[0])

280
260


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  annotation_reader_group_sample_question_data.sort_values(['subreddit', 'article_id'], inplace=True)


In [131]:
## add real/generated labels
shuffled_reader_group_sample_question_data = shuffled_reader_group_sample_question_data.assign(**{
    'question_type' : 'real',
})
shuffled_generation_pred_data = shuffled_generation_pred_data.assign(**{
    'question_type' : 'author_token_model',
})
## combine
combined_ground_truth_data = pd.concat([
    shuffled_reader_group_sample_question_data,
    shuffled_generation_pred_data
], axis=0)
combined_annotation_data = pd.concat([
    annotation_reader_group_sample_question_data,
    annotation_generation_pred_data,
])
## sort
combined_annotation_data.sort_values(['article_id', 'subreddit', 'group_type_choices'], inplace=True)

In [132]:
# write ground truth data
combined_ground_truth_data.to_csv('../../data/reddit_data/annotation_data/generated_text_evaluation/reader_group_ground_truth_data.tsv', sep='\t', index=False)
# write annotation data
combined_annotation_data.to_csv('../../data/reddit_data/annotation_data/generated_text_evaluation/reader_group_annotation_data.tsv', sep='\t', index=False)