# Test question generation on Reddit comments
We've collected valid questions from several advice subreddits and their corresponding posts.

Now let's try to generate the questions!

In [1]:
## load question data
import pandas as pd
question_data = pd.read_csv('../../data/reddit_data/subreddit_combined_valid_question_data.gz', sep='\t', compression='gzip', index_col=False)
print(question_data.shape[0])
display(question_data.head())

818557


Unnamed: 0,author,author_flair_text,author_fullname,body,created_utc,edited,id,parent_id,score,subreddit,question,valid_question_prob,info_question,post_question_overlap,post_question_overlap_score,post_question_overlap_sent
0,HindsightGraduate,,t2_2xbcyguc,"Yes, YWBTA. People can be very black-and-white...",1569354000.0,False,f1cajdw,d8pbuz,3.0,AmItheAsshole,Is there a solid impression she does that make...,0.810461,False,"(0.1, (['I', 'know', 'for', 'a', 'fact', 'that...",0.1,"['I', 'know', 'for', 'a', 'fact', 'that', 'she..."
1,jeliaser,,t2_1eb2ir3n,Here's my opinion as a California Real Estate ...,1566335000.0,False,exitsc7,csztdi,2.0,legaladvice,Likelihood of this being just a bluff?,0.526323,False,"(0.1, (['I', '’', 'm', 'contact', 'a', 'lawyer...",0.1,"['I', '’', 'm', 'contact', 'a', 'lawyer', 'abo..."
2,DamonTheron,,t2_yyh3u,Water is 90 a month? Hot damn USA is expensive...,1531750000.0,1531752032,e2hcc2w,8zarr7,1.0,personalfinance,Water is 90 a month?,0.599035,False,"(0.1, (['credit', 'score', 'is', '534', 'I', '...",0.1,"['credit', 'score', 'is', '534', 'I', 'think',..."
3,0000udeis000,Asshole Aficionado [17],t2_10j4wv,INFO: is your boss legally allowed to fire you...,1574778000.0,False,f8sdvtb,e1yep5,1.0,AmItheAsshole,INFO: is your boss legally allowed to fire you...,0.599035,True,"(0.1, (['My', 'wife', 'is', 'realli', 'mad', '...",0.1,"['My', 'wife', 'is', 'realli', 'mad', 'and', '..."
4,tonytroz,,t2_4apcg,The reason this can be VERY bad is because you...,1531747000.0,False,e2h9sw2,8zatsc,2.0,personalfinance,Instead of being miserable for 3 months why no...,0.557299,False,"(0.1, (['If', 'someth', 'doesn', ""'"", 't', 'ch...",0.1,"['If', 'someth', 'doesn', ""'"", 't', 'chang', '..."


In [2]:
## add submission data
import json
import gzip
submission_data = pd.DataFrame([json.loads(x.strip()) for x in gzip.open('../../data/reddit_data/subreddit_submissions_2018-01_2019-12.gz', 'rt')])
submission_data.rename(columns={'id' : 'parent_id', 'selftext' : 'parent_text', 'title' : 'parent_title', 'author' : 'parent_author', 'edited' : 'parent_edited'}, inplace=True)
display(submission_data.head())

Unnamed: 0,parent_author,author_flair_text,created_utc,parent_edited,parent_id,num_comments,score,parent_text,subreddit,parent_title,category,author_fullname
0,deepsouthsloth,,1514764840,False,7nby0l,7,1,26M/married/2 kids\n\nEmployer match is 50% up...,personalfinance,Should I continue with 401k despite terrible e...,,
1,CapableCounteroffer,,1514764890,False,7nby5t,5,0,"On November 24th, I called AT&amp;T to inquire...",legaladvice,[FL] Issue getting AT&amp;T to pay early termi...,,
2,pinkcrayon69,,1514764948,False,7nbybf,9,3,I live in south OC but I need to move out of m...,personalfinance,I need to move out in a month. What should I p...,,
3,bobshellby,Needs 64bit Windows...,1514765040,False,7nbykz,6,0,Are there keycaps for the Microsoft wireless k...,pcmasterrace,Keyboard keycap help,,
4,j0sh135742,,1514765064,1.51477e+09,7nbyno,4,0,"So in MGL Part 1, Title 15, Chapter 94G, Secti...",legaladvice,Quick question about Medical Marijuana.,,


In [11]:
edited_submission_data = submission_data[submission_data.loc[:, 'parent_edited'].apply(lambda x: type(x) is int)]
display(edited_submission_data.loc[:, 'parent_text'].head(20).values)

array(['Me and my girlfriend live together in a duplex where the rent is around $450 (plus electric and water this is about $600). My girlfriend makes around $800 a month at her job. And I make around $400. We have a car payment that is $341.40 as well as the insurance which is $121.00 . My girlfriend is depressed and we don’t have enough to get her any help. \n\nI need help figuring out how to make our quality of life any better at all. Idk if we are allowed to apply for welfare or even how to start that process. Any advice would be greatly appreciated !\n\nUpdate: thank all of y’all for the advice and I appreciate the time. I’m sorry I couldn’t directly respond to all but I’ve been inspired. God bless.',
       'I\'ve been trying to think of how to condense this question so it\'s more palatable for a quicker response, but I\'m at a loss, so I hope somebody is willing to read this novel-length post.\n\nMy fiancee lives in New York state. Due to severe PTSD and mental illness, she has 

It looks like most of the edits will be too hard to identify automatically, so we will remove all edited submissions for now.

In [12]:
non_edited_submission_data = submission_data[submission_data.loc[:, 'parent_edited'].apply(lambda x: type(x) is bool and not x)]
print(f'{non_edited_submission_data.shape[0]}/{submission_data.shape[0]} non-edited posts')

796557/974252 non-edited posts


In [13]:
pd.set_option('display.max_colwidth', 100)
question_submission_data = pd.merge(
    question_data.loc[:, ['author', 'edited', 'id', 'subreddit', 'question', 'parent_id']],
    non_edited_submission_data.loc[:, ['parent_id', 'parent_text', 'parent_title', 'parent_edited']],
    on='parent_id',
)
print(question_submission_data.shape[0])
display(question_submission_data.head(10))

519557


Unnamed: 0,author,edited,id,subreddit,question,parent_id,parent_text,parent_title,parent_edited
0,HindsightGraduate,False,f1cajdw,AmItheAsshole,Is there a solid impression she does that makes you crack up every single time?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
1,TaKiDaLo,False,f1cmnoh,AmItheAsshole,But why do you keep asking this?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
2,rachelsnipples,False,f1ch41g,AmItheAsshole,"""Why are you asking me this?",d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
3,bigsisthrowaway19,False,f1cfiy5,AmItheAsshole,Why do you think she's asking?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
4,givebusterahand,False,f1dahgg,AmItheAsshole,What is telling her the truth going to do besides further destroy her self esteem?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
5,TripleV420,False,f1d2ms7,AmItheAsshole,Why don't you try to find ways to bring out her beauty?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
6,DK_Son,False,f1d5ej9,AmItheAsshole,"Or ""Do you think I'm a bitch?",d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
7,starjumper_,False,f1df88g,AmItheAsshole,It's not difficult to see the beauty in your friends so why not try it and tell her about it?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
8,velveteen279,False,f1csy7k,AmItheAsshole,Maybe ask her why she's feeling so shit about herself?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
9,Pastelroots,False,f1d0adj,AmItheAsshole,Why don't you suggest to your friend ways to look better instead of telling them their ugly?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False


In [14]:
## clean up columns
question_submission_data.rename(columns={'parent_text' : 'article_text', 'parent_id' : 'article_id', 'parent_title' : 'article_title'}, inplace=True)
## clean up text
import re
info_question_matcher = re.compile('^INFO:? ')
submission_text_matcher = re.compile('^(AITA|WIBTA)|[\n\r]')
question_submission_data = question_submission_data.assign(**{
    'question' : question_submission_data.loc[:, 'question'].apply(lambda x: info_question_matcher.sub('', x)),
    'article_text' : question_submission_data.loc[:, 'article_text'].apply(lambda x: submission_text_matcher.sub('', x)),
    'article_title' : question_submission_data.loc[:, 'article_title'].apply(lambda x: submission_text_matcher.sub('', x)),
})

In [19]:
## get sample!! otherwise training takes weeks lol
import numpy as np
np.random.seed(123)
sample_pct = 0.10
N_sample = int(sample_pct*question_submission_data.shape[0])
print(f'sampling {N_sample} posts')
sample_question_data = question_submission_data.loc[np.random.choice(question_submission_data.index, N_sample, replace=False), :]

sampling 51955 posts


Let's convert all the data to tensor format so that we can train/test in Torch.

In [20]:
from importlib import reload
import data_helpers
reload(data_helpers)
from data_helpers import prepare_question_data
# from transformers import AutoTokenizer
from transformers import BartTokenizer
data_dir = '../../data/reddit_data/'
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base', cache_dir=data_dir)
data_name = 'advice_subreddit'
train_pct = 0.9
max_source_length = 512
max_target_length = 64
data_vars = ['article_text', 'question', 'article_id', 'article_title']
prepare_question_data(sample_question_data, data_dir, data_name, tokenizer, 
                      train_pct=train_pct, 
                      data_vars=data_vars,
                      max_source_length=max_source_length,
                      max_target_length=max_target_length)



Downloading and preparing dataset csv/default-15c3f3e37a707338 (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/ianbstew/.cache/huggingface/datasets/csv/default-15c3f3e37a707338/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b...


HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset csv downloaded and prepared to /home/ianbstew/.cache/huggingface/datasets/csv/default-15c3f3e37a707338/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b. Subsequent calls will reuse this data.




Downloading and preparing dataset csv/default-38f758c32da96e5e (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/ianbstew/.cache/huggingface/datasets/csv/default-38f758c32da96e5e/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b...


HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset csv downloaded and prepared to /home/ianbstew/.cache/huggingface/datasets/csv/default-38f758c32da96e5e/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, max=46660.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=47.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5287.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))




### Test model output
After training (10% of data, ~15 hours), let's see how well-formed the questions seem to be for the test data.

In [None]:
## set CUDA device
!export CUDA_VISIBLE_DEVICES=0

In [2]:
import torch
val_data = torch.load('../../data/reddit_data/advice_subreddit_val_data.pt')['train']
print(len(val_data))
print(val_data)

5287
Dataset(features: {'article_id': Value(dtype='string', id=None), 'article_title': Value(dtype='string', id=None), 'source_text': Value(dtype='string', id=None), 'target_text': Value(dtype='string', id=None), 'source_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'target_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, num_rows: 5287)


In [4]:
from transformers import AutoModelForSeq2SeqLM, BartTokenizer
model_file = '../../data/reddit_data/text_only_model/question_generation_model/checkpoint-116500/pytorch_model.bin'
model_weights = torch.load(model_file)
generation_model = AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-base', cache_dir='../../data/model_cache/')
generation_model.load_state_dict(model_weights)
model_tokenizer = BartTokenizer.from_pretrained('facebook/bart-base', cache_dir='../../data/model_cache/')

In [5]:
from data_helpers import generate_predictions
device_name = 'cuda:0'
generation_method = 'beam_search'
num_beams = 8
val_data_pred = generate_predictions(
    generation_model, val_data, model_tokenizer, 
    device_name=device_name, generation_method=generation_method,
    num_beams=num_beams,
)

  return function(data_struct)
100%|██████████| 5287/5287 [16:19<00:00,  5.40it/s]


In [6]:
## write to file!!
import gzip
pred_out_file = '../../data/reddit_data/advice_subreddit_val_data_pred.gz'
with gzip.open(pred_out_file, 'wt') as pred_out:
    pred_out.write('\n'.join(val_data_pred))

In [5]:
from rouge_score import rouge_scorer
# print(help(rouge_scorer))
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
score = scorer.score('this is a test sentence', 'this is a test sentences to try out')
print(score)

{'rougeL': Score(precision=0.625, recall=1.0, fmeasure=0.7692307692307693)}


In [7]:
## compare predicted data vs. true data
from importlib import reload
import data_helpers
reload(data_helpers)
from data_helpers import compare_pred_text_with_target
cutoff_idx = 100
max_txt_len = 400
extra_data_vars = ['article_id', 'article_title']
compare_pred_text_with_target(val_data, val_data_pred, model_tokenizer, 
                              max_txt_len=max_txt_len, cutoff_idx=cutoff_idx,
                              extra_data_vars=extra_data_vars)

*~*~*~*~*~*
article_id = cz2ri6
article_title =  for refusing to give my girlfriend money because I earn more than her
source text = I've (M24) been with my girlfriend (F25) for almost 8 years (practically married, I know) and we have lived together for 5 years. After my graduation I landed a full time job in IT at an Oil &amp; Gas firm.My girlfriend is working at a supermarket part time and due to non-guaranteed hours her pay fluctuates quite dramatically from time to time.My girlfriend graduated in this year in 2019 with a Ma...
target text = What's the point of money if you can't even spend it on what matters the most to you?
pred text = What sort of degree is she getting?
*~*~*~*~*~*
article_id = d42wf4
article_title =  for telling kids at my old High School that most teachers there are useless and that they need to depend on themselves?
source text = I (32M) left my High School 14 years ago. In the UK, we refer to High School as Secondary School. Same thing basically. I stayed at 

*~*~*~*~*~*
article_id = as7mkw
article_title =  for wearing a sports bra when I run outside?
source text = I like to wear a sports bra top when I do a running workout outside (weather permitting). My husband would prefer I didnt, and wore a shirt over. AITA?To support his opinion: To be fair, I am from the US originally, but we have emigrated elsewhere. There is not so much a workout/fitness culture here, and in all honesty, I dont think Ive ever seen anyone working out in just a sports bra...maybe onc...
target text = Is it worth it?
pred text = Also, are you wearing a sports bra?
*~*~*~*~*~*
article_id = b2za6i
article_title =  for sending my kid to school with "adult" snacks and lunches?
source text = Living a healthy lifestyle is important to me, and it's important to pass it onto my kids as well. I have a second grader who I have brought up (so far) to be very involved in the kitchen, what we eat, diet and exercise. She came home with a note from her teacher the other day asking 

*~*~*~*~*~*
article_id = dv9ogm
article_title =  for asking my husbands sister to consider being a surrogate for us?
source text = My husband and I have been trying for pregnancy for years now, and to cut a long story short it seems as though it will never be a possibility. It took a long time to come to terms with but we've gradually got there. Our entire family is aware of the journey we've been on and how much it meant to us. With that in mind, my husband and I came to his sister (Sarah) with a proposal.Sarah is in her ear...
target text = Because for something to spiral that far out of control needs more than "will you surrogate?
pred text = Didn't you say she should have made it clear that she wasn't sure?
*~*~*~*~*~*
article_id = 8ime01
article_title = I have finally become the "techguy" in the family... FML
source text = I get the blame for literally everything that goes wrong. I mean *EVERYTHING*.My aunt bought a laptop 10 years ago, and somehow ***I*** got the blame for the in

*~*~*~*~*~*
article_id = chlcbo
article_title =  for refusing to allow my daughter to participate in High School cheerleading?
source text = I have a son thats entering 8th grade and a daughter thats entering 9th grade.  Both of them have always been bookworms.  Very studious and cerebral, just like their Dad.  I also played sports all through school and I value the experience that brings.  I've always tried to push them to join sports, but neither one of my kids wanted to.  I tried to insist...maybe they would have fun?  We dont know ...
target text = Does the school have a cross-country team?
pred text = YTA you wanna be a cheerleader in high school?
*~*~*~*~*~*
article_id = 8ohr1y
article_title =  for breaking off a friendship over rent?
source text = I just found out yesterday that a good friend/roommate of mine has been dicking me over by charging me extra rent for a year and a half. He didn't just dick me over, he dicked three other people who lived in the house too. When confron

*~*~*~*~*~*
article_id = bhmv9b
article_title =  for forcing my girlfriend to eat healthier?
source text = I know this sounds like a SHP or too fishing, but hear me out.Since she was a kid, my girlfriend has been a very picky eater. She was admittedly a spoiled princess who ate nothing but meat and sweets whenever she liked. She told me that her also never liked fruits and veggies, so their entire household really wasn't into the whole "healthy eating" thing.When we first dated, I already knew this.I a...
target text = How is her diet affecting you?
pred text = NTA- If you can't find a healthy lifestyle, then what are you going to do?
*~*~*~*~*~*
article_id = 8zck5h
article_title = New hire NOOB looking for investment options. Should I enroll in Roth 401K or Traditional 401K?
source text = I am a new hire starting work this week. My income will be around $100K before taxes in California, probably around $70K after taxes.I have the option to enroll in my company's 401K with fidelity, bu

*~*~*~*~*~*
article_id = ccl4an
article_title =  for going on a trip instead of focusing my time and money on the family to-be?
source text =  The family doesn't exist yet. It was pretty obvious that he was ready for a family soon after we started dating but we hadn't started *really* talking about it until earlier this year. I've never wanted a family, I've never aspired to be a mother, and I didn't see this coming up so soon. We've been dating for a year, we were only seeing each other for a couple months before our status became rath...
target text = You're barely warming up to the idea of a family one day and he's trying to control how you spend your time and money now because you may or may not have kids some day?
pred text = You have been together for how long, moved in together, planned and paid for 6 months, and you have never wanted to even start a family?
*~*~*~*~*~*
article_id = cw34du
article_title =  for showing no sympathy to my daughter through a difficult time?
source t

*~*~*~*~*~*
article_id = djdmik
article_title =  for being angry with my in-laws?
source text = My in-laws are planning a family vacation to go down south (we live in Canada), where they pay for their kids to go, but significant others have to pay for themselves. That's fine, except we have a 6 month old and aren't comfortable traveling with him, so we talked to them about how we wouldn't be going and that's fine. Recently one of my sister in law's let us know when the vacation would be happ...
target text = NTA - What kind of shitty family purposefully schedule a vacation over their grandchild/nephew's first birthday?
pred text = Also, if they've been in the country for 6 months, how were they supposed to know the time off?
*~*~*~*~*~*
article_id = cz5vk2
article_title =  for refusing to use the money I inherited from my mother to pay for medical treatment for my half-sister, who has a life threatening sickness? She is the child of my father and the woman he cheated on my mother with.

- No cases of data bleeding? Seems good to me.
- Reasonable:
    - Submission title: `20 years old. ~45-50k a year. Great credit, 6% contribution to 401k with 6% match. What am I saving for?`
        - Target: `What would you like to do?`
        - Predicted: `And what do you do for fun when you don't have a job?`
    - Submission title: `Online Services, Chargeback attempt, customer threatens me with police`
        - Target: `So let me get this straight: Someone hired you to plagiarize his school work for him, presumably you know this person's real name and the name of the school he goes to, and now this person's decided to threaten *you* with police if you don't let him keep the work you did for`
        - Predicted: `By the way, do you have proof of his chargeback?`
    - Submission title: `I'm about to spend 37,000 on new computers, and I'm freaking worried.`
        - Target: `Also, do you really need 2tb?`
        - Predicted: `What kind of components are you looking for?`
    - Submission title: `Confusion regarding language of vacation time - Employer wants me to pay back "non-vested" vacation. Michigan, USA.`
        - Target: `Are you certain that theyre looking for pay back of vacation days you already used?`
        - Predicted: `How many employees does the company have?`
    - Submission title: `Review my situation?`
        - Target: `Why a 100k car?`
        - Predicted: `What is your annual income?`
- Ungrammatical:
    - Submission title: `Won a brand new SUV (Volvo XC60) from a video contest. Want to sell it! What should I do?`
        - Target: `Did you ever receive the car?`
        - Predicted: `prizes for a prize?`
    - Submission title: `for completely ignoring my father’s girlfriend`
        - Target: `He YELLED my name and I looked around confused like "there's nobody in this class by that name?`
        - Predicted: `mixes you up all the time?`
- Logically incorrect:
    - Submission title: `Grandpa PC Troubles (Please Help)`
        - Target: `Ddr2 or whatever the Mobo is able to use?`
        - Predicted: `What kind of budget did he have for the upgrade?`
    - Submission title: `I have finally become the "techguy" in the family... FML`
        - Target: `Why don't you tell them to fuck off?`
        - Predicted: `Why do you need a computer in the first place?`
- Repetitive:
    - Submission title: `for telling my sister who is getting a boob job that fake boobs are gross?`
        - Target: `You think fake boobs change the personality of the owner?`
        - Predicted: `She's getting a boob job in December?`
    - Submission title: `Should I call CPS?`
        - Target: `The bigger question is how would you feel some day if you see a newspaper with a picture of a building in your complex and EMTs talking about how a child died of neglect?`
        - Predicted: `Or do you yell at the top of your lungs when the baby is crying?`
- Inappropriate:
    - Submission title: `for refusing to pay back my cousin after my baby ruined her blouse?`
        - Target: `What is with these stories of there being a 3rd party conversation after the fact where they decide you owe somebody something?`
        - Predicted: `How the fuck is that even a question?`