# Test question generation on Reddit comments
We've collected valid questions from several advice subreddits and their corresponding posts.

Now let's try to generate the questions!

In [1]:
## load question data
import pandas as pd
question_data = pd.read_csv('../../data/reddit_data/subreddit_combined_valid_question_data.gz', sep='\t', compression='gzip', index_col=False)
print(question_data.shape[0])
display(question_data.head())

818557


Unnamed: 0,author,author_flair_text,author_fullname,body,created_utc,edited,id,parent_id,score,subreddit,question,valid_question_prob,info_question,post_question_overlap,post_question_overlap_score,post_question_overlap_sent
0,HindsightGraduate,,t2_2xbcyguc,"Yes, YWBTA. People can be very black-and-white...",1569354000.0,False,f1cajdw,d8pbuz,3.0,AmItheAsshole,Is there a solid impression she does that make...,0.810461,False,"(0.1, (['I', 'know', 'for', 'a', 'fact', 'that...",0.1,"['I', 'know', 'for', 'a', 'fact', 'that', 'she..."
1,jeliaser,,t2_1eb2ir3n,Here's my opinion as a California Real Estate ...,1566335000.0,False,exitsc7,csztdi,2.0,legaladvice,Likelihood of this being just a bluff?,0.526323,False,"(0.1, (['I', '’', 'm', 'contact', 'a', 'lawyer...",0.1,"['I', '’', 'm', 'contact', 'a', 'lawyer', 'abo..."
2,DamonTheron,,t2_yyh3u,Water is 90 a month? Hot damn USA is expensive...,1531750000.0,1531752032,e2hcc2w,8zarr7,1.0,personalfinance,Water is 90 a month?,0.599035,False,"(0.1, (['credit', 'score', 'is', '534', 'I', '...",0.1,"['credit', 'score', 'is', '534', 'I', 'think',..."
3,0000udeis000,Asshole Aficionado [17],t2_10j4wv,INFO: is your boss legally allowed to fire you...,1574778000.0,False,f8sdvtb,e1yep5,1.0,AmItheAsshole,INFO: is your boss legally allowed to fire you...,0.599035,True,"(0.1, (['My', 'wife', 'is', 'realli', 'mad', '...",0.1,"['My', 'wife', 'is', 'realli', 'mad', 'and', '..."
4,tonytroz,,t2_4apcg,The reason this can be VERY bad is because you...,1531747000.0,False,e2h9sw2,8zatsc,2.0,personalfinance,Instead of being miserable for 3 months why no...,0.557299,False,"(0.1, (['If', 'someth', 'doesn', ""'"", 't', 'ch...",0.1,"['If', 'someth', 'doesn', ""'"", 't', 'chang', '..."


In [2]:
## add submission data
import json
import gzip
submission_data = pd.DataFrame([json.loads(x.strip()) for x in gzip.open('../../data/reddit_data/subreddit_submissions_2018-01_2019-12.gz', 'rt')])
submission_data.rename(columns={'id' : 'parent_id', 'selftext' : 'parent_text', 'title' : 'parent_title', 'author' : 'parent_author', 'edited' : 'parent_edited'}, inplace=True)
display(submission_data.head())

Unnamed: 0,parent_author,author_flair_text,created_utc,parent_edited,parent_id,num_comments,score,parent_text,subreddit,parent_title,category,author_fullname
0,deepsouthsloth,,1514764840,False,7nby0l,7,1,26M/married/2 kids\n\nEmployer match is 50% up...,personalfinance,Should I continue with 401k despite terrible e...,,
1,CapableCounteroffer,,1514764890,False,7nby5t,5,0,"On November 24th, I called AT&amp;T to inquire...",legaladvice,[FL] Issue getting AT&amp;T to pay early termi...,,
2,pinkcrayon69,,1514764948,False,7nbybf,9,3,I live in south OC but I need to move out of m...,personalfinance,I need to move out in a month. What should I p...,,
3,bobshellby,Needs 64bit Windows...,1514765040,False,7nbykz,6,0,Are there keycaps for the Microsoft wireless k...,pcmasterrace,Keyboard keycap help,,
4,j0sh135742,,1514765064,1.51477e+09,7nbyno,4,0,"So in MGL Part 1, Title 15, Chapter 94G, Secti...",legaladvice,Quick question about Medical Marijuana.,,


In [11]:
edited_submission_data = submission_data[submission_data.loc[:, 'parent_edited'].apply(lambda x: type(x) is int)]
display(edited_submission_data.loc[:, 'parent_text'].head(20).values)

array(['Me and my girlfriend live together in a duplex where the rent is around $450 (plus electric and water this is about $600). My girlfriend makes around $800 a month at her job. And I make around $400. We have a car payment that is $341.40 as well as the insurance which is $121.00 . My girlfriend is depressed and we don’t have enough to get her any help. \n\nI need help figuring out how to make our quality of life any better at all. Idk if we are allowed to apply for welfare or even how to start that process. Any advice would be greatly appreciated !\n\nUpdate: thank all of y’all for the advice and I appreciate the time. I’m sorry I couldn’t directly respond to all but I’ve been inspired. God bless.',
       'I\'ve been trying to think of how to condense this question so it\'s more palatable for a quicker response, but I\'m at a loss, so I hope somebody is willing to read this novel-length post.\n\nMy fiancee lives in New York state. Due to severe PTSD and mental illness, she has 

It looks like most of the edits will be too hard to identify automatically, so we will remove all edited submissions for now.

In [12]:
non_edited_submission_data = submission_data[submission_data.loc[:, 'parent_edited'].apply(lambda x: type(x) is bool and not x)]
print(f'{non_edited_submission_data.shape[0]}/{submission_data.shape[0]} non-edited posts')

796557/974252 non-edited posts


In [13]:
pd.set_option('display.max_colwidth', 100)
question_submission_data = pd.merge(
    question_data.loc[:, ['author', 'edited', 'id', 'subreddit', 'question', 'parent_id']],
    non_edited_submission_data.loc[:, ['parent_id', 'parent_text', 'parent_title', 'parent_edited']],
    on='parent_id',
)
print(question_submission_data.shape[0])
display(question_submission_data.head(10))

519557


Unnamed: 0,author,edited,id,subreddit,question,parent_id,parent_text,parent_title,parent_edited
0,HindsightGraduate,False,f1cajdw,AmItheAsshole,Is there a solid impression she does that makes you crack up every single time?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
1,TaKiDaLo,False,f1cmnoh,AmItheAsshole,But why do you keep asking this?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
2,rachelsnipples,False,f1ch41g,AmItheAsshole,"""Why are you asking me this?",d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
3,bigsisthrowaway19,False,f1cfiy5,AmItheAsshole,Why do you think she's asking?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
4,givebusterahand,False,f1dahgg,AmItheAsshole,What is telling her the truth going to do besides further destroy her self esteem?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
5,TripleV420,False,f1d2ms7,AmItheAsshole,Why don't you try to find ways to bring out her beauty?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
6,DK_Son,False,f1d5ej9,AmItheAsshole,"Or ""Do you think I'm a bitch?",d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
7,starjumper_,False,f1df88g,AmItheAsshole,It's not difficult to see the beauty in your friends so why not try it and tell her about it?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
8,velveteen279,False,f1csy7k,AmItheAsshole,Maybe ask her why she's feeling so shit about herself?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False
9,Pastelroots,False,f1d0adj,AmItheAsshole,Why don't you suggest to your friend ways to look better instead of telling them their ugly?,d8pbuz,"So my friend keeps asking people if they find her ugly, she’s done this to me once and also many...","WIBTA if my friend (16F) asked me (16F) whether I think she’s ugly, and I were to be honest and ...",False


In [14]:
## clean up columns
question_submission_data.rename(columns={'parent_text' : 'article_text', 'parent_id' : 'article_id', 'parent_title' : 'article_title'}, inplace=True)
## clean up text
import re
info_question_matcher = re.compile('^INFO:? ')
submission_text_matcher = re.compile('^(AITA|WIBTA)|[\n\r]')
question_submission_data = question_submission_data.assign(**{
    'question' : question_submission_data.loc[:, 'question'].apply(lambda x: info_question_matcher.sub('', x)),
    'article_text' : question_submission_data.loc[:, 'article_text'].apply(lambda x: submission_text_matcher.sub('', x)),
    'article_title' : question_submission_data.loc[:, 'article_title'].apply(lambda x: submission_text_matcher.sub('', x)),
})

In [16]:
## get sample!! otherwise training takes weeks lol
import numpy as np
np.random.seed(123)
sample_pct = 0.25
N_sample = int(sample_pct*question_submission_data.shape[0])
print(f'sampling {N_sample} posts')
sample_question_data = question_submission_data.loc[np.random.choice(question_submission_data.index, N_sample, replace=False), :]

sampling 129889 posts


Let's convert all the data to tensor format so that we can train/test in Torch.

In [18]:
from importlib import reload
import data_helpers
reload(data_helpers)
from data_helpers import prepare_question_data
# from transformers import AutoTokenizer
from transformers import BartTokenizer
data_dir = '../../data/reddit_data/'
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base', cache_dir=data_dir)
data_name = 'advice_subreddit'
train_pct = 0.9
max_source_length = 512
max_target_length = 64
data_vars = ['article_text', 'question', 'article_id', 'article_title']
prepare_question_data(sample_question_data, data_dir, data_name, tokenizer, 
                      train_pct=train_pct, 
                      data_vars=data_vars,
                      max_source_length=max_source_length,
                      max_target_length=max_target_length)



Downloading and preparing dataset csv/default-86e6a22ff6bed936 (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/ianbstew/.cache/huggingface/datasets/csv/default-86e6a22ff6bed936/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b...


HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset csv downloaded and prepared to /home/ianbstew/.cache/huggingface/datasets/csv/default-86e6a22ff6bed936/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b. Subsequent calls will reuse this data.




Downloading and preparing dataset csv/default-849d4eae804f9e12 (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/ianbstew/.cache/huggingface/datasets/csv/default-849d4eae804f9e12/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b...


HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset csv downloaded and prepared to /home/ianbstew/.cache/huggingface/datasets/csv/default-849d4eae804f9e12/0.0.0/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b. Subsequent calls will reuse this data.


HBox(children=(FloatProgress(value=0.0, max=116794.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=117.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=13049.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=14.0), HTML(value='')))




### Test model output
After training (10% of data, ~15 hours), let's see how well-formed the questions seem to be for the test data.

In [1]:
import torch
val_data = torch.load('../../data/reddit_data/advice_subreddit_val_data.pt')['train']
print(len(val_data))
print(val_data)

8024
Dataset(features: {'article_id': Value(dtype='string', id=None), 'article_title': Value(dtype='string', id=None), 'source_text': Value(dtype='string', id=None), 'target_text': Value(dtype='string', id=None), 'source_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'target_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, num_rows: 8024)


In [2]:
## set CUDA device
!export CUDA_VISIBLE_DEVICES=0

In [3]:
from transformers import AutoModelForSeq2SeqLM, BartTokenizer
model_file = '../../data/reddit_data/text_only_model/question_generation_model/checkpoint-184000/pytorch_model.bin'
model_weights = torch.load(model_file)
generation_model = AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-base', cache_dir='../../data/model_cache/')
generation_model.load_state_dict(model_weights)
model_tokenizer = BartTokenizer.from_pretrained('facebook/bart-base', cache_dir='../../data/model_cache/')

In [4]:
from data_helpers import generate_predictions
device_name = 'cuda:0'
generation_method = 'beam_search'
num_beams = 8
val_data_pred = generate_predictions(
    generation_model, val_data, model_tokenizer, 
    device_name=device_name, generation_method=generation_method,
    num_beams=num_beams,
)

  return function(data_struct)
100%|██████████| 8024/8024 [28:14<00:00,  4.74it/s]


In [9]:
## write to file!!
import gzip
pred_out_file = '../../data/reddit_data/advice_subreddit_val_data_pred.gz'
with gzip.open(pred_out_file, 'wt') as pred_out:
    pred_out.write('\n'.join(val_data_pred))

In [6]:
## compare predicted data vs. true data
from importlib import reload
import data_helpers
reload(data_helpers)
from data_helpers import compare_pred_text_with_target
cutoff_idx = 100
max_txt_len = 400
extra_data_vars = ['article_id', 'article_title']
compare_pred_text_with_target(val_data, val_data_pred, model_tokenizer, 
                              max_txt_len=max_txt_len, cutoff_idx=cutoff_idx,
                              extra_data_vars=extra_data_vars)

*~*~*~*~*~*
article_id = csxfq2
article_title = Things I wish I'd done in my 20's
source text = I was thinking this morning about habits I developed a bit later than I should have, even when I knew I should have been doing them. These are a few things I thought I'd share and interested if others who are out of their 20s now have anything additional to add.Edit 1: This is not a everyone must follow this list, but rather one philosophy and how I look back on things.Edit 2: I had NO idea this m...
target text = And how do you get the 401k matched?
pred text = Is there a salary limit for opening a Roth 401k?
*~*~*~*~*~*
article_id = brayou
article_title =  for being disappointed with my boyfriend buying me a car?
source text = Yesterday was my 20th birthday, and my boyfriend bought me a car. My boyfriend is 22. He has been pestering me for the entire relationship (18 months) to pass my driving test. Three weeks ago, I started my lessons, pretty much just to shut him up about it. I don't ha

*~*~*~*~*~*
article_id = dr7l2l
article_title = I need assitance in what to do next after messing up my personal financials over the passed 6 months.
source text = Hey there internet. I have never been one to keep any system on my finances, as every system I have tried, has failed within a month. Mostly due to myself. I finally started getting back on track after I figured out what all of my debts were, and made a plan for my future of economics. Not long after this, I got a new job which required me to travel constantly, and work an insane amount. Not only ...
target text = How much is your car worth, and can you downsize the car?
pred text = $28k job that has you travel and use your own money?
*~*~*~*~*~*
article_id = ai2o9o
article_title = S/O Sold a car to a “friend” that was slowly paying it off and is now MIA
source text = [WA] My significant other sold her car to a coworker/friend a while back and they wrote out a contract and we still have the title until she pays it off. Well 

*~*~*~*~*~*
article_id = cd9r4d
article_title = My Parents Want me to join the military and are expecting me to, even tho I have told them before that I don't want to, and I don't know how to tell them that I'm not going to enlist.
source text = I'm 17, turning 18 in November, and since i was little i was told that the military was my only option after highschool. 6 of my 7 older siblings went in, 3 to the air force, 3 to the navy, and i believed that I was going to enlist like them, since that was what my parents always told me. Up until a few months ago I believed them. After a field trip to a university through AFJROTC, taking the SAT,...
target text = I think the advice you're looking for is how to break it to your parents gently, is that right?
pred text = I think the advice you're looking for is how to break it to your parents gently, is that right?
*~*~*~*~*~*
article_id = bkpheo
article_title = Does my employer have to follow my doctor recommendations to help with my medical co

article_title =  for creating a fake gmail pretending to be my wife's ex to give her some closure from a bad relationship?
source text = I hope the need for a tosser account is obvious, but anyways here goes:Wife and I are both on our second marriages and carry a lot of baggage into this relationship. I love her very much and we've had long talks that it was our fucked up pasts that led us to each other. Her relationship with her ex husband is actually really good and I the three of us actually have a really productive relationship...
target text = Wants to meet up for old times sake??
pred text = What if she messages him back and asks to meet up?
*~*~*~*~*~*
article_id = abo1yj
article_title =  for not wanting to share my potential lottery winnings with my boyfriend?
source text = My boyfriend drove us to a gas station tonight to purchase Mega Millions tickets. I was joking around saying that I was going to win. He said that if I did win, he would expect me to give him a couple millio

*~*~*~*~*~*
article_id = 937hbs
article_title = How to deal with a complicated squatting situation?
source text = If you don't want to help then don't post. You don't know the full story so don't come at me judging me like you know me or my family or our story. We're in California.My parents are illegally squatting in a horse barn which my father and I made into a livable space with a bathroom a kitchen so on and so forth nothing is up to code or permitted. Not the sewage tank not the framing not anything not...
target text = I am confused, is this barn owned by your parents, on land that they own as well?
pred text = I am confused, is this barn owned by your parents, on land that they own as well?
*~*~*~*~*~*
article_id = cuo7et
article_title =  if I start going to Church?
source text = I am not Christian. Even though I do not follow the religion (let us just say X) that I was brought up with, I still believe in a God, or a higher power. I basically believe what my mother told me when

*~*~*~*~*~*
article_id = blibvq
article_title =  for not helping my daughter pay for college and not co-signing a loan?
source text = My daughter is turning 18 in two weeks. She is starting college this fall. She has been accepted into 8 universities but has gotten little to no aid from any of them. The government didn't give her any aid either since our income is apparently too high. My daughter got into her dream school after being waitlisted. She was really happy to start college there this fall, but she has no money. I don't...
target text = What are your reason to say no?
pred text = Do you hold a special grudge against her?
*~*~*~*~*~*
article_id = 8yk90j
article_title = The Wholesomegaming giveaway
source text = Over 60 tech companies from all over the world have joined forces in an insane Twitter thread that went viral. Each company joining the wholesome gaming giveaway did so because PC culture and gaming goes beyond brands and companies. It's about being connected through a l

*~*~*~*~*~*
article_id = duzjox
article_title = How do I get out of my husbands shadow?
source text = I'm 30 years old and I've been with my husband since I was 15. So I never got a chance to learn about myself. And I feel insecure and have no confidence. He is a social butterfly while I am socially awkward and have no friends. At my age it's hard to build what should've been built in my late teens and early 20s....
target text = Do I sometimes jump to the worst possible conclusion?
pred text = Do I sometimes jump to the worst possible conclusion?
*~*~*~*~*~*
article_id = bmqrae
article_title = RYZEN 2400 or Ryzen 1600?
source text = Which would be better to upgrade to? they are both around the same price range and i need a new processor lmao...
target text = If your flair is accurate and that you currently have the FX-4300, I hope you realize that to get any Ryzen processor you'll need a new motherboard, and also some new RAM?
pred text = What processor do you have?
*~*~*~*~*~*
articl

*~*~*~*~*~*
article_id = chcqkp
article_title =  for not supporting my sister's "coming out" party?
source text = I am 27f, my sister is 16 (almost 17). She lives with our parents and has always been very.... Lively....She announced a month ago that for her birthday in July she was going to have a "special" celebration. She and I have always been close, so I pumped her for more details. She eventually admit it was a "coming out" party. Normal I would have been happy and supportive, but she had been dating the...
target text = NTA what kind of closet had she been hiding in?
pred text = You're 27 and she's just coming out?
*~*~*~*~*~*
article_id = b82fqg
article_title =  for not going to work sooner?
source text = Hello! Im not a native speaker, so sorry for eventual mistakes. Also Im writing this on my phone.A little background - me and my boyfriend have been together for a little over a year now. When we started dating, he was working very irrlegularly, worked for one company for a mon

- There's quite a few cases of data bleeding i.e. repeats in target/pred text due to memorization of responses.
    - Submission title: `What are some ways to repay your student loans early?`
        - Target: `Do you have any unsubsidized loans?`
        - Predicted: `Do you have any unsubsidized loans?`
- Some of the questions are ungrammatical but still on topic:
    - Submission title: `for expressing disappointment over our child’s gender?`
        - Target: `Eggs always have an X inside, sperm will either have an X or a Y. Sometimes theres a mutation with XXY or XYY, but it is rare, and even still the unfertilized egg will only carry an X. Maybe you only shoot Ys?`
        - Predicted: `You’re disappointed because your son doesn’t have a girl?`
    - Submission title: `for reclaiming my old WoW/Blizzard Account after a 7+ year hiatus?`
        - Target: `But on the other hand, what if you gave him that account?`
        - Predicted: `if he had not continued to pay the subscription would the account have disappeared and progress lost?`
- Even the questions that miss the mark are still reasonable:
    - Submission title: `Community College Wants My Blood for Points?`
        - Target: `Does the assignment give any actual reason behind wanting you to do this or does it just say to get blood drawn?`
        - Predicted: `Did you talk to your professor about this?`
    - Submission title: `PC dead after power cut and CMOS reset`
        - Target: `So, how does a computer work?`
        - Predicted: `What is a power controller doing and seeing?`