# Reddit AITA Huggingface Dataset Creation


2 Input files from datafile_filtering:
1. AITA submissions with at least 50 score
2. Top level comments that had at least 10 score for the AITA submissions with at least 50 score

1 Output file:
1. CSV/ZST file where each row is an AITA submission with at least 50 score that has columns for the top 10 comments where each comment has at least 10 score

In [None]:
%pip install zstandard
%pip install pandas

In [None]:
import pandas as pd
import zstandard as zstd

## Creation of AITA submissions dataframe

In [1]:
# load submissions csv

submissions_df = pd.read_csv('new_datasets/submissions_2019_to_2022_at_least_50_score.csv')

In [None]:
# filter submissions df to include only relevant link_flair_text (decision) values
# relevant AITA classes - a**hole, not the a-hole, no a-holes here, everyone sucks, not enough info

submissions_df = submissions_df[submissions_df['link_flair_text'].isin(['Asshole', 'Not the A-hole', 'No A-holes here', 'Everyone Sucks', 'Not enough info'])]

In [10]:
# rename columns so that they better reflect their data

submissions_df = submissions_df.rename(columns={'id': 'submission_id',
                                      'link_flair_text': 'decision',
                                      'score': 'submission_score',
                                      'title': 'submission_title',
                                      'selftext': 'submission_text',
                                      'url': 'submission_url'})

In [12]:
submissions_df

Unnamed: 0,submission_id,decision,submission_score,submission_title,submission_text,url,created_utc
2,abee17,Asshole,111,AITA for being upset that my fiancee is gettin...,Yeah yeah it's New Year's Eve and that's what ...,https://www.reddit.com/r/AmItheAsshole/comment...,1546312332
3,abei8b,Not the A-hole,446,Aita for not being attracted to certain parts,So this is something that happened a while ago...,https://www.reddit.com/r/AmItheAsshole/comment...,1546313322
6,abg6ud,Not the A-hole,54,WIBTA for eating Taco Bell...,So serious dilemma. Tonight for New Years my b...,https://www.reddit.com/r/AmItheAsshole/comment...,1546327285
7,abgjkp,Not the A-hole,77,AITA for asking friends to look after my pets ...,So I’m going away and needed my stick insects ...,https://www.reddit.com/r/AmItheAsshole/comment...,1546330846
10,abiar4,Not the A-hole,18338,AITA for asking boyfriend why he packed condom...,Sadly this is a serious question.\n\nIn an exc...,https://www.reddit.com/r/AmItheAsshole/comment...,1546350476
...,...,...,...,...,...,...,...
161829,1005595,Not the A-hole,339,AITA for asking to meet my parents at a neutra...,I (37F) am an only child and I'm married (hubb...,https://www.reddit.com/r/AmItheAsshole/comment...,1672525101
161831,1005mzq,Asshole,102,AITA friend refuses to let me sit in her car d...,[deleted],,1672526585
161832,1005sqb,Not the A-hole,54,AITA for insinuating that my wife's singing is...,"That's an insane title, I know. Let me explain...",https://www.reddit.com/r/AmItheAsshole/comment...,1672527066
161834,1005zsn,Not the A-hole,85,AITA for not giving my dad money after my mom’...,"My parents have always had a rough, codependen...",https://www.reddit.com/r/AmItheAsshole/comment...,1672527666


## Creation of AITA comments dataframe

In [15]:
# load comments csv

comments_df = pd.read_csv('new_datasets/top_level_comments_2019_to_2022_at_least_10_comment_score_at_least_50_submission_score.csv')

In [16]:
# strip the t3_ from the link_id column

comments_df['link_id'] = comments_df['link_id'].str.slice(3)

In [20]:
# rename columns so that they better reflect their data

comments_df = comments_df.rename(columns={'id': 'comment_id',
                                      'score': 'comment_score',
                                      'body': 'comment_text'})

## Merging of AITA submission and comments dataframes

In [24]:
# Create a dataframe of the top 10 comments for each submission

merged_df = submissions_df.merge(comments_df, left_on='submission_id', right_on='link_id') # merge submission and top comments dataframes
merged_df = merged_df.drop('link_id', axis=1) # remove link_id column
top_10_comments = merged_df.groupby('submission_id').apply(lambda x: x.nlargest(10, 'comment_score')['comment_text'].tolist()) # group by submission_id and get the top 10 comments for each submission
top_10_comments_df = pd.DataFrame(top_10_comments.tolist(), index=top_10_comments.index).add_prefix('comment_')

In [71]:
# Merge submissions_df and top_10_comments_df on submission_id
# Result is a dataframe with both submissions and their top 10 comments

submissions_with_top_10_comments = submissions_df.merge(top_10_comments_df, on='submission_id')

In [72]:
# Filter out rows with deleted/removed/null submission texts or top comments

submissions_with_top_10_comments = submissions_with_top_10_comments[(submissions_with_top_10_comments['submission_text'] != '[deleted]') & 
                                                                    (submissions_with_top_10_comments['comment_0'] != '[deleted]') &
                                                                    (submissions_with_top_10_comments['submission_text'] != '[removed]') &
                                                                    (submissions_with_top_10_comments['comment_0'] != '[removed]') &
                                                                    (submissions_with_top_10_comments['submission_text'].notnull()) & 
                                                                    (submissions_with_top_10_comments['comment_0'].notnull())]

In [73]:
# Convert UTC timestamps to datetime

submissions_with_top_10_comments['created_utc'] = pd.to_datetime(submissions_with_top_10_comments['created_utc'], unit='s')


In [74]:
# Rename timestamp and top comment columns for improved clarity

submissions_with_top_10_comments = submissions_with_top_10_comments.rename(columns={'created_utc': 'submission_date',
                                                                                    'comment_0': 'top_comment_1',
                                                                                    'comment_1': 'top_comment_2',
                                                                                    'comment_2': 'top_comment_3',
                                                                                    'comment_3': 'top_comment_4',
                                                                                    'comment_4': 'top_comment_5',
                                                                                    'comment_5': 'top_comment_6',
                                                                                    'comment_6': 'top_comment_7',
                                                                                    'comment_7': 'top_comment_8',
                                                                                    'comment_8': 'top_comment_9',
                                                                                    'comment_9': 'top_comment_10'})

In [75]:
# Remove submission_id column since it isn't important to the dataset

submissions_with_top_10_comments = submissions_with_top_10_comments.drop('submission_id', axis=1)

In [76]:
# Swap decision and submission_title columns

submissions_with_top_10_comments[['decision', 'submission_title']] = submissions_with_top_10_comments[['submission_title', 'decision']]
submissions_with_top_10_comments = submissions_with_top_10_comments.rename(columns={'decision': 'submission_title', 'submission_title': 'decision'})


In [77]:
# Swap submission_score and submission_text columns

submissions_with_top_10_comments[['submission_score', 'submission_text']] = submissions_with_top_10_comments[['submission_text', 'submission_score']]
submissions_with_top_10_comments = submissions_with_top_10_comments.rename(columns={'submission_score': 'submission_text', 'submission_text': 'submission_score'})

In [78]:
submissions_with_top_10_comments

Unnamed: 0,submission_title,submission_text,decision,submission_score,submission_url,submission_date,top_comment_1,top_comment_2,top_comment_3,top_comment_4,top_comment_5,top_comment_6,top_comment_7,top_comment_8,top_comment_9,top_comment_10
0,AITA for being upset that my fiancee is gettin...,Yeah yeah it's New Year's Eve and that's what ...,Asshole,111,https://www.reddit.com/r/AmItheAsshole/comment...,2019-01-01 03:12:12,"YTA, she's drinking in the safest environment ...",YTA and sound controlling. You say 'she knows ...,YTA. She's drinking with her mom and you feel ...,YTA. sounds like you are being pretty control...,YTA- You sound like a lunatic and getting enga...,YTA. Stop being a little Bitch,"YTA. Let her do what she wants, she isn’t hurt...",YTA. You are trying to force her not to do som...,[deleted],I think you are losing the certainty that you ...
1,Aita for not being attracted to certain parts,So this is something that happened a while ago...,Not the A-hole,446,https://www.reddit.com/r/AmItheAsshole/comment...,2019-01-01 03:28:42,NTA. You can’t change who you are and honestly...,NTA it's not your responsibility to start eati...,NTA at all. We each have our own interests and...,"NTA. If the SO wants to turn male, and you're ...",[deleted],NTA\n\nyou're not transphobic. you're straight...,NTA and definitely not transphobic.,"SHP and you know it. No, you can't be blamed f...","Nta, it goes without saying that you can’t hel...",NTA. They’re being sexist for not accepting yo...
2,WIBTA for eating Taco Bell...,So serious dilemma. Tonight for New Years my b...,Not the A-hole,54,https://www.reddit.com/r/AmItheAsshole/comment...,2019-01-01 07:21:25,Eat the taco bell,"NTA eat that shit, bitch!",I’m impressed the two of you managed to buy $2...,"If he’s passed out it’s fine, besides if he go...",NTA. Eat it. It’s gonna be gross if you save...,"Nta , eat the food. You dont need to stay with...",NTA. Taco Bell has a very short shelf life. Ea...,NTA but I wouldn't eat Taco Bell in the same b...,NTA - you earned it lol,&gt; I need to stay in the bedroom with him ju...
3,AITA for asking friends to look after my pets ...,So I’m going away and needed my stick insects ...,Not the A-hole,77,https://www.reddit.com/r/AmItheAsshole/comment...,2019-01-01 08:20:46,NTA. Put it in terms of you wanting to be cons...,"NTA, if I was her I'd be glad you didn't even ...","I read the title differently, that you asked h...",NAH. I can see why she might be upset since sh...,,,,,,
4,AITA for asking boyfriend why he packed condom...,Sadly this is a serious question.\n\nIn an exc...,Not the A-hole,18338,https://www.reddit.com/r/AmItheAsshole/comment...,2019-01-01 13:47:56,NTA and you're completely correct to be concer...,"NTA. He packed the condoms, and if it was a ge...",NTA he is preparing for a “maybe”. I’ve been m...,If he got mad and overly defensive then he was...,"OP, it sounds like he’s gaslighting you. Inten...",NTA. I'm 99% sure he intended to cheat on you ...,NTA. What the fuckity fuck fuck? He's throwi...,Ask him if you can borrow some of his condoms ...,NTA. He should understand why you'd want to kn...,NTA Are these toiletries like prepackaged? So ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
122514,AITA for uninviting my sister's husband from m...,"For context, my family hosts a new years party...",Asshole,290,https://www.reddit.com/r/AmItheAsshole/comment...,2022-12-31 22:11:01,YTA. I agree with your sister. If you weren’t ...,NTA. No idea why people are saying you're TA ...,Gentle YTA. \n\nWhile it is your house and no ...,NTA\n\nIt's your house and you decide who to i...,NTA\n\nHe is a drunk. \n\nHe shit himself last...,I was gonna Y T A this until I read about the ...,NTA. You have to drink quite a bit to be at th...,INFO: Has what happened with Brad been a regul...,Are you wrong for not inviting a possibly drun...,
122515,AITA for asking to meet my parents at a neutra...,I (37F) am an only child and I'm married (hubb...,Not the A-hole,339,https://www.reddit.com/r/AmItheAsshole/comment...,2022-12-31 22:18:21,Your dad's abusive and unrepentant. You absolu...,&gt;My mom has said that they're both worried ...,"lol, I love the way he took you aside to loom ...",NTA. They should be worried about you keeping...,I would forego allowing them to participate pe...,NTA. Protect your kid from him. What he did wa...,"NTA, people who abused their children will hap...",NTA. And feel free to tell your mother that th...,NTA. You don't need counselling: you handled t...,NTA \n\nAnd any adult who makes animal noises ...
122517,AITA for insinuating that my wife's singing is...,"That's an insane title, I know. Let me explain...",Not the A-hole,54,https://www.reddit.com/r/AmItheAsshole/comment...,2022-12-31 22:51:06,Try explaining it to your wife that she’s doin...,NTA. I think your wife needs a therapist or a ...,NTA\n\nCould you have handled the situation a ...,NTA. Her spouse has calmly raised concerns abo...,NTA. You don't welcome someone with something ...,"NTA, I appreciate your wife only has good inte...",,,,
122518,AITA for not giving my dad money after my mom’...,"My parents have always had a rough, codependen...",Not the A-hole,85,https://www.reddit.com/r/AmItheAsshole/comment...,2022-12-31 23:01:06,NTA- The trust is in your name and meant for y...,"NTA, hot tip, I bet your Dad contributed nothi...",NTA. This is so laughable. “Do the right thing...,NTA.\n\nYour dad doesn’t understand how a trus...,NTA. The trust was established for your benefi...,"No, no, no. Your mother set this up the way sh...","NTA. If you mum wanted him to have it, she wou...",,,


### Saving to output CSV and ZST
- Will be considered as the "raw" version

In [79]:
# save the dataframe as a csv
output_file = '2019_to_2022_submissions_at_least_50_score_top_10_comments.csv'
submissions_with_top_10_comments.to_csv(output_file, index=False)

In [80]:
# compress CSV file to ZST format and save it

input_file = '2019_to_2022_submissions_at_least_50_score_top_10_comments.csv'
output_file = '2019_to_2022_submissions_at_least_50_score_top_10_comments.zst'

with open(input_file, 'rb') as f_in, open(output_file, 'wb') as f_out:
    cctx = zstd.ZstdCompressor() # Create a zstd compressor
    cctx.copy_stream(f_in, f_out) # Compress the input file and write the compressed data to the output file

File compressed successfully.
