# Data Collection

[View this notebook in nbviewer](https://nbviewer.org/github/Data-Science-for-Linguists-2023/AITA-Blame-Analysis/blob/main/code/data_collection.ipynb)

## Table of Contents

1. [Scraping from Pushshift API](#Scraping-from-Pushshift-API)
2. [Narrowing Down Data Set](#Narrowing-Down-Data-Set)

## Scraping from Pushshift API

This project utilizes both [PRAW](https://praw.readthedocs.io/en/latest/index.html) and [PMAW](https://github.com/mattpodolak/pmaw) to scrape submission data from [r/AmITheAsshole](https://www.reddit.com/r/AmItheAsshole/). PRAW is a wrapper for the Reddit API. However, through Reddit's built-in API, you cannot query data past a certain time limit and can only query 1000 posts at a time. However, the third-party Pushshift API allows you to query older and larger quantities of data. PMAW is a wrapper for the Pushshift API.

In [1]:
import praw
from pmaw import PushshiftAPI
import pandas as pd
import numpy as np
import datetime as dt

In [2]:
text = open("../user_info.txt", "r")
client_info = [line.strip("\n") for line in text.readlines()]
reddit = praw.Reddit(client_id = client_info[0],
                    client_secret = client_info[1],
                    user_agent = client_info[2])

api = PushshiftAPI(praw=reddit, file_checkpoint=10)

Version 7.6.1 of praw is outdated. Version 7.7.0 was released Saturday February 25, 2023.


From each post, I will be saving the poster's username, the post title, the text, the number of upvotes, the ratio of upvotes to downvotes (Reddit API has removed access to the exact number of downvotes), and any post flairs. On this subreddit, flairs are used to track the final verdicts on "Asshole", "Not the A-hole", "No a-holes here", and "Everyone Sucks." Further information on how this subreddit categorizes posts can be found in their [FAQ](https://www.reddit.com/r/AmItheAsshole/wiki/faq/#wiki_acronyms). Depending on how far into the voting the post was scraped and saved to the Pushshift API, there may be no flairs tagged onto a post even if, in reality, a verdict has been concluded.

Furthermore, any attempts to scrape more than 1,000 posts at a time will scrape all available posts. Pushshift is also undergoing a migration process and does not have any data from before November 2022 ready. Lastly, this data set requires only posts that contain flair info, in which PMAW does not provide a filtering option to search only for posts with flairs. 

Because of all this, I will be scraping 1,000 every day starting in last November. Then I will remove any posts that do not serve this project: posts that don't have flairs, deleted posts, etc.

In [3]:
before = dt.datetime(2023, 3, 15, 0, 0)
after = dt.datetime(2022, 11, 15, 0, 0)
delta = dt.timedelta(days=1)
submissions_list = []
while before >= after:
    submissions = api.search_submissions(subreddit="AmItheAsshole", until=int(before.timestamp()), since=int(after.timestamp()), limit=1000, mem_safe=True)
    submissions_list.extend([sub for sub in submissions])
    after += delta


In [4]:
print(len(submissions_list))

119872


We're starting off with a corpus of almost 120,000 posts! Enjoy it while it lasts...

In [5]:
aita_df = pd.DataFrame(submissions_list)
aita_df.head()

Unnamed: 0,comment_limit,comment_sort,_reddit,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,...,is_video,_fetched,_comments_by_id,post_hint,preview,author_cakeday,crosspost_parent_list,url_overridden_by_dest,crosspost_parent,live_audio
0,2048,confidence,<praw.reddit.Reddit object at 0x0000017DE539F640>,,AmItheAsshole,"I honestly thought what I was doing was fine, ...",t2_13wsc0hc,False,,0,...,False,False,{},,,,,,,
1,2048,confidence,<praw.reddit.Reddit object at 0x0000017DE539F640>,,AmItheAsshole,[removed],t2_uktpkuf3,False,,0,...,False,False,{},,,,,,,
2,2048,confidence,<praw.reddit.Reddit object at 0x0000017DE539F640>,,AmItheAsshole,My boyfriend was FaceTiming his cousin and his...,,False,,0,...,False,False,{},,,,,,,
3,2048,confidence,<praw.reddit.Reddit object at 0x0000017DE539F640>,,AmItheAsshole,[removed],t2_7aji4,False,,0,...,False,False,{},,,,,,,
4,2048,confidence,<praw.reddit.Reddit object at 0x0000017DE539F640>,,AmItheAsshole,[removed],,False,,0,...,False,False,{},,,,,,,


## Narrowing Down Data Set

First, I'm gonna narrow the DataFrame down to just the information I want. This includes the author, the post title, the post's text, the post's flairs, it's number of upvotes, and the proportion of upvotes to downvotes.

In [6]:
# (I recognize now that creating a whole new variable is absolutely destroying my memory but I'm not gonna rerun this to fix it :P Those are precious hours of my life)
cleaned_df = aita_df[["author", "title", "selftext", "link_flair_text", "num_comments", "score", "upvote_ratio"]]
cleaned_df.head(10)

Unnamed: 0,author,title,selftext,link_flair_text,num_comments,score,upvote_ratio
0,PomegranateJellyfish,AITA for sleeping during the day?,"I honestly thought what I was doing was fine, ...",Not the A-hole,34,7,0.82
1,Anth00013,AITA for screaming at my dad?,[removed],,5,2,1.0
2,,AITAH for refusing to change out of my semi-se...,My boyfriend was FaceTiming his cousin and his...,Not the A-hole,23,8,0.78
3,Yamochao,AITA for not wanting to live with a jew?,[removed],,1,1,1.0
4,,[deleted by user],[removed],,1,1,1.0
5,HoneyCornflaakes,AITA for hiding my diarrhoea in my husbands cl...,[removed],,1,1,1.0
6,JamesPildis,AITA for not helping my neighbor?,I (M26) live in a large apartment complex with...,Not the A-hole,253,1016,0.97
7,,[deleted by user],[removed],,1,1,1.0
8,,[deleted by user],[removed],,42,119,0.93
9,pennyforyouraccount,AITA for threatening to lock my housemate's wi...,I'll keep it short; my housemate moved in arou...,Everyone Sucks,34,4,0.76


Posts with "[removed]" as their text are not useful in this project as we can't access the original text. Let's clean off all of those posts.

In [7]:
idx = cleaned_df[cleaned_df["selftext"].isin(["[removed]", "[deleted]"])].index
cleaned_df = cleaned_df.drop(idx)

In [8]:
cleaned_df.head(10)

Unnamed: 0,author,title,selftext,link_flair_text,num_comments,score,upvote_ratio
0,PomegranateJellyfish,AITA for sleeping during the day?,"I honestly thought what I was doing was fine, ...",Not the A-hole,34,7,0.82
2,,AITAH for refusing to change out of my semi-se...,My boyfriend was FaceTiming his cousin and his...,Not the A-hole,23,8,0.78
6,JamesPildis,AITA for not helping my neighbor?,I (M26) live in a large apartment complex with...,Not the A-hole,253,1016,0.97
9,pennyforyouraccount,AITA for threatening to lock my housemate's wi...,I'll keep it short; my housemate moved in arou...,Everyone Sucks,34,4,0.76
25,Strange-andunusua_l,AITA for allowing my bio dad and his wife to b...,I found out that I am pregnant about 5 1/2 mon...,Not the A-hole,62,46,0.88
29,gingerfinland,WIBTA for confronting my dad about skipping Ch...,My (31F) parents (55F) and (63M) have been sep...,Not the A-hole,20,3,0.81
39,mapotofu66,WIBTA if on a day trip with friends I opt out ...,I (26F) live in the US and my friend from anot...,Not the A-hole,10,8,1.0
41,HomocusPocus,WIBTA for not celebrating Christmas with my bo...,**edit #1**: Nancy is hosting Christmas dinner...,Not the A-hole,12,3,0.8
42,Tasty_Garden_6766,AITA for being annoyed that my friend is visit...,I (38f) have two friends from university: Sue ...,Asshole,27,3,0.58
46,babybarbz,AITA for getting upset when my roommate’s part...,Keeping this anon cause I’m not trying to put ...,Not the A-hole,277,2914,0.97


In [9]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21003 entries, 0 to 119870
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           20650 non-null  object 
 1   title            21003 non-null  object 
 2   selftext         21003 non-null  object 
 3   link_flair_text  20728 non-null  object 
 4   num_comments     21003 non-null  int64  
 5   score            21003 non-null  int64  
 6   upvote_ratio     21003 non-null  float64
dtypes: float64(1), int64(2), object(4)
memory usage: 1.3+ MB


Immediately that removes almost 100,000 posts!

What are null authors?

In [10]:
cleaned_df[cleaned_df['author'].isna()].info()
cleaned_df[cleaned_df['author'].isna()]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 353 entries, 2 to 119399
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           0 non-null      object 
 1   title            353 non-null    object 
 2   selftext         353 non-null    object 
 3   link_flair_text  349 non-null    object 
 4   num_comments     353 non-null    int64  
 5   score            353 non-null    int64  
 6   upvote_ratio     353 non-null    float64
dtypes: float64(1), int64(2), object(4)
memory usage: 22.1+ KB


Unnamed: 0,author,title,selftext,link_flair_text,num_comments,score,upvote_ratio
2,,AITAH for refusing to change out of my semi-se...,My boyfriend was FaceTiming his cousin and his...,Not the A-hole,23,8,0.78
78,,AITA for acting annoyed that my partner left h...,My long term live in bf and I have an agreemen...,Not enough info,31,0,0.50
190,,AITA for asking to use my wife's car?,So I live in an area that very rarely gets any...,Not the A-hole,245,323,0.93
217,,AITA for accusing my friend of stealing who ha...,I have a box with 2k cash saved up that I hid ...,Asshole,33,15,0.80
230,,AITA for not allowing my best friend to wear m...,I female ) have a child hood best friend of ov...,Not the A-hole,968,6364,0.97
...,...,...,...,...,...,...,...
115842,,WIBTA for forcing my daughter to withdraw from...,My daughter is 18 and graduated from high scho...,Not the A-hole,51,44,0.90
116880,,AITA for wanting him to contribute money to th...,"I've (f19) known the guy in question, I'll cal...",Not the A-hole,42,7,0.82
118211,,WIBTA if I told my new friend I met her BF on ...,A few years ago I was on Tinder and had my soc...,Asshole,8,5,0.67
119389,,AITA For Talking About Bruno,I’ll admit the title is a little click-baity (...,Not the A-hole,122,152,0.76


I presume these are posts that still exist but the users have deactivated. That's alright though, because my next step is to replace all usernames with a stand-in number. The nature of this subreddit is that people are sharing stories where some fault has occured, and many people opt to make a new account for the sole purpose of posting onto this subreddit due to the shame that might surround the story. So, out of respect of the nature of the subreddit, I will remove all the usernames in the data set. If someone has several posts in the corpus they will be identified with the same number.

In [11]:
cleaned_df["author"] = pd.factorize(cleaned_df.author)[0] + 1
cleaned_df.head()

Unnamed: 0,author,title,selftext,link_flair_text,num_comments,score,upvote_ratio
0,1,AITA for sleeping during the day?,"I honestly thought what I was doing was fine, ...",Not the A-hole,34,7,0.82
2,0,AITAH for refusing to change out of my semi-se...,My boyfriend was FaceTiming his cousin and his...,Not the A-hole,23,8,0.78
6,2,AITA for not helping my neighbor?,I (M26) live in a large apartment complex with...,Not the A-hole,253,1016,0.97
9,3,AITA for threatening to lock my housemate's wi...,I'll keep it short; my housemate moved in arou...,Everyone Sucks,34,4,0.76
25,4,AITA for allowing my bio dad and his wife to b...,I found out that I am pregnant about 5 1/2 mon...,Not the A-hole,62,46,0.88


Let's see how many duplicate posts there are. I presume some most of these are cases where a verdict has been added on or the original poster as added an edit to the post, so we'll keep whatever the last version is.

In [12]:
duplicates_removed = cleaned_df.drop_duplicates(subset=["title", "selftext", "link_flair_text"], keep="last", inplace=False)
duplicates_removed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9901 entries, 0 to 119870
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           9901 non-null   int64  
 1   title            9901 non-null   object 
 2   selftext         9901 non-null   object 
 3   link_flair_text  9776 non-null   object 
 4   num_comments     9901 non-null   int64  
 5   score            9901 non-null   int64  
 6   upvote_ratio     9901 non-null   float64
dtypes: float64(1), int64(3), object(3)
memory usage: 618.8+ KB


Over half the corpus is removed! Lastly, let's make sure that the only posts remaining are posts with flairs that actually contribute to the analysis.

In [13]:
cleaned_df.link_flair_text.unique()

array(['Not the A-hole', 'Everyone Sucks', 'Asshole', 'Not enough info',
       'No A-holes here', None, 'TL;DR', 'Best of 2022', 'UPDATE', '',
       'Upcoming Talk!!!', 'META', 'Open Forum'], dtype=object)

In [25]:
cleaned_df = duplicates_removed
idx = cleaned_df[cleaned_df["link_flair_text"].isin(["None", "", "TL;DR", "Not enough info", "UPDATE", "Best of 2022", "Open Forum", "Upcoming Talk!!!", "META"])].index
cleaned_df = cleaned_df.drop(idx)
cleaned_df.dropna(subset=["link_flair_text"], inplace=True)
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9414 entries, 0 to 119870
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           9414 non-null   int64  
 1   title            9414 non-null   object 
 2   selftext         9414 non-null   object 
 3   link_flair_text  9414 non-null   object 
 4   num_comments     9414 non-null   int64  
 5   score            9414 non-null   int64  
 6   upvote_ratio     9414 non-null   float64
dtypes: float64(1), int64(3), object(3)
memory usage: 588.4+ KB


We started with 120,000 posts and ended with 9,415...

Lastly, I'm going to rename each column just to tidy up and better reflect the contents of the column.

In [26]:
cleaned_df = cleaned_df.rename({"author": "AuthorID", "title": "Title", "selftext": "Text", "link_flair_text": "Ruling",
                               "num_comments": "CommentCount", "score": "Score", "upvote_ratio": "UpvoteRatio"}, axis='columns')

In [27]:
cleaned_df.head()

Unnamed: 0,AuthorID,Title,Text,Ruling,CommentCount,Score,UpvoteRatio
0,1,AITA for sleeping during the day?,"I honestly thought what I was doing was fine, ...",Not the A-hole,34,7,0.82
2,0,AITAH for refusing to change out of my semi-se...,My boyfriend was FaceTiming his cousin and his...,Not the A-hole,23,8,0.78
6,2,AITA for not helping my neighbor?,I (M26) live in a large apartment complex with...,Not the A-hole,253,1016,0.97
9,3,AITA for threatening to lock my housemate's wi...,I'll keep it short; my housemate moved in arou...,Everyone Sucks,34,4,0.76
25,4,AITA for allowing my bio dad and his wife to b...,I found out that I am pregnant about 5 1/2 mon...,Not the A-hole,62,46,0.88


Done! Let's save this data to a CSV to use later.

In [28]:
cleaned_df.to_csv("../data/aita_data.csv")