# Initial Data Collection

[View this notebook in nbviewer](https://nbviewer.org/github/Data-Science-for-Linguists-2023/AITA-Blame-Analysis/blob/main/code/data_collection_testing.ipynb)

This project utilizes both [PRAW](https://praw.readthedocs.io/en/latest/index.html) and [PMAW](https://github.com/mattpodolak/pmaw) to scrape submission data from [r/AmITheAsshole](https://www.reddit.com/r/AmItheAsshole/). PRAW is a wrapper for the Reddit API. However, through Reddit's built-in API, you cannot query data past a certain time limit and can only query 1000 posts at a time. However, the third-party Pushshift API allows you to query older and larger quantities of data. PMAW is a wrapper for the Pushshift API.

In [1]:
import praw
from pmaw import PushshiftAPI
import pandas as pd
import numpy as np
import pickle
import datetime as dt

In [2]:
text = open("../user_info.txt", "r")
client_info = [line.strip("\n") for line in text.readlines()]
reddit = praw.Reddit(client_id = client_info[0],
                    client_secret = client_info[1],
                    user_agent = client_info[2])

api = PushshiftAPI(praw=reddit, file_checkpoint=10)

From each post, I will be saving the poster's username, the post title, the text, the number of upvotes, the ratio of upvotes to downvotes (Reddit API has removed access to the exact number of downvotes), and any post flairs. On this subreddit, flairs are used to track the final verdicts on "Asshole", "Not the A-hole", "No a-holes here", and "Everyone Sucks." Further information on how this subreddit categorizes posts can be found in their [FAQ](https://www.reddit.com/r/AmItheAsshole/wiki/faq/#wiki_acronyms).

Because scraping posts will take a considerable amount of time and I intend to end in a corpus with over 10,000 posts, I will be using this notebook to test out the organization and cleaning process before running this on a larger number of posts. For now, the notebook processing the entire corpus and its resulting .csv is tracked by .gitignore until I'm sure there's nothing present that needs to be omitted.

Pushshift is also undergoing a migration process and does not have any data from before November 2022 ready. Because of this, I'll only be using data from this year.

In [3]:
before = int(dt.datetime(2023, 1, 7, 0, 0).timestamp())
after = int(dt.datetime(2023, 1, 1, 0, 0).timestamp())

submissions = api.search_submissions(subreddit="AmItheAsshole", until=before, since=after, limit=1000, mem_safe=True)
print(len(submissions))

1000


In [4]:
submissions_list = [sub for sub in submissions]
aita_df = pd.DataFrame(submissions_list)
aita_df.head()

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,link_flair_template_id,author_cakeday,post_hint,preview
0,AmItheAsshole,[removed],t2_a8ezj20c,0,AITA for ghosting my boyfriend after he forgot...,[],r/AmItheAsshole,False,6,,...,0,,False,1672704565,1672704565,2023-01-03 00:09:08,,,,
1,AmItheAsshole,[removed],t2_3mmw1drt,0,WIBTA for removing her dog from her: My 70 yo ...,[],r/AmItheAsshole,False,6,,...,0,,False,1672704565,1672704565,2023-01-03 00:09:06,,,,
2,AmItheAsshole,[removed],t2_vdf7eb5l,0,AITA for being angry cause my gf wouldn't trav...,[],r/AmItheAsshole,False,6,,...,0,,False,1672704541,1672704542,2023-01-03 00:08:43,,,,
3,AmItheAsshole,My wife and I have two beautiful small childre...,t2_vdff0kz7,0,AITA: Childcare duty split,[],r/AmItheAsshole,False,6,,...,0,,False,1672704533,1672704533,2023-01-03 00:08:42,,,,
4,AmItheAsshole,To make this short and sweet I used to run an ...,t2_9n9ku4uf,0,WIBTA if I decided to sue my friend?,[],r/AmItheAsshole,False,6,,...,0,,False,1672704468,1672704468,2023-01-03 00:07:36,,,,


In [5]:
aita_df.columns

Index(['subreddit', 'selftext', 'author_fullname', 'gilded', 'title',
       'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls',
       'link_flair_css_class', 'thumbnail_height', 'top_awarded_type',
       'hide_score', 'quarantine', 'link_flair_text_color', 'upvote_ratio',
       'author_flair_background_color', 'subreddit_type',
       'total_awards_received', 'media_embed', 'thumbnail_width',
       'author_flair_template_id', 'is_original_content', 'secure_media',
       'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed',
       'link_flair_text', 'score', 'is_created_from_ads_ui', 'author_premium',
       'thumbnail', 'edited', 'author_flair_css_class',
       'author_flair_richtext', 'gildings', 'content_categories', 'is_self',
       'link_flair_type', 'wls', 'removed_by_category', 'author_flair_type',
       'domain', 'allow_live_comments', 'suggested_sort', 'view_count',
       'archived', 'no_follow', 'is_crosspostable', 'pinned', 'over_18'

In [6]:
cleaned_df = aita_df[["author", "title", "selftext", "link_flair_text", "num_comments", "score", "upvote_ratio"]]
cleaned_df.head(10)

Unnamed: 0,author,title,selftext,link_flair_text,num_comments,score,upvote_ratio
0,xexxyn,AITA for ghosting my boyfriend after he forgot...,[removed],,1,1,1.0
1,Weshnon,WIBTA for removing her dog from her: My 70 yo ...,[removed],,1,1,1.0
2,ThrowRAtindon,AITA for being angry cause my gf wouldn't trav...,[removed],,1,1,1.0
3,Ahi209,AITA: Childcare duty split,My wife and I have two beautiful small childre...,,1,1,1.0
4,No_Accident_1469,WIBTA if I decided to sue my friend?,To make this short and sweet I used to run an ...,,1,1,1.0
5,BalloonBoy14,AITA for standing up against someone who enter...,[removed],,1,1,1.0
6,Lawpoc,AITA If my gf wants to take a break with hopes...,[removed],,1,1,1.0
7,xexxyn,AITA for being mad at my boyfriend after he fo...,[removed],,1,1,1.0
8,Cricket_Spickets_72,AITA for quitting my job over a lack of water ...,[removed],,1,1,1.0
9,fenris52223,AITA for yelling at her? Wife is freezing me out.,[removed],,1,1,1.0


Posts with "[removed]" as their text are not useful in this project as we can't access the original text. Let's clean off all of those posts.

In [7]:
indexNames = cleaned_df[cleaned_df["selftext"].isin(["[removed]"])].index
cleaned_df = cleaned_df.drop(indexNames)

In [8]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 515 entries, 3 to 999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           515 non-null    object 
 1   title            515 non-null    object 
 2   selftext         515 non-null    object 
 3   link_flair_text  0 non-null      object 
 4   num_comments     515 non-null    int64  
 5   score            515 non-null    int64  
 6   upvote_ratio     515 non-null    float64
dtypes: float64(1), int64(2), object(4)
memory usage: 32.2+ KB


In [9]:
cleaned_df.to_csv("aita_data_sample.csv")