# Reddit Analysis using Pushshift API

## Part 1: Data Scraping
- [Background](#Background)
- [Problem Statement](#Problem-Statement)
- [Data Scraping](#Data-Scraping)

# Background

Reddit is a website that dubs itself itself as the "Front Page of the Internet" and functions as hybrid between a news service and social media. The website is divided into subreddits, miniature communities that cover topics that range from overarching (news, science, music, etc.) to incredibly specific (AskNYC, dogecoin, siberianhusky). In these subreddits, users can make posts and up-vote or down-vote it. With enough popularity, a post can make it to the popular section of reddit where it will be seen by anyone that browses the website. This creates a self-functioning advertisement cycle that allows users to join communities they find interesting. The foundation of each post is the title, body text (contains links and images as well), and comments. In the background, users can also send and receive private messages.

In this project, we will be analyzing data from two subreddits:

1. [r/TalesFromRetail](https://www.reddit.com/r/TalesFromRetail/) (640k members): This subreddit talks about experiences from employees working in retail. The posts are mainly negative and largely refers to "Karens", people who are known for being entitled and demanding to others in a public environment.

2.  [r/raisedbynarcissists](https://www.reddit.com/r/raisedbynarcissists/) (714k members): This subreddit is a support group for people who had abusive and toxic parental figures. Posts range from personal history to questions and discussions regarding a user's personal life.

# Problem Statement
The aforementioned subreddits were chosen because both of them focus on negative experiences with specific people in their lives. The two subreddits also have a similar amount of activity. However, there is a large difference between the two as TalesFromRetail deals with one time interactions with strangers while raisedbynarcissists deals with longer and intimate history between family members.

While Karens are known for being openly obnoxious, narcissists can be soft-spoken and manipulative.This juxtaposition between public/private abusers led to a question if there were any personality similarities between the two types of people. This project has two main goals. The first is to utilize NLP methods to see what types of vocabulary can be seen in both subreddits. The second is to use classification models in order to see if the subreddit origin of a post can be found. Do Karens and narcissistic parents have any overlap?

# Data Scraping

In [1]:
import pandas as pd
import numpy as np
import requests
import time

In [2]:
# Posts for /r/raisedbynarcissists
post_num = range(0, 1000, 50)
base_url = 'https://api.pushshift.io/reddit/search/'

for x in post_num:
    if x == 0:
        params = {
            'subreddit': 'raisedbynarcissists',
            'size': 50
        }
        res = requests.get(base_url + 'submission/', params = params)
        data = res.json()
        posts = pd.DataFrame(data['data'])
        earliest = posts['created_utc'].min()
        
        time.sleep(2)
    else:
        new_params = {
            'subreddit': 'raisedbynarcissists',
            'size': 50,
            'before': earliest
        }
        new_res = requests.get(base_url + 'submission/', params = new_params)
        new_data = new_res.json()
        new_df = pd.DataFrame(new_data['data'])
        earliest = new_df['created_utc'].min()
        
        posts = pd.concat([posts, new_df], ignore_index = True)
        
        time.sleep(2)
df1 = posts
df1

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,author_flair_template_id,author_flair_text_color,removed_by_category,post_hint,preview,edited,author_flair_background_color,gilded,author_cakeday,banned_by
0,[],False,nothing_to_be,,[],,text,t2_ldwlf,False,False,...,,,,,,,,,,
1,[],False,Spitefullyginger,,[],,text,t2_bou3i03y,False,False,...,,,,,,,,,,
2,[],False,Magnus826,,[],,text,t2_418e0blm,False,False,...,,,,,,,,,,
3,[],False,sweggin_official,,[],,text,t2_4onw1jkl,False,False,...,,,,,,,,,,
4,[],False,MrBlue404,,[],,text,t2_702zpp0v,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,[],False,NoAd3629,,[],,text,t2_83p0u09n,False,False,...,,,,,,,,,,
996,[],False,Ok-Win4485,,[],,text,t2_bxkc59wb,False,False,...,,,,,,,,,,
997,[],False,pakallakikochino,,[],,text,t2_yfzl4,False,False,...,,,,,,,,,,
998,[],False,ReAwakenedGhost,,[],,text,t2_al09fzut,False,False,...,,,,,,,,,,


In [3]:
# Posts for /r/TalesFromRetail
post_num = range(0, 4500, 50) # 4500 posts gathered due to abundance of [removed] posts
base_url = 'https://api.pushshift.io/reddit/search/'

for x in post_num:
    if x == 0:
        params = {
            'subreddit': 'TalesFromRetail',
            'size': 50
        }
        res = requests.get(base_url + 'submission/', params = params)
        data = res.json()
        posts = pd.DataFrame(data['data'])
        earliest = posts['created_utc'].min()
        
        time.sleep(2)
    else:
        new_params = {
            'subreddit': 'TalesFromRetail',
            'size': 50,
            'before': earliest
        }
        new_res = requests.get(base_url + 'submission/', params = new_params)
        new_data = new_res.json()
        new_df = pd.DataFrame(new_data['data'])
        earliest = new_df['created_utc'].min()
        
        posts = pd.concat([posts, new_df], ignore_index = True)
        
        time.sleep(2)
df2 = posts
df2

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,url_overridden_by_dest,author_flair_template_id,post_hint,preview,suggested_sort,link_flair_template_id,author_cakeday,edited,gilded,distinguished
0,[],False,Stewwhoo22,,[],,text,t2_dzptvyv,False,False,...,,,,,,,,,,
1,[],False,RevXeXnge,,[],,text,t2_2w5kq5pa,False,False,...,,,,,,,,,,
2,[],False,Blueartbird,,[],,text,t2_4ingkij7,False,False,...,,,,,,,,,,
3,[],False,ELfit4life,,[],,text,t2_hi005v4,False,False,...,,,,,,,,,,
4,[],False,ELfit4life,,[],,text,t2_hi005v4,False,False,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4495,[],False,LaLunacy,,[],,text,t2_66sxf76m,,False,...,,,,,,,,,,
4496,[],False,ozzycheet,,[],,text,t2_72c53ns0,,False,...,,,,,,,,,,
4497,[],False,Darkrek,,[],,text,t2_746yz4uo,,False,...,,,,,,,,,,
4498,[],False,Brandykat,,[],,text,t2_2a7ziq70,,False,...,,,,,,,,,,


In [4]:
# Export scraped data into csvs
df1.to_csv('../data/narcissists.csv')

df2.to_csv('../data/retail.csv')

In [5]:
# Preliminary inspection on possibly useful columns
for col in df1.columns:
    try:
        if df1.drop_duplicates(subset = col).shape[0] > 100:
            print(f'Unique in {col} is {df1.drop_duplicates(subset = col).shape[0]}')
    except:
        print(f'{col} is unhashable') # TypeError: unhashable type: 'list'

all_awardings is unhashable
Unique in author is 808
author_flair_richtext is unhashable
Unique in author_fullname is 808
awarders is unhashable
Unique in created_utc is 997
Unique in full_link is 1000
gildings is unhashable
Unique in id is 1000
link_flair_richtext is unhashable
Unique in permalink is 1000
Unique in retrieved_on is 996
Unique in selftext is 902
Unique in subreddit_subscribers is 430
Unique in title is 986
treatment_tags is unhashable
Unique in url is 1000
preview is unhashable


In [6]:
# Preliminary inspection on possibly useful columns
for col in df2.columns:
    try:
        if df2.drop_duplicates(subset = col).shape[0] > 100:
            print(f'Unique in {col} is {df2.drop_duplicates(subset = col).shape[0]}')
    except:
        print(f'{col} is unhashable') # TypeError: unhashable type: 'list'

all_awardings is unhashable
Unique in author is 2935
author_flair_richtext is unhashable
Unique in author_fullname is 2935
awarders is unhashable
Unique in created_utc is 4500
Unique in full_link is 4500
gildings is unhashable
Unique in id is 4500
link_flair_richtext is unhashable
Unique in num_comments is 203
Unique in permalink is 4500
Unique in retrieved_on is 4500
Unique in selftext is 956
Unique in subreddit_subscribers is 3437
Unique in title is 4331
treatment_tags is unhashable
Unique in url is 4498
crosspost_parent_list is unhashable
preview is unhashable
