## Notebook 1 - Intro, Imports, Cleaning
### Project 3 | DSI 523 | Adriana J. Machado 
-----

## Table of Contents
### [1.0 Introduction](#1.0-Introduction)
### [2.0 Imports & Gathering Data](#2.0-Imports-&-Gathering-Data)
### [3.0 Data Cleaning](#3.0-Data-Cleaning)
-----

# 1.0 Introduction
-----
# Shower Thoughts v Intrusive Thoughts
## Using natural language processing and the Reddit pushshift API to distinguish between random casual thoughts and neurotic thoughts. 

https://www.reddit.com/r/Showerthoughts/

>A subreddit for sharing those miniature epiphanies you have that highlight the oddities within the familiar.

https://www.reddit.com/r/intrusivethoughts/

>A subreddit for you to share all those intrusive, obsessive and recurring thoughts or ideas that race through your head throughout the day.

# 2.0 Imports & Gathering Data
-----
Code that is commented out is used for fresh API pulls and marked to indicate it's optional usage. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import requests
import time
from bs4 import BeautifulSoup
# https://www.geeksforgeeks.org/how-to-convert-datetime-to-unix-timestamp-in-python/
import calendar
import datetime

# Lesson 5.04
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
import re

# warnings
import warnings
warnings.simplefilter("ignore")
import warnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=DeprecationWarning)

# formatting
pd.set_option('display.max_columns', None)
pd.options.display.max_colwidth = 400

## Function for API call to work around 100 post max on Push Shift

In [2]:
def pull_sub_df(subreddit, num_posts):
    '''
    Adapted from: https://youtu.be/AcrjEWsMi_E
    Connects to the Reddit pushshift API and loops the 100 max limit 
    for 1000 submissions concatenated into a single dataframe. 
    
    Input: subreddit name as a string with no spaces (found in the url) & integer for # of posts desired
    Output: a pandas dataframe
    '''
    
    url = 'https://api.pushshift.io/reddit/search/submission'
    n = 0
    before = datetime.datetime.utcnow()
    before_utc = calendar.timegm(before.utctimetuple())
    df = pd.DataFrame()
    
    while n < (num_posts/100):
        
        # increment pulls
        time.sleep(3)
        
        # set pushshift parameters with subreddit, size, and before date
        params = {
        'subreddit': subreddit,
        'size': 100,
        'before': before_utc,
        }
        
        # call api and pull subreddit posts
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
        
        # sanity checks
        print(f'=====\nPull #{n+1}')
        print(f'Status Code: {res.status_code}')
        print(f'Posts Length: {len(posts)}')
        
        # create temorary df and concat with returned df
        temp_df = pd.DataFrame(posts)
        df = pd.concat([df, temp_df], ignore_index = True) 
        # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
        
        # reset before utc time to the last entries created_utc date
        before_utc = temp_df.loc[99:, 'created_utc']
        
        # keep looping 10 times for 1000 posts
        n += 1
    
    return df

## Create Data Frames - Uncomment code in cells to get up to date posts data

In [3]:
## create shower thoughts df
## uncomment to get up to date data

shower_df = pull_sub_df('showerthoughts', 1000)
shower_df.head()

=====
Pull #1
Status Code: 200
Posts Length: 98
=====
Pull #2
Status Code: 200
Posts Length: 98
=====
Pull #3
Status Code: 200
Posts Length: 98
=====
Pull #4
Status Code: 200
Posts Length: 98
=====
Pull #5
Status Code: 200
Posts Length: 98
=====
Pull #6
Status Code: 200
Posts Length: 98
=====
Pull #7
Status Code: 200
Posts Length: 98
=====
Pull #8
Status Code: 200
Posts Length: 98
=====
Pull #9
Status Code: 200
Posts Length: 98
=====
Pull #10
Status Code: 200
Posts Length: 98


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,author_flair_background_color,author_flair_text_color,banned_by
0,[],False,skud14,,[],,text,t2_121uqa,False,False,False,[],False,False,1656637254,self.Showerthoughts,https://www.reddit.com/r/Showerthoughts/comments/vonxn0/whether_we_invoke_god_or_a_natural_universe_we/,{},vonxn0,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/Showerthoughts/comments/vonxn0/whether_we_invoke_god_or_a_natural_universe_we/,False,6,moderator,1656637265,1,[removed],True,False,False,Showerthoughts,t5_2szyo,25056120,public,self,"Whether we invoke God or a Natural Universe, we live in a universe which reason for its existence is no more complex than ""it just is""",0,[],1.0,https://www.reddit.com/r/Showerthoughts/comments/vonxn0/whether_we_invoke_god_or_a_natural_universe_we/,all_ads,6,,,
1,[],False,TheTalentedAmateur,,[],,text,t2_5j9wb,False,False,False,[],False,False,1656637162,self.Showerthoughts,https://www.reddit.com/r/Showerthoughts/comments/vonwcq/many_men_have_issues_with_fake_artificialor/,{},vonwcq,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/Showerthoughts/comments/vonwcq/many_men_have_issues_with_fake_artificialor/,False,6,moderator,1656637173,1,[removed],True,False,False,Showerthoughts,t5_2szyo,25056102,public,self,"Many men have issues with ""Fake"" ""Artificial""or ""Plastic"" Boobs. If men could get plastic, artificial penis enlargement surgery, the lines of those same men would stretch for miles.",0,[],1.0,https://www.reddit.com/r/Showerthoughts/comments/vonwcq/many_men_have_issues_with_fake_artificialor/,all_ads,6,,,
2,[],False,annie_bean,,[],,text,t2_3ucrgia7,False,False,False,[],False,False,1656637160,self.Showerthoughts,https://www.reddit.com/r/Showerthoughts/comments/vonwbt/intelligent_people_use_the_phrase_i_dont_know/,{},vonwbt,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/Showerthoughts/comments/vonwbt/intelligent_people_use_the_phrase_i_dont_know/,False,6,moderator,1656637170,1,[removed],True,False,False,Showerthoughts,t5_2szyo,25056102,public,self,"Intelligent people use the phrase ""I don't know"" more often than stupid people do",0,[],1.0,https://www.reddit.com/r/Showerthoughts/comments/vonwbt/intelligent_people_use_the_phrase_i_dont_know/,all_ads,6,,,
3,[],False,[deleted],,,,,,False,,,[],False,False,1656637148,self.Showerthoughts,https://www.reddit.com/r/Showerthoughts/comments/vonw7g/whether_we_invoke_god_or_a_natural_universe_we/,{},vonw7g,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/Showerthoughts/comments/vonw7g/whether_we_invoke_god_or_a_natural_universe_we/,False,6,deleted,1656637159,1,,True,False,False,Showerthoughts,t5_2szyo,25056101,public,default,"Whether we invoke God or a Natural Universe, we live in a unique which reason for its existence is no more complex than ""it just is""",0,[],1.0,https://www.reddit.com/r/Showerthoughts/comments/vonw7g/whether_we_invoke_god_or_a_natural_universe_we/,all_ads,6,,dark,moderators
4,[],False,skud14,,[],,text,t2_121uqa,False,False,False,[],False,False,1656637065,self.Showerthoughts,https://www.reddit.com/r/Showerthoughts/comments/vonv9r/we_either_live_in_a_universe_which_has_no_more/,{},vonv9r,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/Showerthoughts/comments/vonv9r/we_either_live_in_a_universe_which_has_no_more/,False,6,moderator,1656637075,1,[removed],True,False,False,Showerthoughts,t5_2szyo,25056095,public,self,"We either live in a universe which has no more complex a reason for its existence other than ""It just is""",0,[],1.0,https://www.reddit.com/r/Showerthoughts/comments/vonv9r/we_either_live_in_a_universe_which_has_no_more/,all_ads,6,,,


In [4]:
## create intrusive thoughts df
## uncomment to get up to date data

intrusive_df = pull_sub_df('intrusivethoughts', 1000)
intrusive_df.head()

=====
Pull #1
Status Code: 200
Posts Length: 100
=====
Pull #2
Status Code: 200
Posts Length: 100
=====
Pull #3
Status Code: 200
Posts Length: 100
=====
Pull #4
Status Code: 200
Posts Length: 100
=====
Pull #5
Status Code: 200
Posts Length: 100
=====
Pull #6
Status Code: 200
Posts Length: 100
=====
Pull #7
Status Code: 200
Posts Length: 100
=====
Pull #8
Status Code: 200
Posts Length: 100
=====
Pull #9
Status Code: 200
Posts Length: 100
=====
Pull #10
Status Code: 200
Posts Length: 100


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,removed_by_category,author_cakeday,post_hint,preview,call_to_action,category,crosspost_parent,crosspost_parent_list,url_overridden_by_dest
0,[],False,tigerking599,,[],,text,t2_gvbkjr23,False,False,False,[],False,False,1656635652,self.intrusivethoughts,https://www.reddit.com/r/intrusivethoughts/comments/vonerj/can_you_relate/,{},vonerj,False,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,no_ads,/r/intrusivethoughts/comments/vonerj/can_you_relate/,False,0.0,1656635663,1,"I have been suffering with POCD, Harm OCD, and sexual Intrusive thoughts before I got these intrusive thoughts I was very vocal about supporting death penalty for rapists and murders and whenever I watched, seen, or read stories about rape or murders I would always say ""that perpetrator/suspect deserve death"" and now, everytime I watched, seen, or read about these crimes I always get triggered...",True,False,False,intrusivethoughts,t5_2tqd6,90849,public,self,Can you relate?,0,[],1.0,https://www.reddit.com/r/intrusivethoughts/comments/vonerj/can_you_relate/,no_ads,0.0,,,,,,,,,
1,[],False,Lost_And_Found66,,[],,text,t2_9dvgg5yr,False,False,False,[],False,False,1656634577,self.intrusivethoughts,https://www.reddit.com/r/intrusivethoughts/comments/von1sy/you_actually_dont_love_anyone_youre_only_nice_and/,{},von1sy,False,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,no_ads,/r/intrusivethoughts/comments/von1sy/you_actually_dont_love_anyone_youre_only_nice_and/,False,0.0,1656634588,1,,True,False,False,intrusivethoughts,t5_2tqd6,90848,public,self,You actually don't love anyone. you're only nice and caring towards your friends and family because it makes you feel good. If it makes you feel good it's inherently selfish you fucking narcissistic sociopath,0,[],1.0,https://www.reddit.com/r/intrusivethoughts/comments/von1sy/you_actually_dont_love_anyone_youre_only_nice_and/,no_ads,0.0,,,,,,,,,
2,[],False,I_CANT-DO_IT,,[],,text,t2_pf6qnowa,False,False,False,[],False,False,1656633940,self.intrusivethoughts,https://www.reddit.com/r/intrusivethoughts/comments/vomu7f/_/,{},vomu7f,False,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,no_ads,/r/intrusivethoughts/comments/vomu7f/_/,False,0.0,1656633950,1,"When ever i see a knife I think about stabbing someone like in my family which is an intrusive thought I know, I'm diagnosed with severe OCD but it really sucks because I love all my family and I don't understand why this happens :( Sorry if you think I'm a psycho never hurt anyone :&lt;",True,False,False,intrusivethoughts,t5_2tqd6,90847,public,self,:(,0,[],1.0,https://www.reddit.com/r/intrusivethoughts/comments/vomu7f/_/,no_ads,0.0,,,,,,,,,
3,[],False,Depressed_Noodle_,,[],,text,t2_i5nbajbh,False,False,False,[],False,False,1656630817,self.intrusivethoughts,https://www.reddit.com/r/intrusivethoughts/comments/volrpt/racing_thoughts_that_run_themselves_into/,{},volrpt,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,no_ads,/r/intrusivethoughts/comments/volrpt/racing_thoughts_that_run_themselves_into/,False,0.0,1656630828,1,Trigger warning: su**de\n\n\n\n\nI've got manic bi polar I'm on a mood stabilizer and it helps. Just started it a few weeks ago. \n\nThe one thing I haven't been able to excape no mater what coping skill I use is the manic racing voices I hear. I'm also schizophrenic leaning more towards schizoaffective disorder so voices are just a thing for me and most of the time I manage them well.\n\nBut ...,True,False,False,intrusivethoughts,t5_2tqd6,90846,public,self,racing thoughts that run themselves into depression episodes,0,[],1.0,https://www.reddit.com/r/intrusivethoughts/comments/volrpt/racing_thoughts_that_run_themselves_into/,no_ads,0.0,,,,,,,,,
4,[],False,DiddlyDipshit,,[],,text,t2_n0z22gl,False,False,False,[],False,False,1656630040,self.intrusivethoughts,https://www.reddit.com/r/intrusivethoughts/comments/volhnc/coworker_just_farted_cant_allow_others_to_think_i/,{},volhnc,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,no_ads,/r/intrusivethoughts/comments/volhnc/coworker_just_farted_cant_allow_others_to_think_i/,False,0.0,1656630050,1,,True,False,False,intrusivethoughts,t5_2tqd6,90846,public,self,"Coworker just farted. Can't allow others to think I did it. Must yell ""JOE FARTED NOT ME"" at the top of my lungs.",0,[],1.0,https://www.reddit.com/r/intrusivethoughts/comments/volhnc/coworker_just_farted_cant_allow_others_to_think_i/,no_ads,0.0,,,,,,,,,


In [5]:
## Save raw shower and intrusive dfs to csvs
## uncomment to get up to date data

shower_df['subreddit'].replace('Showerthoughts', 'showerthoughts', inplace = True)

# shower_df.to_csv('./data/shower_df_raw2.csv', index = False)
# intrusive_df.to_csv('./data/intrusive_df_raw2.csv', index = False)

In [6]:
## load raw separate subreddit dataframes
shower_df = pd.read_csv('./data/shower_df_raw.csv')
intrusive_df = pd.read_csv('./data/intrusive_df_raw.csv')

In [7]:
## combine shower thoughts and intrusive thoughts into one df
## create is_shower column for modeling
## save raw concatenated df
## uncomment to get up to date data

intrusive_shower = pd.concat([intrusive_df, shower_df], ignore_index = True)

intrusive_shower['is_shower'] = intrusive_shower['subreddit'].replace({'showerthoughts':1, 'intrusivethoughts': 0})

# intrusive_shower.to_csv('./data/intrusive_shower_raw2.csv', index = False)

# intrusive_shower.head()

In [8]:
# print(intrusive_shower.shape)
# print(shower_df.shape)
# print(intrusive_df.shape)

In [9]:
## uncomment to get up to date data

# intrusive_shower.info()

# 3.0 Data Cleaning
-----

## Clean up columns in intrusive_shower

In [10]:
## uncomment to get up to date data

intrusive_shower = intrusive_shower[['is_shower', 'subreddit', 'created_utc', 'title', 'selftext', 'upvote_ratio']]

## Duplicates are removed by the subreddit moderators

## Clean Nulls

In [11]:
# intrusive_shower.info()

In [12]:
## fill selftext na's with 'redacted' to remove null amount
## uncomment to get up to date data

intrusive_shower['selftext'] = intrusive_shower['selftext'].fillna('redacted')

In [13]:
## uncomment to get up to date data

# intrusive_shower.info()

## Clean HTML

In [14]:
# Adapted from Lesson 5.06
def remove_html(text):
    '''function to remove html and lowercase all text'''
    # lowercase
    low = text.lower()
    
    # remove html
    low_nobreak = text.replace('\n', ' ')
    no_html = BeautifulSoup(low_nobreak).text
    
    return no_html

In [15]:
## Apply remove_html fuction to title
## Apply remove_html function to selftext
## uncomment to get up to date data

intrusive_shower['title'] = intrusive_shower['title'].apply(remove_html)

intrusive_shower['selftext'] = intrusive_shower['selftext'].apply(remove_html)

In [16]:
# intrusive_shower.head()

## Tokenize Title and Self Text

In [17]:
## Lesson 5.04
## uncomment to get up to date data

tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')

In [18]:
## lower all text in title and selftext
## uncomment to get up to date data

intrusive_shower["title"] = intrusive_shower["title"].map(lambda x: x.lower())
intrusive_shower["selftext"] = intrusive_shower["selftext"].map(lambda x: x.lower())

In [19]:
## uncomment to get up to date data

intrusive_shower['title_token'] = [tokenizer.tokenize(title) for title in intrusive_shower['title']]

intrusive_shower['selftext_token'] = [tokenizer.tokenize(selftext) for selftext in intrusive_shower['selftext']]

In [20]:
# intrusive_shower.head()

## Clean Stop Words

In [21]:
# stopwords.words('english')

In [22]:
## Lesson 5.04
## uncomment to get up to date data

intrusive_shower['title_no_stop'] = [text if text not in stopwords.words('english') else text for text in intrusive_shower['title_token']]

intrusive_shower['selftext_no_stop'] = [text if text not in stopwords.words('english') else text for text in intrusive_shower['selftext_token']]

In [23]:
intrusive_shower.head()

Unnamed: 0,is_shower,subreddit,created_utc,title,selftext,upvote_ratio,title_token,selftext_token,title_no_stop,selftext_no_stop
0,0,intrusivethoughts,1656469973,i am extremely scared my left eye is moving slowly downwards on my face,redacted,1.0,"[i, am, extremely, scared, my, left, eye, is, moving, slowly, downwards, on, my, face]",[redacted],"[i, am, extremely, scared, my, left, eye, is, moving, slowly, downwards, on, my, face]",[redacted]
1,0,intrusivethoughts,1656468826,here's a list of my intrusive thoughts,jump in front of a car run people over go knife happy throw water over electrics shout racial abuse shoot a school up why is my brain doing this to me!!!!! 😢😢😢😢,1.0,"[here, 's, a, list, of, my, intrusive, thoughts]","[jump, in, front, of, a, car, run, people, over, go, knife, happy, throw, water, over, electrics, shout, racial, abuse, shoot, a, school, up, why, is, my, brain, doing, this, to, me, !!!!!, 😢😢😢😢]","[here, 's, a, list, of, my, intrusive, thoughts]","[jump, in, front, of, a, car, run, people, over, go, knife, happy, throw, water, over, electrics, shout, racial, abuse, shoot, a, school, up, why, is, my, brain, doing, this, to, me, !!!!!, 😢😢😢😢]"
2,0,intrusivethoughts,1656464538,step on your cat,"i love her so much but....she's just lying there, in the way ... could i crush her scull?",1.0,"[step, on, your, cat]","[i, love, her, so, much, but, ....she's, just, lying, there, ,, in, the, way, ..., could, i, crush, her, scull, ?]","[step, on, your, cat]","[i, love, her, so, much, but, ....she's, just, lying, there, ,, in, the, way, ..., could, i, crush, her, scull, ?]"
3,0,intrusivethoughts,1656459837,intrusive thoughts,are paranoid intrusive thoughts a thing??,1.0,"[intrusive, thoughts]","[are, paranoid, intrusive, thoughts, a, thing, ??]","[intrusive, thoughts]","[are, paranoid, intrusive, thoughts, a, thing, ??]"
4,0,intrusivethoughts,1656458075,bash your head against the painting and cut your eyes out with the glass,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhh do it now! do it right fucking now aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhh,1.0,"[bash, your, head, against, the, painting, and, cut, your, eyes, out, with, the, glass]","[aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhh, do, it, now, !, do, it, right, fucking, now, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhh]","[bash, your, head, against, the, painting, and, cut, your, eyes, out, with, the, glass]","[aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhh, do, it, now, !, do, it, right, fucking, now, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaahhhhh]"


# Save Cleaned CSV / Import Cleaned intrusive_shower CSV (if not running for new data)

In [24]:
## create static csv for model training purposes - not dynamic and updated by date
## uncomment to get up to date data

# intrusive_shower.to_csv('./data/intrusive_shower_clean2.csv', index = False)