***Resources for scraping Reddit:***
    
PRAW: https://towardsdatascience.com/scraping-reddit-data-1c0af3040768

PUSHSHIFT: https://medium.com/@RareLoot/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563

In [1]:
import pickle
import pandas as pd

import praw
import pprint
from reddit_credentials import reddit_info
from praw.models import MoreComments

from collections import Counter

In [2]:
# logging in to use Python Reddit API

reddit = praw.Reddit(client_id=reddit_info['client_id'],
                     client_secret=reddit_info['client_secret'],
                     user_agent=reddit_info['user_agent'],
                     username=reddit_info['username'],
                     password=reddit_info['password']
                    )

# Convo with Aaron and Beatriz

- how has it changed over time in terms of positive vs negative sentiment? sentiment analysis
- what are people saying? how do tey feel about it? 
- can I group user groups together? 
- can I group certain topics together?
    - are people talking predominantly about carbon emissions? or that its a hoax? etc.
    - there are anumber of questions about "what is even happening" tthat need to be answered, and unsupervised learning will help?

- can do miultiple methods at once
    - classify topics
    - then do some sentiment analysis - whether or not a certain post is more leaning towards negative vs positive outlook on this
- do everything they talked about in class
    - use word embedding ever which way
    - use topic modeling every which way
    - use everythign every way they talk about it

- stop words
- get rid of punctuation
- lower case all
- functions to tokenize

# NYT API
# GDELT

# Ideas from convo with Kelly

- sentiment & topics of climate change discussions
    - when people talk about climate change, what is the tone of the conversation? (sentiment analysis)
    - topic modeling
- look at different patterns in way people talk about climate change
- how conversation has changed over time, if I can get longitudinal data out of this
- depending on what I get out of that, find interesting patterns
    - may stimulate further ideas down the road
 
- I can engineer my own features
    - look for specific words (word tokens) like "not sure" to see how people are equivocating about climate change
        - create functions that go through each of my reddit comments and look for certain words or things I think might be relevant to some of the data


- both of the below topics involve how people frame arguments; so what are the essential ways people frame arguments for reproductive rights or global warming


***abortion***
- could turn into a predictive model -- given the way people are talking about the issue here, are they prolife or prochoice? 
    - because it's kind of like I have labeled data here based on subreddit names
- George Lakoff -- gets into psychology and the way people with different political views frame arguments
    - could look at things like sentiment, word frequency (whether a word is high vs. low frequency word)
        - can you assume some level of education from the text based on how people are using language?
            - by looking at vocabulary; TF-IDF is important here; looking at weight of each word
                - if people are using v low freuqency words (10 cent words) in their speach, can assume they are more highly educated
                - start to make some hypothesis around how people with dif political views might use language, so when doing NLP can engineer some features to let me know whether or not someone is for or against reproductive rights
                
- reddit aggregator that gives individual info about each reddit commenter (Kelly can figure out name and send it to me)
    - from there can get info about what each of the reddit user likes and what other subreddits they belong to
    - might have some geolocation data (so I know where in the country the person is located who makes certain comments)

### Dan's question: SVD, PCA, LDA -- when to use each in project flow?

- 10,000 comments form dif subreddits
- reduce dataset
    - get rid of stopwords etc.
- once have countvectorizer or ?, do some dimensionality reduction on it
- we do truncated SVD (e.g., chop down to 10-20 principle components)
    - LDA would do somethign similar, but SVD is default
    - reduce to 15 themes in latent space; some sort of cocktail mixture of those 15 themes
    - lets also track the subreddits they came from and see the purity of the subreddits
        - in a perfect world, one subreddit would go to each theme, but there prob will be some crossover
            - see what percentage of each subreddit goes to each theme?
- could also try various forms of clusering on raw dataset or the reduced dataset, and see what the clustering says
    - what does DBSCAN say? or what does spectral clustering say? spectral clustering is combo of ?? and kmeans - reduces dimensions and then does kmeans to cluster documents together
- clustering as preliminary baseline to get an idea; then do something like LDA
    - LDA will give you things like top 10 words in first topic, top 10 words in 2nd topic, etc.
    - then do things like LDA in wider buckets, smaller buckets, etc. (?)
    - note - you have the subbreddit topic sthey come from; don't lose that order so yo uknow where they come from
        - just keep away from any modeling you do so we can see how a blind unsupervsed model would do
            - maybe we find out that the human labeling isn't that great and the model figures out a better distribution?

In [None]:
# how has sentiment / topics changed over time with abortion topic?
# if date is convenient, could be interesting, if not, don't worry about it. 
# so can use praw?

In [None]:
# Look at general information about the prochoice and prolife subreddits
# to find other similar subreddits to add to NLP analysis

pc_subreddit = reddit.subreddit('prochoice')
pl_subreddit = reddit.subreddit('prolife')


print(pc_subreddit.description)
print(pl_subreddit.description)

In [None]:
# object name for subreddit forum of interest
# climate = reddit.subreddit('climate change')

# subreddit_list = ['GlobalWarming','climate','climatechange','climateskeptics']

prolife_subreddits = ['trueprolife', 'abortiondebate', 'gendercide', 'Adoption', 'Fosterit', 'ProLifeLibertarians']

prochoice_subreddits = ['abortion', 'abortions', 'abortiondebate', 'childfree', 'feminism101', 'Feminism4Everyone', 
                        'feministFAQ', 'GenderStudies', 'InternationalWomen', 'LiberalFeminism', 
                        'LibertarianFeminism', 'libs', 'onlywomen', 'riotgrrrl', 'SecondWaveFeminism', 
                        'thetruefeminism', 'ThirdWaveFeminism', 'WomenPositive', 'zines']

other_subreddits = ['TwoXChromosomes, Feminism, antifeminists']

subreddit_list = ['prochoice', 'prolife', 'trueprolife', 'abortion', 'abortions', 'abortiondebate'] 


# d = {"subreddit":[],
#      "title":[],
#      "score":[],
#      "id":[],
#      "url":[],
#      "comms_num": [],
#      "created": [],
#      "body":[],
#      "comments":[]
#     }


d = {"subreddit":[],
     "title":[],
     "comments":[]
    }

In [None]:
for subreddit_name in subreddit_list:
    subreddit = reddit.subreddit(subreddit_name)
    for submission in subreddit.top(limit=3): # Dan is getting top 50-200 at most
        d["subreddit"].append(subreddit)
        d["title"].append(submission.title)
        #d["score"].append(submission.score)
        #d["id"].append(submission.id)
        #d["url"].append(submission.url)
        #d["comms_num"].append(submission.num_comments)
        #d["created"].append(submission.created)
        #d["body"].append(submission.selftext)
        submission.comments.replace_more(limit=2) # could extend that limit to get all the comments in comment tree
        d["comments"].append(submission.comments)
#         for comment in (submission.comments.list()):
#             print(comment.body)
        
#df = pd.DataFrame(d)

In [None]:
df = pd.DataFrame(d)
df

In [None]:
pickle.dump(d, open('reproductive_rights_comments.pkl', 'wb'))

In [None]:
d = pickle.load(open('reproductive_rights_comments.pkl', 'rb'))

In [None]:
print(Counter(d['subreddit']).keys())
(Counter(d['subreddit']).values())

In [None]:
d

In [None]:
comments = []
for i,forest in enumerate(d['comments']):
    for comment in forest.list():
        comments.append([comment.body, str(d['subreddit'][i])])

comments_only = []
for comment in comments:
    comments_only.append(comment[0])

In [None]:
subs = []
coms = [] # in data cleaning file, make all lowercase so its consistent

for i,forest in enumerate(d['comments']):
    for comment in forest.list():
        coms.append([comment.body])
        subs.append(str(d['subreddit'][i]))

In [None]:
comments

In [None]:
comments_only