## Subreddit Data Acquisition, Utilizing Python Reddit API Wrapper (PRAW)
---

Framework is from local repo from Heather's class involving PRAW. Some noteable changes:

- Data collected from new, top and controversial filters. 

- A search of gilded users from each subreddit at the time was also added. It does not confirm that any given **post** is gilded, but does confirm that the id of a post belongs to a gilded user (There are very few of these). 

- Data was drawn multiple times, then combined into one larger dataframe, deleting duplicate rows. This provided a challenge - by just using drop_duplicates, many actual duplicates were not captured, as the number of comments or "karma" score would change for the same post over time, causing the dataframe to keep essentially the same row of information. Therefore, the decision was made to drop duplicates that share the same title and id. I decided to keep the last entry, so as to allow updated comments/scores to be reflected. This may lead to some slight number of errors in indicated who is a "gilded" member (as that can actually expire after a given time), but I think that the potential error in gild status will be outweighed by a sizeable reduction in error of comment/karma score status.

- Data collection was ended on Sunday, October 18. This is intentional, as allowing several days to pass will cause "new" data collection to be entirely unseen data. In this way, I will be able to collect clean test data repeatedly to evaluate my model on completely unseen data multiple times.

**Make a bunch of these, then concat them all together and use pandas' [drop duplicates method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) to add more data while not repeating data needlessly.**

In [30]:
import praw
import pandas as pd
import requests
import time
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

#### Provide credentials so you can use the wrapper

In [3]:
#The info below is provided to the wrapper to pull from the API.

reddit = praw.Reddit(client_id = 'REDACTED',     #this is that personal use script key
                     client_secret = 'REDACTED', #this is that secret key
                     user_agent = 'reddit_api',    #Whatever name you gave your application
                     username = 'REDACTED',      #your Reddit user name
                     password = 'REDACTED')      #your Reddit password

#### Create variable names for the subreddits you want to pull from

In [4]:
subreddit = reddit.subreddit('Conservative')    #set to variable name, pick the name of a subreddit

In [5]:
#The liberal subreddit turns out to be far less active, so I am utilizing a new subreddit for analysis.
#altsubreddit = reddit.subreddit('Liberal')
altsubreddit = reddit.subreddit('Progressive')

In [85]:
sortsalot = combined.sort_values(by = 'comms_num', ascending = False)
sortsalot.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,subreddit,gilded_user
4144,Presidential Debate Thread - Day 1,481,j2bxs0,https://www.reddit.com/r/Conservative/comments...,21909,1601455000.0,The first presidential debate between Presiden...,Conservative,0
2257,"Justice Ruth Bader Ginsburg, Champion Of Gende...",18126,ivh84e,https://www.npr.org/2020/09/18/100306972/justi...,11061,1600501000.0,,Conservative,0
2265,Donald Trump and wife Melania test positive fo...,14185,j3oh32,https://news.sky.com/story/donald-trump-and-wi...,6748,1601644000.0,,Conservative,0
2254,Trump calls for delay to 2020 US election,21393,i0ltok,https://www.bbc.com/news/world-us-canada-53597...,6121,1596144000.0,,Conservative,0
2274,The_donald - as well as 2000 other subs - have...,10582,hi426n,https://www.reddit.com/r/Conservative/comments...,4380,1593480000.0,We're seeing a few submissions about this. As ...,Conservative,0


#### Specify what type of posts from those subreddits you want to pull

Each subreddit has five different ways of organizing the topics created by redditors: `.hot`, `.new`, `.controversial`, `.top`, and `.gilded`. You can also use `.search("SEARCH_KEYWORDS")` to get only results matching an engine search.

The following lines of code were utilized to gather new data on reddit

In [8]:
#You can only pull 1000 at a time
#subreddit_new = subreddit.new(limit = 1000)
subreddit_new = subreddit.top(limit = 1000)
#subreddit_new = subreddit.controversial(limit = 1000)


In [9]:
subreddit_new_2 = altsubreddit.new(limit = 1000)
#subreddit_new_2 = altsubreddit.top(limit = 1000)
#subreddit_new_2 = altsubreddit.controversial(limit = 1000)

In [10]:
cons_gilds = []

In [11]:
for item in subreddit.gilded():
    cons_gilds.append(item.id)

In [12]:
len(cons_gilds)

100

In [13]:
prog_gilds = []

In [14]:
for item in altsubreddit.gilded():
    prog_gilds.append(item.id)

In [15]:
len(prog_gilds)

48

#### Create a topics dictionary:

In [16]:
topics_dict = { "title":[],
                "score":[],
                "id":[],
                "url":[], 
                "comms_num": [],
                "created": [],
                "body":[]}

#Use a for loop to take the posts gathered by the wrapper and place them into a dictionary.

for item in subreddit_new:
    topics_dict["title"].append(item.title)
    topics_dict["score"].append(item.score)
    topics_dict["id"].append(item.id)
    topics_dict["url"].append(item.url)
    topics_dict["comms_num"].append(item.num_comments)
    topics_dict["created"].append(item.created)
    topics_dict["body"].append(item.selftext)

In [17]:
topics_dict2 = { "title":[],
                "score":[],
                "id":[],
                "url":[], 
                "comms_num": [],
                "created": [],
                "body":[]}

#Use a for loop to take the posts gathered by the wrapper and place them into a dictionary.

for item in subreddit_new_2:
    topics_dict2["title"].append(item.title)
    topics_dict2["score"].append(item.score)
    topics_dict2["id"].append(item.id)
    topics_dict2["url"].append(item.url)
    topics_dict2["comms_num"].append(item.num_comments)
    topics_dict2["created"].append(item.created)
    topics_dict2["body"].append(item.selftext)

#### Convert the dictionaries into DataFrames. 

In [20]:
df = pd.DataFrame(topics_dict)

In [21]:
df2 = pd.DataFrame(topics_dict2)

In [22]:
df2.head()

Unnamed: 0,title,score,id,url,comms_num,created,body
0,Opinion: Patriotism and Voting in the Workplac...,1,jdmcs7,https://www.sasentinel.com/opinion-patriotism-...,0,1603078000.0,
1,Ex-prisoner-turned-rapper fights for justice f...,34,jdjpxm,https://www.aljazeera.com/features/2020/10/16/...,1,1603069000.0,
2,Are their any studies and/or statistics that g...,22,jdc9h6,https://www.reddit.com/r/progressive/comments/...,7,1603036000.0,I’m curious how many people believe it. It’s h...
3,"On election coverage, CBS is falling for the s...",135,jd55t9,https://www.mediamatters.org/cbs/election-cove...,9,1603005000.0,
4,INSIGHT: U.K. Justice System ReformâDigital ...,2,jd4xky,https://news.bloomberglaw.com/us-law-week/insi...,0,1603004000.0,


#### Add a column to each dataframe keep track of which subreddit the data is from

In [23]:
df['subreddit'] = "Conservative"

In [24]:
df2['subreddit'] = 'Progressive'

#### Add that extra column

#### Concatenate the 2 dataframes together

In [25]:
df_final = pd.concat([df, df2])
#df_final = df2

In [26]:
df_final.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,subreddit
0,Thousands of Latinos Gather in Miami For Anti-...,3,jdnapu,https://www.thegatewaypundit.com/2020/10/thous...,1,1603081000.0,,Conservative
1,Camera Catches the Incredible Moment Dem Rep W...,3,jdn9ta,https://www.conservativenewsdaily.com/camera-c...,0,1603080000.0,,Conservative
2,New Data Analysis Finds 353 Counties With 1.8 ...,3,jdn6mm,https://www.theepochtimes.com/new-data-analysi...,0,1603080000.0,,Conservative
3,Meanwhile in New Zealand where their leadershi...,0,jdn5qc,https://www.reddit.com/r/sports/comments/jdgrk...,3,1603080000.0,,Conservative
4,"Women's March sign: ""Trump is an unstable penis!""",5,jdn59s,https://hotair.com/archives/karen-townsend/202...,3,1603080000.0,,Conservative


In [27]:
df_final.subreddit.value_counts()

Conservative    980
Progressive     978
Name: subreddit, dtype: int64

The following lines of code were used sequentially to merge newly gathered data with data previously acquired.

In [29]:
#Read in old csv, concat old df with new, then delete duplicates
#df_old = pd.read_csv('redditproject.csv')
df_old = pd.read_csv('Data/redditprojectnew_fixed.csv')
#df_old = pd.read_csv('Data/redditprojecttop_fixed.csv')
#df_old = pd.read_csv('redditproject_cont.csv')

In [30]:
df_final.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,subreddit
0,Thousands of Latinos Gather in Miami For Anti-...,3,jdnapu,https://www.thegatewaypundit.com/2020/10/thous...,1,1603081000.0,,Conservative
1,Camera Catches the Incredible Moment Dem Rep W...,3,jdn9ta,https://www.conservativenewsdaily.com/camera-c...,0,1603080000.0,,Conservative
2,New Data Analysis Finds 353 Counties With 1.8 ...,3,jdn6mm,https://www.theepochtimes.com/new-data-analysi...,0,1603080000.0,,Conservative
3,Meanwhile in New Zealand where their leadershi...,0,jdn5qc,https://www.reddit.com/r/sports/comments/jdgrk...,3,1603080000.0,,Conservative
4,"Women's March sign: ""Trump is an unstable penis!""",5,jdn59s,https://hotair.com/archives/karen-townsend/202...,3,1603080000.0,,Conservative


In [31]:
df_old.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,subreddit,gilded_user
0,"BLM, Antifa violently crash San Fran free spee...",7,jdf7ef,https://www.bizpacreview.com/2020/10/18/blm-an...,1,1603052000.0,,Conservative,0
1,Biden will 'make clear' his position on court ...,4,jdf5ue,https://www.foxnews.com/politics/biden-court-p...,2,1603051000.0,,Conservative,0
2,Pennsylvania Union Calls Out Joe Biden Lie: We...,18,jdf4rn,https://www.breitbart.com/politics/2020/10/17/...,1,1603051000.0,,Conservative,0
3,Giuliani: Computer shop owner who found allege...,8,jdf412,https://www.foxnews.com/politics/rudy-giuliani...,0,1603051000.0,,Conservative,0
4,"Yes, The Hunter Biden Emails are Authentic",11,jdf09v,https://sonar21.com/yes-the-hunter-biden-email...,5,1603051000.0,,Conservative,0


In [32]:
df_final.shape, df_old.shape

((1958, 8), (8709, 9))

In [33]:
df_final['gilded_user'] = df_final.id.apply(lambda x :1 if x in cons_gilds or x in prog_gilds else 0)

In [34]:
submission_df = pd.concat([df_final, df_old])
#submission_df = df_final

In [35]:
submission_df.shape

(10667, 9)

In [36]:
submission_df.drop_duplicates(keep = 'last', inplace = True)

In [37]:
submission_df.shape

(9612, 9)

In [38]:
submission_df.subreddit.value_counts()

Conservative    5343
Progressive     4269
Name: subreddit, dtype: int64

In [39]:
df_old.shape

(8709, 9)

In [40]:
submission_df.isnull().sum()

title             0
score             0
id                0
url               0
comms_num         0
created           0
body           7397
subreddit         0
gilded_user       0
dtype: int64

In [None]:
count = 0
for x in submission_df.id:
    if x in cons_gilds or x in prog_gilds:
        count += 1
count

In [41]:
submission_df.shape

(9612, 9)

In [44]:
submission_df.subreddit.value_counts()

Conservative    5343
Progressive     4269
Name: subreddit, dtype: int64

In [45]:
submission_df.to_csv('Data/redditprojectnew_fixed.csv', index = False) 
#submission_df.to_csv('Data/redditprojecttop_fixed.csv', index = False) 
#submission_df.to_csv('Data/redditproject_cont.csv', index = False)

In [46]:
new = pd.read_csv('Data/redditprojectnew_fixed.csv')
top = pd.read_csv('Data/redditprojecttop_fixed.csv')
cont = pd.read_csv('Data/redditproject_cont.csv')

In [47]:
print(new.shape)
print(top.shape)
print(cont.shape)

(9612, 9)
(7520, 9)
(3932, 9)


In [60]:
combined = pd.concat([new, top, cont])

In [61]:
combined.subreddit.value_counts()

Conservative    11095
Progressive      9969
Name: subreddit, dtype: int64

In [62]:
combined.shape

(21064, 9)

In [63]:
combined.drop_duplicates(subset = ['title', 'id'], keep = 'last', inplace = True)

In [64]:
combined.shape

(5977, 9)

In [65]:
combined.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,subreddit,gilded_user
0,Thousands of Latinos Gather in Miami For Anti-...,3,jdnapu,https://www.thegatewaypundit.com/2020/10/thous...,1,1603081000.0,,Conservative,0
1,Camera Catches the Incredible Moment Dem Rep W...,3,jdn9ta,https://www.conservativenewsdaily.com/camera-c...,0,1603080000.0,,Conservative,0
2,New Data Analysis Finds 353 Counties With 1.8 ...,3,jdn6mm,https://www.theepochtimes.com/new-data-analysi...,0,1603080000.0,,Conservative,0
3,Meanwhile in New Zealand where their leadershi...,0,jdn5qc,https://www.reddit.com/r/sports/comments/jdgrk...,3,1603080000.0,,Conservative,0
4,"Women's March sign: ""Trump is an unstable penis!""",5,jdn59s,https://hotair.com/archives/karen-townsend/202...,3,1603080000.0,,Conservative,0


In [66]:
combined.gilded_user.value_counts()

0    5952
1      25
Name: gilded_user, dtype: int64

In [67]:
combined.reset_index(inplace = True)

In [68]:
combined.drop(columns = 'index', inplace = True)

In [69]:
combined.subreddit.value_counts(normalize = True)
#This dataset is very close to balanced. However, over time, the conservative subreddit will likely begin to
#grow at a faster rate. If this gets out of hand, run only samples from the progressive subreddit.

Conservative    0.511628
Progressive     0.488372
Name: subreddit, dtype: float64

Since I am looking at *ideology* and not a politcal party, I will attempt to find a different left - leaning subreddit for a more fair analysis of two different political ideologies. Examining the progressive subreddit, it appears that the subreddit progressive is far more active, so I will take data from there from now on.

In [70]:
combined.to_csv('Data/redditproject.csv', index = False)

In [159]:
test = pd.read_csv('Data/redditproject.csv')
test.head(3)

Unnamed: 0,title,score,id,url,comms_num,created,body,subreddit,gilded_user
0,Thousands of Latinos Gather in Miami For Anti-...,3,jdnapu,https://www.thegatewaypundit.com/2020/10/thous...,1,1603081000.0,,Conservative,0
1,Camera Catches the Incredible Moment Dem Rep W...,3,jdn9ta,https://www.conservativenewsdaily.com/camera-c...,0,1603080000.0,,Conservative,0
2,New Data Analysis Finds 353 Counties With 1.8 ...,3,jdn6mm,https://www.theepochtimes.com/new-data-analysi...,0,1603080000.0,,Conservative,0


In [72]:
test.score.describe()

count     5977.000000
mean      1018.108583
std       2097.166387
min          0.000000
25%          5.000000
50%        100.000000
75%        479.000000
max      47729.000000
Name: score, dtype: float64

In [78]:
test.comms_num.describe()

count     5977.000000
mean       145.481847
std        446.988720
min          0.000000
25%          3.000000
50%         15.000000
75%        102.000000
max      21909.000000
Name: comms_num, dtype: float64

In [73]:
len(test[test.score == 0]) #Hmmmm

781

In [74]:
test.shape

(5977, 9)

In [75]:
test.isnull().sum()

title             0
score             0
id                0
url               0
comms_num         0
created           0
body           5708
subreddit         0
gilded_user       0
dtype: int64

"EDA" discovered during data acquisition:
Conservative posts more often that liberal, as collecting new data tends to involve more new conservative posts than progressive posts.

On the other hand, progressive subreddit tends to downvote more, as the "controversial" data grows faster with progressive than conservative.

In [108]:
test.shape

(5977, 9)

In [100]:
commentalot = test.sort_values(by = 'comms_num', ascending = False)

In [114]:
half_comments = commentalot[:len(commentalot) // 2]

In [118]:
half_comments.shape

(2988, 9)

In [117]:
half_comments.subreddit.value_counts()

Conservative    2056
Progressive      932
Name: subreddit, dtype: int64

In [96]:
# submission = reddit.submission(id = 'j2bxs0')

# for comment in submission.comments.list():
#     print(comment.body)

Below from [PRAW documentation on comment extration](https://praw.readthedocs.io/en/stable/tutorials/comments.html):

In [126]:
#Commented out for now, this cell takes HOURS to run. Still, cool when it's acquired!
# comments_for_df = []

# for ids in half_comments.id:
#     submission = reddit.submission(id = ids)
#     entry = {}
    
#     entry['id'] = ids
#     comms = []
#     submission.comments.replace_more(limit=1)
#     for comment in submission.comments.list():
#         comms.append(comment.body)
#     entry['comments'] = comms
#     comments_for_df.append(entry)

In [129]:
comments.comments

Unnamed: 0,id,comments
0,j2bxs0,[Tired of reporting this thread? Debate us on ...
1,ivh84e,[We are hosting a polite discussion of RBG and...
2,j3oh32,[Tired of reporting this thread? Debate us on ...
3,i0ltok,[Actual Title - **Donald Trump suggests delay ...
4,hi426n,"[what the fuck is cumtown lmao, Glad to see Re..."
...,...,...
2983,gdq38r,[We need some kind of law that ensures America...
2984,42d8yy,[> Clinton has demonstrated that she is a thou...
2985,fuhnms,[And I'm OK with this except for the fact that...
2986,jd8y59,[Let Twitter censors. It only backfires as fas...


In [137]:
coments.head()

Unnamed: 0,id,comments
0,j2bxs0,[Tired of reporting this thread? Debate us on ...
1,ivh84e,[We are hosting a polite discussion of RBG and...
2,j3oh32,[Tired of reporting this thread? Debate us on ...
3,i0ltok,[Actual Title - **Donald Trump suggests delay ...
4,hi426n,"[what the fuck is cumtown lmao, Glad to see Re..."


In [155]:
half_comments.head(2)

Unnamed: 0,title,score,id,url,comms_num,created,body,subreddit,gilded_user
4144,Presidential Debate Thread - Day 1,481,j2bxs0,https://www.reddit.com/r/Conservative/comments...,21909,1601455000.0,The first presidential debate between Presiden...,Conservative,0
2257,"Justice Ruth Bader Ginsburg, Champion Of Gende...",18126,ivh84e,https://www.npr.org/2020/09/18/100306972/justi...,11061,1600501000.0,,Conservative,0


In [143]:
left = test_df
right = half_comments[['id', 'subreddit']]

comments_subreddit = pd.merge(left, right, on = 'id')

In [157]:
half_comments.subreddit.value_counts()

Conservative    2056
Progressive      932
Name: subreddit, dtype: int64

In [158]:
comments_subreddit.subreddit.value_counts()

Conservative    2056
Progressive      932
Name: subreddit, dtype: int64

In [152]:
comments_subreddit.to_csv('Data/comments.csv', index = False)

In [153]:
last_test = pd.read_csv('Data/comments.csv')

In [154]:
last_test.head()

Unnamed: 0,id,comments,subreddit
0,j2bxs0,"[""Tired of reporting this thread? Debate us on...",Conservative
1,ivh84e,['We are hosting a polite discussion of RBG an...,Conservative
2,j3oh32,"[""Tired of reporting this thread? Debate us on...",Conservative
3,i0ltok,['Actual Title - **Donald Trump suggests delay...,Conservative
4,hi426n,"['what the fuck is cumtown lmao', 'Glad to see...",Conservative


## Adding features to comments
---
Analysis of these scores will occur in the data analysis notebook: Here, they are merely being scored and added to the dataframe.

In [3]:
sentiment = SentimentIntensityAnalyzer()
def get_sentiment(comment, ret_val = 'compounds'):
    negs = []
    posits = []
    compounds = []
    #go in order of pos[0] neg[1] and compound[3]
    #BUT, I only want one at a time so I can apply to the df. So...
    if ret_val == 'compounds':
        comments = comment.split('\\')
        for comms in comments:
            #negs.append(sentiment.polarity_scores(comms))['neg']
            #posits.append(sentiment.polarity_scores(comms))['pos']
            #compound_score = sentiment.polarity_scores(comments[inds])['compound']
            compounds.append(sentiment.polarity_scores(comms)['compound'])
        return compounds
    elif ret_val == 'negs':
        comments = comment.split('\\')
        for comms in comments:
            #neg_score = 
            negs.append(sentiment.polarity_scores(comms)['neg'])
            #posits.append(sentiment.polarity_scores(comms))['pos']
            #compounds.append(sentiment.polarity_scores(comms))['compounds']
        return negs
    else:
        comments = comment.split('\\')
        for comms in comments:
            #negs.append(sentiment.polarity_scores(comms))['neg']
            posits.append(sentiment.polarity_scores(comms)['pos'])
            #compounds.append(sentiment.polarity_scores(comms))['compounds']
        return posits

In [None]:
test = comments.comments[0]
test = test.split('\\')
test[:2]
how_to_list = []
#thing = 
how_to_list.append(sentiment.polarity_scores(test[2])['compound'])
thing = (sentiment.polarity_scores(test[5]))['compound']
#thing
how_to_list.append(thing)
how_to_list

In [None]:
# #Only run this once, given time constraints, and then export the file so that the values remain in future.
# comments['pos_scores'] = comments.comments.apply(lambda x: get_sentiment(x, 'pos'))
# comments['neg_scores'] = comments.comments.apply(lambda x: get_sentiment(x, 'negs'))
# comments['comp_scores'] = comments.comments.apply(lambda x: get_sentiment(x, 'compounds'))

In [None]:
# comments['pos_score_avg'] = comments.pos_scores.apply(lambda x: np.mean(x))
# comments['pos_score_median'] = comments.pos_scores.apply(lambda x: np.median(x))

# comments['neg_score_avg'] = comments.neg_scores.apply(lambda x: np.mean(x))
# comments['neg_score_median'] = comments.neg_scores.apply(lambda x: np.median(x))

# comments['comp_score_avg'] = comments.comp_scores.apply(lambda x: np.mean(x))
# comments['comp_score_median'] = comments.comp_scores.apply(lambda x: np.median(x))

In [None]:
# comments.to_csv('Data/comments_clean.csv', index = False)

# test = pd.read_csv('Data/comments_clean.csv')

Leave it to Heather to show me a MUCH faster and better way to do this! Thanks for the local, I was stuck here until then.

In [26]:
comments = pd.read_csv('Data/comments_clean.csv')
comments.drop(columns = 'Unnamed: 0', inplace = True)

In [27]:
comments.head(1)

Unnamed: 0,id,comments,subreddit,neg,neu,pos,compound
0,j2bxs0,"[""Tired of reporting this thread? Debate us on...",Conservative,0.127,0.74,0.133,0.9907


In [15]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [28]:
comments = comments[['id', 'comments', 'subreddit']]

In [29]:
comments.head()

Unnamed: 0,id,comments,subreddit
0,j2bxs0,"[""Tired of reporting this thread? Debate us on...",Conservative
1,ivh84e,['We are hosting a polite discussion of RBG an...,Conservative
2,j3oh32,"[""Tired of reporting this thread? Debate us on...",Conservative
3,i0ltok,['Actual Title - **Donald Trump suggests delay...,Conservative
4,hi426n,"['what the fuck is cumtown lmao', 'Glad to see...",Conservative


In [35]:
tokened = word_tokenize(comments.comments[0])
tokens_without_sw = [word for word in tokened if not word in stopwords.words('english')]

In [42]:
sia = SentimentIntensityAnalyzer()
dicts = []    

for comms in comments.comments:
    tokenized = word_tokenize(comms)
    cleaned = [word for word in tokenized if not word in stopwords.words('english')]
    scores = sia.polarity_scores(' '.join(cleaned))
    scores['comments'] = comms
    dicts.append(scores)

df = pd.DataFrame(dicts)
df.head()

Unnamed: 0,neg,neu,pos,compound,comments
0,0.162,0.652,0.186,0.9992,"[""Tired of reporting this thread? Debate us on..."
1,0.166,0.556,0.277,1.0,['We are hosting a polite discussion of RBG an...
2,0.161,0.607,0.232,1.0,"[""Tired of reporting this thread? Debate us on..."
3,0.173,0.686,0.141,-0.9998,['Actual Title - **Donald Trump suggests delay...
4,0.175,0.663,0.162,-0.9988,"['what the fuck is cumtown lmao', 'Glad to see..."


In [43]:
df['id'] = comments.id

In [44]:
comments = comments[['id', 'comments', 'subreddit']]
comments.head(2)

Unnamed: 0,id,comments,subreddit
0,j2bxs0,"[""Tired of reporting this thread? Debate us on...",Conservative
1,ivh84e,['We are hosting a polite discussion of RBG an...,Conservative


In [45]:
df.head()

Unnamed: 0,neg,neu,pos,compound,comments,id
0,0.162,0.652,0.186,0.9992,"[""Tired of reporting this thread? Debate us on...",j2bxs0
1,0.166,0.556,0.277,1.0,['We are hosting a polite discussion of RBG an...,ivh84e
2,0.161,0.607,0.232,1.0,"[""Tired of reporting this thread? Debate us on...",j3oh32
3,0.173,0.686,0.141,-0.9998,['Actual Title - **Donald Trump suggests delay...,i0ltok
4,0.175,0.663,0.162,-0.9988,"['what the fuck is cumtown lmao', 'Glad to see...",hi426n


In [46]:
final_df = pd.merge(comments, df.drop(columns = 'comments'), on = 'id')

In [47]:
final_df.head()

Unnamed: 0,id,comments,subreddit,neg,neu,pos,compound
0,j2bxs0,"[""Tired of reporting this thread? Debate us on...",Conservative,0.162,0.652,0.186,0.9992
1,ivh84e,['We are hosting a polite discussion of RBG an...,Conservative,0.166,0.556,0.277,1.0
2,j3oh32,"[""Tired of reporting this thread? Debate us on...",Conservative,0.161,0.607,0.232,1.0
3,i0ltok,['Actual Title - **Donald Trump suggests delay...,Conservative,0.173,0.686,0.141,-0.9998
4,hi426n,"['what the fuck is cumtown lmao', 'Glad to see...",Conservative,0.175,0.663,0.162,-0.9988


In [48]:
final_df.to_csv('Data/comments_clean.csv')