In order to run this, you need to install the following libraries:

1) **Vader** - https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f
   
   *pip install vaderSentiment*
   
   This is used for the Sentiment Analysis
   
2) **Boto3** - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStarted.Python.03.html#GettingStarted.Python.03.01

   *pip install boto3*
   
   This is used to read/write data into DynamoDB objects. Note you have to setup AWSCLI below first as it uses that for
   the connection/user details.
   
3) **AWSCLI** - https://sysadmins.co.za/interfacing-amazon-dynamodb-with-python-using-boto3/

   *pip install awscli*
   
   This is used to setup the connection/configuration parameters needed to access the DynamoDB objects. Sanjeev had provided 
   the details in his email and the instructions show you what to do under Lets get started

In [88]:
import pandas as pd
import datetime as dt
import praw
import boto3
import json
import decimal
import re
import string
import random

from decimal import Decimal
from collections import Counter
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
from nltk.tokenize import TweetTokenizer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk import FreqDist
from nltk import classify
from nltk import NaiveBayesClassifier
import numpy as np
import nltk

np.random.seed(2018)

In order to use the PRAW api, follow the directions here https://www.storybench.org/how-to-scrape-reddit-with-python/
Note this involves creating a Reddit Account and a Reddit App ID which the instructions guide you through

In [4]:
PERSONAL_USE_SCRIPT_14_CHARS = ''
SECRET_KEY_27_CHARS = ''
YOUR_APP_NAME = ''
YOUR_REDDIT_USER_NAME = ''
YOUR_REDDIT_LOGIN_PASSWORD = ''

In [33]:
# Helper class to convert a DynamoDB item to JSON.
class DecimalEncoder(json.JSONEncoder):
    def default(self, o):
        if isinstance(o, decimal.Decimal):
            if abs(o) % 1 > 0:
                return float(o)
            else:
                return int(o)
        return super(DecimalEncoder, self).default(o)

# This class is used to handle all of the interactions (uploading/downloading tables etc) from
# the DynamoDB tables setup for the project
class DynamoDBEngine:
    def __init__(self):
        #This creates the dynamoDB object that points to the location of the database
        #Note this requires the AWSCLI connection details to be setup
        self.dynamodb = boto3.resource('dynamodb', region_name='us-east-1', endpoint_url="https://dynamodb.us-east-1.amazonaws.com")
    
    #This takes the comments panda dataframe and uploads comments to the UserComments
    #table in DynamoDB one row at a time
    def uploadComments(self, comment_df):        
        table = self.dynamodb.Table('UserComments')
        
        for row in comment_df.itertuples():
            response = table.put_item(
                Item={
                    'comment_id': row.comment_id,
                    'story_id': row.story_id,
                    'comment_author': row.comment_author,
                    'comment_body': row.comment_body,
                    'negative_sa_score': Decimal(str(row.negative_sa_score)),
                    'neutral_sa_score': Decimal(str(row.neutral_sa_score)),
                    'positive_sa_score': Decimal(str(row.positive_sa_score)),
                    'compound_sa_score': Decimal(str(row.compound_sa_score))
                })
            
            if response['ResponseMetadata']['HTTPStatusCode'] != 200:
                return False
        
        return True
    
    def downloadComments(self):        
        comments_dict = {"link_id": [],
                         "sortKey": [],
                         "score": [],
                         "permalink": [],
                         "author_fullname": [],
                         "id": [],
                         "storyId": [],
                         "author": [],
                         "parent_id": [],
                         "body": []
                        }
            
        table = self.dynamodb.Table('CommentsNoSleep')
        
        itemsList = []
        response = table.scan()
        
        for i in response['Items']:
            for key in i.keys():                
                comments_dict[key].append(i[key])
            
        while 'LastEvaluatedKey' in response:
            response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
            
            for i in response['Items']:
                for key in i.keys():                
                    comments_dict[key].append(i[key])
                
        df = pd.DataFrame(comments_dict)
        return df        
    
    #This takes the comments panda dataframe and uploads comments to the UserComments
    #table in DynamoDB one row at a time
    def uploadStories(self, story_df):
        table = self.dynamodb.Table('UserStories')
        
        for row in story_df.itertuples():
            response = table.put_item(
                Item={
                    'story_id': row.story_id,
                    'title': row.title,
                    'author': row.author,
                    'body': row.body
                })
            
            if response['ResponseMetadata']['HTTPStatusCode'] != 200:
                return False
        
        return True       

# This class is used to handle all of the interactions (downloading stories/comments) from
# the NoSleep Subreddit
class NoSleepRecommender:
    def __init__(self):
        self.reddit = praw.Reddit(client_id=PERSONAL_USE_SCRIPT_14_CHARS,
                                  client_secret=SECRET_KEY_27_CHARS,
                                  password=YOUR_REDDIT_LOGIN_PASSWORD,
                                  user_agent=YOUR_APP_NAME,
                                  username=YOUR_REDDIT_USER_NAME)

        #print(self.reddit.user.me())

        self.subreddit = self.reddit.subreddit('nosleep')

        self.stories_dict = {"story_id": [],
                        "title": [],
                        "author": [],
                        "body": []}
        self.comments_dict = {"comment_id": [],
                         "story_id": [],
                         "comment_author": [],
                         "comment_body": [],
                         "negative_sa_score": [],
                         "neutral_sa_score": [],
                         "positive_sa_score": [],
                         "compound_sa_score": []}
    
    def loadComments(self):
        self.comment_df = pd.read_csv('user_comments.csv')
        self.comment_df['sentiment'] = self.comment_df['sentiment'].fillna(-1)
        
    def loadStories(self):
        self.story_df = pd.read_csv('user_stories.csv')
        self.story_df = self.story_df.fillna(' ')
        
    def loadAll(self):
        self.loadComments()
        self.loadStories()
        
    def returnComments(self):
        return self.comment_df
    
    def returnStories(self):
        return self.story_df    
    
    def saveComments(self):
        self.comment_df.to_csv (r'user_comments.csv', index = False, header=True)
        
    def saveStories(self):
        self.story_df.to_csv (r'user_stories.csv', index = False, header=True)
        
    def saveAll(self):
        self.saveComments()
        self.saveStories()
        
    def getStories(self):
        analyser = SentimentIntensityAnalyzer()

        my_subreddit = self.subreddit.hot(limit=10)
        for submission in my_subreddit:
            self.stories_dict["title"].append(submission.title)
            self.stories_dict["body"].append(submission.selftext)
            self.stories_dict["author"].append(submission.author)
            self.stories_dict["story_id"].append(submission.id)
            
            submission.comments.replace_more(limit=None)
            all_comments = submission.comments.list()            
            
            for comment in all_comments:
                #This does the sentiment analysis and returns the
                #scores obtained for the comment. A compound score is a
                #one-dimensional assessment. If it is >= 0.05, then the comment
                #is perceived as 'positive'. The individual scores show what %
                #of the comment is neutral, +ve, and/or -ve
                score = analyser.polarity_scores(comment.body)
                
                self.comments_dict["comment_id"].append(comment.id)
                self.comments_dict["story_id"].append(submission.id)
                self.comments_dict["comment_body"].append(comment.body)
                self.comments_dict["comment_author"].append(comment.author)
                self.comments_dict["negative_sa_score"].append(score["neg"])
                self.comments_dict["neutral_sa_score"].append(score["neu"])
                self.comments_dict["positive_sa_score"].append(score["pos"])
                self.comments_dict["compound_sa_score"].append(score["compound"])

        self.story_df = pd.DataFrame(self.stories_dict)
        self.story_df = self.story_df.dropna()        
        
        self.comment_df = pd.DataFrame(self.comments_dict)
        self.comment_df = comment_df.dropna()

# Run this section to read stories/comments from the NoSleep Reddit and save them to CSV

In [None]:
nsapp = NoSleepRecommender()
nsapp.getStories()
nsapp.saveAll()

# Run this section to load the stories/comments from the saved CSV files and save them to the DynamoDB tables

In [None]:
nsapp = NoSleepRecommender()
nsapp.loadAll()

comment_df = nsapp.returnComments()
story_df = nsapp.returnStories()

dbeng = DynamoDBEngine()

dbeng.uploadStories(story_df)
dbeng.uploadComments(comment_df)

In [None]:
story_df

In [114]:
#dbeng = DynamoDBEngine()
#comment_df = dbeng.downloadComments()

nsapp = NoSleepRecommender()
nsapp.loadComments()

comment_df = nsapp.returnComments()

In [36]:
comment_df
#comment_df.to_csv (r'user_comments.csv', index = False, header=True)

Unnamed: 0,link_id,sortKey,score,permalink,author_fullname,id,storyId,author,parent_id,body,sentiment,prediction
0,t3_bs22s7,1558760400,10,/r/nosleep/comments/bs22s7/i_work_on_a_boat_ou...,t2_wcuxx,eok7qb9,bs22s7,Wolf_of_WV,t3_bs22s7,You are a dead man walking. The people who ar...,0.0,-1
1,t3_dukiqw,1573621200,1,/r/nosleep/comments/dukiqw/a_childs_method_for...,t2_86jlh,f77papf,dukiqw,Sporkazm,t1_f77p7a1,&amp;#x200B;\n\nSomehow I came to return the e...,0.0,-1
2,t3_e4lcwk,1575349200,4,/r/nosleep/comments/e4lcwk/everyone_knows_the_...,t2_j5mx0,f9erm02,e4lcwk,LiKenun,t3_e4lcwk,"&gt;My neck was sticky, and stank, stank like...",1.0,-1
3,t3_au1bdu,1551157200,1,/r/nosleep/comments/au1bdu/conditions_of_entry...,t2_nguj2,eh6cdqn,au1bdu,Reddit__Herring,t3_au1bdu,That was really awesome. Very interesting conc...,1.0,-1
4,t3_d64uh2,1568955600,1,/r/nosleep/comments/d64uh2/the_188minute_man/f...,t2_2vpmh2gy,f0rojyz,d64uh2,jcammarato,t1_f0ridzm,"Yes, but she also has no choice and will event...",,-1
...,...,...,...,...,...,...,...,...,...,...,...,...
169583,t3_b8zif8,1554440400,7,/r/nosleep/comments/b8zif8/the_lynch_house/ek2...,t2_20igrc4i,ek27hib,b8zif8,BlondeRR1717,t3_b8zif8,Are you aware that you wrote natural causes fo...,,-1
169584,t3_bgj62b,1556168400,73,/r/nosleep/comments/bgj62b/my_first_breath_too...,t2_lg9ark7,ellzl0a,bgj62b,Bismuthie,t1_ellx4dp,Omg smart,,-1
169585,t3_btonzl,1559106000,2,/r/nosleep/comments/btonzl/her_eye_was_a_spira...,t2_gkau4,ep15bo3,btonzl,thejollyden,t1_ep12y0d,How do you delete comments?,,-1
169586,t3_cn3nbl,1565326800,1,/r/nosleep/comments/cn3nbl/straight_to_vhs_sun...,t2_17gt7w,ew7ctpd,cn3nbl,LadyGrey1174,t3_cn3nbl,"Holy hannah, time for a Disney movie...",,-1


In [66]:
#The punkt module is a pre-trained model that helps you tokenize words and sentences.
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Astayanax\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Astayanax\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Astayanax\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Astayanax\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [61]:
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)

positive_comments = comment_df[comment_df['sentiment'] == 1]
negative_comments = comment_df[comment_df['sentiment'] == 0]

positive_tokens = []
negative_tokens = []

for count in range(len(positive_comments)):
    positive_tokens.append(tknzr.tokenize(positive_comments.iloc[count]["body"]))
    
for count in range(len(negative_comments)):
    negative_tokens.append(tknzr.tokenize(negative_comments.iloc[count]["body"]))

In [72]:
def remove_noise(comment_tokens, stop_words = ()):
    cleaned_tokens = []

    for token, tag in pos_tag(comment_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

stop_words = stopwords.words('english')
negative_tokens_c = []
positive_tokens_c = []

for count in range(len(negative_tokens)):
    negative_tokens_c.append(remove_noise(negative_tokens[count], stop_words))
    
for count in range(len(positive_tokens)):
    positive_tokens_c.append(remove_noise(positive_tokens[count], stop_words))

In [101]:
def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token
            
def get_tweets_for_model(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tokens)

pos_tokens_mod = get_tweets_for_model(positive_tokens_c)
neg_tokens_mod = get_tweets_for_model(negative_tokens_c)

all_pos_words = get_all_words(positive_tokens_c)
freq_dist_pos = FreqDist(all_pos_words)

all_neg_words = get_all_words(negative_tokens_c)
freq_dist_neg = FreqDist(all_neg_words)

print(freq_dist_pos.most_common(10))
print(freq_dist_neg.most_common(10))

[('’', 27), ('get', 17), ('good', 14), ('like', 12), ('read', 12), ('kill', 11), ('think', 11), ('go', 10), ('op', 10), ("i'm", 10)]
[('like', 15), ('r', 15), ('nosleep', 15), ('#x200b', 14), ('get', 14), ('question', 14), ('must', 14), ('message', 14), ('moderator', 14), ('submission', 13)]


In [102]:
#NB - You CAN'T run this multiple times in a row. It basically removes the data from the
# *_tokens_mod variables. If you need to rerun this, please rerun the code above first
# to repopulate these variables
positive_dataset = [(comment_dict, "Positive")
                     for comment_dict in pos_tokens_mod]

negative_dataset = [(comment_dict, "Negative")
                     for comment_dict in neg_tokens_mod]

dataset = positive_dataset + negative_dataset
train_size = int(len(dataset)*0.7)

random.shuffle(dataset)

train_data = dataset[:train_size]
test_data = dataset[train_size:]

print(len(positive_dataset), len(negative_dataset), len(dataset))
print(len(train_data), len(test_data))

107 89 196
137 59


In [104]:
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is: 0.576271186440678
Most Informative Features
                 comment = True           Negati : Positi =      7.2 : 1.0
                    post = True           Negati : Positi =      7.2 : 1.0
               subreddit = True           Negati : Positi =      7.2 : 1.0
                       r = True           Negati : Positi =      6.4 : 1.0
                   check = True           Negati : Positi =      6.4 : 1.0
                    must = True           Negati : Positi =      6.4 : 1.0
                   issue = True           Negati : Positi =      5.5 : 1.0
                     may = True           Negati : Positi =      3.3 : 1.0
                      op = True           Positi : Negati =      3.3 : 1.0
                    good = True           Positi : Negati =      3.3 : 1.0
None


In [119]:
random_comments = comment_df[comment_df['sentiment'] == -1]
random_comments

tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)

for count in range(10):
    custom_tokens = remove_noise(tknzr.tokenize(random_comments.iloc[count]["body"]))
    predicted_sentiment = classifier.classify(dict([token, True] for token in custom_tokens))
    
    print(predicted_sentiment, random_comments.iloc[count]["body"])

Negative Yes, but she also has no choice and will eventually crave a companion as well.
Positive I'm glad your husband decided to confront her. You would have never known otherwise.
Negative Thank you. I don’t know what I believe anymore, but I appreciate your words.
Positive Beautiful!
Negative I did in a comment,but they've allbeen removed by nosleep.

I copy/pasted here for you:

&amp;#x200B;


[This](https://www.reddit.com/r/nosleep/comments/c1wxxu/i_work_at_nasa_we_made_alien_contact_yesterday/) was posted two days ago...not saying it's related, not saying it isn't.

But I'm scared.

&amp;#x200B;

EDIT. Oh, holy shit. All alert posts have been removed. WTF.
Positive Happens with far more stories on this sub than it should. I don’t know if people get bored telling their true stories and rush the ending, or if the endings are just really hard to convey. Either way it’s disappointing every time it happens
Negative r/hydrohomies must be in on it
Negative I did not realize I was readin