# Sentiment Analysis on titles of Reddit Posts using Google Cloud NLP

The below code makes use of Reddit RSS feeds in order to extract some basic information (such as title and author) about posts from a list of subreddits provided by the user. After the information is extarcted, a function is then called to  generate the sentiment analysis on the title of the post. A function is also provided to convert the JSON data into CSVs to make it easier to load and use for BI purposes.

This project was completed as my final project for a cloud computing course as a part of getting my master's at the University of Missouri. The intent of the project was to showcase, at a small scale, the ability to extract data via any method (such as web scrapping), utilize what we learned about cloud computing in order to run machine learning of some kind on it, and then make the insights of the machine learning available to a business intelligence tool. The picture somes up the intended data flow.

![Specific_Project_1.png MISSING](../images/Specific_Project_1.png)

I chose to do my project on extracting data from reddit using RSS feeds, analyzing the sentiment of titles, and exposing that data through an R Shiny dashboard. The below code is exclusive to extracting the data, running Google's NLP API on it, and then flattening it into a CSV. The Shiny app can be found at [insert app link here]() and code for it can be found in the shinyApp folder.

## Table of Contents

- [Dependancies](#Dependancies)
- [Stackoverflow Functions](#stackoverflow)
- [Reddit to JSON](#json)
- [Sentiment Analysis](#sentiment)
- [Flattening the Output](#flatten)

### Dependancies <a name="Dependancies"></a>

The below functions are built out using a few common Python libraries so those need to be installed and loaded in order for all this to work.

In [None]:
#load dependancies
import json
import feedparser
import os
import time
import pandas as pd
from bs4 import BeautifulSoup
from bs4.element import Comment
from google.oauth2 import service_account
from google.cloud import language_v1
from google.cloud.language_v1 import enums
from google.cloud.language_v1 import types

### Stackoverflow Functions <a name="stackoverflow"></a>

I make use for Beautiful Soup in order to parse the RSS feeds and get the data I need. Stackoverflow was a big help here, as indicated by the comment.

In [None]:
# First two functions from: https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text
def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

### Reddit to JSON <a name='json'></a>

These functions are put together in the same cell as find_author() is necessary for reddit_to_json to work. I noticed that with the above functions, when an author was multiple users or not coming through in the RSS feed, it would break reddit_to_json. Rather than remove the author, I thought it would be best to just make it null so that way users could see authors who post a lot. It turns out this wasn't particularly helpful since by doing this a majority of authors ended up null (something I wasn't expecting). I'm not sure if I'll ever return to this project, but I think if I do that will be one of the things I try to make better.

In [None]:
#a function to find the author of a reddit post's author and if there isn't one to return null
def find_author(reddit_post):
    try:
        author = reddit_post["author"]
    except:
        author = None

    return author

#a function to parse reddit RSS feeds and turn them into json files
def reddit_to_json(subreddits,sort,posts_from,amt):
    """ A function to convert a reddit rss feed into a json file.
    
    This function takes four parameters:
        - subreddits is a list of subreddits that you want to parse.
        - sort is the type of sorting you apply to the subreddit in the rss feed URL.
          It can take any parameter and rss feed url can take. Some options include "top", "hot", or "new".
        - posts_from specifies the timeframe for which posts are extracted when sorting by "top". By default
          this is set to "all" to extract the top posts from all time. Options are "hour","day","week", "month",
          "year", and "all".
        - amt is the amount of posts you want to fetch which is equal to the number of posts a subreddit can show.
          By default, this is set to 25 but can be changed to any integer between 1 and 100.
    
    It will return a JSON of 25 reddit posts based off your sorting and posts_from paratemter (for example,
    when sort is "top" and posts_from is "all" it will grab you the top 25 posts of all time). It will return the
    subreddit the post is in, the title, the author, the date posted, a link to the post, and the summary text.
    """
    
    #create for loop that will parse through rss feeds for all subreddits in our subreddit list
    for subreddit in subreddits:
        #create empty dict with empty list of posts to append to for printing purposes
        reddit = {}
        reddit["posts"] = []
        
        #create rss url based off sort and posts_from parameters
        if (sort == "top"):
            if (posts_from is None):
                rss_url = 'http://www.reddit.com/r/%s/top/.rss?limit=%s&sort=%s&t=all' % (subreddit,amt,sort)
                filename = '%s_%s_all.json' % (subreddit, sort)
            else:
                rss_url = 'http://www.reddit.com/r/%s/top/.rss?limit=%s&sort=%s&t=%s' % (subreddit,amt,sort,posts_from)
                filename = '%s_%s_%s.json' % (subreddit,sort,posts_from)
        else: 
            rss_url = 'http://www.reddit.com/r/%s/.rss?limit=%s&sort=%s' % (subreddit,amt,sort)
            filename = '%s_%s.json' % (subreddit,sort)
            
        feed = feedparser.parse(rss_url)

        #create conditional to show errors
        if (feed['bozo'] == 1):
            print("Error Reading/Parsing Feed XML Data for /r/{}".format(subreddit))    
        else:
            print("Making json of %s posts from /r/%s..." % (sort,subreddit))
            for item in feed["items"]:
                reddit["posts"].append({
                    "subreddit":subreddit,
                    "title":item["title"],
                    "author":find_author(item),
                    "link":item["link"],
                    "datetime":item["date"],
                    "summary_text": text_from_html(item["summary"])
                })
                
        #if directory does not exist, make it
        if not os.path.exists('json_outputs/'):
            os.makedirs('json_outputs/')

        #output the data as a json file
        with open(filename, 'w') as outfile:
            json.dump(reddit, outfile)

        #move files to output folder
        os.rename(filename, 'json_outputs/%s' % filename)
    
    print("Done!")

### Sentiment Analysis <a name='sentiment'></a>

The below function will call the Google Cloud APIs (using a user's Google Cloud credentials) and run sentiment analysis. It will then update the JSONs and merge them together into one JSON file that could then be uploaded into a NoSQL database.

In [None]:
 #a function to add sentiment analysis data and compile all reddit jsons from 'reddit_to_json' function
def reddit_sentiment_analysis(credentials_path,data_path='json_outputs/',results_path='results/'):
    """ A function that takes reddit json data (created by the reddit_to_json function) 
    and adds sentiment and entity analysis data for the title using the Google Cloud NLP API.
    
    This function takes three parameters:
    - The credentials_path specifies the path to a json file used for Google Cloud credentials. Without a credentials
      file provided, an exception will be rasied.
    - The data_path specifies the path where the json inputs can be found. The 'reddit_to_json' function will 
      create a directory called 'json_outputs' and so the default value for this parameter is set to 'json_outputs'.
      If the output directory of 'reddit_to_json' was changed or the files moved, you will need to set this parameter.
    - The results_path specifies the path specifies where the final json will land. By default
      a folder called 'results' will be created and the file will be placed there.
    
    It will return a single json with all of the jsons in the file path appended together and sentiment analysis added
    for each post. It will print it to result/reddit_sentiment_analysis.json
    """
        
    #check for credentials file and if it doesn't exist, inform the user
    if credentials_path is None:
        raise Exception("credentials_path: No Google Cloud credentials found - please provide a JSON file with credentials.")

    #establish empty list of files
    print('Reading JSON from %s' % (data_path))
    files = []

    #read in files from path so we can loop through it later
    for file in os.listdir(data_path):
        if file.endswith(".json"):
            files.append(file)

    #let the user know how many files were found so they can validate the path is correct
    #and let them know if no files were found
    if len(files) == 0:
        print("Error: No Files Found!")
    print('Found %s files in provided path' % (len(files)))

    #set up credentials for Google Cloud and set up NLP client API
    print("Fetching Google Cloud Credentials from {} and establishing connection to API".format(credentials_path))
    credentials = service_account.Credentials.from_service_account_file(credentials_path)
    client = language_v1.LanguageServiceClient(credentials=credentials)
    type_ = enums.Document.Type.PLAIN_TEXT
    encoding_type = enums.EncodingType.UTF8

    #create empty outputlist that will ultimately be export to JSON
    outputlist = []

    #loop through files to do sentiment analysis on
    for file in files:
        print("Reading {}...Analyzing Sentiment and Entities".format(data_path+file))
        openfile = open(data_path+file)
        data = json.load(openfile)

        #analyze title sentiment and entities
        for post in data['posts']:
            #build a document with the post title as content
            document = types.Document(content=post['title'],type=enums.Document.Type.PLAIN_TEXT)

            #update post with title sentiment score & magnitude and number of entities
            post.update({
                'sentiment_score':client.analyze_sentiment(document, encoding_type=encoding_type).document_sentiment.score,
                'sentiment_magnitude':client.analyze_sentiment(document, encoding_type=encoding_type).document_sentiment.magnitude,
                'num_entities':len(client.analyze_entities(document, encoding_type=encoding_type).entities)
            })

        #append updated data to empty list
        outputlist.append(data['posts'])

        #Google's API has a limit of 60 calls per minute per user
        #so we sleep for 60 seconds here to avoid timeouts
        print("Done with {}...Waiting for 60 Seconds to avoid API timeout".format(file))
        time.sleep(60)

    #print results to json
    print("Done! Printing file to results folder.")
    
    #assign file name
    file_name = 'reddit_sentiment_analysis.json'
    
    #make the results directory if it doesn't exist
    if not os.path.exists(results_path):
        os.makedirs(results_path)
    
    #dump outputlist as json file into working directory
    with open(file_name,"w") as outfile:
        json.dump(outputlist,outfile)
        
    #move file
    os.rename(file_name,results_path+file_name)

### Flattening the Output <a name='flatten'></a>

The below output will take the JSON we generated from the above function and turn it into a CSV. This is good for low volumes of data (i.e. not extracting many subreddits or posts) and makes analysis in a BI tool quicker. I think a lot more people are comfortable with the data structure of a CSV rather than a JSON as well (at least in my experiernce) so it enables more users to take advantage of the insights generated by the sentiment analysis.

In [None]:
#function to take the output from 'sentiment_analysis_reddit'
def flatten_sentiment_analysis(input_file='results/reddit_sentiment_analysis.json',output_path='results/flattened_sentiment_analysis.csv'):
    """ A function that takes the json output from the 'sentiment_analysis_reddit' function and flattens it to CSV
    
    This function can take two parameters:
    - The input_file is the filepath for the json file. It is by default results/reddit_sentiment_analysis.json
      which is the default output of the 'sentiment_analysis_reddit' function
    - The output_file is the desired landing place for the CSV. By default, this is set to 'results/flattened_sentiment_analysis.csv'
    
    Currently this function CANNOT be run without first running the 'sentiment_analysis_reddit' function in order
    to create one unified file. A way to flatten reddit json output from 'reddit_to_json' will be coming in the future
    """
    
    #open file
    openfile = open(input_file)
    data = json.load(openfile)

    #build empty dataframe
    column_names = ['post_id','subreddit','author','datetime','title','sentiment_score','sentiment_magnitude','num_entities']
    df = pd.DataFrame(columns = column_names)

    #set subreddit index and post_index
    subreddit_index = 1000
    post_index = 1

    #loop through the json building out the dataframe
    print("Flattening json to csv...")
    for i in range(0,len(data)):
        for j in range(0,len(data[i])):
            post_id = subreddit_index + post_index
            post_df = pd.DataFrame(
                [[post_id,
                  data[i][j]['subreddit'],
                  data[i][j]['author'],
                  data[i][j]['datetime'],
                  data[i][j]['title'],
                  data[i][j]['sentiment_score'],
                  data[i][j]['sentiment_magnitude'],
                  data[i][j]['num_entities']
                 ]],
                columns=column_names
            )
            #increment
            post_index = post_index + 1

            #append post_df to empty df
            df = df.append(post_df)

        #increment subreddit index and de-increment post_index
        post_index = 1
        subreddit_index = subreddit_index + 1000

    #print results to CSV
    print("Printing result to {}...".format(output_path))
    df.to_csv(output_path,header=True,index=False,encoding='utf-8')
    print("Done!")

### Extracting Data <a name='extraction'></a>

Finally getting to the part that matters! I ran my functions with the following list of subreddits, which at the time this was done (April 25th, 2020) were the top 25 subreddits by subscriber count (minus /r/Announcements). I wanted to see if we could get any idea of a sentiment of titles for these subreddits and there were definitely some interesting things going on.

The cell will extract the data into JSONs for each of the subreddits, do the sentiment analysis, and then merge the data into one JSON.

In [None]:
# unfortunately there is no way to easily extract a list of top subreddits from  reddit itself
# so I pulled them from http://redditlist.com/ as shown in the screenshot above
# I removed announcements as it does not have user generated content
subreddits = ["funny","AskReddit","gaming","pics","aww","science","worldnews","Music","movies","videos",
              "todayilearned","news","IAmA","gifs","Showerthoughts","EarthPorn","askscience","Jokes","food",
              "explainlikeimfive","books","LifeProTips","Art","mildlyinteresting","DIY"]

reddit_to_json(subreddits,"top","all",100)
reddit_sentiment_analysis('mu-dsa-course-8635-sp20-ffc2322a17d7.json')

Running this will then flatten it and create a CSV. Never hurts to flatten things if you can, in my opinion.

In [None]:
flatten_sentiment_analysis()