# Business Problem
The COVID-19 pandemic took our world by storm, and there is no doubt that covid related topic becomes the most popular one in the social media. Stay at home order makes more people to express opinions at social media like Facebook, Twitter and Reddit. 

In this business problem, I will use Reddit API to get the top100 posts and their comments at COVID-19 subreddit. Sentiment analysis and prediction will be used to build a machine learning pipeline by investigate public opinion on all kinds of text content on Reddit and helping decision makers to make better policies.

### Objective
*   Successfully build a machine learning pipeline to collect data, analyze  and predict the sentiment.  
*   Help decision maker understand importance of using public opinion 
*   Help decision maker to find possible issue surrouding the COVID


 

In [None]:
pip install praw

Collecting praw
[?25l  Downloading https://files.pythonhosted.org/packages/2c/15/4bcc44271afce0316c73cd2ed35f951f1363a07d4d5d5440ae5eb2baad78/praw-7.1.0-py3-none-any.whl (152kB)
[K     |██▏                             | 10kB 17.2MB/s eta 0:00:01[K     |████▎                           | 20kB 3.3MB/s eta 0:00:01[K     |██████▌                         | 30kB 4.1MB/s eta 0:00:01[K     |████████▋                       | 40kB 4.4MB/s eta 0:00:01[K     |██████████▊                     | 51kB 3.9MB/s eta 0:00:01[K     |█████████████                   | 61kB 4.3MB/s eta 0:00:01[K     |███████████████                 | 71kB 4.6MB/s eta 0:00:01[K     |█████████████████▎              | 81kB 5.0MB/s eta 0:00:01[K     |███████████████████▍            | 92kB 5.3MB/s eta 0:00:01[K     |█████████████████████▌          | 102kB 5.2MB/s eta 0:00:01[K     |███████████████████████▊        | 112kB 5.2MB/s eta 0:00:01[K     |█████████████████████████▉      | 122kB 5.2MB/s eta 0:00:01

In [None]:
import datetime

import praw
import pandas as pd
from keys import client_id, client_secret

## Collecting the posts for our topic
Initializing a Reddit Instance

In [None]:
reddit = praw.Reddit( client_id=client_id,
            client_secret=client_secret,
            user_agent='android:my_app:v1 (by /u/HardPlayer23)')

In [None]:
covid = reddit.subreddit('COVID19')

#Gathering the top 500 posts, with their title, url, body, upvotes, timestamp, and an index that serves as a key between the
#posts and the comments we collect later
posts = []
for index, post in enumerate(covid.top(limit=100)):
    posts.append([post.title, "https://www.reddit.com" + post.permalink, post.selftext, post.score, post.created_utc, index])

#Converting into DataFrame
posts = pd.DataFrame(posts, columns=['Title', 'URL', 'Body', 'Upvotes', 'Time', 'Key'])
#Changing from utc time to standard timestamp
posts.Time = posts.Time.apply(lambda x: pd.to_datetime(datetime.datetime.fromtimestamp(x)))

#The first post is a sticky, so we can drop it
posts = posts.iloc[1:]

In [None]:
posts.head()

Unnamed: 0,Title,URL,Body,Upvotes,Time,Key
1,Number of people with coronavirus infections m...,https://www.reddit.com/r/COVID19/comments/g2cz...,,9387,2020-04-16 11:11:30,1
2,At least 11% of tested blood donors in Stockho...,https://www.reddit.com/r/COVID19/comments/g4zn...,,8906,2020-04-20 19:43:27,2
3,Ending coronavirus lockdowns will be a dangero...,https://www.reddit.com/r/COVID19/comments/g1hp...,,6920,2020-04-15 00:46:34,3
4,NYC Health: Only 1.8% of deaths in New York Ci...,https://www.reddit.com/r/COVID19/comments/ftlq...,,6738,2020-04-02 12:48:15,4
5,Not wearing masks to protect against coronavir...,https://www.reddit.com/r/COVID19/comments/fqdq...,,6305,2020-03-28 04:59:10,5


In [None]:
posts.shape

(99, 6)


**Collecting the comments for each of our posts**

We want to get all the comments for the posts we collected

In [None]:
def collect_replies(key, url):
    ''' 
    params pandas series row: each row of the dataframe we built above in the form of a panda series
    Returns a pandas DataFrame, where each row represents an individual comment
    '''
    submission = reddit.submission(url=url)
    submission.comments.replace_more(limit=None)
    comment_queue = submission.comments[:] 

    table = {'Reply':[], 'Upvote':[], 'Time':[], 'Key':[]}

    while comment_queue:
        comment = comment_queue.pop(0)
        table['Reply'].append(comment.body)
        table['Time'].append(comment.created_utc)
        table['Upvote'].append(comment.score)
        table['Key'].append(key)
        comment_queue.extend(comment.replies)
    
    return pd.DataFrame.from_dict(table)

In [None]:
#Let us first generate a list of tupules that contains the key and url for each row - the first value of the tupule is key,
#and the second value is url
keys = posts.Key.tolist()
urls = posts.URL.tolist()
tupules = list(zip(keys, urls))

#Now we generate our comments dataframe using list comprehensions!
comments = pd.concat([collect_replies(x[0], x[1]) for x in tupules])

In [None]:
#Again, converting the timestamp from utc to a standard format
comments.Time = comments.Time.apply(lambda x: pd.to_datetime(datetime.datetime.fromtimestamp(x)))

In [None]:
comments.head()

Unnamed: 0,Reply,Upvote,Time,Key
0,"OP, you may want to flair this as Press Releas...",449,2020-04-16 11:19:00,1
1,"The sampling taken during week 13, included 1...",72,2020-04-16 11:15:11,1
2,Very curious to see the random sampling result...,57,2020-04-16 11:58:13,1
3,[removed],156,2020-04-16 12:27:51,1
4,"If this is true, wouldn't that bring the death...",23,2020-04-16 16:53:09,1


In [None]:
comments.shape

(39623, 4)

In [None]:
comments.to_csv('Comments_.csv', index=False)

In [None]:
posts.to_csv('Posts.csv', index=False)

In [None]:
!cp Comments_.csv "drive/My Drive/"