# Business Problem
The COVID-19 pandemic took our world by storm, and there is no doubt that covid related topic becomes the most popular one in the social media. Stay at home order makes more people to express opinions at social media like Facebook, Twitter and Reddit. 

In this business problem, I will use Reddit API to get the top500 posts and their comments at COVID-19 related subreddit. Topic modeling, sentiment analysis and prediction will be used to build a machine learning pipeline by investigate public opinion on all kinds of text content on Reddit and helping decision makers to make better policies.

### Objective
*   Successfully build a machine learning pipeline to collect data, analyze  and predict the sentiment.  
*   Help decision maker understand importance of using public opinion 
*   Help decision maker to find possible issue surrouding the COVID


 

In [None]:
pip install praw

Collecting praw
[?25l  Downloading https://files.pythonhosted.org/packages/2c/15/4bcc44271afce0316c73cd2ed35f951f1363a07d4d5d5440ae5eb2baad78/praw-7.1.0-py3-none-any.whl (152kB)
[K     |██▏                             | 10kB 14.8MB/s eta 0:00:01[K     |████▎                           | 20kB 1.8MB/s eta 0:00:01[K     |██████▌                         | 30kB 2.3MB/s eta 0:00:01[K     |████████▋                       | 40kB 2.7MB/s eta 0:00:01[K     |██████████▊                     | 51kB 2.1MB/s eta 0:00:01[K     |█████████████                   | 61kB 2.3MB/s eta 0:00:01[K     |███████████████                 | 71kB 2.6MB/s eta 0:00:01[K     |█████████████████▎              | 81kB 2.8MB/s eta 0:00:01[K     |███████████████████▍            | 92kB 3.1MB/s eta 0:00:01[K     |█████████████████████▌          | 102kB 2.9MB/s eta 0:00:01[K     |███████████████████████▊        | 112kB 2.9MB/s eta 0:00:01[K     |█████████████████████████▉      | 122kB 2.9MB/s eta 0:00:01

In [None]:
import datetime
import praw
import pandas as pd
#from keys import client_id, client_secret

In [None]:
client_id = '**'
client_secret = '**'

## Collecting the posts for our topic
Initializing a Reddit Instance

In [None]:
reddit = praw.Reddit( client_id=client_id,
            client_secret=client_secret,
            user_agent='android:my_app:v1 (by /u/HardPlayer23)')

In [None]:
covid = reddit.subreddit('CoronavirusCanada')

#Gathering the top 500 posts, with their title, url, body, upvotes, timestamp, and an index that serves as a key between the
#posts and the comments we collect later
posts = []
for index, post in enumerate(covid.top(limit=20)):
    posts.append([post.title, "https://www.reddit.com" + post.permalink, post.selftext, post.score, post.created_utc, index])

#Converting into DataFrame
posts = pd.DataFrame(posts, columns=['Title', 'URL', 'Body', 'Upvotes', 'Time', 'Key'])
#Changing from utc time to standard timestamp
posts.Time = posts.Time.apply(lambda x: pd.to_datetime(datetime.datetime.fromtimestamp(x)))

#The first post is a sticky, so we can drop it
posts = posts.iloc[1:]

In [None]:
posts.head()

Unnamed: 0,Title,URL,Body,Upvotes,Time,Key
1,British public: “We love the NHS!” *elects Con...,https://www.reddit.com/r/CoronavirusUK/comment...,,3817,2020-04-18 10:24:05,1
2,Richard Branson is worth $4 billion dollars an...,https://www.reddit.com/r/CoronavirusUK/comment...,,1999,2020-04-01 10:26:15,2
3,Just about the kindest thing to happen in the ...,https://www.reddit.com/r/CoronavirusUK/comment...,,1639,2020-03-29 13:54:37,3
4,Over Bristol tonight,https://www.reddit.com/r/CoronavirusUK/comment...,,1496,2020-06-02 17:33:16,4
5,Wetherspoons: Just Say No,https://www.reddit.com/r/CoronavirusUK/comment...,,1463,2020-03-25 09:32:06,5


In [None]:
posts.shape

(19, 6)


**Collecting the comments for each of our posts**

We want to get all the comments for the posts we collected

In [None]:
def collect_replies(key, url):
    ''' 
    params pandas series row: each row of the dataframe we built above in the form of a panda series
    Returns a pandas DataFrame, where each row represents an individual comment
    '''
    submission = reddit.submission(url=url)
    submission.comments.replace_more(limit=None)
    comment_queue = submission.comments[:] 

    table = {'Reply':[], 'Upvote':[], 'Time':[], 'Key':[]}

    while comment_queue:
        comment = comment_queue.pop(0)
        table['Reply'].append(comment.body)
        table['Time'].append(comment.created_utc)
        table['Upvote'].append(comment.score)
        table['Key'].append(key)
        comment_queue.extend(comment.replies)
    
    return pd.DataFrame.from_dict(table)

In [None]:
#Let us first generate a list of tupules that contains the key and url for each row - the first value of the tupule is key,
#and the second value is url
keys = posts.Key.tolist()
urls = posts.URL.tolist()
tupules = list(zip(keys, urls))

#Now we generate our comments dataframe using list comprehensions!
comments = pd.concat([collect_replies(x[0], x[1]) for x in tupules])

In [None]:
#Again, converting the timestamp from utc to a standard format
comments.Time = comments.Time.apply(lambda x: pd.to_datetime(datetime.datetime.fromtimestamp(x)))

In [None]:
comments.head()

Unnamed: 0,Reply,Upvote,Time,Key
0,This should be WAY upvoted. I’ve been waiting ...,18,2020-05-13 04:20:47,1
1,"Good to know, once this comes out to the masse...",14,2020-05-13 05:29:30,1
2,"If you're in BC, the BCCDC has a survey where ...",9,2020-05-13 05:40:04,1
3,Excellent. This can be great for determining i...,6,2020-05-13 12:18:57,1
4,[deleted],6,2020-05-13 19:23:21,1


In [None]:
comments.shape

(588, 4)

In [None]:
comments.to_csv('Comments_canadatop20.csv', index=False)

In [None]:
posts.to_csv('Posts_canadatop20.csv', index=False)

In [None]:
!cp Comments_.csv "drive/My Drive/"