<p align = "center" draggable=”false” ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png" 
     width="200px"
     height="auto"/>
</p>

# <h1 align="center" id="heading">Sentiment Analysis of Reddit Data using Reddit API</h1>

In this live coding session, we leverage the Python Reddit API Wrapper (`PRAW`) to retrieve data from subreddits on [Reddit](https://www.reddit.com), and perform sentiment analysis using [`pipelines`](https://huggingface.co/docs/transformers/main_classes/pipelines) from [HuggingFace ( 🤗 the GitHub of Machine Learning )](https://techcrunch.com/2022/05/09/hugging-face-reaches-2-billion-valuation-to-build-the-github-of-machine-learning/), powered by [transformer](https://arxiv.org/pdf/1706.03762.pdf).

## Objectives

At the end of the session, you will 

- know how to work with APIs
- feel more comfortable navigating thru documentation, even inspecting the source code
- understand what a `pipeline` object is in HuggingFace
- perform sentiment analysis using `pipeline`
- run a python script in command line and get the results

## How to Submit

- At the end of each task, commit* the work into the repository you created before the assignment
- After completing all three tasks, make sure to push the notebook containing all code blocks and output cells to your repository you created before the assignment
- Submit the link to the notebook in Canvas

\***NEVER** commit a notebook displaying errors unless it is instructed otherwise. However, commit often; recall git ABC = **A**lways **B**e **C**ommitting.

## Tasks

### Task I: Instantiate a Reddit API Object

The first task is to instantiate a Reddit API object using [PRAW](https://praw.readthedocs.io/en/stable/), through which you will retrieve data. PRAW is a wrapper for [Reddit API](https://www.reddit.com/dev/api) that makes interacting with the Reddit API easier unless you are already an expert of [`requests`](https://docs.python-requests.org/en/latest/).

#### 1. Install packages

Please ensure you've ran all the cells in the `imports.ipynb`, located [here](https://github.com/FourthBrain/MLE-8/blob/main/assignments/week-3-analyze-sentiment-subreddit/imports.ipynb), to make sure you have all the required packages for today's assignment.

####  2. Create a new app on Reddit 

Create a new app on Reddit and save secret tokens; refer to [post in medium](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c) for more details.

- Create a Reddit account if you don't have one, log into your account.
- To access the API, we need create an app. Slight updates, on the website, you need to navigate to `preference` > `app`, or click [this link](https://www.reddit.com/prefs/apps) and scroll all the way down. 
- Click to create a new app, fill in the **name**, choose `script`, fill in  **description** and **redirect uri** ( The redirect URI is where the user is sent after they've granted OAuth access to your application (more info [here](https://github.com/reddit-archive/reddit/wiki/OAuth2)) For our purpose, you can enter some random url, e.g., www.google.com; as shown below.


    <img src="https://miro.medium.com/max/700/1*lRBvxpIe8J2nZYJ6ucMgHA.png" width="500"/>
- Jot down `client_id` (left upper corner) and `client_secret` 

    NOTE: CLIENT_ID refers to 'personal use script" and CLIENT_SECRET to secret.
    
    <div>
    <img src="https://miro.medium.com/max/700/1*7cGAKth1PMrEf2sHcQWPoA.png" width="300"/>
    </div>

- Create `secrets_reddit.py` in the same directory with this notebook, fill in `client_id` and `secret_id` obtained from the last step. We will need to import those constants in the next step.
    ```
    REDDIT_API_CLIENT_ID = "client_id"
    REDDIT_API_CLIENT_SECRET = "secret_id"
    REDDIT_API_USER_AGENT = "any string except bot; ex. My User Agent"
    ```
- Add `secrets_reddit.py` to your `.gitignore` file if not already done. NEVER push credentials to a repo, private or public. 

#### 3. Instantiate a `Reddit` object

Now you are ready to create a read-only `Reddit` instance. Refer to [documentation](https://praw.readthedocs.io/en/stable/code_overview/reddit_instance.html) when necessary.

In [1]:
import praw
import secrets_reddit

# Create a Reddit object which allows us to interact with the Reddit API
reddit = praw.Reddit(
    client_id=secrets_reddit.REDDIT_API_CLIENT_ID,
    client_secret=secrets_reddit.REDDIT_API_CLIENT_SECRET,
    user_agent=secrets_reddit.REDDIT_API_USER_AGENT,
)

In [2]:
print(reddit) 

<praw.reddit.Reddit object at 0x7fd4aa718190>


<details>
<summary>Expected output:</summary>   

```<praw.reddit.Reddit object at 0x10f8a0ac0>```
</details>

#### 4. Instantiate a `subreddit` object

Lastly, create a `subreddit` object for your favorite subreddit and inspect the object. The expected output you will see ar from `r/machinelearning` unless otherwise specified.

In [10]:
subreddit = reddit.subreddit("doge")

What is the display name of the subreddit?

In [11]:
print(subreddit.display_name)

doge


<details>
<summary>Expected output:</summary>   

    machinelearning
</details>

How about its title, is it different from the display name? 

Answer: Yes, but it is only different in this case when it comes to lower versus upper case. Some titles and display names can be the same. For example the subreddit funny has the display_name of 'funny' and a title of 'funny'. In general when you look at a subreddit page, the r/{name} is the display name and the name directly above it is the title name.

In [12]:
print(subreddit.title)

DOGE


<details>
<summary>Expected output:</summary>   

    Machine Learning
</details>

Print out the description of the subreddit:

In [13]:
print(subreddit.description)

[](http://www.reddit.com/message/compose/?to=CelestialWalrus&subject=Sidebar+ad+on+rdoge&message=Put+your+300x250+ad+link+in+here&?sidebarad)

[](http://www.reddit.com/r/doge?spyingdoge)

[Free sidebar ad in /r/doge](http://www.reddit.com/message/compose/?to=CelestialWalrus&subject=Sidebar+ad+on+rdoge&message=Put+your+300x250+ad+link+in+here)

**Rules:**

[Additional explainations for these rules can be found on our Rules wiki page.](https://old.reddit.com/r/doge/wiki/rules)

*hover for details*

| | |
|-|-|
|1. No posts that are not related to Doge|This is subreddit for mainly Kabosu-related (the "original" doge) content, but other animals are allowed (and photoshops with Doge).|
|1a. No Forced / Ironic / Surreal Doge Posts|This rule has been expanded to cover 'forced' Doge posts that feature the original 'Doge' image, but have been modified in such a way that does not relate to the Doge meme. For clarification, please see the [Ironic Doge Meme KnowYourMeme](https://knowyourmeme.com/m

<details>
<summary>Expected output:</summary>

    **[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
    --------
    +[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
    --------
    +[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
    --------
    +[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
    --------
    +[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict
</details>

### Task II: Parse comments

#### 1. Top Posts of All Time

Find titles of top 10 posts of **all time** from your favorite subreddit. Refer to [Obtain Submission Instances from a Subreddit Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html)) if necessary. Verify if the titles match what you read on Reddit.

In [14]:
# try run this line, what do you see? press q once you are done
# Answer: The ? before a method gives you the manual page for the method you are trying to use.
#         The manual page is a reference to the documention for the method called after the '?', in this case
#         it brings up the manual page for the method subreddit.top
?subreddit.top 

In [15]:
for submission in subreddit.top(limit=10, time_filter="all"):
    print(submission.title)
    # Output: the submission's title
    print(submission.score)
    # Output: the submission's score
    print(submission.id)
    # Output: the submission's ID
    print(submission.url)
    # Output: the URL the submission points to or the submission's URL if it's a self post

DOGE PHONES HOME with Elon
3160
mrpkjg
https://i.redd.it/66cp6qmtwet61.jpg
Wow. So wolf. Much howl. Very moon.
1930
1s5rf8
http://i.imgur.com/klb812s.jpg
Lost Doge
1758
1no2y0
http://i.imgur.com/4tz2eNt.jpg
Such angle many neck wow
1310
1r7e49
http://i.imgur.com/e4CiCs7.gif
The Firefox icon has never looked better
1281
1t32gx
https://people.mozilla.org/~smartell/meme/such-logo.gif
Much knife. Such Stab. Wow.
1236
1r3shx
http://i.imgur.com/z4MPTZb.png
Wow such PlayStation
1225
1pzbnb
http://imgur.com/CAU3pJa
I will never get over how perfect this is.
1228
25e2fv
http://i.imgur.com/uMwpKP0.gif
Such wink! Such sleep!
1131
mkm6sg
https://i.imgur.com/wn3HXHG.png
doge on a calculator
1038
1n7od6
http://i.imgur.com/QaqHdPg.jpg


<details> <summary>Expected output:</summary>

    [Project] From books to presentations in 10s with AR + ML
    [D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition
    [R] First Order Motion Model applied to animate paintings
    [N] AI can turn old photos into moving Images / Link is given in the comments - You can also turn your old photo like this
    [D] This AI reveals how much time politicians stare at their phone at work
    [D] Types of Machine Learning Papers
    [D] The machine learning community has a toxicity problem
    [Project] NEW PYTHON PACKAGE: Sync GAN Art to Music with "Lucid Sonic Dreams"! (Link in Comments)
    [P] Using oil portraits and First Order Model to bring the paintings back to life
    [D] Convolution Neural Network Visualization - Made with Unity 3D and lots of Code / source - stefsietz (IG)    
</details>

#### 2. Top 10 Posts of This Week

What are the titles of the top 10 posts of **this week** from your favorite subreddit?

In [16]:
for submission in subreddit.top(limit=10, time_filter="week"):
    print(submission.title)
    # Output: the submission's title
    print(submission.score)
    # Output: the submission's score
    print(submission.id)
    # Output: the submission's ID
    print(submission.url)
    # Output: the URL the submission points to or the submission's URL if it's a self post

Will Elon make Doge The Currency of Twitter?
8
yjmflb
https://i.redd.it/7lbngmn0uex91.jpg


<details><summary>Expected output:</summary>

    [N] Ian Goodfellow, Apple’s director of machine learning, is leaving the company due to its return to work policy. In a note to staff, he said “I believe strongly that more flexibility would have been the best policy for my team.” He was likely the company’s most cited ML expert.
    [R][P] Thin-Plate Spline Motion Model for Image Animation + Gradio Web Demo
    [P] I’ve been trying to understand the limits of some of the available machine learning models out there. Built an app that lets you try a mix of CLIP from Open AI + Apple’s version of MobileNet, and more directly on your phone's camera roll.
    [R] Meta is releasing a 175B parameter language model
    [N] Hugging Face raised $100M at $2B to double down on community, open-source & ethics
    [P] T-SNE to view and order your Spotify tracks
    [D] : HELP Finding a Book - A book written for Google Engineers about foundational Math to support ML
    [R] Scaled up CLIP-like model (~2B) shows 86% Zero-shot on Imagenet
    [D] Do you use NLTK or Spacy for text preprocessing?
    [D] Democratizing Diffusion Models - LDMs: High-Resolution Image Synthesis with Latent Diffusion Models, a 5-minute paper summary by Casual GAN Papers
</details>

💽❓ Data Question:

Check out what other attributes the `praw.models.Submission` class has in the [docs](https://praw.readthedocs.io/en/stable/code_overview/models/submission.html). 

1. After having a chance to look through the docs, is there any other information that you might want to extract? How might this additional data help you?

*Answer: Yes, I found the number of comments for a submission, the upvote ration (percentage of upvotes from all votes on the submission), and the submissions's score helpful as a start. This could help with sentiment analysis because the higher the percentage of upvotes, the higher the score, and the higher the number of comments generally could inform a sentiment analysis model. In particular, the upvote_ratio is a direct measure of user happiness with the submission. These three information fields are also beneficial because they are simply numbers and are easier to work with than say pictures or text. 

*Additionally, you can pull the comments attached to a submission and analyze each in turn. These could be either text or image comments. An initial manual read over the comments gives us an idea of the sentiment of a submission and how well it matches the original three metrics already mentioned above (num_comments, upvote_ratio, submission.comments). Of note, the comments can be of various output types but will be returned in one list in the format of a 'CommentForest' object. See code below for what that might look like for all time top 10.

Write a sample piece of code below extracting three additional pieces of information from the submission below.

In [17]:
for submission in subreddit.top(limit=10, time_filter="all"):
    print(submission.num_comments)
    # Output: the number of comments on a submission
    print(submission.upvote_ratio)
    # Output: the percentage of upvotes from all votes on the submission
    print(submission.score)
    # Output: the submission's number of upvotes
    for sub in submission.comments:
        print(sub.body)
    # Output: the submissions comments
    


362
0.99
3158
Doge to all the way to 1 dollar
I’ve made over 8,000 in 35 days! I’m loving it!
Robinhood shut that shit down just like they did GME!!! Can't buy or sell right now!
i'll allow it
What’s a good amount to invest to doge right now ? I don’t want to buy too much or too little
[deleted]
Is this ever going to stop?
We broke the doge! RH says outage
I’d wait right now I think .33 was the peak and we’re gonna see the correction now. Some scary things are one man holds 20 percent of the supply, 50 percent is owned by 5 people. And because it’s not capped. To get doge to 500 roughly 1/100th of but coin not even. It would take more then the worlds entire gdp
Yo buy the dips bois
im predicting Dogie to Dump... a turd on the ground 🤣🐶🔥💯
Listen to I'm A Alien by LiricoTheKiDD on #SoundCloud
https://soundcloud.app.goo.gl/5CLoM
Lots of blessings let’s get it 🙏 god bless 🙏
Risking it all for $$$ memes 🚀
What is everyones doge strategy if you bought late  at like .15 ?
`Im holding my doge 

omg im in tears right now.
At first I was like..what is this? then I saw the face 10/10 thank you for making me laugh
       wow 

                so subtle

      adoge photoshop

       
Fucking Moon Moon. 
      wow
                                              many wild

                      so original doge

                                                 much photogenic
              very bff

                                                              wow
Such shop

Much Blend

Wow

Very Impress
          wow 

                                                    such front page

                       very successful 

       
such woofe
wow
can not contain
very wild
Wow
It makes me uncomfortable. Not sure why...
Goddamnit moon moon!
              wow
                    much side pains
     very lol            
                                    WOW
          such stares
Shibes are genetically closest to wolves of all doge breeds. now you know. Wow.
Well done!
Actually mak

"Wrong" is the best part. 
      wow
               such repost
    many karma 
              wow
         deja shibe 
                      
I laughed way too hard at this. 
agility, not 2 worry Haha
http://www.youtube.com/watch?v=gvdf5n-zI14
Cutest doge gif ever
18
0.91
1274
[deleted]
If you did this, I fucking love you. If you didn't, I still fucking love you for showing it to me. 
     wow             very surprise

             much unexpected

                                              so browser
           such firefox
made a .ico out of it

download it here
http://www.megafileupload.com/en/file/480586/IconDogeFoxe-ico.html

sample
http://nsa33.casimages.com/img/2013/12/20/131220015938346075.png
You forgot, "many addon"
This is so excellently done.
Excellent work. Improved my day. 
The wapapapapow really took it to the next level
much scary

edit: wow
This almost makes me want to switch back to Firefox.
Not sure if Mozilla marketing or fanmade
I really want the gif to just lo

💽❓ Data Question:

2. Is there any information available that might be a concern when it comes to Ethical Data? 

*Answer: When it comes to the subreddit object and what the api offers as attributes, there is some information that might be a concern to Ethical Data. Firstly, some data violates privacy of reddit users. Some examples include the attributes 'author','comments',and possibly 'poll_data' since it can contain voting information. The 'comments' field could have revealing information about the user or others through the use of links, private discussions held in a public discourse, and tagged users to name a few examples. Of note, the 'name' attribute appears to already be anonymized so this attribute is not a concern to privacy. In summary, the 'author', 'poll_data', and 'comments' attributes should either be excluded or anonymized and aggregated as much as possible.

*In addition to privacy concerns, there is some potential bias in the data fields such as 'created_utc' and 'comments'. For the former, the 'created_utc' string tells us the exact time distribution of submissions. For example, the top 10 all time DOGE subreddit submissions are dominated by the years 2013 and 2014. This could skew the sentiment analysis to what was positive and negative back in 2013 and 2014 rather than the present day. Some effort should be made to assume all years are equally represented as much as possible. For the latter, the 'comments' are posted by real reddit users with real biases, especially in community of interests that a user joins if they generally already like the thread. In the case of r/doge, you can two very different communities already within the comments. One of them is related around crypto and the other around general doge memes. This even led to the r/doge managers banning dogecoin/cryptocurrency posts. Nevertheless, these comments persist in the history of this subreddit and should probably be removed. 

*Examples of the attributes 'author' and 'created_utc', as mentioned above' are shown in the code chunk below.

In [45]:
import datetime #library to change UNIX time stamps from reddit to date formats for easy viewing

for submission in subreddit.top(limit=10, time_filter="all"):
    print(submission.author)
    # Output: the author of the submission
    print(datetime.datetime.fromtimestamp(submission.created_utc))

Thesaafdaaf
2021-04-15 18:19:42
Emperor_NOPEolean
2013-12-05 08:53:21
aguilar_s24
2013-10-03 14:07:00
SkyyLord
2013-11-22 02:48:55
tobiasahlin
2013-12-17 07:32:02
GoNavy_09
2013-11-20 20:25:51
None
2013-11-05 17:11:39
RankedQueue
2014-05-12 16:49:03
VerGuy
2021-04-05 10:38:22
None
2013-09-26 19:50:32


#### 3. Comment Code

Add comments to the code block below to describe what each line of the code does (Refer to [Obtain Comment Instances Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html) when necessary). The code is adapted from [this tutorial](https://praw.readthedocs.io/en/stable/tutorials/comments.html)

The purpose is 
1. to understand what the code is doing 
2. start to comment your code whenever it is not self-explantory if you have not (others will thank you, YOU will thank you later 😊) 

In [46]:
%%time # prints wall time for entire cell, where wall time is how much time has passed on a clock
from praw.models import MoreComments

# Store comments in list 'top_comments'
top_comments = []

# Loop over the top 10 submissions for the chosen subreddit and grab all top level comments
for submission in subreddit.top(limit=10):
    # Loop over each comment in the specific submission (uses a 'CommentForest' object)
    for top_level_comment in submission.comments:
        # If the comment is a NOT a top level comment, then ignore it (avoids a potential error when you call comment.body)
        if isinstance(top_level_comment, MoreComments):
            continue
        # If the comment is a top level comment, append it to our stored comments for the subreddit
        top_comments.append(top_level_comment.body)

CPU times: user 69.2 ms, sys: 9.63 ms, total: 78.8 ms
Wall time: 1min 46s


#### 4. Inspect Comments

How many comments did you extract from the last step? Examine a few comments. 

In [59]:
# Show the number of top level comments extracted from the subreddit top 10
print(len(top_comments), " comments extracted")

# Analyze some top level comments
print("Example comments below:")
print(top_comments[1]) # comment from when crypto took over the r/doge subreddit
print(top_comments[2]) # comment from when crypto took over the r/doge subreddit - notice the temperment change
print(top_comments[3]) # comment from when crypto took over the r/doge subreddit
print(top_comments[133]) # comment from when crypto took over the r/doge subreddit, last one before it shifts to older doge memes
print(top_comments[200]) # unclear comment, shows that sometimes format characters are in the strings
print(top_comments[240]) # Standard doge meme comment
print(top_comments[254]) # Standard doge meme comment

255  comments extracted
Example comments below:
I’ve made over 8,000 in 35 days! I’m loving it!
Robinhood shut that shit down just like they did GME!!! Can't buy or sell right now!
i'll allow it
Blow up BTT it’s still low
*ow
Much wow
For anyone wondering, yes this is the original doge calculator meme. Nothing predates it. I am the lore


In [65]:
import random

[random.choice(top_comments) for i in range(3)]

['Wow',
 'This is so excellently done.',
 'Gemini and crypto.com sells Doge, and s few more now']

<details> <summary>Some of the comments from `r/machinelearning` subreddit are:</summary>

    ['Awesome visualisation',
    'Similar to a stack or connected neurons.',
    'Will this Turing pass the Turing Test?']
</details>

💽❓ Data Question:

3. After having a chance to review a few samples of 5 comments from the subreddit, what can you say about the data? 

HINT: Think about the "cleanliness" of the data, the content of the data, think about what you're trying to do - how does this data line up with your goal?

*Answer: If we want to complete sentiment analysis on the subreddit r/doge data, there are few things of note about the data. First, a large chunk of the top level comments are from the dogecoin/cryptocurrency community briefly taking over the r/doge subreddit in 2021, specifically 133/255 top level comments. This might not be prudent to include in sentiment analysis and could be defined as noise that requires removal. Second, the comments are a mix of text, images, images made from text codes, tagged users, emoji, and specific links. Some of these have privacy concerns and will need to be anonymized. Additionally, some of these may not be useful if we are using a text only sentiment-analyzer since they are images. Third, some comments use special characters such as '*' and will require data cleaning before use. Fourth, many of the original r/doge comments are very old so some effort should be made to see if data outside 2013 and 2014 is available to keep the data as unbiased as possible.   

*In summary, the r/doge comments will require cleaning, some top-level comments might need to be removed based on data type (image, text, emoji, etc) depending on what type of sentiment analysis is used, removal of data that is considered noise, some fields in the text comments will require anonymization, and finally the data might be biased to 2013-2104 so some effort to capture more data in extra years is recommended.

#### 5. Extract Top Level Comment from Subreddit `TSLA`.

Write your code to extract top level comments from the top 10 topics of a time period, e.g., year, from subreddit `TSLA` and store them in a list `top_comments_tsla`.  

In [67]:
# Instantiate subreddit object for 'TSLA'
subreddit_tsla = reddit.subreddit("TSLA")

# Grab Top level comments for top 10 topics from the last year
# Store comments in list 'top_comments'
top_comments_tsla = []

# Loop over the top 10 submissions for the chosen subreddit and grab all top level comments
for submission in subreddit_tsla.top(limit=10,time_filter="year"):
    # Loop over each comment in the specific submission (uses a 'CommentForest' object)
    for top_level_comment in submission.comments:
        # If the comment is a NOT a top level comment, then ignore it (avoids a potential error when you call comment.body)
        if isinstance(top_level_comment, MoreComments):
            continue
        # If the comment is a top level comment, append it to our stored comments for the subreddit
        top_comments_tsla.append(top_level_comment.body)

In [68]:
len(top_comments_tsla) 

109

In [85]:
[random.choice(top_comments_tsla) for i in range(3)]

["Holding 415 shares since 450 (presplit)\nI'm gonna hold for another 5-10 years.\n\n\nIf Tsla eventually overpasses Apple as the most valuable company ill become a Teslionaire!!",
 'I’m still just going to wait for a stock split',
 '75 Shares baby \n\n90% of my portfolio']

<details>
<summary>Some of the comments from `r/TSLA` subreddit:</summary>

    ['I bought puts',
    '100%',
    'Yes. And I’m bag holding 1200 calls for Friday and am close to throwing myself out the window']
</details>

💽❓ Data Question:

4. Now that you've had a chance to review another subreddits comments, do you see any differences in the kinds of comments either subreddit has - and how might this relate to bias?

*Answer: The r/TSLA subreddit's top level comments for the top 10 submissions over the last year are heavily biased towards stock related conversations. Additionally, there is a bullish trend to the comments surrounding stocks. These comments are not balanced in terms of sentiment and might very well show very little negative sentiment when passed through a sentiment analysis. Once can also expect to see a lot of neutral comments since there are a lot of numbers being stated in conjuction with stock prices. 

*When this is compared to the r/doge community, there are very different kinds of comments. In stark comparision to r/TSLA, the r/doge community outright outlaws stock/crypto/dogecoin chatter. The r/doge community also is very graphics and text-based image driven versus more of the text driven r/TSLA. 

### Task III: Sentiment Analysis

Let us analyze the sentiment of comments scraped from `r/TSLA` using a pre-trained HuggingFace model to make the inference. Take a [Quick tour](https://huggingface.co/docs/transformers/quicktour). 

#### 1. Import `pipeline`

In [86]:
from transformers import pipeline

#### 2. Create a Pipeline to Perform Task "sentiment-analysis"

In [87]:
sentiment_model = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


#### 3. Get one comment from list `top_comments_tsla` from Task II - 5.

In [96]:
random.seed(1) # set seet for reproducibility of 'comment' below
comment = random.choice(top_comments_tsla)

In [97]:
comment

'10 from 500 days. Bought one more during 800 low last week. Wishing my tax return would come in before giga Austin opens up!'

The example comment is: `'Bury Burry!!!!!'`. Print out what you get. For reproducibility, use the same comment in the next step; consider setting a seed.

#### 4. Make Inference!

In [100]:
sentiment = sentiment_model(comment) # predict sentiment of comment using huggingFace sentiment analysis task
print(type(sentiment)) # get type of output for variable 'sentiment'

<class 'list'>


What is the type of the output `sentiment`?

```
The type of output for sentiment is 'list'.
```

In [101]:
print(f'The comment: {comment}')
print(f'Predicted Label is {sentiment[0]["label"]} and the score is {sentiment[0]["score"]:.3f}')

The comment: 10 from 500 days. Bought one more during 800 low last week. Wishing my tax return would come in before giga Austin opens up!
Predicted Label is NEGATIVE and the score is 0.999


For the example comment, the output is:

    The comment: Bury Burry!!!!!
    Predicted Label is NEGATIVE and the score is 0.989

🖥️❓ Model Question:

1. What does the score represent?

*Answer: This score represents the probability distribution of the two labels: POSITIVE and NEGATIVE. According to the paper on this model(https://arxiv.org/pdf/1910.01108v4.pdf) a standard softmax function is applied at inference (aka predictions). This means that the total value of all the scores must equal one. The label with the highest score is what the model chooses to classify the comment as, whether it be POSITIVE or NEGATIVE.

### Task IV: Put All Together

Let's pull all the piece together, create a simple script that does 

- get the subreddit
- get comments from the top posts for given subreddit
- run sentiment analysis 

#### Complete the Script

Once you complete the code, running the following block writes the code into a new Python script and saves it as `top_tlsa_comment_sentiment.py` under the same directory with the notebook. 

In [None]:
%%writefile top_tlsa_comment_sentiment.py

import secrets
import random

from typing import Dict, List

from praw import Reddit
from praw.models.reddit.subreddit import Subreddit
from praw.models import MoreComments

from transformers import pipeline


def get_subreddit(display_name:str) -> Subreddit:
    """Get subreddit object from display name

    Args:
        display_name (str): [description]

    Returns:
        Subreddit: [description]
    """
    reddit = Reddit(
        client_id=secrets.REDDIT_API_CLIENT_ID,        
        client_secret=secrets.REDDIT_API_CLIENT_SECRET,
        user_agent=secrets.REDDIT_API_USER_AGENT
        )
    
    subreddit = # YOUR CODE HERE
    return subreddit

def get_comments(subreddit:Subreddit, limit:int=3) -> List[str]:
    """ Get comments from subreddit

    Args:
        subreddit (Subreddit): [description]
        limit (int, optional): [description]. Defaults to 3.

    Returns:
        List[str]: List of comments
    """
    top_comments = []
    for submission in subreddit.top(limit=limit):
        for top_level_comment in submission.comments:
            if isinstance(top_level_comment, MoreComments):
                continue
            top_comments.append(top_level_comment.body)
    return top_comments

def run_sentiment_analysis(comment:str) -> Dict:
    """Run sentiment analysis on comment using default distilbert model
    
    Args:
        comment (str): [description]
        
    Returns:
        str: Sentiment analysis result
    """
    sentiment_model = # YOUR CODE HERE
    sentiment = sentiment_model(comment)
    return sentiment[0]


if __name__ == '__main__':
    subreddit = # YOUR CODE HERE
    comments = get_comments(subreddit)
    comment = # YOUR CODE HERE
    sentiment = run_sentiment_analysis(comment)
    
    print(f'The comment: {comment}')
    print(f'Predicted Label is {sentiment["label"]} and the score is {sentiment["score"]:.3f}')

Run the following block to see the output.

In [None]:
!python top_tlsa_comment_sentiment.py

<details><summary> Expected output:</summary>

    No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
    The comment: When is DOGE flying
    Predicted Label is POSITIVE and the score is 0.689
</details>

💽❓ Data Question:

5. Is the subreddit active? About how many posts or threads per day? How could you find this information?

💽❓ Data Question:

6. Does there seem to be a large distribution of posters or a smaller concentration of posters who are very active? What kind of impact might this have on the data?