<p align = "center" draggable=”false” ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png" 
     width="200px"
     height="auto"/>
</p>

# <h1 align="center" id="heading">Sentiment Analysis of Reddit Data using Reddit API</h1>

In this live coding session, we leverage the Python Reddit API Wrapper (`PRAW`) to retrieve data from subreddits on [Reddit](https://www.reddit.com), and perform sentiment analysis using [`pipelines`](https://huggingface.co/docs/transformers/main_classes/pipelines) from [HuggingFace ( 🤗 the GitHub of Machine Learning )](https://techcrunch.com/2022/05/09/hugging-face-reaches-2-billion-valuation-to-build-the-github-of-machine-learning/), powered by [transformer](https://arxiv.org/pdf/1706.03762.pdf).

## Objectives

At the end of the session, you will 

- know how to work with APIs
- feel more comfortable navigating thru documentation, even inspecting the source code
- understand what a `pipeline` object is in HuggingFace
- perform sentiment analysis using `pipeline`
- run a python script in command line and get the results

## How to Submit

- At the end of each task, commit* the work into the repository you created before the assignment
- After completing all three tasks, make sure to push the notebook containing all code blocks and output cells to your repository you created before the assignment
- Submit the link to the notebook in Canvas

\***NEVER** commit a notebook displaying errors unless it is instructed otherwise. However, commit often; recall git ABC = **A**lways **B**e **C**ommitting.

## Tasks

### Task I: Instantiate a Reddit API Object

The first task is to instantiate a Reddit API object using [PRAW](https://praw.readthedocs.io/en/stable/), through which you will retrieve data. PRAW is a wrapper for [Reddit API](https://www.reddit.com/dev/api) that makes interacting with the Reddit API easier unless you are already an expert of [`requests`](https://docs.python-requests.org/en/latest/).

#### 1. Install packages

Please ensure you've ran all the cells in the `imports.ipynb`, located [here](https://github.com/FourthBrain/MLE-8/blob/main/assignments/week-3-analyze-sentiment-subreddit/imports.ipynb), to make sure you have all the required packages for today's assignment.

####  2. Create a new app on Reddit 

Create a new app on Reddit and save secret tokens; refer to [post in medium](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c) for more details.

- Create a Reddit account if you don't have one, log into your account.
- To access the API, we need create an app. Slight updates, on the website, you need to navigate to `preference` > `app`, or click [this link](https://www.reddit.com/prefs/apps) and scroll all the way down. 
- Click to create a new app, fill in the **name**, choose `script`, fill in  **description** and **redirect uri** ( The redirect URI is where the user is sent after they've granted OAuth access to your application (more info [here](https://github.com/reddit-archive/reddit/wiki/OAuth2)) For our purpose, you can enter some random url, e.g., www.google.com; as shown below.


    <img src="https://miro.medium.com/max/700/1*lRBvxpIe8J2nZYJ6ucMgHA.png" width="500"/>
- Jot down `client_id` (left upper corner) and `client_secret` 

    NOTE: CLIENT_ID refers to 'personal use script" and CLIENT_SECRET to secret.
    
    <div>
    <img src="https://miro.medium.com/max/700/1*7cGAKth1PMrEf2sHcQWPoA.png" width="300"/>
    </div>

- Create `secrets_reddit.py` in the same directory with this notebook, fill in `client_id` and `secret_id` obtained from the last step. We will need to import those constants in the next step.
    ```
    REDDIT_API_CLIENT_ID = "client_id"
    REDDIT_API_CLIENT_SECRET = "secret_id"
    REDDIT_API_USER_AGENT = "any string except bot; ex. My User Agent"
    ```
- Add `secrets_reddit.py` to your `.gitignore` file if not already done. NEVER push credentials to a repo, private or public. 

#### 3. Instantiate a `Reddit` object

Now you are ready to create a read-only `Reddit` instance. Refer to [documentation](https://praw.readthedocs.io/en/stable/code_overview/reddit_instance.html) when necessary.

In [10]:
import praw
import secrets_reddit

# Create a Reddit object which allows us to interact with the Reddit API
reddit = praw.Reddit(
    client_id = "mD542C7kDIQfxqUA7EWABQ",
    client_secret = "2Egznz-FXCarCqpgfj5-92PIsw_FtA",
    user_agent = "testscript by u/Funny_Magician_1022",
)

In [11]:
# get MachineLearning subreddit data
ml_subreddit = reddit.subreddit('MachineLearning')

#### 4. Instantiate a `subreddit` object

Lastly, create a `subreddit` object for your favorite subreddit and inspect the object. The expected output you will see ar from `r/machinelearning` unless otherwise specified.

What is the display name of the subreddit?

In [7]:
# get MachineLearning subreddit data
ml_subreddit = reddit.subreddit('MachineLearning')

print(ml_subreddit.description)

**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
***[@slashML on Twitter](https://twitter.com/slashML)***
--------
***[Chat with us on Slack](https://join.slack.com/t/rml-talk/shared_invite/enQtNjkyMzI3NjA2NTY2LWY0ZmRjZjNhYjI5NzYwM2Y0YzZhZWNiODQ3ZGFjYmI2NTU3YjE1ZDU5MzM2ZTQ4ZGJmOTFmNWVkMzFiMzVhYjg)***
--------
**Beginners:**
--------
Please have a look at [our FAQ and Link-Collection](http://www.reddit.com/r/MachineLearning/wiki/index)

[Metacademy](http://www.metacademy.org) is a great resource which compiles le

In [8]:
# get hottest posts from all subreddits
hot_posts = reddit.subreddit('all').hot(limit=5)
for post in hot_posts:
    print(post.title)

2 years ago today Rudy Giuliani held a press conference for Trump at Four Seasons Total Landscaping
meirl
“Groomer”-obsessed Gov. Ron DeSantis partied with students as a 23-year-old teacher | Former students say the then 23-year-old attended parties with students where alcohol was served.
He tried negotiating lol
I joined the Air Force just before this was announced. My 4 year contract has already expired.


In [7]:
pl_subreddit = reddit.subreddit('police')
#print(pl_subreddit.description)

hot_posts = reddit.subreddit('police').hot(limit=5)
for post in hot_posts:
    print(post.title)
   

A San Antonio police officer shoots this 17 year old boy for what the officer believes to be a stolen car. The boy was eating a burger and causing no disturbance.
Merry Christmas Amigazo!
Are police allowed to make fun of you in an interrogation?
Fun fact: Local police departments in Czech Republic has one or two "police ambulances", that are used to transport drunk people, or when the person needs to seek medical attention, but has to be supervised by police. Is this a thing in other countries like USA as well? NYPD has something similar.
Lord protect our brave men and women in blue🙏✝️ Officer Gadson POV:


<details>
<summary>Expected output:</summary>   

    machinelearning
</details>

How about its title, is it different from the display name?

<details>
<summary>Expected output:</summary>   

    Machine Learning
</details>

Print out the description of the subreddit:

In [9]:
# get MachineLearning subreddit data
ml_subreddit = reddit.subreddit('MachineLearning')

print(ml_subreddit.description)

**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
***[@slashML on Twitter](https://twitter.com/slashML)***
--------
***[Chat with us on Slack](https://join.slack.com/t/rml-talk/shared_invite/enQtNjkyMzI3NjA2NTY2LWY0ZmRjZjNhYjI5NzYwM2Y0YzZhZWNiODQ3ZGFjYmI2NTU3YjE1ZDU5MzM2ZTQ4ZGJmOTFmNWVkMzFiMzVhYjg)***
--------
**Beginners:**
--------
Please have a look at [our FAQ and Link-Collection](http://www.reddit.com/r/MachineLearning/wiki/index)

[Metacademy](http://www.metacademy.org) is a great resource which compiles le

<details>
<summary>Expected output:</summary>

    **[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
    --------
    +[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
    --------
    +[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
    --------
    +[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
    --------
    +[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict
</details>

### Task II: Parse comments

#### 1. Top Posts of All Time

Find titles of top 10 posts of **all time** from your favorite subreddit. Refer to [Obtain Submission Instances from a Subreddit Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html)) if necessary. Verify if the titles match what you read on Reddit.

In [12]:
# get 10 top posts from the MachineLearning subreddit
mlhot_posts = reddit.subreddit('MachineLearning').hot(limit=10)
for post in mlhot_posts:
    print(post.title)

[D] Simple Questions Thread
[D] Machine Learning - WAYR (What Are You Reading) - Week 140
[P] COCO captions translation to Nepali using Meta AI's NLLB model
[D] Do you think there is a competitive future for smaller, locally trained/served models?
[P] Transcribe any podcast episode in just 1 minute with optimized OpenAI/whisper
[D] At what tasks are models better than humans given the same amount of data?
[D] Medium Article: How to code Temporal Distribution Characterization (TDC) for time series?
[P] Stable-diffusion's implementation of Paint-with-words : method from NVIDIA that generates images from text-labeled segmentation map.
[D] Git Re-Basin Paper Accused of Misinformation
[D] What's the best speech to speech deep fake voice project?


In [10]:
# try run this line, what do you see? press q once you are done
?ml_subreddit.top 


In [57]:
# YOUR CODE HERE# get hottest posts from all subreddits
allhot_posts = reddit.subreddit('all').hot(limit=10)
for post in allhot_posts:
    print(post.title)

[Postgame Thread] LSU Defeats Alabama 32-31 (OT)
Investors hard at work.
African painted dogs at the Oregon Zoo notice a visitor's service animal
The longest elbow plank by a female (4 hours and 20 minutes!)
Literally having her name repeatedly called over PA and holding up an entire flight for a cheeseburger
He really does have tiny hands (I'm a 5 ft. tall woman for reference)
[Charania] Sources: Nets have delivered Kyrie Irving six items he must complete to return to team: - Apologize/condemn movie - $500K donation to anti-hate causes - Sensitivity training - Antisemitic training - Meet with ADL, Jewish leaders - Meet with Joe Tsai to demonstrate understanding
1665 london deaths
meirl
he's having a good day!


<details> <summary>Expected output:</summary>

    [Project] From books to presentations in 10s with AR + ML
    [D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition
    [R] First Order Motion Model applied to animate paintings
    [N] AI can turn old photos into moving Images / Link is given in the comments - You can also turn your old photo like this
    [D] This AI reveals how much time politicians stare at their phone at work
    [D] Types of Machine Learning Papers
    [D] The machine learning community has a toxicity problem
    [Project] NEW PYTHON PACKAGE: Sync GAN Art to Music with "Lucid Sonic Dreams"! (Link in Comments)
    [P] Using oil portraits and First Order Model to bring the paintings back to life
    [D] Convolution Neural Network Visualization - Made with Unity 3D and lots of Code / source - stefsietz (IG)    
</details>

#### 2. Top 10 Posts of This Week

What are the titles of the top 10 posts of **this week** from your favorite subreddit?

In [65]:
   
AIhot_posts = reddit.subreddit('ArtificalIntelligence').hot(limit=10)
for post in AIhot_posts:
    print(post.title)

AI Chrome Extension to Fight Fake News
What is artificial intelligence (AI)?
AIA - ARTIFICIAL INTELLIGENCE ACT
TPU killer?
Guys!!! HELP What do you make of this???
Why we are already invaded by robots and brainwashed..
Apache Spark Consulting Company | Apache Spark Developers
AI and language Intelligence: A Learning Companion For Employees
10 crazy AI Tools Which Gonna Blow Up
Artificial Intelligence And Blockchain: The Ideal Partners!


<details><summary>Expected output:</summary>

    [N] Ian Goodfellow, Apple’s director of machine learning, is leaving the company due to its return to work policy. In a note to staff, he said “I believe strongly that more flexibility would have been the best policy for my team.” He was likely the company’s most cited ML expert.
    [R][P] Thin-Plate Spline Motion Model for Image Animation + Gradio Web Demo
    [P] I’ve been trying to understand the limits of some of the available machine learning models out there. Built an app that lets you try a mix of CLIP from Open AI + Apple’s version of MobileNet, and more directly on your phone's camera roll.
    [R] Meta is releasing a 175B parameter language model
    [N] Hugging Face raised $100M at $2B to double down on community, open-source & ethics
    [P] T-SNE to view and order your Spotify tracks
    [D] : HELP Finding a Book - A book written for Google Engineers about foundational Math to support ML
    [R] Scaled up CLIP-like model (~2B) shows 86% Zero-shot on Imagenet
    [D] Do you use NLTK or Spacy for text preprocessing?
    [D] Democratizing Diffusion Models - LDMs: High-Resolution Image Synthesis with Latent Diffusion Models, a 5-minute paper summary by Casual GAN Papers
</details>

💽❓ Data Question:

Check out what other attributes the `praw.models.Submission` class has in the [docs](https://praw.readthedocs.io/en/stable/code_overview/models/submission.html). 

1. After having a chance to look through the docs, is there any other information that you might want to extract? How might this additional data help you?

Write a sample piece of code below extracting three additional pieces of information from the submission below.

In [86]:
for comment in reddit.subreddit("MachineLearning").comments(limit=5):
    print(comment.author)

give_me_the_truth
seiqooq
starstruckmon
TiredOldCrow
GroundbreakingArm944


💽❓ Data Question:

2. Is there any information available that might be a concern when it comes to Ethical Data? 
I don't see a fake avatar name being an issue, but perhaps other details could be. 

#### 3. Comment Code

Add comments to the code block below to describe what each line of the code does (Refer to [Obtain Comment Instances Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html) when necessary). The code is adapted from [this tutorial](https://praw.readthedocs.io/en/stable/tutorials/comments.html)

The purpose is 
1. to understand what the code is doing 
2. start to comment your code whenever it is not self-explantory if you have not (others will thank you, YOU will thank you later 😊) 

In [12]:
%%time
from praw.models import MoreComments

# YOUR COMMENT HERE: Initializing the list
top_comments = []

# For each of the elements in the top 10 comments
for submission in ml_subreddit.top(limit=10):
    # 
    for top_level_comment in submission.comments:
        # Checking if the comment is of the same type (MoreComments) then ignore else append to list
        if isinstance(top_level_comment, MoreComments):
            continue
        # 
        top_comments.append(top_level_comment.body)
#print(top_comments)

CPU times: user 453 ms, sys: 34 ms, total: 487 ms
Wall time: 2min 3s


#### 4. Inspect Comments

How many comments did you extract from the last step? Examine a few comments. 

In [42]:
#YOUR CODE HERE  # the answer may vary 693 for r/machinelearning
print(len(top_comments))
print(top_comments[1:15])

746
['Simple yet very useful. Thank you for sharing the code.', 'The future 🤯', 'Ohh the nightmare of making this into a stable product... Enough to drive you mad just thinking about it', 'Almost guaranteed, Apple will copy your idea in 3, 2, 1....', 'Wtffff. Well that was incredible.', 'Apple can’t wait to steal this and not credit the creators', 'fantastic!', 'Why did the boxes in the diagram turn gray?', 'How does the Algorithm decide what it cuts out from the input pictures? \n\nFor example it only cut out the two people in the picture and not the surroundings.\n\nAmazing project though!', '#WITCH!  BURN THEM!', 'This will be amazing if released, even as a beta. Definitely can see this being very useful', 'Any sufficiently advanced technology is indistinguishable from magic.', 'Really good work, thanks for sharing!', "I'm extremely impressed with it cutting dark hair from a brown background. Is that the pixel's camera doing the hard work or is it U^2_Net ? Have you tried it with ot

In [28]:
import random

[random.choice(top_comments) for i in range(3)]

['Why tf is this visually satisfying',
 'What about tablets and computers?',
 'This feels scary but I would love to give all my old pictures a spin.']

<details> <summary>Some of the comments from `r/machinelearning` subreddit are:</summary>

    ['Awesome visualisation',
    'Similar to a stack or connected neurons.',
    'Will this Turing pass the Turing Test?']
</details>

💽❓ Data Question:

3. After having a chance to review a few samples of 5 comments from the subreddit, what can you say about the data? 

HINT: Think about the "cleanliness" of the data, the content of the data, think about what you're trying to do - how does this data line up with your goal?

#### 5. Extract Top Level Comment from Subreddit `TSLA`.

Write your code to extract top level comments from the top 10 topics of a time period, e.g., year, from subreddit `TSLA` and store them in a list `top_comments_tsla`.  

In [15]:
from praw.models import MoreComments
tsla_reddit = reddit.subreddit('TSLA')

top_comments_tsla = []

for submission in tsla_reddit.top(limit=10):
    for top_level_comment in submission.comments:
        if isinstance(top_level_comment, MoreComments):
            continue
        top_comments_tsla.append(top_level_comment.body)
print(top_comments_tsla)

['ho lee fuk \n\nyou got anymore insider information? 👀👀', "What will happen if you post that GME it's the new buy target from them? 🤣", 'When are you all buying $DOGE, and how much will you all buy?', 'Papa Musk?? 😘😘😘', 'I really don’t understand what Musk is trying to do. It seems he is trying to legitimize BTC and create a sustainable ecosystem for it. But I question whether Tesla shareholders are going to be happy with such an unplanned use of invested capital. Musk is not the majority of Tesla, and big shareholders are very very picky about where their portion of $1.5bm goes to!', "lmk when they start loading up on Doge and I'm in", '[deleted]', 'When is DOGE flying', 'Are they gonna fire you lol', "You're a fucking legend", 'Give this man a raise! (In BTC)', 'Do you have twitter or instagram?', "Could you point me in the right direction on to how to code one of this bots myself. I'm a developer and have an extensive trade background. I've been interested in trading algos for a wh

In [17]:
import random
[random.choice(top_comments_tsla) for i in range(3)]

["What will happen if you post that GME it's the new buy target from them? 🤣",
 'trust the process',
 'Karma would eventually catch up with TSLA making a shitty product which is overpriced and all the false claims about autopilot and putting people’s lives in danger']

<details>
<summary>Some of the comments from `r/TSLA` subreddit:</summary>

    ['I bought puts',
    '100%',
    'Yes. And I’m bag holding 1200 calls for Friday and am close to throwing myself out the window']
</details>

💽❓ Data Question:

4. Now that you've had a chance to review another subreddits comments, do you see any differences in the kinds of comments either subreddit has - and how might this relate to bias? 

Answer: From the above comments, it seems to me that folks who comment on the reddit social media have strong opinions, which may or may not be factual or honest or true. It's more a platform to vent extreme emotions, cursing and swearing, as well as genuine thoughts. But the majority of the opinions seem to be NOT moderated. So yes, it seems to me that the comments on reddit and possibly biased towards the extreme views of folks who follow Tesla. 

### Task III: Sentiment Analysis

Let us analyze the sentiment of comments scraped from `r/TSLA` using a pre-trained HuggingFace model to make the inference. Take a [Quick tour](https://huggingface.co/docs/transformers/quicktour). 

#### 1. Import `pipeline`

In [19]:
from transformers import pipeline

#### 2. Create a Pipeline to Perform Task "sentiment-analysis"

In [20]:
sentiment_model = pipeline("sentiment-analysis")
sentiment_model("top_comments_tsla")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.8886483907699585}]

#### 3. Get one comment from list `top_comments_tsla` from Task II - 5.

In [21]:
comment = random.choice(top_comments_tsla)

In [22]:
comment

'I’m holding 34 shares at $418 average. Will I be able to average down once the stock splits?'

The example comment is: `'Bury Burry!!!!!'`. Print out what you get. For reproducibility, use the same comment in the next step; consider setting a seed.

#### 4. Make Inference!

In [26]:
sentiment = sentiment_model("I’m holding 34 shares at $418 average. Will I be able to average down once the stock splits")

What is the type of the output `sentiment`?

```
YOUR ANSWER HERE

Label and score. Dict.
```

In [27]:
print(f'The comment: {comment}')
print(f'Predicted Label is {sentiment[0]["label"]} and the score is {sentiment[0]["score"]:.3f}')

The comment: I’m holding 34 shares at $418 average. Will I be able to average down once the stock splits?
Predicted Label is NEGATIVE and the score is 0.998


For the example comment, the output is:

    The comment: Bury Burry!!!!!
    Predicted Label is NEGATIVE and the score is 0.989

🖥️❓ Model Question:

1. What does the score represent? 

Answer: Score represents that the sentiment expressed in this comment is overall negative. In all honesty, the comment would make sense to someone with finance background. To me, it is hard to tell what the comment means - perhaps that the person is concerned about losing money? So my reading is neutral. I am not sure I can judge what this means without asking a finance person about the comment. But the reason the sentiment came out as negative is because of the 'average down' part? Not sure, but it is very interesting.

### Task IV: Put All Together

Let's pull all the piece together, create a simple script that does 

- get the subreddit
- get comments from the top posts for given subreddit
- run sentiment analysis 

#### Complete the Script

Once you complete the code, running the following block writes the code into a new Python script and saves it as `top_tlsa_comment_sentiment.py` under the same directory with the notebook. 

In [72]:
%%writefile top_tlsa_comment_sentiment.py

import secrets_reddit
import random

from typing import Dict, List

from praw import Reddit
from praw.models.reddit.subreddit import Subreddit
from praw.models import MoreComments

from transformers import pipeline


def get_subreddit(display_name:str) -> Subreddit:
    """Get subreddit object from display name

    Args:
        display_name (str): [description]

    Returns:
        Subreddit: [description]
    """
    reddit = Reddit(
        client_id=secrets_reddit.REDDIT_API_CLIENT_ID,        
        client_secret=secrets_reddit.REDDIT_API_CLIENT_SECRET,
        user_agent=secrets_reddit.REDDIT_API_USER_AGENT
        )
    
    subreddit = reddit.subreddit(display_name)
    return subreddit

def get_comments(subreddit:Subreddit, limit:int=3) -> List[str]:
    """ Get comments from subreddit

    Args:
        subreddit (Subreddit): [description]
        limit (int, optional): [description]. Defaults to 3.

    Returns:
        List[str]: List of comments
    """
    top_comments = []
    for submission in subreddit.top(limit=limit):
        for top_level_comment in submission.comments:
            if isinstance(top_level_comment, MoreComments):
                continue
            top_comments.append(top_level_comment.body)
    return top_comments

def run_sentiment_analysis(comment:str) -> Dict:
    """Run sentiment analysis on comment using default distilbert model
    
    Args:
        comment (str): [description]
        
    Returns:
        str: Sentiment analysis result
    """
    sentiment_model = pipeline("sentiment-analysis")
    sentiment = sentiment_model(comment)
    return sentiment[0]


if __name__ == '__main__':
    subreddit = get_subreddit('TSLA')
    comments = get_comments(subreddit)
    comment = random.choice(comments)
    sentiment = run_sentiment_analysis(comment)
    
    print(f'The comment: {comment}')
    print(f'Predicted Label is {sentiment["label"]} and the score is {sentiment["score"]:.3f}')

Overwriting top_tlsa_comment_sentiment.py


Run the following block to see the output.

In [73]:
!python top_tlsa_comment_sentiment.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The comment: Would be great but Elon said he wasn't in a hurry to do that again
Predicted Label is NEGATIVE and the score is 0.982


<details><summary> Expected output:</summary>

    No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
    The comment: When is DOGE flying
    Predicted Label is POSITIVE and the score is 0.689
</details>

💽❓ Data Question:

5. Is the subreddit active? About how many posts or threads per day? How could you find this information? 

Answer: Seems like 2.5 million people are active on this subreddit. Ran out of time to find out how many posts per day.

💽❓ Data Question:

6. Does there seem to be a large distribution of posters or a smaller concentration of posters who are very active? What kind of impact might this have on the data? 

Answer: Smaller number are active. Only 400 are active out of 2.5 million. Impact on the data - when we read from reddit, we are oversampling this small crowd as opposed to the real population and the smaller group is not a good representation of the larger group, which implies it's strongly biased by those who self-select to engage in the conversation and what we see reflects the view of the few.