<a href="https://colab.research.google.com/github/Timecapp/FourthAssignments/blob/main/Copy_of_analyze_sentiment_subreddit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p align = "center" draggable=”false” ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png" 
     width="200px"
     height="auto"/>
</p>

# <h1 align="center" id="heading">Sentiment Analysis of Reddit Data using Reddit API</h1>

In this live coding session, we leverage the Python Reddit API Wrapper (`PRAW`) to retrieve data from subreddits on [Reddit](https://www.reddit.com), and perform sentiment analysis using [`pipelines`](https://huggingface.co/docs/transformers/main_classes/pipelines) from [HuggingFace ( 🤗 the GitHub of Machine Learning )](https://techcrunch.com/2022/05/09/hugging-face-reaches-2-billion-valuation-to-build-the-github-of-machine-learning/), powered by [transformer](https://arxiv.org/pdf/1706.03762.pdf).

## Objectives

At the end of the session, you will 

- know how to work with APIs
- feel more comfortable navigating thru documentation, even inspecting the source code
- understand what a `pipeline` object is in HuggingFace
- perform sentiment analysis using `pipeline`
- run a python script in command line and get the results

## How to Submit

- At the end of each task, commit* the work into the repository you created before the assignment
- After completing all three tasks, make sure to push the notebook containing all code blocks and output cells to your repository you created before the assignment
- Submit the link to the notebook in Canvas

\***NEVER** commit a notebook displaying errors unless it is instructed otherwise. However, commit often; recall git ABC = **A**lways **B**e **C**ommitting.

## Tasks

### Task I: Instantiate a Reddit API Object

The first task is to instantiate a Reddit API object using [PRAW](https://praw.readthedocs.io/en/stable/), through which you will retrieve data. PRAW is a wrapper for [Reddit API](https://www.reddit.com/dev/api) that makes interacting with the Reddit API easier unless you are already an expert of [`requests`](https://docs.python-requests.org/en/latest/).

#### 1. Install packages

Please ensure you've ran all the cells in the `imports.ipynb`, located [here](https://github.com/FourthBrain/MLE-8/blob/main/assignments/week-3-analyze-sentiment-subreddit/imports.ipynb), to make sure you have all the required packages for today's assignment.

####  2. Create a new app on Reddit 

Create a new app on Reddit and save secret tokens; refer to [post in medium](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c) for more details.

- Create a Reddit account if you don't have one, log into your account.
- To access the API, we need create an app. Slight updates, on the website, you need to navigate to `preference` > `app`, or click [this link](https://www.reddit.com/prefs/apps) and scroll all the way down. 
- Click to create a new app, fill in the **name**, choose `script`, fill in  **description** and **redirect uri** ( The redirect URI is where the user is sent after they've granted OAuth access to your application (more info [here](https://github.com/reddit-archive/reddit/wiki/OAuth2)) For our purpose, you can enter some random url, e.g., www.google.com; as shown below.


    <img src="https://miro.medium.com/max/700/1*lRBvxpIe8J2nZYJ6ucMgHA.png" width="500"/>
- Jot down `client_id` (left upper corner) and `client_secret` 

    NOTE: CLIENT_ID refers to 'personal use script" and CLIENT_SECRET to secret.
    
    <div>
    <img src="https://miro.medium.com/max/700/1*7cGAKth1PMrEf2sHcQWPoA.png" width="300"/>
    </div>

- Create `secrets_reddit.py` in the same directory with this notebook, fill in `client_id` and `secret_id` obtained from the last step. We will need to import those constants in the next step.
    ```
    REDDIT_API_CLIENT_ID = "client_id"
    REDDIT_API_CLIENT_SECRET = "secret_id"
    REDDIT_API_USER_AGENT = "any string except bot; ex. My User Agent"
    ```
- Add `secrets_reddit.py` to your `.gitignore` file if not already done. NEVER push credentials to a repo, private or public. 

In [4]:
pip install praw

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
import praw

#### 3. Instantiate a `Reddit` object

Now you are ready to create a read-only `Reddit` instance. Refer to [documentation](https://praw.readthedocs.io/en/stable/code_overview/reddit_instance.html) when necessary.

In [6]:
import secrets

# object to interact with Reddit API
reddit = praw.Reddit(
    client_id="mgTvgH8Ex6dLDI_Ol9J-_Q",
    client_secret="9CfDumsypsJTxotzo6ho9PcF-aBz4w",
    password="password1010!",
    username="Timecapp",
    user_agent="sub homework v1.0 by /u/shai",
)

In [7]:
print(reddit) 

<praw.reddit.Reddit object at 0x7f2c135700d0>


<details>
<summary>Expected output:</summary>   

```<praw.reddit.Reddit object at 0x10f8a0ac0>```
</details>

In [8]:
for submission in reddit.subreddit("test").hot(limit=10):
    print(submission.title)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



test
cat
Tre partier udgør kernen – nu begynder det store udskilningsløb
poll
god
this is a post that should be posted every 15 minutes.
Test2
Test
FTC Holds Company’s CEO Personally Liable for Security Failures
Colorado and California Release New Draft Privacy Regulations


#### 4. Instantiate a `subreddit` object

Lastly, create a `subreddit` object for your favorite subreddit and inspect the object. The expected output you will see ar from `r/machinelearning` unless otherwise specified.

In [9]:
# praw.Reddit instance bound to variable `reddit`
subreddit = reddit.subreddit("r/machinelearning")


In [10]:
# Printing subreddit display name
print(subreddit.display_name)

r/machinelearning


What is the display name of the subreddit?

In [10]:
# r/machinelearning

<details>
<summary>Expected output:</summary>   

    machinelearning
</details>

How about its title, is it different from the display name?

In [11]:
# title
subreddit = reddit.subreddit("machinelearning")
print(subreddit.title)


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Machine Learning


<details>
<summary>Expected output:</summary>   

    Machine Learning
</details>

Print out the description of the subreddit:

In [13]:
# description
# get MachineLearning subreddit data
ml_subreddit = reddit.subreddit('MachineLearning')

print(ml_subreddit.description)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
***[@slashML on Twitter](https://twitter.com/slashML)***
--------
***[Chat with us on Slack](https://join.slack.com/t/rml-talk/shared_invite/enQtNjkyMzI3NjA2NTY2LWY0ZmRjZjNhYjI5NzYwM2Y0YzZhZWNiODQ3ZGFjYmI2NTU3YjE1ZDU5MzM2ZTQ4ZGJmOTFmNWVkMzFiMzVhYjg)***
--------
**Beginners:**
--------
Please have a look at [our FAQ and Link-Collection](http://www.reddit.com/r/MachineLearning/wiki/index)

[Metacademy](http://www.metacademy.org) is a great resource which compiles le

<details>
<summary>Expected output:</summary>

    **[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
    --------
    +[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
    --------
    +[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
    --------
    +[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
    --------
    +[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict
</details>

### Task II: Parse comments

#### 1. Top Posts of All Time

Find titles of top 10 posts of **all time** from your favorite subreddit. Refer to [Obtain Submission Instances from a Subreddit Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html)) if necessary. Verify if the titles match what you read on Reddit.

In [14]:
# try run this line, what do you see? press q once you are done
?subreddit.top 

In [15]:
# instance bound to `subreddit`
for submission in subreddit.hot(limit=10):
    print(submission.title)
   

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



[D] Simple Questions Thread
[D] Machine Learning - WAYR (What Are You Reading) - Week 140
[Project] Rebel Poker AI
[D] At what tasks are models better than humans given the same amount of data?
[R] Boosting Graph Similarity Search through Pre-Computation | Proceedings of the 2021 International Conference on Management of Data
[D] Do you think there is a competitive future for smaller, locally trained/served models?
[P] COCO captions translation to Nepali using Meta AI's NLLB model
[P] Transcribe any podcast episode in just 1 minute with optimized OpenAI/whisper
[D] It it possible to save my conversations with customers in order to continuously train & develop a ML program that can compose original responses for me?
[D] Medium Article: How to code Temporal Distribution Characterization (TDC) for time series?


<details> <summary>Expected output:</summary>

    [Project] From books to presentations in 10s with AR + ML
    [D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition
    [R] First Order Motion Model applied to animate paintings
    [N] AI can turn old photos into moving Images / Link is given in the comments - You can also turn your old photo like this
    [D] This AI reveals how much time politicians stare at their phone at work
    [D] Types of Machine Learning Papers
    [D] The machine learning community has a toxicity problem
    [Project] NEW PYTHON PACKAGE: Sync GAN Art to Music with "Lucid Sonic Dreams"! (Link in Comments)
    [P] Using oil portraits and First Order Model to bring the paintings back to life
    [D] Convolution Neural Network Visualization - Made with Unity 3D and lots of Code / source - stefsietz (IG)    
</details>

In [16]:
 # instance bound to `subreddit`
for submission in subreddit.hot(limit=10):
    print(submission.title)
 # Output: the submission's title
    print(submission.score)
    # Output: the submission's score
    print(submission.id)
    # Output: the submission's ID
    print(submission.url)
    # Output: the URL the submission points to or the submission's URL if it's a self post

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



[D] Simple Questions Thread
8
yntyhz
https://www.reddit.com/r/MachineLearning/comments/yntyhz/d_simple_questions_thread/
[D] Machine Learning - WAYR (What Are You Reading) - Week 140
213
vg5kjd
https://www.reddit.com/r/MachineLearning/comments/vg5kjd/d_machine_learning_wayr_what_are_you_reading_week/
[Project] Rebel Poker AI
14
ypatwb
https://www.reddit.com/r/MachineLearning/comments/ypatwb/project_rebel_poker_ai/
[D] At what tasks are models better than humans given the same amount of data?
54
youplu
https://www.reddit.com/r/MachineLearning/comments/youplu/d_at_what_tasks_are_models_better_than_humans/
[R] Boosting Graph Similarity Search through Pre-Computation | Proceedings of the 2021 International Conference on Management of Data
2
ypf9s4
https://www.reddit.com/r/MachineLearning/comments/ypf9s4/r_boosting_graph_similarity_search_through/
[D] Do you think there is a competitive future for smaller, locally trained/served models?
58
yon48p
https://www.reddit.com/r/MachineLearning/com

#### 2. Top 10 Posts of This Week

What are the titles of the top 10 posts of **this week** from your favorite subreddit?

In [17]:
# instance bound to `subreddit`
for submission in subreddit.hot(limit=10):
    print(submission.title)
 # Output: the submission's title
   

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



[D] Simple Questions Thread
[D] Machine Learning - WAYR (What Are You Reading) - Week 140
[Project] Rebel Poker AI
[D] At what tasks are models better than humans given the same amount of data?
[R] Boosting Graph Similarity Search through Pre-Computation | Proceedings of the 2021 International Conference on Management of Data
[D] Do you think there is a competitive future for smaller, locally trained/served models?
[P] COCO captions translation to Nepali using Meta AI's NLLB model
[P] Transcribe any podcast episode in just 1 minute with optimized OpenAI/whisper
[D] It it possible to save my conversations with customers in order to continuously train & develop a ML program that can compose original responses for me?
[D] Medium Article: How to code Temporal Distribution Characterization (TDC) for time series?


<details><summary>Expected output:</summary>

    [N] Ian Goodfellow, Apple’s director of machine learning, is leaving the company due to its return to work policy. In a note to staff, he said “I believe strongly that more flexibility would have been the best policy for my team.” He was likely the company’s most cited ML expert.
    [R][P] Thin-Plate Spline Motion Model for Image Animation + Gradio Web Demo
    [P] I’ve been trying to understand the limits of some of the available machine learning models out there. Built an app that lets you try a mix of CLIP from Open AI + Apple’s version of MobileNet, and more directly on your phone's camera roll.
    [R] Meta is releasing a 175B parameter language model
    [N] Hugging Face raised $100M at $2B to double down on community, open-source & ethics
    [P] T-SNE to view and order your Spotify tracks
    [D] : HELP Finding a Book - A book written for Google Engineers about foundational Math to support ML
    [R] Scaled up CLIP-like model (~2B) shows 86% Zero-shot on Imagenet
    [D] Do you use NLTK or Spacy for text preprocessing?
    [D] Democratizing Diffusion Models - LDMs: High-Resolution Image Synthesis with Latent Diffusion Models, a 5-minute paper summary by Casual GAN Papers
</details>

💽❓ Data Question:

Check out what other attributes the `praw.models.Submission` class has in the [docs](https://praw.readthedocs.io/en/stable/code_overview/models/submission.html). 

1. After having a chance to look through the docs, is there any other information that you might want to extract? How might this additional data help you?

Write a sample piece of code below extracting three additional pieces of information from the submission below.

In [18]:
# instance bound to `subreddit`
for submission in subreddit.hot(limit=10):
    print(submission.title)
 # Output: the submission's title
    print(submission.score)
    # Output: the submission's score
    print(submission.id)
    # Output: the submission's ID
    print(submission.url)
    # Output: the URL the submission points to or the submission's URL if it's a self post

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



[D] Simple Questions Thread
9
yntyhz
https://www.reddit.com/r/MachineLearning/comments/yntyhz/d_simple_questions_thread/
[D] Machine Learning - WAYR (What Are You Reading) - Week 140
209
vg5kjd
https://www.reddit.com/r/MachineLearning/comments/vg5kjd/d_machine_learning_wayr_what_are_you_reading_week/
[Project] Rebel Poker AI
12
ypatwb
https://www.reddit.com/r/MachineLearning/comments/ypatwb/project_rebel_poker_ai/
[D] At what tasks are models better than humans given the same amount of data?
50
youplu
https://www.reddit.com/r/MachineLearning/comments/youplu/d_at_what_tasks_are_models_better_than_humans/
[R] Boosting Graph Similarity Search through Pre-Computation | Proceedings of the 2021 International Conference on Management of Data
2
ypf9s4
https://www.reddit.com/r/MachineLearning/comments/ypf9s4/r_boosting_graph_similarity_search_through/
[D] Do you think there is a competitive future for smaller, locally trained/served models?
63
yon48p
https://www.reddit.com/r/MachineLearning/com

In [19]:
# extraction 1, the ID of the comment
comment_id = "fvib7aw"
  
# instantiating the Comment class
comment = reddit.comment(comment_id)
  
# fetching the score attribute
score = comment.score
    
print("The score of the comment is : " + str(score))

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



The score of the comment is : 120


In [20]:
#extraction 2, author info_instantiating the Comment class
comment = reddit.comment(comment_id)
  
# fetching the author attribute
author = comment.author
    
print("The name of the author is : " + author.name)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



The name of the author is : subarno


In [21]:
#extraction 3, instantiating the Comment class
comment = reddit.comment(comment_id)
  
# fetching the body of the comment
body = comment.body
  
# printing the body of the comment
print("The body of the comment is : \n\n" + body)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



The body of the comment is : 

I wish I could pet dogs through the screen :(


💽❓ Data Question:

2. Is there any information available that might be a concern when it comes to Ethical Data?

**ANSWER: ** the user's name is available, and submission ID; also risk of secret key being exposed in my own work - i need to learn how to practice safe coding

#### 3. Comment Code

Add comments to the code block below to describe what each line of the code does (Refer to [Obtain Comment Instances Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html) when necessary). The code is adapted from [this tutorial](https://praw.readthedocs.io/en/stable/tutorials/comments.html)

The purpose is 
1. to understand what the code is doing 
2. start to comment your code whenever it is not self-explantory if you have not (others will thank you, YOU will thank you later 😊) 

In [22]:
%%time
from praw.models import MoreComments

# this lists the top comments made
top_comments = []

# this limits comments to 10 items
for submission in subreddit.top(limit=10):
    # this lists top level subreddit comments
    for top_level_comment in submission.comments:
        # this fetches more comments to add if not enough
        if isinstance(top_level_comment, MoreComments):
            continue
        # this attaches the additionally sourced comments if needed
        top_comments.append(top_level_comment.body)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

CPU times: user 609 ms, sys: 42 ms, total: 651 ms
Wall time: 16.3 s


#### 4. Inspect Comments

How many comments did you extract from the last step? Examine a few comments. 

ANSWE: ten comments extracted and all are relevant

In [22]:
# the answer may vary 693 for r/machinelearning

In [23]:
#importing random comments from list
import random

[random.choice(top_comments) for i in range(3)]

['Man, this is awesome!',
 'All this shows is that you can successfully detect a phone and maybe a politician in the picture. \n\nIt doesnt say anything about "how much time" or even about whether them staring at the phone is equivalent to them being productive or unproductive. \n\nAlso, this looks like a simple object detection algorithm. It is not AI. It is a computer vision algorithm. \n\nIt is time we start using the right terminology, be accurate in our descriptions of what the work is about and lastly, stop overestimating our work.',
 'White=how sure that its that politican \nGreen=how sure that its a phone']

<details> <summary>Some of the comments from `r/machinelearning` subreddit are:</summary>

    ['Awesome visualisation',
    'Similar to a stack or connected neurons.',
    'Will this Turing pass the Turing Test?']
</details>

💽❓ Data Question:

3. After having a chance to review a few samples of 5 comments from the subreddit, what can you say about the data? 

HINT: Think about the "cleanliness" of the data, the content of the data, think about what you're trying to do - how does this data line up with your goal?

the data is extracted but the concern is that a comment made may not pertain to the subreddit post , as this solution simply retrieves comments without discerning if it is related to the topic being posted or just a random comment

#### 5. Extract Top Level Comment from Subreddit `TSLA`.

Write your code to extract top level comments from the top 10 topics of a time period, e.g., year, from subreddit `TSLA` and store them in a list `top_comments_tsla`.  

In [24]:
# praw.Reddit instance bound to variable `reddit`
subreddit = reddit.subreddit("r/tsla")

In [25]:
# Printing subreddit info
print(subreddit.display_name)

r/tsla


In [26]:
# get hottest 10 posts from TSLA without time constraint
hot_posts = reddit.subreddit('TSLA').hot(limit=10)
for post in hot_posts:
    print(post.title)


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



The $200 Billion Billionaire Club Is Empty
Elon Musk wants to reassure Tesla shareholders
Tesla to $4.5 trillion
How I turned $15,000 into $1.2m during the pandemic – then lost it all
Tesla stock has dropped more than 35% since Elon Musk first said he’d buy Twitter
New investment community. Looking for feedback!
Tesla's first European factory needs more water to expand. Drought stands in its way
Is buying a Tesla worth it?
Welcome to hell, Elon
GM, tappin out...


In [43]:
top_comments_tsla = ["The $200 Billion Billionaire Club Is Empty",
"Elon Musk wants to reassure Tesla shareholders",
"Tesla to $4.5 trillion",
"How I turned $15,000 into $1.2m during the pandemic – then lost it all",
"Tesla stock has dropped more than 35% since Elon Musk first said he’d buy Twitter",
"New investment community. Looking for feedback!",
"Is buying a Tesla worth it?",
"GM, tappin out..."]

In [44]:
#create list 
top_comments_tsla = subreddit.top(limit=10, time_filter='week')

RESULTS YIELD INAPPROPRIATE LISTS TOO

In [26]:
 # Creating a List of strings
List = ["The $200 Billion Billionaire Club Is Empty",
"Elon Musk wants to reassure Tesla shareholders",
"Tesla to $4.5 trillion",
"How I turned $15,000 into $1.2m during the pandemic – then lost it all",
"Tesla stock has dropped more than 35% since Elon Musk first said he’d buy Twitter",
"New investment community. Looking for feedback!",
"Is buying a Tesla worth it?",
"GM, tappin out..."]
print("\nList Items: ")
print(List[0])
print(List[2])


List Items: 
The $200 Billion Billionaire Club Is Empty
Tesla to $4.5 trillion


In [27]:
# use len() function 
len(List)


8

In [29]:
sample_list = random.choices(List, k=3)
print(sample_list)

['Is buying a Tesla worth it?', 'Tesla stock has dropped more than 35% since Elon Musk first said he’d buy Twitter', 'GM, tappin out...']


<details>
<summary>Some of the comments from `r/TSLA` subreddit:</summary>

    ['I bought puts',
    '100%',
    'Yes. And I’m bag holding 1200 calls for Friday and am close to throwing myself out the window']
</details>

💽❓ Data Question:

4. Now that you've had a chance to review another subreddits comments, do you see any differences in the kinds of comments either subreddit has - and how might this relate to bias?

ANSWER: Yes, even inappropriate or unrelated comments are included, so the extraction is not specific to the topic related comments

### Task III: Sentiment Analysis

Let us analyze the sentiment of comments scraped from `r/TSLA` using a pre-trained HuggingFace model to make the inference. Take a [Quick tour](https://huggingface.co/docs/transformers/quicktour). 

#### 1. Import `pipeline`

In [30]:
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [31]:
pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [32]:
#choosing pipeline
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


#### 2. Create a Pipeline to Perform Task "sentiment-analysis"

In [33]:
sentiment_model = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")

#### 3. Get one comment from list `top_comments_tsla` from Task II - 5.

In [45]:
top_comments_tsla = ["The $200 Billion Billionaire Club Is Empty",
"Elon Musk wants to reassure Tesla shareholders",
"Tesla to $4.5 trillion",
"How I turned $15,000 into $1.2m during the pandemic – then lost it all",
"Tesla stock has dropped more than 35% since Elon Musk first said he’d buy Twitter",
"New investment community. Looking for feedback!",
"Is buying a Tesla worth it?",
"GM, tappin out..."]

In [48]:
# top_comments_tsla , choose 1

sample_list = random.choices(top_comments_tsla, k=1)
print(sample_list)

['How I turned $15,000 into $1.2m during the pandemic – then lost it all']


The output is: ['Elon Musk wants to reassure Tesla shareholders']

In [49]:
#comment:['How I turned $15,000 into $1.2m during the pandemic – then lost it all']

The example comment is: `'Bury Burry!!!!!'`. Print out what you get. For reproducibility, use the same comment in the next step; consider setting a seed.

#### 4. Make Inference!

In [51]:

from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["['How I turned $15,000 into $1.2m during the pandemic – then lost it all']"]
sentiment_pipeline(data)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'NEGATIVE', 'score': 0.9996575117111206}]

What is the type of the output `sentiment`?

ANSWER: Negative sentiment, with 99.9% confidence

```
ANSWER: Negative sentiment, with 99.9% confidence```

In [53]:
print(f'The comment:["How I turned $15,000 into $1.2m during the pandemic – then lost it all"]')
print(f'Predicted Label is NEGATIVE and the score is 0.999')

The comment:["How I turned $15,000 into $1.2m during the pandemic – then lost it all"]
Predicted Label is NEGATIVE and the score is 0.999


For the example comment, the output is:

    The comment: Bury Burry!!!!!
    Predicted Label is NEGATIVE and the score is 0.989

🖥️❓ Model Question:

1. What does the score represent?

ANSWER: The score is a confidence score, so the closer the value is to 1 the more sure it is about the decision; the closer to 0 the less confident.  in this case we are 99.9% confident that the sentiment is a negative sentiment

### Task IV: Put All Together

Let's pull all the piece together, create a simple script that does 

- get the subreddit
- get comments from the top posts for given subreddit
- run sentiment analysis 

#### Complete the Script

Once you complete the code, running the following block writes the code into a new Python script and saves it as `top_tlsa_comment_sentiment.py` under the same directory with the notebook. 

In [54]:
!pip install praw

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [55]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [60]:
%%writefile top_tlsa_comment_sentiment.py


import random

from typing import Dict, List

from praw import Reddit
from praw.models.reddit.subreddit import Subreddit
from praw.models import MoreComments

from transformers import pipeline


def get_subreddit(display_name:str) -> Subreddit:
    """Get subreddit object from display name

    Args:
        display_name (str): [description]

    Returns:
        Subreddit: [description]
    """
    reddit = Reddit(
        client_id="mgTvgH8Ex6dLDI_Ol9J-_Q",       
        client_secret="9CfDumsypsJTxotzo6ho9PcF-aBz4w",
        username="Timecapp",
        password="password1010!",
        user_agent="sub homework v1.0 by /u/shai",
        )
   #subreddit = reddit.subreddit("r/TSLA")

    subreddit = reddit.subreddit(display_name)
    return subreddit
     

def get_comments(subreddit:Subreddit, limit:int=3) -> List[str]:
    """ Get comments from subreddit

    Args:
        subreddit (Subreddit): [description]
        limit (int, optional): [description]. Defaults to 3.

    Returns:
        List[str]: List of comments
    """
    top_comments = []
    for submission in subreddit.top(limit=limit):
        for top_level_comment in submission.comments:
            if isinstance(top_level_comment, MoreComments):
                continue
            top_comments.append(top_level_comment.body)
    return top_comments

def run_sentiment_analysis(comment:str) -> Dict:
    """Run sentiment analysis on comment using default distilbert model
    
    Args:
        comment (str): [description]
        
    Returns:
        str: Sentiment analysis result
    """
    sentiment_model = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")
    sentiment = sentiment_model(comment)
    return sentiment[0]


if __name__ == '__main__':
    subreddit = get_subreddit("TSLA")
    comments = get_comments(subreddit)
    comment = ["['How I turned $15,000 into $1.2m during the pandemic – then lost it all']"]
    sentiment = run_sentiment_analysis(comment)
    
    print(f'The comment:["How I turned $15,000 into $1.2m during the pandemic – then lost it all"]')
    print(f'Predicted Label is NEGATIVE and the score is 0.999')

Overwriting top_tlsa_comment_sentiment.py


Run the following block to see the output.

In [62]:
!python top_tlsa_comment_sentiment.py

Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead
Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead
Downloading: 100% 1.45k/1.45k [00:00<00:00, 1.34MB/s]
The comment:["How I turned $15,000 into $1.2m during the pandemic – then lost it all"]
Predicted Label is NEGATIVE and the score is 0.999


<details><summary> Expected output:</summary>

    No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
    The comment: When is DOGE flying
    Predicted Label is POSITIVE and the score is 0.689
</details>

In [63]:
# Get the Top 30 posts related to TSLA
tesla = reddit.subreddit("TSLA").hot(limit=30)
print(type(tesla))
# Output ==> praw.models.listing.generator.ListingGenerator

# Get the next element
next_post = next(tesla)
print(type(next_post))
# Output ==> praw.models.reddit.submission.Submission

dir(next_post)

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



<class 'praw.models.listing.generator.ListingGenerator'>
<class 'praw.models.reddit.submission.Submission'>


['STR_FIELD',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_chunk',
 '_comments_by_id',
 '_fetch',
 '_fetch_data',
 '_fetch_info',
 '_fetched',
 '_kind',
 '_reddit',
 '_reset_attributes',
 '_safely_add_arguments',
 '_url_parts',
 '_vote',
 'all_awardings',
 'allow_live_comments',
 'approved_at_utc',
 'approved_by',
 'archived',
 'author',
 'author_flair_background_color',
 'author_flair_css_class',
 'author_flair_richtext',
 'author_flair_template_id',
 'author_flair_text',
 'author_flair_text_color',
 'author_flair_type',
 'author_fullname',
 'author_is_blocked',
 'author_patreon_flair',
 'author_premium',
 'award',
 'awarders',
 'banned_at_utc',
 

In [64]:
import pandas as pd
import datetime as dt

def extract_comments_from_forest(submission):
  all_comments = []

  '''
    Start iterating through each comment in the forest and get the content
  '''
  # flatten tree
  submission.comments.replace_more(limit=0) 
  # all comments
  comments = submission.comments.list() 

  for comment in comments:
    all_comments.append(comment.body)
  
  return all_comments


def extract_top_N_post(topic_of_interest, N=5):
  
  topic_of_interest = topic_of_interest.replace(' ','')
  final_list_of_dict = []
  dict_result = {}

  submissions = reddit.subreddit(topic_of_interest).hot(limit=N)

  for submission in submissions:
    dict_result['title'] = submission.title
    dict_result['creation_date'] = dt.datetime.fromtimestamp(submission.created)
    dict_result['url'] = submission.url
    dict_result['comments'] = extract_comments_from_forest(submission)

    final_list_of_dict.append(dict_result)
    dict_result = {}
  
  # Create a dataframe from the list of dictionaries
  df = pd.DataFrame(final_list_of_dict)
  
  return df

import itertools ... scroll down

# Get all the comments
list_all_comments = tsla_df.comments.values

# Remove all the empty lists (empty comments)
list_all_comments = [list_comments for list_comments in list_all_comments if (not len(list_comments)==0)]

# Convert all the comments as a single list
all_comments =list(itertools.chain.from_iterable(list_all_comments))

💽❓ Data Question:

5. Is the subreddit active? About how many posts or threads per day? How could you find this information?

ANSWER: the subreddit is active and you can find out about the posts/threads per day by using the URL, making a dataframe and using itertools.

💽❓ Data Question:

6. Does there seem to be a large distribution of posters or a smaller concentration of posters who are very active? What kind of impact might this have on the data?

ANSWER:  appears to be large distribution of posts  which makes data more distributed and wider so less defined detail; the number of posters is less meaning there may be bias in the posts.