This is a programme for FM412 Group Assignment 1 Part 2 r/WSB GME related sentiment analysis

In [None]:
# In case in the new environment there is no local packages
'''
pip install psaw
pip install praw
pip install nltk
pip install timestamp
pip install emoji
pip install re
pip install transformers
pip install seaborn
pip install matplotlib
'''

In [1]:
# Import the packages
from psaw import PushshiftAPI
from transformers import AutoTokenizer as AT
from transformers import AutoModelForSequenceClassification as AM # The above two to use the pretrained model
import numpy as np
import pandas as pd
import datetime as dt
import torch

I will scrapy the submissions and comments about GME from the begining of 2021.01.01

In [2]:
start_time = int(dt.datetime(2021, 1, 1).timestamp())

Construct API Query for searching submissions containing GME as keyword in the subreddit r/wsb

In [3]:
api = PushshiftAPI()
GME_submissions = api.search_submissions(after = start_time, q = 'GME', subreddit = 'wallstreetbets', limit = 10)
# search for submissions that containing GME
GME_comments = api.search_comments(after = start_time, q = 'GME', subreddit = 'wallstreetbets', limit = 10)
# search for comments that containing GME

In [4]:
text_submissions = []
text_comments = []
for rows in GME_submissions:
    if rows.title.find('GME') != -1:
        text_submissions.append(rows.title)
    if rows.selftext.find('GME') != -1:
        text_submissions.append(rows.selftext)
for items in GME_comments:
    text_comments.append(items.body)
# Combining two texts
text = text_submissions + text_comments

By now, we have got all the text information from Reddit, we will now proceed to sentiment analysis

Preprosessing the text list

In [5]:
# Cleaning text, get unique items
def get_unique_text(text):
    list_of_unique_text = []
    unique_text = set(text)
    for text in unique_text:
        list_of_unique_text.append(text)
    return list_of_unique_text
text = get_unique_text(text) # Get the distinct texts

In [6]:
len(text)

19

In [7]:
df = pd.DataFrame(np.array(text), columns = ['text'])
df.head(5)

Unnamed: 0,text
0,$ISPO Could it squeeze similar to $GME? Since ...
1,!banbet GME 100 1d
2,Wow a pin for gme on wsb? Color me surprised! ...
3,15k open interest GME 950 C for 1/20/23
4,I called you a dumbass in a GME thread because...


One must install pytorch and then use the pretrained package. Go to http://pytorch.org 

!pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio===0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

In [8]:
# Activate the pretrained sentiment analysis package
tokenizer = AT.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AM.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

In [9]:
# Define a function that will map the sentiment number with each title
def sentiment_classifier(text, model, tokenizer):
    inputs = tokenizer.encode_plus(text, return_tensors = 'pt', add_special_tokens = True)
    token_type_ids = inputs['token_type_ids']
    input_ids = inputs['input_ids']
    output = model(input_ids, token_type_ids, return_dict = True, output_hidden_states = True)
    logits = np.array(output.logits.tolist()[0])
    prob = np.exp(logits)/np.sum(np.exp(logits))
    return np.sum([(x+1)*prob[x] for x in range(len(prob))])

In [10]:
df['sentiment'] = df['text'].apply(lambda x: sentiment_classifier(x[:512], model, tokenizer))

In [11]:
# Mapping the attitude
df['attitude'] = 0
df.loc[df['sentiment'] > 2, 'attitude'] = 'Positive'
df.loc[df['sentiment'] < 2, 'attitude'] = 'Negative'
df

Unnamed: 0,text,sentiment,attitude
0,$ISPO Could it squeeze similar to $GME? Since ...,1.720743,Negative
1,!banbet GME 100 1d,2.776648,Positive
2,Wow a pin for gme on wsb? Color me surprised! ...,4.552956,Positive
3,15k open interest GME 950 C for 1/20/23,3.519945,Positive
4,I called you a dumbass in a GME thread because...,1.147833,Negative
5,$50k+ GME YOLO 🚀🚀🚀 Hedgies R Fuk !!! DRS is th...,2.787308,Positive
6,"Just ask, I'll tell you. fuck the GME cult, ...",1.626596,Negative
7,LOL. I opened after commenting. Yep. GME Meltd...,1.253565,Negative
8,Maximum TFSA YOLO! Let’s go GME 🚀,3.963553,Positive
9,**Ban Bet Created:** **/u/CantStop_GameStop** ...,1.419017,Negative


In [12]:
df['text'].iloc[15] # Sentiment: 1.72

'Could the masses turn MULN into the next GME? This company is 45% shorted at the moment...'

It can be seen that the sentence marked as 1.7, which should be relatively negative sentence has appeared positive to GME itself. The reason it is marked as 1.7 because it complained about other things.
It can be shown in after cells that people are using a cynical and complaining tones. When their sentiments are marked below 2.5, which should be recognised as negative, but actually are positive to GME. It tells us that the market is irrational,unlike in the normal time when they phrase the good stocks.
Therefore, to adjust the change, it is reasonable to set the bar of negative/positive as 2.

In [13]:
df['text'].iloc[17] # Sentiment: 1.89

'Nah, it obviously means you are bought by Melvin Capital and Citadel and you had lunch with Vlad Tenev today to orchestrate a planned shutdown of WSB and RH at the same time to stop purchases from accelerating the price of GME to $69,420,420 and also your dad is Jerome Powell and the SEC and they made love to create you.'

In [14]:
df['text'].iloc[18] # Sentiment: 1.88

'The Final Order: GME Earnings YOLO update. Cohen you little tardling. Put up or shut up fool'

In [15]:
df['text'].iloc[4] # This sentiment is marked as 1.14 which is near to the lower limit of sentiment analysis

"I called you a dumbass in a GME thread because you're not on a GME forum, this is open discussion for anyone with any opinion about it. And my opinion is that you're delusional, sad, stupid, and throwing good money after bad in a quixotic attempt to catch a wave that already made it to shore 14 months ago"

As one can see, even though I put some of truly positive comments to negative, it also shows that the positive views are more than the negative ones.

In [16]:
counts = df.attitude.value_counts().to_frame()
counts

Unnamed: 0,attitude
Positive,10
Negative,9


In [17]:
def main(name, start_time, limit, subreddit, benchmark):
    api = PushshiftAPI()
    GME_submissions = api.search_submissions(after = start_time, q = name, subreddit = subreddit, limit = limit)
    # search for submissions that containing GME
    GME_comments = api.search_comments(after = start_time, q = name, subreddit = subreddit, limit = limit)
    # search for comments that containing GME
    
    # Delete texts without keyword and append them in the same list
    text_submissions = []
    text_comments = []
    for rows in GME_submissions:
        if rows.title.find(name) != -1:
            text_submissions.append(rows.title)
        # When submission is high, it returns no selftext
        #if rows.selftext.find(name) != -1:
            #text_submissions.append(rows.selftext)
    for items in GME_comments:
        text_comments.append(items.body)
    # Combining two texts
    text = text_submissions + text_comments
    
    # Cleaning text, get unique items
    def get_unique_text(text):
        list_of_unique_text = []
        unique_text = set(text)
        for text in unique_text:
            list_of_unique_text.append(text)
        return list_of_unique_text
    text = get_unique_text(text) # Get the distinct texts
    
    # Making a pandas dataframe
    df = pd.DataFrame(np.array(text), columns = ['text'])
    
    # Natural Language Processing with pre_trained package 'Bert'
    tokenizer = AT.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
    model = AM.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
    
    def sentiment_classifier(text, model, tokenizer):
        inputs = tokenizer.encode_plus(text, return_tensors = 'pt', add_special_tokens = True)
        token_type_ids = inputs['token_type_ids']
        input_ids = inputs['input_ids']
        output = model(input_ids, token_type_ids, return_dict = True, output_hidden_states = True)
        logits = np.array(output.logits.tolist()[0])
        prob = np.exp(logits)/np.sum(np.exp(logits))
        return np.sum([(x+1)*prob[x] for x in range(len(prob))])
    
    # Mapping sentiment and attitudes with text
    df['sentiment'] = df['text'].apply(lambda x: sentiment_classifier(x[:512], model, tokenizer))
    df['attitude'] = 0
    df.loc[df['sentiment'] > benchmark, 'attitude'] = 'Positive'
    df.loc[df['sentiment'] < benchmark, 'attitude'] = 'Negative'
    
    # Reducing the counts
    counts = df.attitude.value_counts().to_frame()
    
    return counts

In [18]:
# Data to be used in report, with 20k sample space 

df = main(name = 'GME', start_time = start_time, subreddit = 'wallstreetbets', limit = 10000, benchmark = 2)
df

Unnamed: 0,attitude
Positive,10932
Negative,4643


In [32]:
df['percentage%'] = round((df['attitude']/df['attitude'].sum())*100, 2)

Unnamed: 0,attitude,percentage%
Positive,10932,70.19
Negative,4643,29.81
