# Importing Libraries

**Note: PyTorch is required to run the following script as we use FinBERT transformer to obtain the sentiment score. We attempted using Tensorflow but the Tensorflow version failed to work for some reasons.** 

**Additionally, we rely on fasttext to identify whether each post is in fact written in English. For this purpose, we downloaded the fasttext bin file which could be obtained from the following URL: https://fasttext.cc/docs/en/language-identification.html. Note that Windows users might encounter problem (requires Visual C++ 14.0 to install**

In [99]:
import os
import regex as re
import json
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import torch
from tqdm import tqdm

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import tokenize
import statistics
import emoji
from textblob import TextBlob
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [100]:
#Change the path below to point to the respective filename
tsmc_raw = pd.read_csv("./Reddit CSV/METAReddit.csv")

In [101]:
tsmc_raw.head()

Unnamed: 0,company_name,date,time,title,self_text,score,total_comments,comments
0,META,2022-10-01,14:02:35,Question on what to do with both TQQQ and SQQQ?,I might be the first person to ever hold both ...,1,0,['“Last week when the market was just beginnin...
1,META,2022-10-01,03:05:16,Facebook scrambles to escape stock’s death spi...,"A year ago, before Facebook had turned [Meta](...",1,0,['How come articles like this talk about the i...
2,META,2022-09-30,18:07:29,Some folks on here can not see the forest for ...,Late last year the members of the Fed sold all...,1,0,[' >and are dollar cost averaging into etfs gr...
3,META,2022-09-30,00:07:37,FCF Yield of Stocks I am Watching,I recently compiled a list of stocks with thei...,1,0,[]
4,META,2022-09-29,18:44:34,"Meta Announces Hiring Freeze, Warns Employees ...","Meta Platforms Inc., the owner of Facebook and...",1,0,['3/95 (3%) reduction in expense is not enough...


In [102]:
tsmc_raw.shape

(21371, 8)

In [103]:
tsmc_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21371 entries, 0 to 21370
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   company_name    21371 non-null  object
 1   date            21371 non-null  object
 2   time            21371 non-null  object
 3   title           21371 non-null  object
 4   self_text       19159 non-null  object
 5   score           21371 non-null  int64 
 6   total_comments  21371 non-null  int64 
 7   comments        21371 non-null  object
dtypes: int64(2), object(6)
memory usage: 1.3+ MB


In [104]:
tsmc_raw["title"] 

0          Question on what to do with both TQQQ and SQQQ?
1        Facebook scrambles to escape stock’s death spi...
2        Some folks on here can not see the forest for ...
3                        FCF Yield of Stocks I am Watching
4        Meta Announces Hiring Freeze, Warns Employees ...
                               ...                        
21366    This sub hates Wealthfront, hates Robinhood, h...
21367    Facebook's Oculus will release its virtual rea...
21368                              Facebook way overvalued
21369         My wife's 401k has a major discrepancy in it
21370    FANG (Facebook, Amazon, Netflix and Google) an...
Name: title, Length: 21371, dtype: object

In [105]:
tsmc_raw["self_text"]

0        I might be the first person to ever hold both ...
1        A year ago, before Facebook had turned [Meta](...
2        Late last year the members of the Fed sold all...
3        I recently compiled a list of stocks with thei...
4        Meta Platforms Inc., the owner of Facebook and...
                               ...                        
21366                                            [deleted]
21367    In a blog post on Monday, Oculus FB revealed v...
21368    Facebook's Price-to-Sales ratio: 18.57 (as of ...
21369    She has her money in T Rowe Price Retire 2050 ...
21370    http://www.ft.com/intl/cms/s/0/b73d74c6-938c-1...
Name: self_text, Length: 21371, dtype: object

In [106]:
tsmc_raw["title"].isnull().sum()

0

In [107]:
tsmc_raw["self_text"].isnull().sum()

2212

In [108]:
#Change nan to ""
tsmc_raw["self_text"] = tsmc_raw["self_text"].apply(lambda x: "" if pd.isnull(x) else x)
tsmc_raw["self_text"].isnull().sum()

0

In [109]:
tsmc_raw.loc[0,"self_text"]

"I might be the first person to ever hold both TQQQ and SQQQ at the same time. I bought both last week trying to make a gain on TQQQ since the market was just beginning to turn upward - until it collapsed again. \n\nSo I then bought up SQQQ to make a profit on losses. \n\nShould I just let my TQQQ sit there until 2024, which is when the recession will be over hopefully? I don't see a need to pull it all out. Why not recoup the gains by leaving it in? I plan to sell my SQQQ for a profit next week. \n\nCurrent holdings: \n\nMETV, META, GOOG, TSLA, SOFI, SQQQ and TQQQ."

In [110]:
#Remove emoji, url, html tag, and regexpr from title
tsmc_raw["title"] = tsmc_raw["title"].apply(lambda x: emoji.replace_emoji(str(x), replace=""))
tsmc_raw["title"] = tsmc_raw["title"].apply(lambda x: re.sub(r'https?://\S+', '', str(x)))
tsmc_raw['title'] = tsmc_raw['title'].apply(lambda x: re.sub('<[^<]+?>', '', str(x)))
tsmc_raw['title'] = tsmc_raw['title'].apply(lambda x: str(x).replace('\n',''))
tsmc_raw['title'] = tsmc_raw['title'].apply(lambda x: str(x).replace('\t',''))

In [111]:
tsmc_raw[pd.isnull(tsmc_raw["title"])]

Unnamed: 0,company_name,date,time,title,self_text,score,total_comments,comments


In [112]:
#Remove emoji, url, html tag, and regexpr from self_text
tsmc_raw["self_text"] = tsmc_raw["self_text"].apply(lambda x: emoji.replace_emoji(str(x), replace=""))
tsmc_raw["self_text"] = tsmc_raw["self_text"].apply(lambda x: re.sub(r'https?://\S+', '', str(x)))
tsmc_raw['self_text'] = tsmc_raw['self_text'].apply(lambda x: re.sub('<[^<]+?>', '', str(x)))
tsmc_raw['self_text'] = tsmc_raw['self_text'].apply(lambda x: str(x).replace('\n',''))
tsmc_raw['self_text'] = tsmc_raw['self_text'].apply(lambda x: str(x).replace('\t',''))

In [113]:
tsmc_raw.loc[0,"self_text"]

"I might be the first person to ever hold both TQQQ and SQQQ at the same time. I bought both last week trying to make a gain on TQQQ since the market was just beginning to turn upward - until it collapsed again. So I then bought up SQQQ to make a profit on losses. Should I just let my TQQQ sit there until 2024, which is when the recession will be over hopefully? I don't see a need to pull it all out. Why not recoup the gains by leaving it in? I plan to sell my SQQQ for a profit next week. Current holdings: METV, META, GOOG, TSLA, SOFI, SQQQ and TQQQ."

In [114]:
tsmc_raw[pd.isnull(tsmc_raw["self_text"])]

Unnamed: 0,company_name,date,time,title,self_text,score,total_comments,comments


In [115]:
#Remove emoji, url, html tag, and regexpr from comments
tsmc_raw["comments"] = tsmc_raw["comments"].apply(lambda x: emoji.replace_emoji(str(x), replace=""))
tsmc_raw["comments"] = tsmc_raw["comments"].apply(lambda x: re.sub(r'https?://\S+', '', str(x)))
tsmc_raw['comments'] = tsmc_raw['comments'].apply(lambda x: re.sub('<[^<]+?>', '', str(x)))
tsmc_raw['comments'] = tsmc_raw['comments'].apply(lambda x: str(x).replace('\n',''))
tsmc_raw['comments'] = tsmc_raw['comments'].apply(lambda x: str(x).replace('\t',''))
tsmc_raw['comments'] = tsmc_raw['comments'].apply(lambda x: str(x).replace('\\',''))

In [116]:
tsmc_raw.loc[0,"comments"]

'[\'“Last week when the market was just beginning to turn upward…”nnLol\', "With both tqqq and sqqq, you are betting that the market goes one direction. Like in the bull market, tqqq 10x\'d, while sqqq basically went to 0", "I only swing trade the 3x and 2x leveraged.  I\'m betting on a downward trend with SQQQ and QID currently, but I buy and sell small portions of my positions almost daily depending on the price movement.  I\'m constantly taking profit, averaging down on my cost basis, and expanding or reducing my exposure.nnIf I was betting on an upward trend, I\'d be in UPRO.  I think the components of the Qs are getting a big valuation adjustment currently and will have less potential for growth going forward.  With UPRO, you could capture gains in energy and value stocks.nnI guess it depends on your thesis of the market going forward.", "Swing one or the other, there\'s no point in holding both at the same time. As others have mentioned you just get some decay.", \'What to do: No

In [117]:
tsmc_raw[pd.isnull(tsmc_raw["comments"])]

Unnamed: 0,company_name,date,time,title,self_text,score,total_comments,comments


### Sentiment

In [118]:
df_senti = tsmc_raw.copy()

In [119]:
df_senti["title"][216]

'Please do not buy META. I know for a damn FACT that their services are dying. Litteraly thousands leaving their services.'

In [120]:
#Combine title and self_text into one single column
df_senti['title_self_text'] = df_senti['title'] + " " + df_senti['self_text']

In [121]:
df_senti["title_self_text"] = df_senti["title_self_text"].apply(lambda x: "" if pd.isnull(x) else x)
df_senti["comments"] = df_senti["comments"].apply(lambda x: "" if pd.isnull(x) else x)

In [122]:
df_senti.drop(columns=["title", "self_text"], inplace=True)

In [123]:
df_senti.loc[0,"title_self_text"]

"Question on what to do with both TQQQ and SQQQ? I might be the first person to ever hold both TQQQ and SQQQ at the same time. I bought both last week trying to make a gain on TQQQ since the market was just beginning to turn upward - until it collapsed again. So I then bought up SQQQ to make a profit on losses. Should I just let my TQQQ sit there until 2024, which is when the recession will be over hopefully? I don't see a need to pull it all out. Why not recoup the gains by leaving it in? I plan to sell my SQQQ for a profit next week. Current holdings: METV, META, GOOG, TSLA, SOFI, SQQQ and TQQQ."

In [124]:
df_senti.index

RangeIndex(start=0, stop=21371, step=1)

In [125]:
#Identification of English language in the title_self_text column
import fasttext

#Uncomment the following line before running the script
path_to_pretrained_model = '/Users/andreaslukita7/Downloads/lid.176.bin'
fmodel = fasttext.load_model(path_to_pretrained_model)
fmodel.predict('Life is like a box of chocolates. You never know what you are gonna get.')[0][0][-2:]



'en'

In [126]:
df_senti["lang"] = df_senti["title_self_text"].apply(lambda x: fmodel.predict(x)[0][0][-2:])

In [127]:
df_senti["lang"].value_counts()

en    21245
om       20
de       18
es       12
zh       10
ja        9
tr        8
pt        8
ru        5
it        5
fr        4
hi        4
ko        4
fa        2
no        2
lt        2
sv        2
pl        2
vi        2
eb        1
fi        1
la        1
cs        1
uz        1
da        1
mk        1
Name: lang, dtype: int64

In [128]:
df_senti.shape

(21371, 8)

In [129]:
#Subset only title_self_text which are indentified to be written in English
df_senti = df_senti[df_senti["lang"]  == "en"]

In [130]:
df_senti.shape

(21245, 8)

In [131]:
sia = SentimentIntensityAnalyzer()

In [132]:
tokenize.sent_tokenize(df_senti["title_self_text"][0])

['Question on what to do with both TQQQ and SQQQ?',
 'I might be the first person to ever hold both TQQQ and SQQQ at the same time.',
 'I bought both last week trying to make a gain on TQQQ since the market was just beginning to turn upward - until it collapsed again.',
 'So I then bought up SQQQ to make a profit on losses.',
 'Should I just let my TQQQ sit there until 2024, which is when the recession will be over hopefully?',
 "I don't see a need to pull it all out.",
 'Why not recoup the gains by leaving it in?',
 'I plan to sell my SQQQ for a profit next week.',
 'Current holdings: METV, META, GOOG, TSLA, SOFI, SQQQ and TQQQ.']

In [133]:
len(tokenize.sent_tokenize(df_senti["title_self_text"][0]))

9

In [134]:
holder = df_senti['title_self_text'].copy()

all_senti = []
for title in holder.tolist():
    all_vader = []

    for sentence in tokenize.sent_tokenize(title):
        senti = sia.polarity_scores(sentence)
        all_vader.append(senti)

    all_senti.append(all_vader)

# df_senti["sentiment"] = holder.apply(lambda title : [sia.polarity_scores(sentence) for sentence in tokenize.sent_tokenize(title)])

In [135]:
all_senti_col = pd.DataFrame({"sentiment" : all_senti})
all_senti_col.shape

(21245, 1)

In [136]:
df_senti = pd.concat([df_senti, all_senti_col], axis=1)

In [137]:
df_senti.shape

(21371, 9)

In [138]:
len(df_senti[pd.isnull(df_senti["sentiment"])])

126

In [139]:
df_senti.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21371 entries, 0 to 20866
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   company_name     21245 non-null  object 
 1   date             21245 non-null  object 
 2   time             21245 non-null  object 
 3   score            21245 non-null  float64
 4   total_comments   21245 non-null  float64
 5   comments         21245 non-null  object 
 6   title_self_text  21245 non-null  object 
 7   lang             21245 non-null  object 
 8   sentiment        21245 non-null  object 
dtypes: float64(2), object(7)
memory usage: 1.6+ MB


In [140]:
df_senti.dropna(axis=0, inplace=True)

In [141]:
df_senti.reset_index(drop=True, inplace=True)

In [142]:
df_senti.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21119 entries, 0 to 21118
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   company_name     21119 non-null  object 
 1   date             21119 non-null  object 
 2   time             21119 non-null  object 
 3   score            21119 non-null  float64
 4   total_comments   21119 non-null  float64
 5   comments         21119 non-null  object 
 6   title_self_text  21119 non-null  object 
 7   lang             21119 non-null  object 
 8   sentiment        21119 non-null  object 
dtypes: float64(2), object(7)
memory usage: 1.5+ MB


In [143]:
print(df_senti['sentiment'][0], '\n')
print(len(df_senti['sentiment'][0]))

[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, {'neg': 0.079, 'neu': 0.792, 'pos': 0.128, 'compound': 0.3182}, {'neg': 0.199, 'neu': 0.588, 'pos': 0.213, 'compound': 0.0516}, {'neg': 0.13, 'neu': 0.744, 'pos': 0.126, 'compound': -0.0258}, {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}, {'neg': 0.203, 'neu': 0.797, 'pos': 0.0, 'compound': -0.2584}, {'neg': 0.0, 'neu': 0.734, 'pos': 0.266, 'compound': 0.4404}, {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}] 

9


In [144]:
sentiment_score = df_senti.loc[0, "sentiment"]
sentiment_score

[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
 {'neg': 0.079, 'neu': 0.792, 'pos': 0.128, 'compound': 0.3182},
 {'neg': 0.199, 'neu': 0.588, 'pos': 0.213, 'compound': 0.0516},
 {'neg': 0.13, 'neu': 0.744, 'pos': 0.126, 'compound': -0.0258},
 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
 {'neg': 0.203, 'neu': 0.797, 'pos': 0.0, 'compound': -0.2584},
 {'neg': 0.0, 'neu': 0.734, 'pos': 0.266, 'compound': 0.4404},
 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}]

In [145]:
df_senti.shape

(21119, 9)

In [146]:
df_senti = df_senti[df_senti['sentiment'].notna()]

In [147]:
df_senti.shape

(21119, 9)

**Note: Split the Vader sentiment score into 4 different columns**

In [148]:
list_neg_score = []

for i in range(len(df_senti)):
    each_neg_score = []
    sentiment_score = df_senti.loc[i, "sentiment"]
    for item in sentiment_score:
        each_neg_score.append(item["neg"])
    try:
        list_neg_score.append(statistics.mean(each_neg_score))
    except:
        list_neg_score.append(0)

list_neg_score = pd.DataFrame(list_neg_score, columns=['title_neg_sentiment'])
    
df_senti = pd.concat([df_senti, list_neg_score], axis=1)

In [149]:
list_neu_score = []

for i in range(len(df_senti)):
    each_neu_score = []
    sentiment_score = df_senti.loc[i, "sentiment"]
    for item in sentiment_score:
        each_neu_score.append(item["neu"])
    try:
        list_neu_score.append(statistics.mean(each_neu_score))
    except:
        list_neu_score.append(0)

list_neu_score = pd.DataFrame(list_neu_score, columns=['title_neu_sentiment'])
    
df_senti = pd.concat([df_senti, list_neu_score], axis=1)

In [150]:
list_pos_score = []

for i in range(len(df_senti)):
    each_pos_score = []
    sentiment_score = df_senti.loc[i, "sentiment"]
    for item in sentiment_score:
        each_pos_score.append(item["pos"])
    try:
        list_pos_score.append(statistics.mean(each_pos_score))
    except:
        list_pos_score.append(0)

list_pos_score = pd.DataFrame(list_pos_score, columns=['title_pos_sentiment'])
    
df_senti = pd.concat([df_senti, list_pos_score], axis=1)

In [151]:
list_compound_score = []

for i in range(len(df_senti)):
    each_compound_score = []
    sentiment_score = df_senti.loc[i, "sentiment"]
    for item in sentiment_score:
        each_compound_score.append(item["compound"])
    try:
        list_compound_score.append(statistics.mean(each_compound_score))
    except:
        list_compound_score.append(0)

list_compound_score = pd.DataFrame(list_compound_score, columns=['title_compound_sentiment'])
    
df_senti = pd.concat([df_senti, list_compound_score], axis=1)

In [152]:
df_senti.head()

Unnamed: 0,company_name,date,time,score,total_comments,comments,title_self_text,lang,sentiment,title_neg_sentiment,title_neu_sentiment,title_pos_sentiment,title_compound_sentiment
0,META,2022-10-01,14:02:35,1.0,0.0,['“Last week when the market was just beginnin...,Question on what to do with both TQQQ and SQQQ...,en,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.067889,0.850556,0.081444,0.058444
1,META,2022-10-01,03:05:16,1.0,0.0,['How come articles like this talk about the i...,Facebook scrambles to escape stock’s death spi...,en,"[{'neg': 0.118, 'neu': 0.848, 'pos': 0.034, 'c...",0.066553,0.829868,0.077263,0.066716
2,META,2022-09-30,18:07:29,1.0,0.0,[' >and are dollar cost averaging into etfs gr...,Some folks on here can not see the forest for ...,en,"[{'neg': 0.053, 'neu': 0.947, 'pos': 0.0, 'com...",0.0878,0.8527,0.0596,0.01969
3,META,2022-09-30,00:07:37,1.0,0.0,[],FCF Yield of Stocks I am Watching I recently c...,en,"[{'neg': 0.075, 'neu': 0.753, 'pos': 0.172, 'c...",0.020833,0.909667,0.069417,0.155742
4,META,2022-09-29,18:44:34,1.0,0.0,['3/95 (3%) reduction in expense is not enough...,"Meta Announces Hiring Freeze, Warns Employees ...",en,"[{'neg': 0.095, 'neu': 0.84, 'pos': 0.065, 'co...",0.0359,0.9126,0.0515,0.02824


**Note: Use FinBERT transformers to obtain sentiment score**

**Note: Activate GPU on Metal Apple MPS**

In [153]:
device = torch.device('mps')

In [154]:
# create a tokenizer object
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

# fetch the pretrained model 
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert").to(device)

In [155]:
def sentim_analyzer(df, tokenizer, model):
    ''' Given a df that contains a column 'headline' with article healine texts, it runs inference on the healine with the 'model' (FinBert) 
       and inserts output sentiment features into the dataframe in the respective columns (Positive_sentim, Negative_sentim, Neutral_sentim)
       
        Parameters :
          df : A dataframe that contains headlines in a column called 'headline' . 
          tokenizer(AutoTokenizer object) : A pre-processing tokenizer object from Hugging Face lib. 
          model (AutoModelForSequenceClassification object) : A hugging face transformer model.     
          
          returns df : The initial dataframe with the 3 sentiment features as columns for each headline'''
    
    for i in tqdm(df.index) :
        try:
            headline = df.loc[i, 'title_self_text']
        except:
            return print(' \'headline\' column might be missing from dataframe')
        # Pre-process input phrase
        input = tokenizer(headline, padding = True, truncation = True, return_tensors='pt').to(device)
        # Estimate output
        output = model(**input)
        # Pass model output logits through a softmax layer.
        predictions = torch.nn.functional.softmax(output.logits, dim=-1)
        df.loc[i, 'Positive'] = predictions[0][0].tolist()
        df.loc[i, 'Negative'] = predictions[0][1].tolist()
        df.loc[i, 'Neutral']  = predictions[0][2].tolist()
    # rearrange column order
    try:
        df = df[['date', 'stock', 'Open', 'Close', 'Volume',  'headline', 'Positive', 'Negative', 'Neutral','Price_change']]
    except:
        pass
    return df

In [156]:
finbert_df = sentim_analyzer(df_senti, tokenizer, model)

100%|█████████████████████████████████████| 21119/21119 [14:51<00:00, 23.68it/s]


In [157]:
finbert_df = finbert_df.rename(columns={'Positive': 'finbert_positive', 'Negative': 'finbert_negative', 'Neutral': 'finbert_neutral'})

In [158]:
finbert_df.head()

Unnamed: 0,company_name,date,time,score,total_comments,comments,title_self_text,lang,sentiment,title_neg_sentiment,title_neu_sentiment,title_pos_sentiment,title_compound_sentiment,finbert_positive,finbert_negative,finbert_neutral
0,META,2022-10-01,14:02:35,1.0,0.0,['“Last week when the market was just beginnin...,Question on what to do with both TQQQ and SQQQ...,en,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.067889,0.850556,0.081444,0.058444,0.072471,0.03766,0.889869
1,META,2022-10-01,03:05:16,1.0,0.0,['How come articles like this talk about the i...,Facebook scrambles to escape stock’s death spi...,en,"[{'neg': 0.118, 'neu': 0.848, 'pos': 0.034, 'c...",0.066553,0.829868,0.077263,0.066716,0.010709,0.950504,0.038787
2,META,2022-09-30,18:07:29,1.0,0.0,[' >and are dollar cost averaging into etfs gr...,Some folks on here can not see the forest for ...,en,"[{'neg': 0.053, 'neu': 0.947, 'pos': 0.0, 'com...",0.0878,0.8527,0.0596,0.01969,0.033386,0.326413,0.6402
3,META,2022-09-30,00:07:37,1.0,0.0,[],FCF Yield of Stocks I am Watching I recently c...,en,"[{'neg': 0.075, 'neu': 0.753, 'pos': 0.172, 'c...",0.020833,0.909667,0.069417,0.155742,0.037562,0.147082,0.815357
4,META,2022-09-29,18:44:34,1.0,0.0,['3/95 (3%) reduction in expense is not enough...,"Meta Announces Hiring Freeze, Warns Employees ...",en,"[{'neg': 0.095, 'neu': 0.84, 'pos': 0.065, 'co...",0.0359,0.9126,0.0515,0.02824,0.012997,0.936364,0.050639


In [159]:
df_comments = finbert_df.copy()

### Comment to each post

In [160]:
df_comments.shape

(21119, 16)

In [161]:
df_comments = df_comments[df_comments['comments'].notna()]

In [162]:
holder = df_comments['comments'].copy()

all_senti = []
for comments in holder.tolist():
    all_vader = []
    for comment in tokenize.sent_tokenize(comments):
        senti = sia.polarity_scores(comment)
        all_vader.append(senti)
    all_senti.append(all_vader)

In [163]:
all_senti_col = pd.DataFrame({"comment_sentiment" : all_senti})
df_comments = pd.concat([df_comments, all_senti_col], axis=1)

In [164]:
df_comments = df_comments[df_comments['comment_sentiment'].notna()]

**Note: Split the Vader sentiment score into 4 different columns**

In [165]:
list_neg_score = []

for i in range(len(df_comments)):
    each_neg_score = []
    sentiment_score = df_comments.loc[i, "comment_sentiment"]
    for item in sentiment_score:
        each_neg_score.append(item["neg"])
    try:
        list_neg_score.append(statistics.mean(each_neg_score))
    except:
        list_neg_score.append(0)

list_neg_score = pd.DataFrame(list_neg_score, columns=['comment_neg_sentiment'])
    
df_comments = pd.concat([df_comments, list_neg_score], axis=1)

In [166]:
list_neu_score = []

for i in range(len(df_comments)):
    each_neu_score = []
    sentiment_score = df_comments.loc[i, "comment_sentiment"]
    for item in sentiment_score:
        each_neu_score.append(item["neu"])
    try:
        list_neu_score.append(statistics.mean(each_neu_score))
    except:
        list_neu_score.append(0)

list_neu_score = pd.DataFrame(list_neu_score, columns=['comment_neu_sentiment'])
    
df_comments = pd.concat([df_comments, list_neu_score], axis=1)

In [167]:
list_pos_score = []

for i in range(len(df_comments)):
    each_pos_score = []
    sentiment_score = df_comments.loc[i, "comment_sentiment"]
    for item in sentiment_score:
        each_pos_score.append(item["pos"])
    try:
        list_pos_score.append(statistics.mean(each_pos_score))
    except:
        list_pos_score.append(0)

list_pos_score = pd.DataFrame(list_pos_score, columns=['comment_pos_sentiment'])
    
df_comments = pd.concat([df_comments, list_pos_score], axis=1)

In [168]:
list_compound_score = []

for i in range(len(df_comments)):
    each_compound_score = []
    sentiment_score = df_comments.loc[i, "comment_sentiment"]
    for item in sentiment_score:
        each_compound_score.append(item["compound"])
    try:
        list_compound_score.append(statistics.mean(each_compound_score))
    except:
        list_compound_score.append(0)

list_compound_score = pd.DataFrame(list_compound_score, columns=['comment_compound_sentiment'])
    
df_comments = pd.concat([df_comments, list_compound_score], axis=1)

**Use textblob to obtain polarity and subjectivity score**

In [169]:
# define a function that accepts text and returns the polarity
def detect_sentiment_polarity(text):
    
    # use this line for Python 2 (avoids UnicodeDecodeError for some reviews)
    # blob = TextBlob(text.decode(encoding='utf-8'))
    
    # use this line instead for Python 3
    blob = TextBlob(text)
    
    # return the polarity
    return blob.sentiment.polarity

In [170]:
# define a function that accepts text and returns the polarity
def detect_sentiment_subjectivity(text):
    
    # use this line for Python 2 (avoids UnicodeDecodeError for some reviews)
    # blob = TextBlob(text.decode(encoding='utf-8'))
    
    # use this line instead for Python 3
    blob = TextBlob(text)
    
    # return the polarity
    return blob.sentiment.subjectivity

In [171]:
df_comments = df_comments[df_comments['title_self_text'].notna()]

In [172]:
df_comments["title_textblob_polarity"] = df_comments["title_self_text"].apply(detect_sentiment_polarity)
df_comments.head()

Unnamed: 0,company_name,date,time,score,total_comments,comments,title_self_text,lang,sentiment,title_neg_sentiment,...,title_compound_sentiment,finbert_positive,finbert_negative,finbert_neutral,comment_sentiment,comment_neg_sentiment,comment_neu_sentiment,comment_pos_sentiment,comment_compound_sentiment,title_textblob_polarity
0,META,2022-10-01,14:02:35,1.0,0.0,['“Last week when the market was just beginnin...,Question on what to do with both TQQQ and SQQQ...,en,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.067889,...,0.058444,0.072471,0.03766,0.889869,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.067773,0.885273,0.047045,-0.012709,0.05
1,META,2022-10-01,03:05:16,1.0,0.0,['How come articles like this talk about the i...,Facebook scrambles to escape stock’s death spi...,en,"[{'neg': 0.118, 'neu': 0.848, 'pos': 0.034, 'c...",0.066553,...,0.066716,0.010709,0.950504,0.038787,"[{'neg': 0.267, 'neu': 0.668, 'pos': 0.065, 'c...",0.107855,0.786172,0.105977,0.009081,0.054842
2,META,2022-09-30,18:07:29,1.0,0.0,[' >and are dollar cost averaging into etfs gr...,Some folks on here can not see the forest for ...,en,"[{'neg': 0.053, 'neu': 0.947, 'pos': 0.0, 'com...",0.0878,...,0.01969,0.033386,0.326413,0.6402,"[{'neg': 0.0, 'neu': 0.728, 'pos': 0.272, 'com...",0.055464,0.84955,0.094981,0.063796,0.107778
3,META,2022-09-30,00:07:37,1.0,0.0,[],FCF Yield of Stocks I am Watching I recently c...,en,"[{'neg': 0.075, 'neu': 0.753, 'pos': 0.172, 'c...",0.020833,...,0.155742,0.037562,0.147082,0.815357,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.0,1.0,0.0,0.0,0.048
4,META,2022-09-29,18:44:34,1.0,0.0,['3/95 (3%) reduction in expense is not enough...,"Meta Announces Hiring Freeze, Warns Employees ...",en,"[{'neg': 0.095, 'neu': 0.84, 'pos': 0.065, 'co...",0.0359,...,0.02824,0.012997,0.936364,0.050639,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.091446,0.820754,0.087785,0.037768,0.089036


In [173]:
df_comments["comment_textblob_polarity"] = df_comments["comments"].apply(detect_sentiment_polarity)
df_comments.head()

Unnamed: 0,company_name,date,time,score,total_comments,comments,title_self_text,lang,sentiment,title_neg_sentiment,...,finbert_positive,finbert_negative,finbert_neutral,comment_sentiment,comment_neg_sentiment,comment_neu_sentiment,comment_pos_sentiment,comment_compound_sentiment,title_textblob_polarity,comment_textblob_polarity
0,META,2022-10-01,14:02:35,1.0,0.0,['“Last week when the market was just beginnin...,Question on what to do with both TQQQ and SQQQ...,en,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.067889,...,0.072471,0.03766,0.889869,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.067773,0.885273,0.047045,-0.012709,0.05,-0.003924
1,META,2022-10-01,03:05:16,1.0,0.0,['How come articles like this talk about the i...,Facebook scrambles to escape stock’s death spi...,en,"[{'neg': 0.118, 'neu': 0.848, 'pos': 0.034, 'c...",0.066553,...,0.010709,0.950504,0.038787,"[{'neg': 0.267, 'neu': 0.668, 'pos': 0.065, 'c...",0.107855,0.786172,0.105977,0.009081,0.054842,0.026422
2,META,2022-09-30,18:07:29,1.0,0.0,[' >and are dollar cost averaging into etfs gr...,Some folks on here can not see the forest for ...,en,"[{'neg': 0.053, 'neu': 0.947, 'pos': 0.0, 'com...",0.0878,...,0.033386,0.326413,0.6402,"[{'neg': 0.0, 'neu': 0.728, 'pos': 0.272, 'com...",0.055464,0.84955,0.094981,0.063796,0.107778,0.097039
3,META,2022-09-30,00:07:37,1.0,0.0,[],FCF Yield of Stocks I am Watching I recently c...,en,"[{'neg': 0.075, 'neu': 0.753, 'pos': 0.172, 'c...",0.020833,...,0.037562,0.147082,0.815357,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.0,1.0,0.0,0.0,0.048,0.0
4,META,2022-09-29,18:44:34,1.0,0.0,['3/95 (3%) reduction in expense is not enough...,"Meta Announces Hiring Freeze, Warns Employees ...",en,"[{'neg': 0.095, 'neu': 0.84, 'pos': 0.065, 'co...",0.0359,...,0.012997,0.936364,0.050639,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.091446,0.820754,0.087785,0.037768,0.089036,0.144646


In [174]:
df_comments["title_textblob_subjectivity"] = df_comments["title_self_text"].apply(detect_sentiment_subjectivity)

In [175]:
df_comments["comment_textblob_subjectivity"] = df_comments["comments"].apply(detect_sentiment_subjectivity)

In [176]:
#Convert date from str to datetime object
df_comments["date"] = pd.to_datetime(df_comments["date"])

**Note: Using FinBERT Transformer to obtain sentiment score for Comment**

In [177]:
def sentim_analyzer(df, tokenizer, model):
    ''' Given a df that contains a column 'headline' with article healine texts, it runs inference on the healine with the 'model' (FinBert) 
       and inserts output sentiment features into the dataframe in the respective columns (Positive_sentim, Negative_sentim, Neutral_sentim)
       
        Parameters :
          df : A dataframe that contains headlines in a column called 'headline' . 
          tokenizer(AutoTokenizer object) : A pre-processing tokenizer object from Hugging Face lib. 
          model (AutoModelForSequenceClassification object) : A hugging face transformer model.     
          
          returns df : The initial dataframe with the 3 sentiment features as columns for each headline'''
    
    for i in tqdm(df.index) :
        try:
            headline = df.loc[i, 'comments']
        except:
            return print(' \'headline\' column might be missing from dataframe')
        # Pre-process input phrase
        input = tokenizer(headline, padding = True, truncation = True, return_tensors='pt').to(device)
        # Estimate output
        output = model(**input)
        # Pass model output logits through a softmax layer.
        predictions = torch.nn.functional.softmax(output.logits, dim=-1)
        df.loc[i, 'Positive'] = predictions[0][0].tolist()
        df.loc[i, 'Negative'] = predictions[0][1].tolist()
        df.loc[i, 'Neutral']  = predictions[0][2].tolist()
    # rearrange column order
    try:
        df = df[['date', 'stock', 'Open', 'Close', 'Volume',  'headline', 'Positive', 'Negative', 'Neutral','Price_change']]
    except:
        pass
    return df

In [178]:
finbert_comment_df = sentim_analyzer(df_comments, tokenizer, model)

100%|█████████████████████████████████████| 21119/21119 [13:40<00:00, 25.74it/s]


In [179]:
df_comments = finbert_comment_df.copy()

In [180]:
df_comments.head()

Unnamed: 0,company_name,date,time,score,total_comments,comments,title_self_text,lang,sentiment,title_neg_sentiment,...,comment_neu_sentiment,comment_pos_sentiment,comment_compound_sentiment,title_textblob_polarity,comment_textblob_polarity,title_textblob_subjectivity,comment_textblob_subjectivity,Positive,Negative,Neutral
0,META,2022-10-01,14:02:35,1.0,0.0,['“Last week when the market was just beginnin...,Question on what to do with both TQQQ and SQQQ...,en,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.067889,...,0.885273,0.047045,-0.012709,0.05,-0.003924,0.185,0.45269,0.038201,0.554231,0.407568
1,META,2022-10-01,03:05:16,1.0,0.0,['How come articles like this talk about the i...,Facebook scrambles to escape stock’s death spi...,en,"[{'neg': 0.118, 'neu': 0.848, 'pos': 0.034, 'c...",0.066553,...,0.786172,0.105977,0.009081,0.054842,0.026422,0.413646,0.494938,0.031542,0.131909,0.836549
2,META,2022-09-30,18:07:29,1.0,0.0,[' >and are dollar cost averaging into etfs gr...,Some folks on here can not see the forest for ...,en,"[{'neg': 0.053, 'neu': 0.947, 'pos': 0.0, 'com...",0.0878,...,0.84955,0.094981,0.063796,0.107778,0.097039,0.382937,0.503998,0.054207,0.163752,0.78204
3,META,2022-09-30,00:07:37,1.0,0.0,[],FCF Yield of Stocks I am Watching I recently c...,en,"[{'neg': 0.075, 'neu': 0.753, 'pos': 0.172, 'c...",0.020833,...,1.0,0.0,0.0,0.048,0.0,0.436167,0.0,0.07117,0.187309,0.741521
4,META,2022-09-29,18:44:34,1.0,0.0,['3/95 (3%) reduction in expense is not enough...,"Meta Announces Hiring Freeze, Warns Employees ...",en,"[{'neg': 0.095, 'neu': 0.84, 'pos': 0.065, 'co...",0.0359,...,0.820754,0.087785,0.037768,0.089036,0.144646,0.394187,0.548611,0.024031,0.791251,0.184718


In [181]:
df_comments = df_comments.rename(columns={'Positive': 'finbert_comment_positive', 'Negative': 'finbert_comment_negative', 'Neutral': 'finbert_comment_neutral'})

In [182]:
df_comments.head()

Unnamed: 0,company_name,date,time,score,total_comments,comments,title_self_text,lang,sentiment,title_neg_sentiment,...,comment_neu_sentiment,comment_pos_sentiment,comment_compound_sentiment,title_textblob_polarity,comment_textblob_polarity,title_textblob_subjectivity,comment_textblob_subjectivity,finbert_comment_positive,finbert_comment_negative,finbert_comment_neutral
0,META,2022-10-01,14:02:35,1.0,0.0,['“Last week when the market was just beginnin...,Question on what to do with both TQQQ and SQQQ...,en,"[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compoun...",0.067889,...,0.885273,0.047045,-0.012709,0.05,-0.003924,0.185,0.45269,0.038201,0.554231,0.407568
1,META,2022-10-01,03:05:16,1.0,0.0,['How come articles like this talk about the i...,Facebook scrambles to escape stock’s death spi...,en,"[{'neg': 0.118, 'neu': 0.848, 'pos': 0.034, 'c...",0.066553,...,0.786172,0.105977,0.009081,0.054842,0.026422,0.413646,0.494938,0.031542,0.131909,0.836549
2,META,2022-09-30,18:07:29,1.0,0.0,[' >and are dollar cost averaging into etfs gr...,Some folks on here can not see the forest for ...,en,"[{'neg': 0.053, 'neu': 0.947, 'pos': 0.0, 'com...",0.0878,...,0.84955,0.094981,0.063796,0.107778,0.097039,0.382937,0.503998,0.054207,0.163752,0.78204
3,META,2022-09-30,00:07:37,1.0,0.0,[],FCF Yield of Stocks I am Watching I recently c...,en,"[{'neg': 0.075, 'neu': 0.753, 'pos': 0.172, 'c...",0.020833,...,1.0,0.0,0.0,0.048,0.0,0.436167,0.0,0.07117,0.187309,0.741521
4,META,2022-09-29,18:44:34,1.0,0.0,['3/95 (3%) reduction in expense is not enough...,"Meta Announces Hiring Freeze, Warns Employees ...",en,"[{'neg': 0.095, 'neu': 0.84, 'pos': 0.065, 'co...",0.0359,...,0.820754,0.087785,0.037768,0.089036,0.144646,0.394187,0.548611,0.024031,0.791251,0.184718


In [183]:
df_comments.drop(columns=["time", "comments", "title_self_text", "sentiment", "comment_sentiment", "lang"], inplace=True)

In [184]:
# Feature engineering: Reputation
df_comments['reputation'] = df_comments["score"] * df_comments["total_comments"]

In [185]:
# Feature engineering: Groupthink
df_comments["groupthink"] = 0.40 * df_comments['title_compound_sentiment'] + \
                            0.30 * df_comments['comment_compound_sentiment'] + \
                            0.05 * df_comments['title_textblob_subjectivity'] + \
                            0.05 * df_comments['comment_textblob_subjectivity'] + \
                            0.10 * df_comments['score'] + \
                            0.10 * df_comments['total_comments']

In [186]:
# Feature engineering: FinBERT argmax for title_self_text
df_comments["finbert_argmax"] = df_comments[['finbert_positive','finbert_negative','finbert_neutral']].max(axis=1)

In [187]:
# Feature engineering: FinBERT argmax for comment
df_comments["finbert_comment_argmax"] = df_comments[['finbert_comment_positive','finbert_comment_negative','finbert_comment_neutral']].max(axis=1)

In [188]:
df_comments.head()

Unnamed: 0,company_name,date,score,total_comments,title_neg_sentiment,title_neu_sentiment,title_pos_sentiment,title_compound_sentiment,finbert_positive,finbert_negative,...,comment_textblob_polarity,title_textblob_subjectivity,comment_textblob_subjectivity,finbert_comment_positive,finbert_comment_negative,finbert_comment_neutral,reputation,groupthink,finbert_argmax,finbert_comment_argmax
0,META,2022-10-01,1.0,0.0,0.067889,0.850556,0.081444,0.058444,0.072471,0.03766,...,-0.003924,0.185,0.45269,0.038201,0.554231,0.407568,0.0,0.15145,0.889869,0.554231
1,META,2022-10-01,1.0,0.0,0.066553,0.829868,0.077263,0.066716,0.010709,0.950504,...,0.026422,0.413646,0.494938,0.031542,0.131909,0.836549,0.0,0.17484,0.950504,0.836549
2,META,2022-09-30,1.0,0.0,0.0878,0.8527,0.0596,0.01969,0.033386,0.326413,...,0.097039,0.382937,0.503998,0.054207,0.163752,0.78204,0.0,0.171361,0.6402,0.78204
3,META,2022-09-30,1.0,0.0,0.020833,0.909667,0.069417,0.155742,0.037562,0.147082,...,0.0,0.436167,0.0,0.07117,0.187309,0.741521,0.0,0.184105,0.815357,0.741521
4,META,2022-09-29,1.0,0.0,0.0359,0.9126,0.0515,0.02824,0.012997,0.936364,...,0.144646,0.394187,0.548611,0.024031,0.791251,0.184718,0.0,0.169766,0.936364,0.791251


In [189]:
# Group by date and perform aggregation
daily_raw = df_comments.groupby(["date"]).agg({"score": "mean",
                                               "total_comments": "sum",
                                               "title_neg_sentiment": "mean",
                                               "title_neu_sentiment": "mean",
                                               "title_pos_sentiment": "mean",
                                               "title_compound_sentiment": "mean",
                                               "comment_neg_sentiment": "mean",
                                               "comment_neu_sentiment": "mean",
                                               "comment_pos_sentiment": "mean",
                                               "comment_compound_sentiment": "mean",
                                               "title_textblob_polarity": "mean",
                                               "comment_textblob_polarity": "mean",
                                               "title_textblob_subjectivity": "mean",
                                               "comment_textblob_subjectivity": "mean",
                                               "reputation": "mean",
                                               "groupthink": "mean",
                                               "finbert_positive": "mean",
                                               "finbert_negative": "mean",
                                               "finbert_neutral": "mean",
                                               "finbert_comment_positive": "mean",
                                               "finbert_comment_negative": "mean",
                                               "finbert_comment_neutral": "mean",
                                               "finbert_argmax": "mean",
                                               "finbert_comment_argmax": "mean"})
daily_raw

Unnamed: 0_level_0,score,total_comments,title_neg_sentiment,title_neu_sentiment,title_pos_sentiment,title_compound_sentiment,comment_neg_sentiment,comment_neu_sentiment,comment_pos_sentiment,comment_compound_sentiment,...,reputation,groupthink,finbert_positive,finbert_negative,finbert_neutral,finbert_comment_positive,finbert_comment_negative,finbert_comment_neutral,finbert_argmax,finbert_comment_argmax
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-12-31,1.0,2.0,0.000000,1.000000,0.000000,0.000000,0.139000,0.809250,0.051750,-0.220575,...,2.000000,0.280053,0.070723,0.014945,0.914332,0.045451,0.180246,0.774303,0.914332,0.774303
2016-01-01,1.0,3.0,0.000000,1.000000,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,...,3.000000,0.437381,0.113130,0.049625,0.837245,0.812084,0.032854,0.155062,0.837245,0.812084
2016-01-02,1.0,0.0,0.014000,0.786000,0.200000,0.972800,0.000000,1.000000,0.000000,0.000000,...,0.000000,0.489120,0.024031,0.031584,0.944386,0.071170,0.187309,0.741521,0.944386,0.741521
2016-01-04,1.0,0.0,0.145667,0.854333,0.000000,-0.158900,0.000000,1.000000,0.000000,0.000000,...,0.000000,0.055607,0.043178,0.066855,0.889967,0.071170,0.187309,0.741521,0.889967,0.741521
2016-01-05,0.5,11.0,0.087591,0.594182,0.068227,0.034930,0.006071,0.963571,0.030357,0.051557,...,5.000000,0.646262,0.151048,0.018564,0.830388,0.050246,0.130463,0.819291,0.830388,0.819291
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-09-27,1.0,6.0,0.033976,0.847341,0.118688,0.189947,0.055574,0.839411,0.104997,0.102278,...,0.285714,0.273131,0.301345,0.128033,0.570622,0.067892,0.140756,0.791352,0.801563,0.791352
2022-09-28,1.0,7.0,0.037721,0.847408,0.114895,0.182350,0.039135,0.907632,0.053224,0.021364,...,0.700000,0.283905,0.163387,0.349807,0.486806,0.053114,0.268386,0.678500,0.773973,0.774137
2022-09-29,1.0,9.0,0.082919,0.791675,0.094151,-0.035333,0.027054,0.908362,0.064595,0.092364,...,0.562500,0.196290,0.115126,0.327657,0.557217,0.061205,0.186413,0.752382,0.865946,0.796102
2022-09-30,1.0,7.0,0.038535,0.858872,0.102574,0.150119,0.038218,0.916454,0.045330,0.007492,...,0.538462,0.244689,0.156086,0.396583,0.447331,0.083272,0.163445,0.753282,0.766348,0.754630


In [190]:
# Create a dataframe of datetime object
complete_date = pd.DataFrame(pd.date_range(start='30/9/2016', end='30/9/2022'), columns=['date'])
complete_date.shape

  complete_date = pd.DataFrame(pd.date_range(start='30/9/2016', end='30/9/2022'), columns=['date'])


(2192, 1)

In [191]:
daily_missing = complete_date.merge(right=daily_raw, how="left", left_on="date", right_on=daily_raw.index)
daily_missing

Unnamed: 0,date,score,total_comments,title_neg_sentiment,title_neu_sentiment,title_pos_sentiment,title_compound_sentiment,comment_neg_sentiment,comment_neu_sentiment,comment_pos_sentiment,...,reputation,groupthink,finbert_positive,finbert_negative,finbert_neutral,finbert_comment_positive,finbert_comment_negative,finbert_comment_neutral,finbert_argmax,finbert_comment_argmax
0,2016-09-30,9.5,37.0,0.080165,0.863329,0.056506,-0.046520,0.095843,0.833952,0.070240,...,145.250000,1.908280,0.098404,0.031920,0.869676,0.178378,0.180734,0.640889,0.869676,0.640889
1,2016-10-01,1.0,0.0,0.063000,0.847750,0.089250,0.195475,0.000000,1.000000,0.000000,...,0.000000,0.178190,0.034611,0.295551,0.669839,0.071170,0.187309,0.741521,0.669839,0.741521
2,2016-10-02,22.0,20.0,0.097667,0.890833,0.011500,-0.398842,0.028111,0.884500,0.087389,...,430.000000,3.112529,0.051544,0.243121,0.705336,0.067948,0.102282,0.829770,0.705336,0.829770
3,2016-10-03,0.5,6.0,0.021000,0.777850,0.201150,0.373685,0.000000,0.776625,0.223375,...,0.000000,0.553340,0.043442,0.095442,0.861116,0.066328,0.110353,0.823319,0.861116,0.823319
4,2016-10-04,70.5,71.0,0.037690,0.873034,0.089283,0.172521,0.008360,0.931919,0.059728,...,4525.500000,8.945029,0.286733,0.336762,0.376505,0.107465,0.149286,0.743249,0.750727,0.743249
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2187,2022-09-26,1.0,4.0,0.080005,0.772613,0.114048,0.238236,0.080373,0.867171,0.052461,...,0.666667,0.280131,0.157242,0.285663,0.557095,0.066088,0.381197,0.552715,0.696704,0.737326
2188,2022-09-27,1.0,6.0,0.033976,0.847341,0.118688,0.189947,0.055574,0.839411,0.104997,...,0.285714,0.273131,0.301345,0.128033,0.570622,0.067892,0.140756,0.791352,0.801563,0.791352
2189,2022-09-28,1.0,7.0,0.037721,0.847408,0.114895,0.182350,0.039135,0.907632,0.053224,...,0.700000,0.283905,0.163387,0.349807,0.486806,0.053114,0.268386,0.678500,0.773973,0.774137
2190,2022-09-29,1.0,9.0,0.082919,0.791675,0.094151,-0.035333,0.027054,0.908362,0.064595,...,0.562500,0.196290,0.115126,0.327657,0.557217,0.061205,0.186413,0.752382,0.865946,0.796102


In [192]:
#Fill in missing day data with data from the previously available days
daily_filled = daily_missing.fillna(method="ffill")

In [193]:
daily_filled["company"] = "META"

In [194]:
daily_filled = daily_filled[['date',
                             'company',
                             'score',
                             'total_comments',
                             'title_neg_sentiment',
                             'title_neu_sentiment',
                             'title_pos_sentiment',
                             'title_compound_sentiment',
                             'comment_neg_sentiment',
                             'comment_neu_sentiment',
                             'comment_pos_sentiment',
                             'comment_compound_sentiment',
                             'title_textblob_polarity',
                             'comment_textblob_polarity',
                             'title_textblob_subjectivity',
                             'comment_textblob_subjectivity',
                             'reputation',
                             'groupthink',
                             'finbert_positive',
                             'finbert_negative',
                             'finbert_neutral',
                             'finbert_argmax',
                             'finbert_comment_positive',
                             'finbert_comment_negative',
                             'finbert_comment_neutral',
                             'finbert_comment_argmax'
 ]]

daily_filled

Unnamed: 0,date,company,score,total_comments,title_neg_sentiment,title_neu_sentiment,title_pos_sentiment,title_compound_sentiment,comment_neg_sentiment,comment_neu_sentiment,...,reputation,groupthink,finbert_positive,finbert_negative,finbert_neutral,finbert_argmax,finbert_comment_positive,finbert_comment_negative,finbert_comment_neutral,finbert_comment_argmax
0,2016-09-30,META,9.5,37.0,0.080165,0.863329,0.056506,-0.046520,0.095843,0.833952,...,145.250000,1.908280,0.098404,0.031920,0.869676,0.869676,0.178378,0.180734,0.640889,0.640889
1,2016-10-01,META,1.0,0.0,0.063000,0.847750,0.089250,0.195475,0.000000,1.000000,...,0.000000,0.178190,0.034611,0.295551,0.669839,0.669839,0.071170,0.187309,0.741521,0.741521
2,2016-10-02,META,22.0,20.0,0.097667,0.890833,0.011500,-0.398842,0.028111,0.884500,...,430.000000,3.112529,0.051544,0.243121,0.705336,0.705336,0.067948,0.102282,0.829770,0.829770
3,2016-10-03,META,0.5,6.0,0.021000,0.777850,0.201150,0.373685,0.000000,0.776625,...,0.000000,0.553340,0.043442,0.095442,0.861116,0.861116,0.066328,0.110353,0.823319,0.823319
4,2016-10-04,META,70.5,71.0,0.037690,0.873034,0.089283,0.172521,0.008360,0.931919,...,4525.500000,8.945029,0.286733,0.336762,0.376505,0.750727,0.107465,0.149286,0.743249,0.743249
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2187,2022-09-26,META,1.0,4.0,0.080005,0.772613,0.114048,0.238236,0.080373,0.867171,...,0.666667,0.280131,0.157242,0.285663,0.557095,0.696704,0.066088,0.381197,0.552715,0.737326
2188,2022-09-27,META,1.0,6.0,0.033976,0.847341,0.118688,0.189947,0.055574,0.839411,...,0.285714,0.273131,0.301345,0.128033,0.570622,0.801563,0.067892,0.140756,0.791352,0.791352
2189,2022-09-28,META,1.0,7.0,0.037721,0.847408,0.114895,0.182350,0.039135,0.907632,...,0.700000,0.283905,0.163387,0.349807,0.486806,0.773973,0.053114,0.268386,0.678500,0.774137
2190,2022-09-29,META,1.0,9.0,0.082919,0.791675,0.094151,-0.035333,0.027054,0.908362,...,0.562500,0.196290,0.115126,0.327657,0.557217,0.865946,0.061205,0.186413,0.752382,0.796102


In [195]:
new_col_name = list(daily_filled)[0:2] + ['reddit_' + item for item in list(daily_filled)[2:]]
new_col_name

['date',
 'company',
 'reddit_score',
 'reddit_total_comments',
 'reddit_title_neg_sentiment',
 'reddit_title_neu_sentiment',
 'reddit_title_pos_sentiment',
 'reddit_title_compound_sentiment',
 'reddit_comment_neg_sentiment',
 'reddit_comment_neu_sentiment',
 'reddit_comment_pos_sentiment',
 'reddit_comment_compound_sentiment',
 'reddit_title_textblob_polarity',
 'reddit_comment_textblob_polarity',
 'reddit_title_textblob_subjectivity',
 'reddit_comment_textblob_subjectivity',
 'reddit_reputation',
 'reddit_groupthink',
 'reddit_finbert_positive',
 'reddit_finbert_negative',
 'reddit_finbert_neutral',
 'reddit_finbert_argmax',
 'reddit_finbert_comment_positive',
 'reddit_finbert_comment_negative',
 'reddit_finbert_comment_neutral',
 'reddit_finbert_comment_argmax']

In [196]:
daily_filled.to_csv("METAFinbertRedditSentiment.csv", index=False)