Sentiment Analysis using NLTK, AFINN, and BERT

In this sentiment analysis project, we aim to analyze the sentiment of comments using different methods: NLTK, AFINN, and BERT. We will start by explaining each method and its role in the sentiment analysis pipeline.

1. NLTK (Natural Language Toolkit)

NLTK is a powerful Python library for natural language processing, including sentiment analysis. It provides various tools for text processing and analysis. In our project, we will use NLTK primarily for tokenization, removing stopwords, and calculating sentiment scores.

Sentiment Scoring:
NLTK offers various sentiment analysis approaches, including using pre-built sentiment lexicons. In this project, we will utilize the VADER (Valence Aware Dictionary and sentiment Reasoner) sentiment analyzer provided by NLTK. VADER is a lexicon and rule-based tool that assigns a sentiment score to each comment, indicating its positivity, negativity, or neutrality.

In [1]:
# Importing Required Libraries

import nltk
import pandas as pd

In [2]:
# Using the VADER sentiment analyzer

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/wailunchan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [3]:
from nltk.sentiment import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [4]:
def get_sentiment_score(token):
    sentiment = sid.polarity_scores(token)
    return sentiment['compound']

In [5]:
# Loading the Data which is created a "cleaned_comments" after nltk text preprocessing

df = pd.read_csv("nltk_Reddit_comment_Jan-Jun23.csv")

In [6]:
# Calculate sentiment score for each word in the cleaned_comments and store in a single column

df['sentiment_scores'] = df['cleaned_comments'].apply(lambda x: [get_sentiment_score(token) for token in x.split()])

In [7]:
# Flatten the list of lists into a single list (handle non-list values)

df['sentiment_scores'] = df['sentiment_scores'].apply(lambda x: [score for score in x if score != 0.0] if isinstance(x, list) else x)

In [8]:
# Combine sentiment scores of each word into a single list

df['sentiment_scores'] = df['sentiment_scores'].apply(lambda x: ' '.join(str(score) for score in x))

In [9]:
# Drop the useless columns

df.drop('Unnamed: 0.1', axis=1, inplace=True)
df.drop('Unnamed: 0', axis=1, inplace=True)

In [10]:
df.head()

Unnamed: 0,index,date,comments,Open_x,High_x,Low_x,Close_x,Adj Close_x,Volume_x,percent_chnage_x,...,Close_y,Adj Close_y,Volume_y,percent_chnage_y,Jump_y,Big_Jump_y,Drop_y,Big_Drop_y,cleaned_comments,sentiment_scores
0,0,2022-07-13,This week's [Earnings Thread](https://www.redd...,3779.669922,3829.439941,3759.070068,3801.780029,3801.780029,4109390000,0.584975,...,11247.58008,11247.58008,4433060000,1.727757,0,1,0,0,week earnings thread http wwwredditcomrwallstr...,-0.5423 -0.5423 -0.5423 0.4588 0.4215 0.25 0.5...
1,1,2022-07-14,This week's [Earnings Thread](https://www.redd...,3763.98999,3796.409912,3721.560059,3790.379883,3790.379883,4199690000,0.701115,...,11251.19043,11251.19043,4481070000,0.896589,1,0,0,0,week earnings thread http wwwredditcomrwallstr...,-0.3182 -0.3182 -0.4404 -0.4019 0.4404 0.4215 ...
2,2,2022-07-15,Cashed out up 56k today. Now cuddled up watchi...,3818.0,3863.620117,3817.179932,3863.159912,3863.159912,4143800000,1.182816,...,11452.41992,11452.41992,4369060000,0.642036,1,0,0,0,cashed k today cuddled watching netflix eating...,0.4215 -0.4215 -0.5859 -0.4215 -0.4939 -0.1531...
3,3,2022-07-19,This week's [Earnings Thread](https://www.redd...,3860.72998,3939.810059,3860.72998,3936.689941,3936.689941,4041070000,1.967503,...,11713.15039,11713.15039,5302740000,1.720802,0,1,0,0,week earnings thread http wwwredditcomrwallstr...,0.5719 -0.3182 -0.6908 -0.5106 0.3612 0.3612 -...
4,4,2022-07-20,If 2008 was the Great Recession\n\nThen 2022 i...,3935.320068,3974.129883,3922.030029,3959.899902,3959.899902,4185300000,0.624596,...,11897.65039,11897.65039,5467080000,1.463067,0,1,0,0,great recession fake recession nflx gon na war...,0.6249 -0.4215 -0.4767 -0.4215 -0.1027 -0.3818...


In [11]:
df.to_csv("nltk_sentiment_test.csv")

2. AFINN (Affective Norms for English Words)

AFINN is a sentiment analysis tool that assigns pre-computed sentiment scores to individual words. Each word is given a score ranging from negative to positive, representing its sentiment intensity. We will use AFINN to calculate the overall sentiment score of each comment by summing up the sentiment scores of its constituent words.

Download 3 lexicon from https://github.com/fnielsen/afinn/blob/master/afinn/data/AFINN-111.txt

AFINN-111.txt
AFINN-96.txt
AFINN-en-165.txt

In [12]:
# Read the AFINN lexicon into a Python dictionary

def load_sentiment_lexicon(filename):
    sentiment_scores = {}
    with open(filename, 'r') as file:
        for line in file:
            word, score = line.strip().split('\t')
            sentiment_scores[word] = int(score)
    return sentiment_scores

afinn_111_lexicon = load_sentiment_lexicon('AFINN-111.txt')
afinn_96_lexicon = load_sentiment_lexicon('AFINN-96.txt')
afinn_en_lexicon = load_sentiment_lexicon('AFINN-en-165.txt')

combined_lexicon = {**afinn_111_lexicon, **afinn_96_lexicon, **afinn_en_lexicon}

In [13]:
# Perform sentiment analysis using AFINN

def get_sentiment_score(token):
    if token in combined_lexicon:
        score = combined_lexicon[token]
        if score != 0.0:
            return combined_lexicon[token]
    return None

In [14]:
## Loading the Data which is created a "cleaned_comments" after spacy text preprocessing

df = pd.read_csv("spacy_Reddit_comment_Jan-Jun23.csv")

In [15]:
# Calculate sentiment scores for each word in the comment using the get_sentiment_score 

df['sentiment_scores'] = df['cleaned_comments'].apply(lambda x: [get_sentiment_score(token) for token in x.split() if get_sentiment_score(token) is not None])

In [16]:
# Convert the list of scores into a single comma-separated string, make each sentiment score becomes an individual entry 

df['sentiment_scores'] = df['sentiment_scores'].apply(lambda x: ', '.join(str(score) for score in x))

In [17]:
# Drop the useless columns

df.drop('Unnamed: 0.1', axis=1, inplace=True)
df.drop('Unnamed: 0', axis=1, inplace=True)

In [18]:
df.head()

Unnamed: 0,index,date,comments,Open_x,High_x,Low_x,Close_x,Adj Close_x,Volume_x,percent_chnage_x,...,Close_y,Adj Close_y,Volume_y,percent_chnage_y,Jump_y,Big_Jump_y,Drop_y,Big_Drop_y,cleaned_comments,sentiment_scores
0,0,2022-07-13,This week's [Earnings Thread](https://www.redd...,3779.669922,3829.439941,3759.070068,3801.780029,3801.780029,4109390000,0.584975,...,11247.58008,11247.58008,4433060000,1.727757,0,1,0,0,week earnings threadhttpswwwredditcom r wallst...,"-3, -4, -3, 3, 1, 4, -2, 2, 4, 2, 1, -2, 2, 1,..."
1,1,2022-07-14,This week's [Earnings Thread](https://www.redd...,3763.98999,3796.409912,3721.560059,3790.379883,3790.379883,4199690000,0.701115,...,11251.19043,11251.19043,4481070000,0.896589,1,0,0,0,week earnings threadhttpswwwredditcom r wallst...,"-3, -3, -3, 2, -1, 2, -2, -2, -3, -4, 3, 3, 3,..."
2,2,2022-07-15,Cashed out up 56k today. Now cuddled up watchi...,3818.0,3863.620117,3817.179932,3863.159912,3863.159912,4143800000,1.182816,...,11452.41992,11452.41992,4369060000,0.642036,1,0,0,0,cashed 56k today cuddle watch netflix eat chur...,"3, -2, -1, -3, -2, -2, -1, -1, 1, -3, -2, 1, 3..."
3,3,2022-07-19,This week's [Earnings Thread](https://www.redd...,3860.72998,3939.810059,3860.72998,3936.689941,3936.689941,4041070000,1.967503,...,11713.15039,11713.15039,5302740000,1.720802,0,1,0,0,week earnings threadhttpswwwredditcom r wallst...,"-3, -2, -4, 2, 2, -4, -2, 2, 1, 3, -3, -2, -4,..."
4,4,2022-07-20,If 2008 was the Great Recession\n\nThen 2022 i...,3935.320068,3974.129883,3922.030029,3959.899902,3959.899902,4185300000,0.624596,...,11897.65039,11897.65039,5467080000,1.463067,0,1,0,0,2008 great recession 2022 fake recession nflx ...,"3, -2, -3, -2, -2, -3, -1, -4, -4, -3, 1, 2, 2..."


In [19]:
df.to_csv("afinn_sentiment_scores.csv")

3. BERT (Bidirectional Encoder Representations from Transformers)

BERT is a state-of-the-art natural language processing model developed by Google. Unlike traditional methods like NLTK and AFINN, BERT is a deep learning model that can capture complex linguistic patterns and contextual information. In this project, we will use the BERT model, fine-tuned on a sentiment analysis task, to predict the sentiment of each comment.

In [None]:
!pip install transformers bert-for-tf2

In [None]:
pip install --upgrade transformers bert-for-tf2

In [21]:
import pandas as pd
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification

In [20]:
# Loading the Original Data

df = pd.read_csv("consol_Reddit_comment_Jan-Jun23.csv")

In [22]:
# Instantiate the Model and Tokenizer

tokenizer = BertTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = TFBertForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Metal device set to: Apple M1 Pro

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB



2023-07-24 14:37:18.830164: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-07-24 14:37:18.830312: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
All model checkpoint layers were used when initializing TFBertForSequenceClassification.

All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [23]:
# Define the Sentiment Analysis Function

def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='tf', max_length=512, truncation=True)
    result = model(tokens)
    return int(tf.argmax(result.logits, axis=1)[0])

In [24]:
# Split comments into sentences and apply 'sentiment_score' function

df['sentiment_scores'] = df['comments'].str.split('.').apply(lambda sentences: [sentiment_score(sentence) for sentence in sentences])

In [None]:
# Save the DataFrame with sentiment scores to a new CSV file

df.to_csv('bert_sentiment_analysis.csv', index=False)