### Comparing different Natural Language Processing techniques.

In [2]:
import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt

import sys
sys.executable

# news_df = pd.read_csv("../data/financial_news/stock_news_api/financial_news_data_stocknewsapi_AAPL.csv")
# news_df.head()

ModuleNotFoundError: No module named 'numpy'

In [None]:
plt.figure(figsize=(8, 4))
plt.hist(news_df["sentiment"])
plt.show()

### Text preprocessing

First we will perform text preprocessing using the following steps:

1. Converting the text to lowercase to avoid duplicates.
2. Remove punctation as it wouldn't make any difference in the analysis.
3. Tokenization: Converting the article text into an array of words/tokens.
4. Remove the stopwords that wouldn't make a difference to the sentiment analysis e.g. is, be, was, etc.
5. Lemmatization or Stemming: Converting each word to its root word e.g. running->run, better->good.
6. Add new columns to the dataset with the processed text 

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import string

# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')


def preprocess_text(text):
    text = text.lower()

    # removing punctutation
    text = ''.join([char for char in text if char not in string.punctuation])

    tokens = word_tokenize(text)

    tokens = [word for word in tokens if word not in set(stopwords.words("english"))]
    
    # Converting words to their corresponding roots
    lemmatizer = WordNetLemmatizer()
    porter = PorterStemmer()
    
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    processed_text = " ".join(tokens)

    # print(tokens)
    return processed_text


news_df["processed_text"] = news_df["text"].apply(preprocess_text)
news_df["processed_title"] = news_df["title"].apply(preprocess_text)

news_df.head()
print(news_df[['text', 'processed_text']].head())

The next step is to convert the processed text into a numerical value in order to apply machine learning models on this data. One of the popular techniques for this purpose is the bag-of-words model including the CountVectorizer technique which represents each document (in this case each news article) as a vector of word frequencies.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(news_df["processed_text"])

print(vectorizer.get_feature_names_out())

df_bow = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df_bow.head()


First we split our dataset into training and testing sets to allow us to train the model on one subset and then evaluate its performance on the other subset.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_bow, news_df["sentiment"], test_size=0.2, random_state=42)
X_train.head()

We can see that we have 8715 unique words as the column names and each entry is a vector representing an article and indicating the frequency of each of the unique words in that article.

https://medium.com/@b.terryjack/nlp-pre-trained-sentiment-analysis-1eb52a9d742c

https://medium.com/@sharma.tanish096/sentiment-analysis-using-pre-trained-models-and-transformer-28e9b9486641

Now that we have our data processed and vectorized(bag-of-words presentation), we can start implementing Sentiment Analysis techniques/models in order to compare their performance. The proposed techniques for this research will be:

Machine Learning Models
- Logistic Regression
- Support Vector Machines(SVM)
- Naive Bayes
- RNN-LSTM

Pre-trained State-of-the-art Models
- **BERT**(Bidirectional Encoder Representations from Transformers): Developed by Google.
- **FinBERT**: a pre-trained sentiment analysis model tailored for the financial domain.
- **VADER**: (Valence Aware Dictionary and Sentiment Reasoner): uses a bag of words approach with a table of positive and negative words. focused on social media sentiment
    - Advantage: heuristics to increase the intensity with words like "really", "so", "a bit", "not". returns you the polarity of positive, negative, and neutral sentiments
    - Disadvantage: Out of Vocab(OOV) words that were not seen before can not be interpreted.
- **Textblob**: bag of words classifer
    - Advantage: Subjectivity Analysis(how factual/opinionated a piece of text is),returns the tuple of polarity and subjectivity.
    - Disadvantage: No heuristics so it won't evaluate the intensity of sentiment or negate a sentence.
- **Flair**: character-level LSTM neural network based on other state-of-the-art models
    - Advantage: takes sequences of letters and words into account when predicting, takes negations as well as intensifiers into account. Moreover, it can predict a sentiment for OOV words that it hasn't seen before such as typos.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


model = LogisticRegression(max_iter=1000, random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
classification_report_result = classification_report(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(classification_report_result)
print("Confusion Matrix:")
print(confusion_mat)


In [None]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
sid.polarity_scores("happy sad good bad down loss")

In [None]:
def vader_sentiment_analyzer(text):
    scores = sid.polarity_scores(text)
    return max(scores, key=lambda k: scores.get(k))

news_df["VADER"] = news_df["processed_text"].apply(vader_sentiment_analyzer)
news_df["sentiment"].value_counts()


In [None]:
news_df["VADER"].value_counts()