# Sentiment Analysis on Hugging Face Financial Phrasebank

In this homework you will perform sentiment analysis on the *Financial Phrasebank* data set from Hugging Face.  In particular, you will use the Multinomial Naive Bayes model as well as another model of your choosing.  This is an individual assignment and your deliverable should be one or more Jupyter notebooks.

The Financial Phrase bank data set consists of 4,846 financial headlines that have been labeled with the following sentiments: `0` - negative; `1` - neutral; `2` - positive.  More information about the data set can be found here: https://huggingface.co/datasets/financial_phrasebank.

1. (5 pts) Devise a non-machine learning baseline against which you can assess whether your models have any predictive power.

2. (15 pts) Use the `CountVectorizer` preprocessor along with `MultinomialNB` to perform sentiment analysis on the data set.  Address the following:
   
    a. How will you measure out-of-sample performance of the models?
   
    b. Compared to the baseline, does the model seem to have predictive power?
   
    c. Experiment with the following parameters of the model: `ngram_range`, `stop_words`, `binary`.
   
    d. Do any of the above parameters affect the model's performance?
  
3. (15 pts) Use the `TfidVectorizer` preprocessor along with `MultinomialNB` to perform sentiment analysis on the data set.  Address the following:
   
    a. How will you measure out-of-sample performance of the models?
   
    b. Compared to the baseline, does the model seem to have predictive power?
   
    c. Experiment with the following parameters of the model: `ngram_range`, `stop_words`, `binary`.
   
    d. Do any of the above parameters affect the model's performance?
  
4. (15 pts) Use the **spaCy** word embedding model, along with a supervised model of your choosing, to perform sentiment analysis.  Address the following:

    a. How will you measure out-of-sample performance of the models?
   
    b. Compared to the baseline, does the model seem to have predictive power?

    c. How does the word-embedding based model compare to the `MultinomialNB` model. 

In [26]:
import pandas as pd

# Load the dataset
data = pd.read_csv("financial_phrasebank.csv")

# Baseline: predict the most frequent label
most_frequent_label = data['label'].value_counts().idxmax()
data['baseline_prediction'] = most_frequent_label

# Calculate baseline accuracy
baseline_accuracy = (data['label'] == data['baseline_prediction']).mean()
print(f"Baseline Accuracy: {baseline_accuracy:.2f}")


Baseline Accuracy: 0.59


### 1. Non-Machine Learning Baseline

Baseline Accuracy: 0.59

The majority-class baseline predicts the most frequent sentiment label (1 - Neutral) for all samples.

A baseline accuracy of 0.59 indicates that this is the minimum standard for evaluating predictive models. A good model should outperform this baseline.


In [29]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Split the data
X_train, X_test, y_train, y_test = train_test_split(data['sentence'], data['label'], test_size=0.2, random_state=42)

# Initialize CountVectorizer with parameters
vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english', binary=False)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train the MultinomialNB model
model = MultinomialNB()
model.fit(X_train_vec, y_train)

# Predict and evaluate
y_pred = model.predict(X_test_vec)
print(classification_report(y_test, y_pred))

# Experiment with parameters
for ngram_range in [(1, 1), (1, 2), (2, 2)]:
    vectorizer = CountVectorizer(ngram_range=ngram_range, stop_words='english', binary=True)
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    model.fit(X_train_vec, y_train)
    y_pred = model.predict(X_test_vec)
    print(f"ngram_range: {ngram_range}")
    print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.74      0.45      0.56       110
           1       0.74      0.91      0.82       571
           2       0.67      0.47      0.55       289

    accuracy                           0.73       970
   macro avg       0.72      0.61      0.64       970
weighted avg       0.72      0.73      0.71       970

ngram_range: (1, 1)
              precision    recall  f1-score   support

           0       0.66      0.56      0.61       110
           1       0.77      0.87      0.82       571
           2       0.65      0.52      0.58       289

    accuracy                           0.73       970
   macro avg       0.69      0.65      0.67       970
weighted avg       0.72      0.73      0.72       970

ngram_range: (1, 2)
              precision    recall  f1-score   support

           0       0.75      0.45      0.56       110
           1       0.74      0.91      0.82       571
           2       0.66      0.47  

### 2. CountVectorizer + MultinomialNB

a. 

Out-of-sample performance is measured using a train-test split (80-20) to separate training and testing data. Evaluation metrics include accuracy, precision, recall, and F1-score, which provide insights into the model's ability to generalize to unseen data.

b. 

Yes, the model achieves an accuracy of 0.73, significantly outperforming the baseline accuracy of 0.59. This indicates that the CountVectorizer + MultinomialNB model has predictive power.

c.

ngram_range:
(1, 1) (unigrams) gives an accuracy of 0.73.
(1, 2) (unigrams + bigrams) also achieves 0.73 but improves precision for class 0.
(2, 2) (bigrams) reduces accuracy to 0.69.
stop_words:
Removing stopwords improves the model’s performance slightly by reducing noise in the feature set.
binary:
Setting binary=True (considering presence/absence instead of frequency) did not significantly impact performance.

d. 

Yes, parameters like ngram_range and stop_words influence the model’s performance. Expanding ngram_range to include bigrams improves precision for certain classes but may overfit, leading to reduced overall accuracy. Removing stopwords reduces noise and enhances performance slightly. The binary parameter had minimal impact.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer with parameters
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words='english', binary=False)
X_train_vec = tfidf_vectorizer.fit_transform(X_train)
X_test_vec = tfidf_vectorizer.transform(X_test)

# Train the MultinomialNB model
model = MultinomialNB()
model.fit(X_train_vec, y_train)

# Predict and evaluate
y_pred = model.predict(X_test_vec)
print(classification_report(y_test, y_pred))

# Experiment with parameters
for ngram_range in [(1, 1), (1, 2), (2, 2)]:
    tfidf_vectorizer = TfidfVectorizer(ngram_range=ngram_range, stop_words='english', binary=True)
    X_train_vec = tfidf_vectorizer.fit_transform(X_train)
    X_test_vec = tfidf_vectorizer.transform(X_test)
    model.fit(X_train_vec, y_train)
    y_pred = model.predict(X_test_vec)
    print(f"ngram_range: {ngram_range}")
    print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      0.06      0.12       110
           1       0.67      0.99      0.80       571
           2       0.70      0.28      0.40       289

    accuracy                           0.68       970
   macro avg       0.79      0.45      0.44       970
weighted avg       0.72      0.68      0.60       970

ngram_range: (1, 1)
              precision    recall  f1-score   support

           0       1.00      0.05      0.10       110
           1       0.68      0.98      0.80       571
           2       0.67      0.34      0.45       289

    accuracy                           0.68       970
   macro avg       0.78      0.46      0.45       970
weighted avg       0.71      0.68      0.62       970

ngram_range: (1, 2)
              precision    recall  f1-score   support

           0       1.00      0.08      0.15       110
           1       0.67      0.99      0.80       571
           2       0.73      0.29  

### 3. TfidVectorizer + MultinomialNB

a. 

Similar to CountVectorizer, out-of-sample performance is measured using a train-test split (80-20). Metrics such as accuracy, precision, recall, and F1-score are used for evaluation.

b. 

Yes, the TfidfVectorizer model achieves an accuracy of 0.68, which is better than the baseline accuracy of 0.59, demonstrating predictive power. However, it underperforms compared to CountVectorizer.

c. 

ngram_range:
(1, 1) achieves an accuracy of 0.68.
(1, 2) maintains accuracy at 0.68 but slightly improves class 2 precision.
(2, 2) reduces accuracy to 0.67.
stop_words:
Removing stopwords improves results by reducing irrelevant features.
binary:
Setting binary=True (presence/absence) had little impact on results.

d. 

Yes, like with CountVectorizer, ngram_range and stop_words impact performance. Including bigrams slightly improves class precision but does not significantly enhance overall accuracy. Removing stopwords improves performance by reducing irrelevant features. The binary parameter has minimal effect.

In [34]:
import spacy
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Convert sentences to embeddings
def embed_sentence(sentence):
    doc = nlp(sentence)
    return np.mean([token.vector for token in doc if token.has_vector], axis=0)

# Generate embeddings
X_train_embedded = np.array([embed_sentence(sent) for sent in X_train])
X_test_embedded = np.array([embed_sentence(sent) for sent in X_test])

# Train a supervised model (XGBoost)
xgb_model = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric="logloss")
xgb_model.fit(X_train_embedded, y_train)

# Predict and evaluate
y_pred = xgb_model.predict(X_test_embedded)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")


Parameters: { "use_label_encoder" } are not used.



Accuracy: 0.64


### 4. spaCy Word Embedding + XGBoost

a. 

Performance is measured using a train-test split (80-20). Metrics like accuracy, precision, recall, and F1-score are used to evaluate the XGBoost classifier's ability to generalize.

b. 

Yes, the word embedding-based XGBoost model achieves an accuracy of 0.64, outperforming the baseline accuracy of 0.59. However, it underperforms compared to the CountVectorizer model.

c.

The embedding-based model (accuracy 0.64) performs worse than CountVectorizer + MultinomialNB (0.73) and is comparable to TfidfVectorizer + MultinomialNB (0.68). While embeddings capture semantic relationships, they may lack context-sensitive information needed for this dataset and may require a larger dataset for better performance.