
# Financial News Sentiment Classification — End‑to‑End Project

**Author:** _Fill your name_  
**Last updated:** 2025-11-12 10:16 UTC

This notebook performs an end-to-end workflow to classify sentiment (positive / neutral / negative) of financial news headlines/sentences.

**What you'll find:**
1. **Data Collection**: Load Financial PhraseBank and optionally **scrape recent headlines** (Yahoo Finance / EastMoney).
2. **Cleaning & Preprocessing**: lowercase, punctuation removal, stopword removal, lemmatization.
3. **Exploratory Data Analysis (EDA)**: class balance, word frequency, word cloud, key terms.
4. **Modeling** (3+ models):
   - **Multinomial Naive Bayes** (TF‑IDF features)
   - **BiLSTM** (pretrained embeddings or randomly initialized)
   - **Transformer fine‑tuning** (FinBERT or DistilBERT baseline)
5. **Evaluation**: accuracy, precision/recall/F1, confusion matrix; compare models.
6. **Application**: apply the best model to **unseen** scraped/manual headlines and interpret predictions.

> **Scoring focus**: clarity and depth of data prep, EDA, and model selection rationale. Model performance supports your narrative.



## 0. Setup
Install packages (one-time). If you're in a restricted environment, consider running only classical models (NB+TF‑IDF) first.


In [None]:

# If running locally, uncomment as needed (may take time for Transformers).
# %pip install -q numpy pandas scikit-learn matplotlib wordcloud nltk beautifulsoup4 requests lxml
# %pip install -q tensorflow==2.* gensim
# %pip install -q transformers datasets torch --extra-index-url https://download.pytorch.org/whl/cpu


## 1. Imports & Utilities

In [None]:

import os, re, json, math, random, string, time, warnings, itertools, collections
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, precision_recall_fscore_support
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.utils.class_weight import compute_class_weight

from wordcloud import WordCloud

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

warnings.filterwarnings("ignore")

# NLTK data (first run may download; if blocked, set these manually)
try:
    nltk.data.find("corpora/stopwords")
except LookupError:
    nltk.download("stopwords", quiet=True)
try:
    nltk.data.find("corpora/wordnet")
except LookupError:
    nltk.download("wordnet", quiet=True)
try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt", quiet=True)

EN_STOP = set(stopwords.words("english"))
LEMM = WordNetLemmatizer()

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

DATA_DIR = Path("./data")
DATA_DIR.mkdir(exist_ok=True, parents=True)
OUT_DIR = Path("./outputs")
OUT_DIR.mkdir(exist_ok=True, parents=True)



## 2. Data Collection
### 2.1 Option A — Load Financial PhraseBank (recommended baseline)
Use one of:
- **Local CSV**: put `financial_phrasebank.csv` into `./data/`
- **HuggingFace**: `datasets.load_dataset("financial_phrasebank", "sentences_allagree")`  


In [None]:

USE_HF = False  # Set to True to try Hugging Face loading

df_base = None

if USE_HF:
    try:
        from datasets import load_dataset
        ds = load_dataset("financial_phrasebank", "sentences_allagree")
        df_base = ds["train"].to_pandas().rename(columns={"sentence": "text", "label": "label_id"})
        id2label = {0: "negative", 1: "neutral", 2: "positive"}
        df_base["label"] = df_base["label_id"].map(id2label)
    except Exception as e:
        print("HF loading failed, falling back to local CSV:", e)

if df_base is None:
    local_csv = DATA_DIR / "financial_phrasebank.csv"
    if local_csv.exists():
        df_base = pd.read_csv(local_csv)
        assert {"text","label"}.issubset(df_base.columns)
    else:
        df_base = pd.DataFrame({
            "text":[
                "Company X reports record profits in Q3.",
                "Regulators fine Company Y for accounting violations.",
                "Market remains unchanged amid mixed economic data."
            ],
            "label":["positive","negative","neutral"]
        })
        print("WARNING: No dataset found. Using a tiny placeholder. Please add your dataset to ./data.")

df_base = df_base.dropna(subset=["text","label"]).reset_index(drop=True)
df_base.head()



### 2.2 Option B — Build Your Own Dataset (Scrape recent headlines)
Best‑effort scrapers for Yahoo Finance ticker pages and a minimal EastMoney keyword demo. These may break if sites change.


In [None]:

import requests
from bs4 import BeautifulSoup

def scrape_yahoo_ticker_news(ticker="AAPL", max_items=50):
    url = f"https://finance.yahoo.com/quote/{ticker}/news"
    headers = {"User-Agent": "Mozilla/5.0"}
    r = requests.get(url, headers=headers, timeout=15)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")
    items = []
    for a in soup.select('a[href*="/news/"]'):
        title = a.get_text(strip=True)
        href = a.get("href", "")
        if title and len(title.split()) > 2 and "/news/" in href:
            if not href.startswith("http"):
                href = "https://finance.yahoo.com" + href
            items.append({"source":"yahoo", "ticker":ticker, "title":title, "url":href})
        if len(items) >= max_items:
            break
    return pd.DataFrame(items).drop_duplicates(subset=["title"])

def scrape_eastmoney_keywords(keyword="新能源", max_pages=1):
    rows = []
    headers = {"User-Agent": "Mozilla/5.0"}
    for p in range(1, max_pages+1):
        url = f"https://so.eastmoney.com/news/s?keyword={keyword}&pageindex={p}"
        try:
            r = requests.get(url, headers=headers, timeout=15)
            r.raise_for_status()
            soup = BeautifulSoup(r.text, "lxml")
            for a in soup.select("a"):
                title = a.get_text(strip=True)
                href = a.get("href","")
                if title and href and ("eastmoney" in href or "finance" in href):
                    rows.append({"source":"eastmoney","keyword":keyword,"title":title,"url":href})
        except Exception as e:
            print("EastMoney scrape error:", e)
    df = pd.DataFrame(rows).drop_duplicates(subset=["title"])
    return df

DF_SCRAPE = pd.DataFrame()
try:
    df_y_aapl = scrape_yahoo_ticker_news("AAPL", max_items=30)
    DF_SCRAPE = pd.concat([DF_SCRAPE, df_y_aapl], ignore_index=True)
except Exception as e:
    print("Yahoo scrape skipped:", e)

if len(DF_SCRAPE):
    DF_SCRAPE.to_csv(OUT_DIR / "scraped_headlines.csv", index=False, encoding="utf-8")
DF_SCRAPE.head()



## 3. Cleaning & Preprocessing
We normalize text (lowercase, punctuation removal), remove stopwords, and lemmatize. For Transformer models, we will skip heavy normalization and rely on the tokenizer—so keep both clean and raw versions.


In [None]:

def basic_clean(text: str) -> str:
    if not isinstance(text, str):
        return ""
    t = text.lower()
    t = re.sub(r"\s+", " ", t)
    t = re.sub(r"[\t\n\r]", " ", t)
    t = re.sub(r"[^a-z0-9\s\-\$%.,!?']", " ", t)
    return t.strip()

def tokenize_lemmatize(text: str):
    from nltk import word_tokenize
    words = word_tokenize(text)
    words = [LEMM.lemmatize(w) for w in words if w not in EN_STOP and w not in string.punctuation]
    return " ".join(words)

df = df_base.copy()
df["text_clean"] = df["text"].apply(basic_clean).apply(tokenize_lemmatize)

label_set = sorted(df["label"].unique().tolist())
label2id = {lab:i for i,lab in enumerate(label_set)}
id2label = {i:lab for lab,i in label2id.items()}
df["label_id"] = df["label"].map(label2id)

print("Labels:", label2id)
df.head()


## 4. Exploratory Data Analysis (EDA)

In [None]:

# Class balance
counts = df["label"].value_counts().sort_index()
ax = counts.plot(kind="bar", rot=0, title="Class distribution")
plt.xlabel("label")
plt.ylabel("count")
plt.show()

# Lengths
df["n_tokens"] = df["text_clean"].str.split().apply(len)
df["n_chars"] = df["text"].str.len()

plt.figure()
df["n_tokens"].hist(bins=30)
plt.title("Token count distribution")
plt.xlabel("tokens")
plt.ylabel("freq")
plt.show()

# Top words per class
from collections import Counter
def top_words(subset, k=20):
    c = Counter()
    for s in subset["text_clean"]:
        c.update(s.split())
    return pd.DataFrame(c.most_common(k), columns=["word","count"])

tops = {}
for lab in label_set:
    tops[lab] = top_words(df[df["label"]==lab], k=20)

for lab in label_set:
    print(f"Top words — {lab}")
    display(tops[lab])

# WordClouds
for lab in label_set:
    try:
        wc = WordCloud(width=800, height=400, background_color="white").generate(" ".join(df[df["label"]==lab]["text_clean"]))
        plt.figure(figsize=(10,5))
        plt.imshow(wc, interpolation="bilinear")
        plt.axis("off")
        plt.title(f"WordCloud — {lab}")
        plt.show()
    except Exception as e:
        print("WordCloud error:", e)


## 5. Train / Test Split

In [None]:

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df["label_id"])
train_df, val_df  = train_test_split(train_df, test_size=0.2, random_state=42, stratify=train_df["label_id"])

print(train_df.shape, val_df.shape, test_df.shape)


## 6. Model 1 — Multinomial Naive Bayes (TF‑IDF)

In [None]:

tfidf_nb = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.9)),
    ("clf", MultinomialNB(alpha=0.5))
])

tfidf_nb.fit(train_df["text_clean"], train_df["label_id"])
pred_val = tfidf_nb.predict(val_df["text_clean"])
print("Validation — accuracy:", accuracy_score(val_df["label_id"], pred_val))
print(classification_report(val_df["label_id"], pred_val, target_names=label_set))


## 7. Model 2 — BiLSTM (Keras/TensorFlow)

In [None]:

USE_BILSTM = True

if USE_BILSTM:
    try:
        import tensorflow as tf
        from tensorflow.keras.preprocessing.text import Tokenizer
        from tensorflow.keras.preprocessing.sequence import pad_sequences
        from tensorflow.keras.models import Sequential
        from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout

        MAX_VOCAB = 20000
        MAX_LEN = 40
        tk = Tokenizer(num_words=MAX_VOCAB, oov_token="<unk>")
        tk.fit_on_texts(train_df["text_clean"].tolist())

        def to_seqs(series):
            return pad_sequences(tk.texts_to_sequences(series.tolist()), maxlen=MAX_LEN, padding="post", truncating="post")

        X_tr = to_seqs(train_df["text_clean"])
        X_va = to_seqs(val_df["text_clean"])
        y_tr = train_df["label_id"].values
        y_va = val_df["label_id"].values

        model = Sequential([
            Embedding(input_dim=MAX_VOCAB, output_dim=128, input_length=MAX_LEN),
            Bidirectional(LSTM(64, return_sequences=False)),
            Dropout(0.3),
            Dense(len(label_set), activation="softmax")
        ])

        model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
        hist = model.fit(X_tr, y_tr, validation_data=(X_va, y_va), epochs=5, batch_size=64, verbose=1)

        bilstm_val_pred = np.argmax(model.predict(X_va), axis=1)
        print("BiLSTM Validation — accuracy:", accuracy_score(y_va, bilstm_val_pred))
        print(classification_report(y_va, bilstm_val_pred, target_names=label_set))
    except Exception as e:
        print("BiLSTM section skipped due to error:", e)


## 8. Model 3 — Transformer Fine‑Tuning (FinBERT or DistilBERT)

In [None]:

USE_TRANSFORMER = True
TRANSFORMER_MODEL = "ProsusAI/finbert"  # or "distilbert-base-uncased"

if USE_TRANSFORMER:
    try:
        from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
        import torch
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL)
        num_labels = len(label_set)

        model = AutoModelForSequenceClassification.from_pretrained(
            TRANSFORMER_MODEL, num_labels=num_labels, 
            id2label={i:lab for i,lab in enumerate(label_set)},
            label2id={lab:i for i,lab in enumerate(label_set)}
        ).to(device)

        class DS(torch.utils.data.Dataset):
            def __init__(self, df):
                self.df = df.reset_index(drop=True)
            def __len__(self): return len(self.df)
            def __getitem__(self, idx):
                row = self.df.iloc[idx]
                enc = tokenizer(row["text"], truncation=True, padding="max_length", max_length=64, return_tensors="pt")
                item = {k:v.squeeze(0) for k,v in enc.items()}
                item["labels"] = torch.tensor(int(row["label_id"]), dtype=torch.long)
                return item

        train_ds = DS(train_df)
        val_ds   = DS(val_df)

        args = TrainingArguments(
            output_dir="./transformer_out",
            evaluation_strategy="epoch",
            save_strategy="epoch",
            learning_rate=2e-5,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=32,
            num_train_epochs=2,
            weight_decay=0.01,
            logging_steps=50,
            load_best_model_at_end=True,
            report_to="none"
        )

        def compute_metrics(eval_pred):
            logits, labels = eval_pred
            preds = np.argmax(logits, axis=1)
            pr, rc, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
            acc = accuracy_score(labels, preds)
            return {"accuracy": acc, "precision": pr, "recall": rc, "f1": f1}

        trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=val_ds, compute_metrics=compute_metrics)
        trainer.train()

        eval_res = trainer.evaluate()
        print("Transformer Validation metrics:", eval_res)

        TRANSFORMER_TRAINER = trainer
    except Exception as e:
        print("Transformer section skipped due to error:", e)


## 9. Final Evaluation on Test Set

In [None]:

results = []

# NB
try:
    nb_test_pred = tfidf_nb.predict(test_df["text_clean"])
    nb_metrics = classification_report(test_df["label_id"], nb_test_pred, target_names=label_set, output_dict=True)
    results.append(("TFIDF+NB", nb_metrics["accuracy"], nb_metrics["macro avg"]["f1-score"]))
except Exception as e:
    print("NB test eval skipped:", e)

# BiLSTM
try:
    if USE_BILSTM and 'model' in globals():
        from tensorflow.keras.preprocessing.sequence import pad_sequences
        X_te = pad_sequences(tk.texts_to_sequences(test_df["text_clean"].tolist()), maxlen=40, padding="post", truncating="post")
        bilstm_test_pred = np.argmax(model.predict(X_te), axis=1)
        bilstm_metrics = classification_report(test_df["label_id"], bilstm_test_pred, target_names=label_set, output_dict=True)
        results.append(("BiLSTM", bilstm_metrics["accuracy"], bilstm_metrics["macro avg"]["f1-score"]))
except Exception as e:
    print("BiLSTM test eval skipped:", e)

# Transformer
try:
    if USE_TRANSFORMER and 'TRANSFORMER_TRAINER' in globals():
        import torch, numpy as np
        class DS2(torch.utils.data.Dataset):
            def __init__(self, df):
                self.df = df.reset_index(drop=True)
            def __len__(self): return len(self.df)
            def __getitem__(self, idx):
                row = self.df.iloc[idx]
                enc = tokenizer(row["text"], truncation=True, padding="max_length", max_length=64, return_tensors="pt")
                item = {k:v.squeeze(0) for k,v in enc.items()}
                item["labels"] = torch.tensor(int(row["label_id"]), dtype=torch.long)
                return item
        preds = TRANSFORMER_TRAINER.predict(DS2(test_df)).predictions
        tr_test_pred = np.argmax(preds, axis=1)
        tr_metrics = classification_report(test_df["label_id"], tr_test_pred, target_names=label_set, output_dict=True)
        results.append(("Transformer", tr_metrics["accuracy"], tr_metrics["macro avg"]["f1-score"]))
except Exception as e:
    print("Transformer test eval skipped:", e)

if results:
    cmp_df = pd.DataFrame(results, columns=["model","accuracy","macro_f1"]).sort_values("macro_f1", ascending=False)
    display(cmp_df)
else:
    print("No results to display. Make sure at least one model trained successfully.")


### Confusion Matrices

In [None]:

def plot_cm(y_true, y_pred, labels, title):
    cm = confusion_matrix(y_true, y_pred, labels=list(range(len(labels))))
    plt.figure()
    plt.imshow(cm, interpolation="nearest")
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(labels))
    plt.xticks(tick_marks, labels, rotation=45)
    plt.yticks(tick_marks, labels)
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, format(cm[i, j], 'd'),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
    plt.show()

# NB CM (validation)
try:
    plot_cm(val_df["label_id"], pred_val, label_set, "NB (val) Confusion Matrix")
except Exception as e:
    print("NB CM skipped:", e)

# BiLSTM CM (validation)
try:
    if 'bilstm_val_pred' in globals():
        plot_cm(val_df["label_id"], bilstm_val_pred, label_set, "BiLSTM (val) Confusion Matrix")
except Exception as e:
    print("BiLSTM CM skipped:", e)


## 10. Apply Best Model to New Headlines

In [None]:

def predict_with_nb(texts):
    probs = tfidf_nb.predict_proba(texts)
    preds = tfidf_nb.predict(texts)
    return preds, probs

def demo_application():
    path = OUT_DIR / "scraped_headlines.csv"
    if path.exists():
        df_new = pd.read_csv(path)
        texts = df_new["title"].astype(str).tolist()
    else:
        texts = [
            "Tesla shares jump as deliveries beat expectations",
            "Federal Reserve signals rates may stay higher for longer",
            "Company Z misses revenue estimates; outlook trimmed"
        ]
        df_new = pd.DataFrame({"title": texts})
    preds, probs = predict_with_nb(texts)
    df_new["pred_id"] = preds
    df_new["pred_label"] = [id2label[i] for i in preds]
    # map probabilities to labels
    label_order = tfidf_nb.named_steps["clf"].classes_
    df_prob = pd.DataFrame(probs, columns=[id2label[i] for i in label_order])
    df_out = pd.concat([df_new, df_prob], axis=1)
    return df_out

df_app = demo_application()
df_app.head()


## 11. Save Artifacts

In [None]:

train_df.to_csv(OUT_DIR / "train.csv", index=False)
val_df.to_csv(OUT_DIR / "val.csv", index=False)
test_df.to_csv(OUT_DIR / "test.csv", index=False)

try:
    import joblib
    joblib.dump(tfidf_nb, OUT_DIR / "tfidf_nb.joblib")
except Exception as e:
    print("Model save skipped:", e)

print("Artifacts saved to ./outputs")



## 12. Notes on Customization & Justification

- **Preprocessing**: Light normalization preserves finance symbols ($, %, -) that carry meaning.
- **Imbalance**: Use class weights (deep models) or resampling. Track **macro F1**.
- **Model choices**:
  - **TF‑IDF + NB**: fast/strong for short texts.
  - **BiLSTM**: captures limited order/negation.
  - **FinBERT**: domain-specific; often best if compute allows.
- **Optimization**:
  - NB: tune `alpha`, `ngram_range`.
  - BiLSTM: embedding dim, LSTM units, dropout, LR, epochs.
  - Transformers: LR (1e‑5~5e‑5), batch (8–32), epochs (2–5), max length.
- **Evaluation**:
  - Prefer **macro F1** and per-class F1; add confusion matrices.
  - Consider **time‑based split** when mixing old vs. recent news.
- **Error analysis**:
  - Inspect false positives/negatives; build keyword sanity lists (“beat”, “miss”, “downgrade”, “upgrade”).
- **Reproducibility**:
  - Fix seeds; save artifacts; export versions.
