# Financial News Sentiment Analysis

This notebook performs sentiment classification (bullish, neutral, bearish) on financial news using:
1. GMI API (DeepSeek-V3.2) as baseline
2. Two pre-trained transformer models for comparison
3. Fine-tuning the best model
4. Re-evaluation after fine-tuning


In [None]:
import os

In [12]:
# Install required packages
%pip install datasets transformers torch requests pandas scikit-learn tqdm -q

In [None]:
HF_TOKEN = os.getenv("HF_TOKEN")

DATASET_NAME = "ArthurMrv/EDGAR-CORPUS-Financial-Summarization-Labeled"

In [None]:
TRAIN_SPLIT = 0.8

In [22]:
from tqdm import tqdm

## 1. Load Dataset


In [15]:
# Authenticate with Hugging Face Hub
from huggingface_hub import login
import getpass

login(token=HF_TOKEN)

print("\nSuccessfully logged in to Hugging Face Hub!")



Successfully logged in to Hugging Face Hub!


In [None]:
from datasets import load_dataset
import pandas as pd

# 1. Load in streaming mode
ds = load_dataset(DATASET_NAME)

df = ds.to_pandas()
df.head()


Unnamed: 0,input,summary,model
0,FINANCIAL STATEMENTS AND SUPPLEMENTARY DATA IN...,Here's a summary of the financial statement:\n...,Claude
1,FINANCIAL STATEMENTS AND SUPPLEMENTARY DATA. ...,"Based on the provided excerpt, here's a summar...",Claude
2,. Report of Independent Registered Public Acco...,This appears to be a partial financial stateme...,Claude
3,Index to Consolidated Financial Statements All...,This appears to be a partial financial stateme...,Claude
4,. ACCOUNTING FIRM To the Board of Directors a...,Here's a summary of the financial statement:\n...,Claude


## 2. Setup API for Classification


In [37]:
import os
from huggingface_hub import InferenceClient

client = InferenceClient(
    api_key=HF_TOKEN,
)

In [38]:
def request_sentiment(text, prompt_template):
    prompt = prompt_template.format(input_text=text)
    completion = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V3.2",
        messages=[
            {"role": "user", "content": prompt}
        ],
    )
    response = completion.choices[0].message
    # Extract the word (expected to be only the class label)
    return response.content.strip().lower()

def get_llm_sentiment(df, input_column, output_column, prompt_template):
    """
    Given a DataFrame, input column, and output column name,
    calls the DeepSeek LLM via Hugging Face Inference API to get the sentiment for each row.
    The result will be stored in a new column (output_column).
    """

    tqdm.pandas(desc="Classifying sentiment")
    df[output_column] = df[input_column].progress_apply(lambda x: request_sentiment(x, prompt_template))
    return df

In [46]:
prompt_template = """
You are a Financial News Sentiment Classifier.
Your task is to classify the sentiment EXPLICITLY expressed in the text regarding stocks.

### Scoring Rubric:
+2 (Highly Bullish): Text explicitly states prices are soaring, skyrocketing, or reports massive success/breakthroughs.
+1 (Bullish): Text describes price increases, positive outlooks, or favorable conditions.
 0 (Neutral): Text is purely factual with no emotional charge, or describes flat price movement.
-1 (Bearish): Text describes price drops, fears, negative outlooks, or unfavorable conditions.
-2 (Highly Bearish): Text explicitly states prices are crashing, plummeting, or reports crisis/panic.

### Critical Rules:
1. **React to the text, not the market.** If the text says "stocks are down," the score MUST be negative, even if the reason is generic (like the FED).
2. Do not assume news is "priced in." Analyze the immediate emotional and factual content of the snippet.
3. Ignore external context. Only use the provided text.

### Output Format:
Reasoning: [1 sentence identifying the specific keywords or claims in the text that justify the score]
Score: [Integer between -2 and 2]

---
Article: {input_text}
"""
input_column = "text"
output_column = "llm_response"

In [52]:
input_txt = """
Here's a summary of the financial statement: Financial Health Overview: - The company (Digerati) is operating at a loss with a working capital deficit - There are substantial concerns about the company's ability to continue operations - Management has addressed these concerns in Note 2 Revenue Streams: 1. Global VoIP Services - Provides VoIP services to U.S. and foreign telecommunications companies - Focuses on markets in Mexico, Asia, the Middle East, and Latin America 2. Cloud Communication Services - Offers hosted IP/PBX services to resellers and enterprise customers - Includes various features like call center applications, prepaid services, and customized IP/PBX features Key Expenses: - Transmission and termination charges from suppliers - Infrastructure and network costs - Internet bandwidth charges - Licensing and co-location charges - Installation costs Financial Risk Factors: - Credit risk from trade receivables - Potential exposure from bank deposits exceeding federally insured limits - Customer concentration risk (four customers comprise 20% of revenue) Revenue Recognition: - Based on evidence of arrangement, service delivery, fixed pricing, and collectability - Company acts as primary obligor with pricing authority and credit risk responsibility This statement indicates a company with established revenue streams but facing significant financial challenges and operational risks."""
request_sentiment(input_txt, prompt_template)

'reasoning: the text explicitly describes a company "operating at a loss with a working capital deficit," notes "substantial concerns about the company\'s ability to continue operations," and lists multiple "financial challenges and operational risks," which are all explicitly unfavorable conditions.  \nscore: -1'

In [49]:
df_labeled = get_llm_sentiment(df, input_column, output_column, prompt_template)
df_labeled.head()

Classifying sentiment: 100%|██████████| 10/10 [00:31<00:00,  3.18s/it]


Unnamed: 0,date,text,extra_fields,llm_response
0,2016-01-01T00:00:00Z,New Ted Cruz Super PAC with $4M ad buy\n\n(CNN...,"{""publication"":""CNN"",""author"":""Theodore Schlei...",reasoning: the text is about political campaig...
1,2016-01-01T00:00:00Z,"Write an essay, win a 100-year-old movie theat...","{""publication"":""CNN"",""author"":""Kevin Conlon"",""...",reasoning: the article discusses an essay cont...
2,2016-01-01T00:00:00Z,Putin points to NATO threat in new security st...,"{""publication"":""CNN"",""author"":""Euan McKirdy"",""...",reasoning: the text is a factual report on geo...
3,2016-01-01T00:00:00Z,Better results offset costs of prostate surger...,"{""publication"":""Reuters"",""author"":""Lisa Rapapo...",reasoning: the article discusses potential cos...
4,2016-01-01T00:00:00Z,Cruz super-PACs unveil Iowa TV ad buy | TheHil...,"{""publication"":""The Hill"",""author"":""Jesse Byrn...","reasoning: the text describes an ""uptick in sp..."


In [50]:
df_labeled['llm_response'].value_counts()

Unnamed: 0_level_0,count
llm_response,Unnamed: 1_level_1
"reasoning: the text is about political campaign funding and advertisements, with no mention of stock prices, market conditions, or any financial sentiment relevant to stocks. \nscore: 0",1
"reasoning: the article discusses an essay contest to win a historic movie theater, focusing on the contest details and the business's potential, with no mention of stocks, stock prices, or market sentiment. \nscore: 0",1
"reasoning: the text is a factual report on geopolitical developments and security strategy without any explicit mention of stock prices, market movements, or financial sentiment. \nscore: 0",1
"reasoning: the article discusses potential cost savings and better outcomes from prostate surgery at specialized centers, which is a positive outlook on medical outcomes but does not mention stock prices, market conditions, or investment sentiment. \nscore: 0",1
"reasoning: the text describes an ""uptick in spending,"" ""nearly $20 million"" raised, and the candidate being ""in a strong position,"" which indicates favorable conditions and a positive outlook. \nscore: 1",1
"reasoning: the text is a purely factual report on military air strikes with no mention of any stocks, financial markets, prices, or economic conditions.\nscore: 0",1
"reasoning: the text reports a factual military procurement decision without any explicit claims about stock price movement, outlook, or financial performance for the involved companies. \nscore: 0",1
"reasoning: the text reports on a prison fight in guatemala with no mention of stocks, markets, prices, or any financial sentiment whatsoever. \nscore: 0",1
"reasoning: the text discusses donald trump's political aspirations and a new year's call, with no mention of stocks, market prices, outlooks, or financial conditions.\nscore: 0",1
"reasoning: the article discusses geopolitical conflict and violence with no mention of financial markets, stock prices, or economic conditions related to stocks. \nscore: 0",1


## 3. Load Pre-trained Transformer Models


In [9]:
from transformers import pipeline

# Use pipelines as high-level helpers
pipes = {
    "nickmuchi": pipeline("text-classification", model="nickmuchi/deberta-v3-base-finetuned-finance-text-classification"),
    "mrm8488": pipeline("text-classification", model="mrm8488/deberta-v3-ft-financial-news-sentiment-analysis")
}

print("Models loaded successfully!")
print(f"nickmuchi model labels: {pipes['nickmuchi'].model.config.id2label}")
print(f"mrm8488 model labels: {pipes['mrm8488'].model.config.id2label}")




config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/738M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/368 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Device set to use cpu


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/568M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Device set to use cpu


Models loaded successfully!
nickmuchi model labels: {0: 'bearish', 1: 'neutral', 2: 'bullish'}
mrm8488 model labels: {0: 'negative', 1: 'neutral', 2: 'positive'}


## 4. Compare Models on Sample Data


In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

def normalize_prediction(pred, model_name):
    """Normalize predictions to bullish/neutral/bearish"""
    if isinstance(pred, dict):
        label = pred.get('label', '').lower()
        score = pred.get('score', 0)
    else:
        label = str(pred).lower()
        score = 1.0
    
    # Map various label formats to our three classes
    if 'bullish' in label or 'positive' in label or 'bull' in label:
        return 'bullish'
    elif 'bearish' in label or 'negative' in label or 'bear' in label:
        return 'bearish'
    else:
        return 'neutral'

# Test on a sample of data (e.g., first 50 rows for comparison)
sample_size = min(50, len(df))
test_df = df.head(sample_size).copy()

if 'text' not in test_df.columns:
    print("Error: 'text' column not found!")
    print(f"Available columns: {test_df.columns.tolist()}")
else:
    print(f"Testing on {sample_size} samples...")
    
    # Get predictions from transformer models
    print("\nGetting predictions from transformer models...")
    test_df['nickmuchi_pred'] = test_df['text'].apply(lambda x: normalize_prediction(pipes['nickmuchi'](x)[0], 'nickmuchi'))
    test_df['mrm8488_pred'] = test_df['text'].apply(lambda x: normalize_prediction(pipes['mrm8488'](x)[0], 'mrm8488'))
    
    # Get predictions from API (with rate limiting)
    print("\nGetting predictions from API (this may take a while)...")
    api_predictions = []
    for idx, text in enumerate(tqdm(test_df['text'], desc="API predictions")):
        pred = classify_with_api(text, api_key)
        api_predictions.append(pred)
        time.sleep(0.1)  # Rate limiting
    
    test_df['api_pred'] = api_predictions
    
    print("\nPredictions completed!")
    print("\nSample predictions:")
    print(test_df[['text', 'nickmuchi_pred', 'mrm8488_pred', 'api_pred']].head(10))


In [None]:
# Compare model predictions (using API as ground truth for comparison)
print("=" * 60)
print("MODEL COMPARISON (using API predictions as reference)")
print("=" * 60)

# Compare each model against API
for model_name in ['nickmuchi', 'mrm8488']:
    pred_col = f'{model_name}_pred'
    accuracy = accuracy_score(test_df['api_pred'], test_df[pred_col])
    print(f"\n{model_name.upper()} vs API:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"\nClassification Report:")
    print(classification_report(test_df['api_pred'], test_df[pred_col], 
                              target_names=['bearish', 'bullish', 'neutral']))
    print(f"\nConfusion Matrix:")
    print(confusion_matrix(test_df['api_pred'], test_df[pred_col]))

# Distribution of predictions
print("\n" + "=" * 60)
print("PREDICTION DISTRIBUTIONS")
print("=" * 60)
print("\nAPI predictions:")
print(test_df['api_pred'].value_counts())
print("\nNickmuchi predictions:")
print(test_df['nickmuchi_pred'].value_counts())
print("\nMRM8488 predictions:")
print(test_df['mrm8488_pred'].value_counts())


## 5. Fine-tune the Best Model

Based on the comparison above, we'll fine-tune the best performing model on our dataset.


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Determine best model (you can change this based on comparison results)
# For now, we'll use the one with higher accuracy or choose manually
best_model_name = "nickmuchi"  # Change to "mrm8488" if that performs better
best_model_id = "nickmuchi/deberta-v3-base-finetuned-finance-text-classification" if best_model_name == "nickmuchi" else "mrm8488/deberta-v3-ft-financial-news-sentiment-analysis"

print(f"Fine-tuning model: {best_model_name} ({best_model_id})")

# Prepare data for fine-tuning
# We'll use API predictions as labels for fine-tuning
if 'text' not in df.columns:
    print("Error: 'text' column not found!")
else:
    # Get API labels for all data (or use a subset for faster training)
    train_size = min(500, len(df))  # Use 500 samples for training
    train_df = df.head(train_size).copy()
    
    print(f"\nGetting API labels for {train_size} training samples...")
    train_labels = []
    for idx, text in enumerate(tqdm(train_df['text'], desc="Getting labels")):
        label = classify_with_api(text, api_key)
        train_labels.append(label)
        time.sleep(0.1)  # Rate limiting
    
    train_df['label'] = train_labels
    
    # Map labels to integers
    label_map = {'bearish': 0, 'neutral': 1, 'bullish': 2}
    train_df['label_id'] = train_df['label'].map(label_map)
    
    # Split into train and validation
    train_texts, val_texts, train_labels, val_labels = train_test_split(
        train_df['text'].tolist(),
        train_df['label_id'].tolist(),
        test_size=0.2,
        random_state=42
    )
    
    print(f"\nTrain samples: {len(train_texts)}")
    print(f"Validation samples: {len(val_texts)}")
    print(f"Label distribution - Train: {pd.Series(train_labels).value_counts().to_dict()}")
    print(f"Label distribution - Val: {pd.Series(val_labels).value_counts().to_dict()}")


In [None]:
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(best_model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    best_model_id,
    num_labels=3,
    id2label={0: 'bearish', 1: 'neutral', 2: 'bullish'},
    label2id={'bearish': 0, 'neutral': 1, 'bullish': 2}
)

# Tokenize the data
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512)

# Create datasets
train_dataset = Dataset.from_dict({'text': train_texts, 'label': train_labels})
val_dataset = Dataset.from_dict({'text': val_texts, 'label': val_labels})

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

# Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

print("Datasets prepared!")


In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    acc = accuracy_score(labels, predictions)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Training arguments
training_args = TrainingArguments(
    output_dir=f'./results/{best_model_name}_finetuned',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir=f'./logs/{best_model_name}_finetuned',
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("Starting fine-tuning...")
trainer.train()

print("\nFine-tuning completed!")


## 6. Re-test Fine-tuned Model


In [None]:
# Load the fine-tuned model
finetuned_model = AutoModelForSequenceClassification.from_pretrained(
    f'./results/{best_model_name}_finetuned'
)
finetuned_tokenizer = AutoTokenizer.from_pretrained(best_model_id)

# Create pipeline with fine-tuned model
finetuned_pipe = pipeline(
    "text-classification",
    model=finetuned_model,
    tokenizer=finetuned_tokenizer
)

print("Fine-tuned model loaded!")

# Test on validation set
print("\nEvaluating on validation set...")
val_predictions = []
for text in tqdm(val_texts, desc="Predicting"):
    pred = finetuned_pipe(text)[0]
    pred_label = normalize_prediction(pred, 'finetuned')
    val_predictions.append(pred_label)

# Map validation labels back to strings
id_to_label = {0: 'bearish', 1: 'neutral', 2: 'bullish'}
val_labels_str = [id_to_label[label] for label in val_labels]

# Calculate metrics
accuracy = accuracy_score(val_labels_str, val_predictions)
print(f"\nFine-tuned Model Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"\nClassification Report:")
print(classification_report(val_labels_str, val_predictions, 
                          target_names=['bearish', 'neutral', 'bullish']))
print(f"\nConfusion Matrix:")
print(confusion_matrix(val_labels_str, val_predictions))


In [None]:
# Compare original vs fine-tuned model
print("=" * 60)
print("COMPARISON: Original vs Fine-tuned Model")
print("=" * 60)

# Get predictions from original model
original_predictions = []
for text in tqdm(val_texts, desc="Original model"):
    pred = pipes[best_model_name](text)[0]
    pred_label = normalize_prediction(pred, best_model_name)
    original_predictions.append(pred_label)

print(f"\nOriginal {best_model_name} model accuracy: {accuracy_score(val_labels_str, original_predictions):.4f}")
print(f"Fine-tuned model accuracy: {accuracy:.4f}")

print(f"\nOriginal model classification report:")
print(classification_report(val_labels_str, original_predictions, 
                          target_names=['bearish', 'neutral', 'bullish']))

print(f"\nFine-tuned model classification report:")
print(classification_report(val_labels_str, val_predictions, 
                          target_names=['bearish', 'neutral', 'bullish']))


In [None]:
# Test on a few examples from the full dataset
print("\n" + "=" * 60)
print("SAMPLE PREDICTIONS ON NEW DATA")
print("=" * 60)

test_samples = df.tail(10) if len(df) > 10 else df
if 'text' in test_samples.columns:
    for idx, row in test_samples.iterrows():
        text = row['text']
        original_pred = normalize_prediction(pipes[best_model_name](text)[0], best_model_name)
        finetuned_pred = normalize_prediction(finetuned_pipe(text)[0], 'finetuned')
        
        print(f"\nText: {text[:150]}...")
        print(f"Original {best_model_name}: {original_pred}")
        print(f"Fine-tuned: {finetuned_pred}")
        print("-" * 60)
else:
    print("'text' column not found in dataframe")
