<a href="https://colab.research.google.com/github/AKASH4671/Sentiment-Analysis-on-Financial-News-and-Its-Impact-on-Stock-Prices/blob/main/03_sentiment_scoring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**mount drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**libraries**

In [None]:
!pip install transformers
!pip install torch

import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

**load data**

In [None]:
# Load the cleaned data from previous step
file_path = "/content/drive/MyDrive/Colab Notebooks/Finance Projects/Sentiment-Analysis-on-Financial-News-and-Its-Impact-on-Stock-Prices/data/apple_news_cleaned.csv"
df = pd.read_csv(file_path)

print("Loaded data:", df.shape)
df.head()

Loaded data: (2934, 5)


Unnamed: 0,date,headline,cleaned_text,word_count,weekday
0,2025-06-24 19:59:58,Node v22.17.0 (LTS),node lts nodejs free opensource crossplatform ...,38,Tuesday
1,2025-06-24 19:57:16,"Marcelo vê Moçambique a ""olhar para o futuro""",marcelo see mozambique look futurepresident re...,36,Tuesday
2,2025-06-24 19:56:34,Obediências maçónicas emitem declaração em def...,masonic obedience issue statement defense peac...,38,Tuesday
3,2025-06-24 19:54:57,8点1氪｜顺丰等多家快递公司拒收罗马仕充电宝；字节通报大模型团队负责人出轨HRBP处理结果；...,express express delivery company refused accep...,124,Tuesday
4,2025-06-24 19:54:56,Almada. Detido suspeito de tentativa de homicí...,almadadetained suspected attempted murder braz...,36,Tuesday


In [None]:
# NaN or empty cleaned_text
print("Missing cleaned_text:", df['cleaned_text'].isna().sum())
print("Empty cleaned_text rows:", df['cleaned_text'].str.strip().eq('').sum())

# Duplicates
print("Duplicate articles:", df.duplicated(subset=['headline', 'cleaned_text']).sum())

Missing cleaned_text: 0
Empty cleaned_text rows: 0
Duplicate articles: 0


**load FinBERT Model from Hugging Face**

In [None]:
model_name = "yiyanghkust/finbert-tone"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create sentiment pipeline
nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/533 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

Device set to use cpu


**Apply Sentiment Scoring to news**

In [None]:
# Apply FinBERT on cleaned_text
def get_sentiment(text):
    if isinstance(text, str) and text.strip() != "":
        return nlp(text[:512])[0]  # truncate to 512 tokens
    else:
        return {'label': 'neutral', 'score': 0.0}

# Run sentiment analysis
sentiment_output = df['cleaned_text'].apply(get_sentiment)

# Separate into columns
df['sentiment'] = sentiment_output.apply(lambda x: x['label'])
df['sentiment_score'] = sentiment_output.apply(lambda x: x['score'])

print("Sentiment scoring complete.")
df[['headline', 'sentiment', 'sentiment_score']].head()

Sentiment scoring complete.


Unnamed: 0,headline,sentiment,sentiment_score
0,Node v22.17.0 (LTS),Neutral,0.999738
1,"Marcelo vê Moçambique a ""olhar para o futuro""",Neutral,0.999268
2,Obediências maçónicas emitem declaração em def...,Neutral,0.999991
3,8点1氪｜顺丰等多家快递公司拒收罗马仕充电宝；字节通报大模型团队负责人出轨HRBP处理结果；...,Neutral,0.999938
4,Almada. Detido suspeito de tentativa de homicí...,Neutral,0.999965


In [None]:
print("Sentiment Distribution:")
print(df['sentiment'].value_counts())

Sentiment Distribution:
sentiment
Neutral     2435
Positive     393
Negative     106
Name: count, dtype: int64


**save**

In [None]:
# Save final sentiment-scored news data
output_path = "/content/drive/MyDrive/Colab Notebooks/Finance Projects/Sentiment-Analysis-on-Financial-News-and-Its-Impact-on-Stock-Prices/data/apple_sentiment_dataset.csv"
df.to_csv(output_path, index=False)

print(f"Saved final sentiment dataset to: {output_path}")

Saved final sentiment dataset to: /content/drive/MyDrive/Colab Notebooks/Finance Projects/Sentiment-Analysis-on-Financial-News-and-Its-Impact-on-Stock-Prices/data/apple_sentiment_dataset.csv
