# 3. Sentiment Analysis Stage  
## 3.a) News Headline Sentiment Scoring using FinBERT  
### In this step, we apply FinBERT to calculate sentiment scores for Reliance Industries news headlines.

---

<details>
  <summary><strong>1. Purpose & Overview</strong></summary>

<br>

This cell runs **sentiment analysis** on cleaned Google News headlines for Reliance Industries from 2020–2024 using the **FinBERT** model.  

**Key Functions Performed**:
- Loads preprocessed news dataset  
- Loads `ProsusAI/finbert` model and tokenizer via Hugging Face    
- Applies batch-wise inference with FinBERT  
- Calculates sentiment score:  
  $$
  \text{Score} = P(\text{positive}) - P(\text{negative})
  $$
- Stores results with their original dates in a CSV

</details>

---

<details>
  <summary><strong>2. Why This Matters</strong></summary>
<br>
- FinBERT is a financial domain-tuned BERT model specifically suited for **market sentiment**.
- Output scores allow **daily-level sentiment tracking** over 4 years, which can then be merged with stock price data for correlation studies or trading signals.

</details>

---

<details>
  <summary><strong>3. Output & Results</strong></summary>

<br>

**Saved CSV**:  
`sentiment_score_by_date.csv`

**Columns**:
- `Published Date` — News headline publish date  
- `Sentiment Score` — Calculated using FinBERT  

**Batch Size Handling**:
- Tested multiple batch sizes (16 to 48)  
- Cached the optimal size in `batch_config.json`  

**Runtime**:
- Total inference time is printed in seconds  
- Status updates provided with progress bar (`tqdm`)  

**Sample Output**:
Displayed using `.head()` — shows the first few rows of the sentiment-scored DataFrame.

</details>

---

In [3]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm
import time

In [4]:
df=pd.read_csv("../Data/Reliance_GoogleNews_Monthly_2020_2024_preprocessed.csv") # Loaded the saved pre-processed dataset

In [5]:
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

# Prepare data
texts = df["Headline"].tolist()
dates = df["Published Date"].tolist()

batch_size = 16
all_scores = []

start_time = time.time()

for i in tqdm(range(0, len(texts), batch_size), desc="Running Inference"):
    batch_texts = texts[i:i + batch_size]

    # Tokenize
    inputs = tokenizer(
        batch_texts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128
    )

    # Inference
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1)

    # Calculate sentiment score = P(positive) - P(negative)
    sentiment_scores = (probs[:, 2] - probs[:, 0]).tolist()
    all_scores.extend(sentiment_scores)

end_time = time.time()
print(f"Total inference time: {end_time - start_time:.2f} seconds")

# Create minimal result DataFrame
result_df = pd.DataFrame({
    "Published Date": dates,
    "Sentiment Score": all_scores
})

# Save to CSV
result_df.to_csv("../Data/sentiment_score_by_date.csv", index=False)
print("Saved to sentiment_score_by_date.csv")

# Show first few rows
result_df.head()
result_df.describe()

Running Inference: 100%|███████████████████████████████████████████████████████████████| 89/89 [00:49<00:00,  1.79it/s]

Total inference time: 49.80 seconds
Saved to sentiment_score_by_date.csv





Unnamed: 0,Sentiment Score
count,1417.0
mean,0.345303
std,0.59298
min,-0.931364
25%,0.003837
50%,0.643689
75%,0.832885
max,0.933444
