# 📊 Real-Time Financial Sentiment Classification using GenAI + RAG + Agent

This capstone demonstrates an end-to-end **Generative AI pipeline** using **Gemini 2.0 Flash**, **retrieval-augmented generation (RAG)**, and a **LangChain Python agent** for financial sentiment analysis.

The steps are:
- Fetch **1000 real-time finance news articles**
- Use **Gemini embeddings** to retrieve similar examples from a labeled dataset via **FAISS**
- Build **few-shot prompts** dynamically and classify sentiment using Gemini Flash
- Return a structured **JSON output** per article, showing the prediction and the supporting example

**💡 Impact**: This solution supports analysts and decision-makers by transforming unstructured market news into structured, explainable sentiment summaries in real time.


### 📦 Install Dependencies
Install required libraries such as LangChain, Gemini SDK, FAISS, and sentence-transformers.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/sentiment-analysis-for-financial-news/all-data.csv
/kaggle/input/sentiment-analysis-for-financial-news/FinancialPhraseBank/Sentences_66Agree.txt
/kaggle/input/sentiment-analysis-for-financial-news/FinancialPhraseBank/Sentences_AllAgree.txt
/kaggle/input/sentiment-analysis-for-financial-news/FinancialPhraseBank/README.txt
/kaggle/input/sentiment-analysis-for-financial-news/FinancialPhraseBank/License.txt
/kaggle/input/sentiment-analysis-for-financial-news/FinancialPhraseBank/Sentences_75Agree.txt
/kaggle/input/sentiment-analysis-for-financial-news/FinancialPhraseBank/Sentences_50Agree.txt


## Step 1: Install dependencies in Kaggle

### 🔑 Set Up Gemini & Import Packages
Configure the Gemini SDK and import all required Python packages.

In [2]:
!pip install -q langchain faiss-cpu google-generativeai

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m51.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25h

### 📁 Load & Clean Labeled Data
Load the labeled financial sentiment dataset and prepare it for embedding.

In [3]:
!pip install -q langchain-google-genai

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.9/433.9 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-generativeai 0.8.4 requires google-ai-generativelanguage==0.6.15, but you have google-ai-generativelanguage 0.6.17 which is incompatible.[0m[31m
[0m

## Step 2: LangChain + Gemini Setup

### 🧠 Embed Labeled Data with Gemini
Use Gemini's embedding API to encode each labeled example for similarity search.

In [4]:
import google.generativeai as genai
from langchain_core.tools import tool
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI

import os
import time
import numpy as np
import pandas as pd
import faiss
import random
from tqdm import tqdm
from typing import List
from typing import Dict,Any
import requests


## Step 3: Configure Gemini

### 📦 Build FAISS Index on Labeled Embeddings
Use FAISS to index all labeled data vectors for fast nearest-neighbor retrieval.

In [5]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

### 🌐 Fetch Real-Time Finance News
Use NewsAPI to retrieve the top 100 recent finance-related news articles.

In [6]:
from google import genai

client = genai.Client(api_key=GOOGLE_API_KEY)
import google.generativeai as genai

model = genai.GenerativeModel('gemini-2.0-flash')
def classify_with_gemini(prompt: str):
    response = genai.generate_content(prompt, model="gemini-2.0-flash")
    return response.text.strip()


  warn(


## Step 4: Load Labeled Data + Build FAISS Index

### 🧠 Embed Real-Time News Articles
Generate embeddings for each real-time article using Gemini for similarity comparison.

In [7]:
df = pd.read_csv("/kaggle/input/sentiment-analysis-for-financial-news/all-data.csv", encoding="ISO-8859-1", header=None)
df.columns = ['label', 'text']
df['label'] = df['label'].str.lower().str.strip()
df = df.iloc[:1000,:]

In [8]:
from google.generativeai import embed_content
# from google.generativeai.types import EmbedContentConfig
# Step 4: Function to embed text using Gemini
def get_gemini_embedding(text):
    try:
        response = client.models.embed_content(
            
                model="models/text-embedding-004",
                contents=text,
                config={"task_type":'RETRIEVAL_DOCUMENT'}
                            
        )
        time.sleep(0.7)
        return response.embeddings
    except Exception as e:
        print("Embedding failed:", e)
        return None

# Step 5: Apply to the labeled dataset
labeled_data = []
for i, row in tqdm(df.iterrows(), total=len(df)):
    embedding = get_gemini_embedding(row['text'])
    if embedding:
        labeled_data.append({
            'text': row['text'],
            'label': row['label'],
            'embedding': embedding
        })

100%|██████████| 1000/1000 [21:54<00:00,  1.31s/it]


### 🔍 Retrieve Most Similar Labeled Example
For each article, retrieve the closest labeled example from the FAISS index.

In [9]:
# Assume you already have labeled_data with 'text', 'label', and 'embedding'
label_embeddings = np.array([item['embedding'][0].values for item in labeled_data]).astype('float32')
faiss.normalize_L2(label_embeddings)
faiss_index = faiss.IndexFlatIP(label_embeddings.shape[1])
faiss_index.add(label_embeddings)


In [10]:
from kaggle_secrets import UserSecretsClient
secrets = UserSecretsClient()
NEWS_API_KEY = secrets.get_secret("NEWSAPI_KEY")  # Or whatever name you used


### 🧰 Define LangChain Tool: RAG Retriever
Wrap the retrieval function as a LangChain-compatible Tool for the agent.

## Step 5: Fetch News Articles (Tool 1)

In [11]:
# Your existing NewsAPI fetch code
def fetch_news(query: str, max_results: int = 100) -> List[str]:
    url = "https://newsapi.org/v2/everything"
    params = {
        "q": query,
        "language": "en",
        "pageSize": max_results,
        "sortBy": "relevance",
        "apiKey": NEWS_API_KEY
    }
    response = requests.get(url, params=params)
    if response.status_code != 200:
        print("Failed to fetch news:", response.text)
        return []
    return [f"{a['title']}. {a['description']}" for a in response.json().get("articles", []) if a['description']]


## Step 6: Define Tools for LangChain Agent

### 🧰 Define LangChain Tool: Sentiment Classifier
Wrap the few-shot Gemini classifier as another LangChain Tool.

### -Tool: RAG Retriever (Top-1 from FAISS)

In [12]:
@tool
def retrieve_similar_labeled_example(news_text: str) -> Dict[str, str]:
    """Retrieve the most similar labeled finance example to the input news article using FAISS."""
    query_vec = get_gemini_embedding(news_text)
   
    if query_vec is None:
        return {"error": "Failed to embed"}
    query_array = np.array(query_vec[0].values, dtype='float32').reshape(1, -1)
    # print(query_array)
    faiss.normalize_L2(query_array)
    _, indices = faiss_index.search(query_array.reshape(1, -1), 1)
    matched = labeled_data[indices[0][0]]
    
    return {"example_text": matched['text'], "example_label": matched['label']}


### -Tool: Classify with Gemini (Few-shot prompt)

In [13]:
from langchain.tools import tool

@tool
def classify_sentiment_with_few_shot(news: str, example_text: str, example_label: str) -> Dict[str, str]:
    """
    Uses a few-shot prompt with Gemini to classify the sentiment (positive, negative, neutral)
    of a financial news article based on a matched labeled example.
    """
    prompt = f"""You are a financial sentiment classifier.
Here is an example:
Text: {example_text}
Sentiment: {example_label.capitalize()}

Now classify the following:
Text: {news}
Sentiment:"""

    try:
        response = client.models.generate_content(
            model="gemini-2.0-flash",
            contents=prompt
        )
        sentiment = response.text.strip().lower()

        if sentiment.startswith("positive"):
            sentiment = "positive"
        elif sentiment.startswith("negative"):
            sentiment = "negative"
        elif sentiment.startswith("neutral"):
            sentiment = "neutral"
        else:
            sentiment = "unknown"

        return {
            "sentiment": sentiment,
            "news": news,
            "example_text": example_text,
            "example_label": example_label
        }

    except Exception as e:
        return {"error": str(e)}


## Step 7: Create Agent and Run Over News Batch

In [14]:

from langchain_core.language_models import BaseChatModel
from langchain_core.messages import HumanMessage
from langchain_core.outputs import ChatResult, ChatGeneration
from langchain.agents import initialize_agent, AgentType

In [15]:
from typing import List, Optional, Union, Any

### 🔁 Run Agent in Batch Mode
Iterate through 1000 articles, retrieve examples, classify, and store the results.

In [18]:
real_time_news = fetch_news("S&P 500 trend today", max_results=100)

results = []

for article in tqdm(real_time_news[:25]):  # Test with 25 first
    retrieved = retrieve_similar_labeled_example.run(article)
    
    if 'example_text' not in retrieved:
        continue
    
    # Call classification with retrieved example
    classification = classify_sentiment_with_few_shot.run({
        "news":article,
        "example_text":retrieved['example_text'],
        'example_label':retrieved['example_label']
})
    
    results.append(classification)


100%|██████████| 25/25 [00:39<00:00,  1.58s/it]


### 💾 Save Structured Results
Save the final structured JSON output (news, sentiment, example used) to a downloadable file.

In [19]:
results

[{'sentiment': 'positive',
  'news': 'Stock market today: Dow surges 600 points, S&P 500 has best week since 2023 to cap wild week of tariff-fueled chaos. Wall Street is set to wrap up another week of tariff-fueled turmoil.',
  'example_text': "The broad-based WIG index ended Thursday 's session 0.1 pct up at 65,003.34 pts , while the blue-chip WIG20 was 1.13 down at 3,687.15 pts .",
  'example_label': 'neutral'},
 {'sentiment': 'positive',
  'news': 'Stock market today: Dow, S&P 500, Nasdaq rise as bond yields surge, China-US trade war in focus. Wall Street is set to wrap up another week of tariff-fueled turmoil.',
  'example_text': "The broad-based WIG index ended Thursday 's session 0.1 pct up at 65,003.34 pts , while the blue-chip WIG20 was 1.13 down at 3,687.15 pts .",
  'example_label': 'neutral'},
 {'sentiment': 'negative',
  'news': "Stock market today: S&P 500, Nasdaq plunge, Dow drops 1,400 points as Trump's tariffs shock markets. US stocks plunged after President Trump annou