# Real-Time Financial Sentiment Classification using GenAI + RAG + Agent

This capstone demonstrates an end-to-end **Generative AI pipeline** using **Gemini 2.0 Flash**, **retrieval-augmented generation (RAG)**, and a **LangChain Python agent** for financial sentiment analysis.

The steps are:
- Fetch **1000 real-time finance news articles**
- Use **Gemini embeddings** to retrieve similar examples from a labeled dataset via **FAISS**
- Build **few-shot prompts** dynamically and classify sentiment using Gemini Flash
- Return a structured **JSON output** per article, showing the prediction and the supporting example

**Impact**: This solution supports analysts and decision-makers by transforming unstructured market news into structured, explainable sentiment summaries in real time.


### Install Dependencies
Install required libraries such as LangChain, Gemini SDK, FAISS, and sentence-transformers.

In [1]:
import numpy as np 
import pandas as pd 

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
    # for filename in filenames:
        # print(os.path.join(dirname, filename))

## Install dependencies in Kaggle

### Set Up Gemini & Import Packages
Configure the Gemini SDK and import all required Python packages.

In [3]:
!pip install -q langchain faiss-cpu google-generativeai

In [2]:
!pip install -U langchain-google-genai

## LangChain + Gemini Setup

In [19]:
import google.generativeai as genai
from langchain_core.tools import tool
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.tools import tool

import os
import time
import numpy as np
import pandas as pd
import faiss
import random
from tqdm import tqdm
from typing import List
from typing import Dict,Any
import requests


In [26]:
from dotenv import load_dotenv
load_dotenv()

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

In [6]:
from google import genai

client = genai.Client(api_key=GOOGLE_API_KEY)
import google.generativeai as genai

model = genai.GenerativeModel('gemini-2.0-flash')
def classify_with_gemini(prompt: str):
    response = genai.generate_content(prompt, model="gemini-2.0-flash")
    return response.text.strip()


### Load & Clean Labeled Data
Load the labeled financial sentiment dataset and prepare it for embedding.

In [10]:
df = pd.read_csv("all-data.csv", encoding="ISO-8859-1", header=None)
df.columns = ['label', 'text']
df['label'] = df['label'].str.lower().str.strip()
df = df.iloc[:100,:]

### Embed Labeled Data with Gemini
Use Gemini's embedding API to encode each labeled example for similarity search.

In [11]:
from google.generativeai import embed_content

def get_gemini_embedding(text):
    try:
        response = client.models.embed_content(
                model="models/text-embedding-004",
                contents=text,
                config={"task_type":'RETRIEVAL_DOCUMENT'}
        )
        time.sleep(0.7)
        return response.embeddings
    except Exception as e:
        print("Embedding failed:", e)
        return None

# Step 5: Apply to the labeled dataset
labeled_data = []
for i, row in tqdm(df.iterrows(), total=len(df)):
    embedding = get_gemini_embedding(row['text'])
    if embedding:
        labeled_data.append({
            'text': row['text'],
            'label': row['label'],
            'embedding': embedding
        })

100%|█████████████████████████████████████████| 100/100 [01:32<00:00,  1.08it/s]


### Build FAISS Index on Labeled Embeddings
Use FAISS to index all labeled data vectors for fast nearest-neighbor retrieval.

In [12]:
#
label_embeddings = np.array([item['embedding'][0].values for item in labeled_data]).astype('float32')
faiss.normalize_L2(label_embeddings)
faiss_index = faiss.IndexFlatIP(label_embeddings.shape[1])
faiss_index.add(label_embeddings)


### Fetch Real-Time Finance News
Use NewsAPI to retrieve the top 100 recent finance-related news articles.

In [28]:
NEWS_API_KEY = os.getenv("NEWS_API_KEY")

In [48]:
@tool(parse_docstring=True)
def fetch_news(query: str, max_results: int = 100) -> List[str]:
    """
    Fetches recent English-language news articles matching the provided search query using the NewsAPI.

    Args:
        query (str): The search term or keywords to look for in news articles.
        max_results (int, optional): The maximum number of articles to retrieve. Defaults to 100.

    Returns:
        List[str]: A list of strings, where each string contains the title and description of a news article.
                   Returns an empty list if the request fails or no articles are found.

    Example:
        articles = fetch_news("artificial intelligence", max_results=10)
        # Returns a list of up to 10 news articles about artificial intelligence.
    """
    url = "https://newsapi.org/v2/everything"
    params = {
        "q": query,
        "language": "en",
        "pageSize": max_results,
        "sortBy": "relevance",
        "apiKey": NEWS_API_KEY
    }
    response = requests.get(url, params=params)
    if response.status_code != 200:
        print("Failed to fetch news:", response.text)
        return []
    return [f"{a['title']}. {a['description']}" for a in response.json().get("articles", []) if a['description']]


In [49]:
fetch_news.args_schema.model_json_schema()

{'description': 'Fetches recent English-language news articles matching the provided search query using the NewsAPI.',
 'properties': {'query': {'description': 'The search term or keywords to look for in news articles.',
   'title': 'Query',
   'type': 'string'},
  'max_results': {'default': 100,
   'description': 'The maximum number of articles to retrieve. Defaults to 100.',
   'title': 'Max Results',
   'type': 'integer'}},
 'required': ['query'],
 'title': 'fetch_news',
 'type': 'object'}

### Define LangChain Tool: RAG Retriever
Wrap the retrieval function as a LangChain-compatible Tool for the agent.

In [51]:
@tool(parse_docstring=True)
def retrieve_similar_labeled_example(news_text: str) -> Dict[str, str]:
    """
    Finds and returns the most similar labeled financial example to the given news article using FAISS-based semantic search.

    Args:
        news_text (str): The news article or headline text to search for similar labeled examples.

    Returns:
        Dict[str, str]: A dictionary containing the following keys:
            - 'example_text': The text of the most similar labeled finance example from the database.
            - 'example_label': The label or category associated with the matched example.
            - If embedding fails, returns {'error': 'Failed to embed'}.

    Example:
        result = retrieve_similar_labeled_example("Apple stock surges after earnings report.")
        # Returns: {'example_text': 'Apple posts record quarterly revenue...', 'example_label': 'positive'}
    """
    query_vec = get_gemini_embedding(news_text)
   
    if query_vec is None:
        return {"error": "Failed to embed"}
    query_array = np.array(query_vec[0].values, dtype='float32').reshape(1, -1)
    # print(query_array)
    faiss.normalize_L2(query_array)
    _, indices = faiss_index.search(query_array.reshape(1, -1), 1)
    matched = labeled_data[indices[0][0]]
    
    return {"example_text": matched['text'], "example_label": matched['label']}


### Define LangChain Tool: Sentiment Classifier
Wrap the few-shot Gemini classifier as another LangChain Tool.

### -Tool: Classify with Gemini (Few-shot prompt)

In [56]:
@tool(parse_docstring=True)
def classify_sentiment_with_few_shot(news: str, example_text: str, example_label: str) -> Dict[str, str]:
    """
    Classifies the sentiment of a financial news article (positive, negative, or neutral) using a few-shot prompt with Gemini.
    The function leverages a matched labeled example as in-context reference to improve classification accuracy.

    Args:
        news (str): The text of the financial news article to be classified.
        example_text (str): A labeled example news text that is semantically similar to the input.
        example_label (str): The sentiment label ('positive', 'negative', or 'neutral') of the example_text.

    Returns:
        Dict[str, str]: A dictionary containing:
            - 'sentiment': The predicted sentiment for the input news article ('positive', 'negative', 'neutral', or 'unknown').
            - 'news': The input news article text.
            - 'example_text': The matched example news text used for few-shot prompting.
            - 'example_label': The sentiment label of the matched example.
            - If an error occurs, returns {'error': <error_message>}.

    Example:
        result = classify_sentiment_with_few_shot(
            news="Tesla shares drop after recall announcement.",
            example_text="Tesla faces scrutiny after software glitch, stock falls.",
            example_label="negative"
        )
        # Returns: {
        #   'sentiment': 'negative',
        #   'news': "...",
        #   'example_text': "...",
        #   'example_label': "negative"
        # }
    """
    prompt = f"""You are a financial sentiment classifier.
    Here is an example:
    Text: {example_text}
    Sentiment: {example_label.capitalize()}
    
    Now classify the following:
    Text: {news}
    Sentiment:
    """

    try:
        response = client.models.generate_content(
            model="gemini-2.0-flash",
            contents=prompt
        )
        sentiment = response.text.strip().lower()

        if sentiment.startswith("positive"):
            sentiment = "positive"
        elif sentiment.startswith("negative"):
            sentiment = "negative"
        elif sentiment.startswith("neutral"):
            sentiment = "neutral"
        else:
            sentiment = "unknown"

        return {
            "sentiment": sentiment,
            "news": news,
            "example_text": example_text,
            "example_label": example_label
        }

    except Exception as e:
        return {"error": str(e)}


## Create Agent and Run Over News Batch

In [22]:
from langchain_core.language_models import BaseChatModel
from langchain_core.messages import HumanMessage
from langchain_core.outputs import ChatResult, ChatGeneration
from langchain.agents import initialize_agent, AgentType

In [23]:
from typing import List, Optional, Union, Any

### 🔁 Run Agent in Batch Mode
Iterate through 1000 articles, retrieve examples, classify, and store the results.

In [29]:
real_time_news = fetch_news("S&P 500 trend today", max_results=100)

results = []

for article in tqdm(real_time_news[:25]):  # Test with 25 first
    retrieved = retrieve_similar_labeled_example.run(article)
    time.sleep(5) 
    if 'example_text' not in retrieved:
        continue
    
    # Call classification with retrieved example
    classification = classify_sentiment_with_few_shot.run({
        "news":article,
        "example_text":retrieved['example_text'],
        'example_label':retrieved['example_label']
})
    
    results.append(classification)


100%|███████████████████████████████████████████| 25/25 [02:40<00:00,  6.43s/it]


In [None]:
real_time_news

### Save Structured Results

In [28]:
import json
from pprint import pprint

In [26]:
with open("financial_sentiment_results.json", "w") as f:
    json.dump(results, f, indent=2)

In [30]:
with open("financial_sentiment_results.json", "r") as f:
    data = json.load(f)

pprint(data[:5])  

[{'example_label': 'negative',
  'example_text': 'A tinyurl link takes users to a scamming site promising '
                  'that users can earn thousands of dollars by becoming a '
                  'Google ( NASDAQ : GOOG ) Cash advertiser .',
  'news': "Tech companies want humans to help level up AI models. What's your "
          'price for training them?. The humanization of AI is turning into a '
          "nice side hustle, and it's an interesting dilemma for the humans "
          'paid to train them.',
  'sentiment': 'positive'},
 {'example_label': 'neutral',
  'example_text': "The broad-based WIG index ended Thursday 's session 0.1 pct "
                  'up at 65,003.34 pts , while the blue-chip WIG20 was 1.13 '
                  'down at 3,687.15 pts .',
  'news': 'Meta, Microsoft, Starbucks, Visa: Stocks to watch today. Markets '
          'were slipping lower ahead of Wednesday’s open, with S&P 500 futures '
          'down 0.7%, the Nasdaq down 1%, and the Dow Jones I