In [3]:
!pip install -U sentence-transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [4]:
!pip install faiss-cpu

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [5]:
!pip install tiktoken

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [1]:
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss

# Ensure sentencepiece is installed: pip install sentencepiece

# Load your company data
try:
    df = pd.read_csv("/home/chebolu_srikanth/.keras/company_description.csv")  # Ensure it has 'Ticker' and 'Description'
except FileNotFoundError:
    print("Error: company_description.csv not found. Please check the path.")
    print("Creating a dummy DataFrame for demonstration purposes.")
    data = {
        'Ticker': ['AAPL', 'MSFT', 'GOOG'],
        'Description': [
            'Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide.',
            'Microsoft Corporation develops, licenses, and supports software, services, devices, and solutions worldwide.',
            'Alphabet Inc. provides online advertising services in the United States, Europe, the Middle East, Africa, the Asia-Pacific, Canada, and Latin America.'
        ]
    }
    df = pd.DataFrame(data)

tickers = df['Ticker'].tolist()
descriptions = df['Description'].tolist()

# --- Device Configuration ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load the BGE encoder for dense retrieval
print("Loading BGE model...")
bge_model = SentenceTransformer("BAAI/bge-large-en-v1.5", device=device)
bge_model.encode("warmup")  # warmup
print("BGE model loaded.")

# Encode all company descriptions
print("Encoding company descriptions...")
company_embeddings = bge_model.encode(descriptions, convert_to_tensor=True, normalize_embeddings=True, device=device)
print(f"Company embeddings shape: {company_embeddings.shape}")

# Build FAISS index for retrieval
print("Building FAISS index...")
faiss_index = faiss.IndexFlatIP(company_embeddings.shape[1])
# Ensure embeddings are on CPU and numpy for FAISS
faiss_index.add(company_embeddings.cpu().numpy())
print("FAISS index built.")

# --- Load Cross-Encoders and their specific Tokenizers ---

# Relevance Model (DeBERTa)
print("Loading DeBERTa tokenizer and model for relevance...")
relevance_tokenizer_name = "microsoft/deberta-v3-large"
# IMPORTANT: If you are using a base DeBERTa model for relevance, its output
# logits are NOT inherently relevance scores. It needs to be fine-tuned on a
# relevance task (e.g., NLI, or query-document relevance).
# For demonstration, we'll assume it's a binary classifier where output[1] is relevance.
# If you have a custom fine-tuned model, replace `relevance_model_name`
relevance_model_name = "microsoft/deberta-v3-large" # Or your fine-tuned relevance model
try:
    relevance_tokenizer = AutoTokenizer.from_pretrained(relevance_tokenizer_name)
    # If the base model is used, num_labels will be its default (often for MLM or pre-training objectives)
    # If you fine-tuned it as a binary classifier for relevance, it would have num_labels=2
    relevance_model = AutoModelForSequenceClassification.from_pretrained(relevance_model_name) #, num_labels=2 if fine-tuned
    relevance_model.to(device)
    relevance_model.eval() # Set to evaluation mode
    print("DeBERTa model loaded.")
except Exception as e:
    print(f"Error loading DeBERTa model: {e}")
    print("Skipping relevance scoring with DeBERTa.")
    relevance_model = None # Fallback

# Sentiment Model (FinBERT)
print("Loading FinBERT tokenizer and model for sentiment...")
sentiment_tokenizer_name = "ProsusAI/finbert"
sentiment_model_name = "ProsusAI/finbert"
try:
    sentiment_tokenizer = AutoTokenizer.from_pretrained(sentiment_tokenizer_name)
    sentiment_model = AutoModelForSequenceClassification.from_pretrained(sentiment_model_name)
    sentiment_model.to(device)
    sentiment_model.eval() # Set to evaluation mode
    print("FinBERT model loaded.")

    # Check FinBERT's label mapping
    # Typically: {'positive': 0, 'negative': 1, 'neutral': 2} OR {'positive': 2, 'negative': 0, 'neutral': 1} etc.
    # The provided code uses ["negative", "neutral", "positive"] which implies negative=0, neutral=1, positive=2
    # Let's try to get it from config if possible, otherwise stick to the assumed order
    if hasattr(sentiment_model.config, 'id2label'):
        # FinBERT from ProsusAI often has labels like: {0: 'positive', 1: 'negative', 2: 'neutral'}
        # We need to map this to your desired output order
        config_labels = sentiment_model.config.id2label
        print(f"FinBERT config labels: {config_labels}")
        # Example: if config_labels is {0: 'positive', 1: 'negative', 2: 'neutral'}
        # and you want ["negative", "neutral", "positive"]
        # Then index 0 (negative) maps to config's 1, index 1 (neutral) maps to config's 2, index 2 (positive) maps to config's 0.
        # This is tricky. For simplicity, we'll use a fixed mapping matching common FinBERT outputs.
        # The original code's assumption was: 0 -> negative, 1 -> neutral, 2 -> positive.
        # FinBERT (ProsusAI) is typically: positive, negative, neutral.
        # So if argmax is 0, it's positive. If 1, negative. If 2, neutral.
        # Let's use a mapping for clarity:
        finbert_label_map_from_index = {
            sentiment_model.config.label2id['positive']: "positive",
            sentiment_model.config.label2id['negative']: "negative",
            sentiment_model.config.label2id['neutral']: "neutral"
        }
        sentiment_labels_ordered = [finbert_label_map_from_index[i] for i in range(len(finbert_label_map_from_index))]
        print(f"Using FinBERT sentiment labels (ordered by index): {sentiment_labels_ordered}")
    else:
        # Fallback to the original assumption if config is not as expected
        sentiment_labels_ordered = ["negative", "neutral", "positive"] # Original assumption
        print(f"Warning: Could not determine FinBERT label order from config. Assuming: {sentiment_labels_ordered}")

except Exception as e:
    print(f"Error loading FinBERT model: {e}")
    print("Skipping sentiment scoring.")
    sentiment_model = None # Fallback


# Inference function
def find_relevant_companies(news_text, top_k=10):
    news_embedding = bge_model.encode(news_text, convert_to_tensor=True, normalize_embeddings=True, device=device)
    # FAISS search needs CPU numpy array
    D, indices = faiss_index.search(news_embedding.cpu().unsqueeze(0).numpy(), top_k) # unsqueeze for batch dim

    results = []
    for i, idx in enumerate(indices[0]):
        if idx == -1: # FAISS can return -1 if fewer than k items are available or for empty results
            continue
        company_ticker = tickers[idx]
        company_desc = descriptions[idx]
        retrieval_score = D[0][i] # This is the dot product (IP) score from FAISS

        item_result = {
            "Ticker": company_ticker,
            "RetrievalScore": round(float(retrieval_score), 3), # Store BGE retrieval score
            # "Description": company_desc # Optional: for debugging
        }

        # Relevance scoring with cross-encoder (DeBERTa)
        # --- IMPORTANT CAVEAT ---
        # The base "microsoft/deberta-v3-large" is NOT fine-tuned for relevance.
        # Its logits will NOT represent relevance scores without fine-tuning.
        # If you have fine-tuned it, it might output 2 classes (relevant, not_relevant).
        # The code below assumes the second class (index 1) means "relevant".
        # If you haven't fine-tuned it, this relevance_score will be meaningless.
        if relevance_model and relevance_tokenizer:
            inputs = relevance_tokenizer(news_text, company_desc, return_tensors="pt", truncation=True, padding=True, max_length=512)
            inputs = {k: v.to(device) for k, v in inputs.items()} # Move inputs to device
            with torch.no_grad():
                logits = relevance_model(**inputs).logits
                # Assuming a binary classifier where output[1] is 'relevant'
                # This needs to match how your model was fine-tuned.
                # If not fine-tuned, num_labels might be different, and softmax over arbitrary logits is not relevance.
                if logits.shape[1] >= 2: # Check if there are at least two output logits
                    relevance_prob = torch.softmax(logits, dim=1)[0]
                    # Example: if label 1 is "relevant"
                    # Check model.config.id2label if available for your fine-tuned model
                    # For now, let's assume class 1 is 'relevant' if num_labels is 2
                    # If num_labels > 2, this is more complex, as it's not a simple binary relevance.
                    if relevance_model.config.num_labels == 2:
                         # Assuming index 1 is 'relevant', index 0 is 'not relevant'
                        relevance_score = relevance_prob[1].item()
                    else:
                        # Cannot reliably get a binary relevance score from a model not trained for it
                        # Or if it has more than 2 classes not clearly mapped to relevance.
                        # For demonstration, we'll take the max logit if not binary, but this is NOT a true relevance score.
                        relevance_score = relevance_prob.max().item() # This is NOT a proper relevance score
                        print(f"Warning: DeBERTa model does not seem to be a binary relevance classifier (num_labels={relevance_model.config.num_labels}). Relevance score is based on max probability and might not be meaningful.")

                else:
                    relevance_score = 0.0 # Cannot determine relevance
                    print(f"Warning: DeBERTa model output logits shape {logits.shape} not suitable for binary relevance.")

            item_result["RelevanceScore_CrossEncoder"] = round(relevance_score, 3)
            relevance_threshold = 0.6 # Your threshold
        else:
            relevance_score = 0.0 # Fallback if model not loaded
            relevance_threshold = 0.0 # Effectively disable filtering if no model
            item_result["RelevanceScore_CrossEncoder"] = "N/A"


        # Proceed if deemed relevant by cross-encoder (or if no cross-encoder used)
        if not relevance_model or relevance_score > relevance_threshold:
            # Sentiment scoring with FinBERT
            if sentiment_model and sentiment_tokenizer:
                sent_inputs = sentiment_tokenizer(news_text, return_tensors="pt", truncation=True, padding=True, max_length=512)
                sent_inputs = {k: v.to(device) for k, v in sent_inputs.items()} # Move inputs to device
                with torch.no_grad():
                    sent_logits = sentiment_model(**sent_inputs).logits
                    sentiment_probabilities = torch.softmax(sent_logits, dim=1)[0]
                    predicted_sentiment_index = torch.argmax(sentiment_probabilities).item()
                    sentiment_label = sentiment_labels_ordered[predicted_sentiment_index]
                    sentiment_confidence = sentiment_probabilities[predicted_sentiment_index].item()

                item_result["Sentiment"] = sentiment_label
                item_result["SentimentConfidence"] = round(sentiment_confidence, 3)
            else:
                item_result["Sentiment"] = "N/A"
                item_result["SentimentConfidence"] = "N/A"

            results.append(item_result)

    # Sort by cross-encoder relevance score if available, otherwise by retrieval score
    if relevance_model:
        return sorted(results, key=lambda x: -x.get("RelevanceScore_CrossEncoder", 0))
    else:
        return sorted(results, key=lambda x: -x.get("RetrievalScore", 0))


# 🔍 Test
if __name__ == "__main__":
    news_example = "OM Infra Ltd share price surges 8 percent on 129 crore water project win ."
    print(f"\n--- Finding relevant companies for news: '{news_example}' ---")
    relevant_companies = find_relevant_companies(news_example, top_k=5)
    if relevant_companies:
        for r in relevant_companies:
            print(r)
    else:
        print("No relevant companies found.")

    print("\n--- Another example ---")
    news_example_2 = "A new brand of organic coffee beans is gaining popularity among consumers for its rich flavor and sustainable sourcing."
    print(f"\n--- Finding relevant companies for news: '{news_example_2}' ---")
    relevant_companies_2 = find_relevant_companies(news_example_2, top_k=3)
    if relevant_companies_2:
        for r in relevant_companies_2:
            print(r)
    else:
        print("No relevant companies found.")

Error: company_description.csv not found. Please check the path.
Creating a dummy DataFrame for demonstration purposes.
Using device: cuda
Loading BGE model...
BGE model loaded.
Encoding company descriptions...
Company embeddings shape: torch.Size([3, 1024])
Building FAISS index...
FAISS index built.
Loading DeBERTa tokenizer and model for relevance...
Error loading DeBERTa model: Converting from Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast convertors: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertToke

In [4]:
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss

# === Load Company Data ===
try:
    df = pd.read_csv("/home/chebolu_srikanth/.keras/company_description.csv")
except FileNotFoundError:
    df = pd.DataFrame({
        'Ticker': ['AAPL', 'MSFT', 'GOOG'],
        'Description': [
            'Apple Inc. designs smartphones, computers, and accessories.',
            'Microsoft Corp. creates software, services, and devices worldwide.',
            'Alphabet Inc. offers online ads and related services globally.'
        ]
    })

tickers = df['Ticker'].tolist()
descriptions = df['Description'].tolist()

# === Setup Device and Models ===
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load BGE encoder
bge_model = SentenceTransformer("BAAI/bge-large-en-v1.5", device=device)
bge_model.encode("warmup")

# Encode descriptions
company_embeddings = bge_model.encode(descriptions, convert_to_tensor=True, normalize_embeddings=True, device=device)

# Build FAISS index
faiss_index = faiss.IndexFlatIP(company_embeddings.shape[1])
faiss_index.add(company_embeddings.cpu().numpy())

# Load DeBERTa for relevance scoring
try:
    relevance_tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-large")
    relevance_model = AutoModelForSequenceClassification.from_pretrained("microsoft/deberta-v3-large").to(device).eval()
except:
    relevance_model = None
    relevance_tokenizer = None

# Load FinBERT for sentiment analysis
try:
    sentiment_tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
    sentiment_model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert").to(device).eval()
    sentiment_labels = sentiment_model.config.id2label
except:
    sentiment_model = None
    sentiment_tokenizer = None
    sentiment_labels = None

# === Function to Process Articles ===
def find_relevant_companies_multiple(articles, top_k=5, relevance_threshold=0.6):
    results = []

    article_embeddings = bge_model.encode(articles, convert_to_tensor=True, normalize_embeddings=True, device=device)
    D, I = faiss_index.search(article_embeddings.cpu().numpy(), top_k)

    for i, article in enumerate(articles):
        article_result = {
            "Article": article,
            "Matches": []
        }

        for j, idx in enumerate(I[i]):
            if idx == -1:
                continue

            ticker = tickers[idx]
            desc = descriptions[idx]
            sim_score = float(D[i][j])

            result = {
                "Ticker": ticker,
                "RetrievalScore": round(sim_score, 3)
            }

            # Relevance
            if relevance_model and relevance_tokenizer:
                inputs = relevance_tokenizer(article, desc, return_tensors="pt", truncation=True, padding=True, max_length=512)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                with torch.no_grad():
                    logits = relevance_model(**inputs).logits
                    prob = torch.softmax(logits, dim=1)[0]
                    relevance_score = prob[1].item() if relevance_model.config.num_labels == 2 else prob.max().item()
                    result["RelevanceScore"] = round(relevance_score, 3)
                    if relevance_score < relevance_threshold:
                        continue

            # Sentiment
            if sentiment_model and sentiment_tokenizer:
                inputs = sentiment_tokenizer(article, return_tensors="pt", truncation=True, padding=True, max_length=512)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                with torch.no_grad():
                    logits = sentiment_model(**inputs).logits
                    label_id = torch.argmax(logits).item()
                    sentiment = sentiment_labels[label_id]
                    result["Sentiment"] = sentiment

            article_result["Matches"].append(result)

        results.append(article_result)

    return results

# === Example Usage ===
if __name__ == "__main__":
    articles = [
        "Vodafone Idea went bankrupt.",
        "Adani Power to supply 1,500 MW to Uttar Pradesh",
        "Alphabet’s YouTube expands into podcast streaming."
    ]

    matches = find_relevant_companies_multiple(articles, top_k=3)
    for i, res in enumerate(matches):
        print(f"\nArticle {i+1}: {res['Article']}")
        for match in res['Matches']:
            print(match)

Using device: cuda

Article 1: Vodafone Idea went bankrupt.
{'Ticker': 'IDEA.NS', 'RetrievalScore': 0.716, 'Sentiment': 'negative'}
{'Ticker': 'BHARTIARTL.NS', 'RetrievalScore': 0.553, 'Sentiment': 'negative'}
{'Ticker': 'OPTIEMUS.NS', 'RetrievalScore': 0.518, 'Sentiment': 'negative'}

Article 2: Adani Power to supply 1,500 MW to Uttar Pradesh
{'Ticker': 'ADANIPOWER.NS', 'RetrievalScore': 0.721, 'Sentiment': 'neutral'}
{'Ticker': 'ADANIGREEN.NS', 'RetrievalScore': 0.67, 'Sentiment': 'neutral'}
{'Ticker': 'RTNPOWER.NS', 'RetrievalScore': 0.625, 'Sentiment': 'neutral'}

Article 3: Alphabet’s YouTube expands into podcast streaming.
{'Ticker': 'NETWORK18.NS', 'RetrievalScore': 0.46, 'Sentiment': 'neutral'}
{'Ticker': 'JUSTDIAL.NS', 'RetrievalScore': 0.443, 'Sentiment': 'neutral'}
{'Ticker': 'IMAGICAA.NS', 'RetrievalScore': 0.441, 'Sentiment': 'neutral'}


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss

# === Fetch News from MoneyControl ===
url = 'https://www.moneycontrol.com/news/business/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Look for all divs with class "clearfix"
articles = soup.find_all('li', class_='clearfix')

news_headlines = []
for article in articles:
    headline_tag = article.find('h2')
    if headline_tag:
        title = headline_tag.get_text(strip=True)
        link_tag = headline_tag.find('a')
        link = link_tag['href'] if link_tag else None
        news_headlines.append(title)

# === Load Company Data ===
try:
    df = pd.read_csv("/home/chebolu_srikanth/.keras/company_description.csv")
except FileNotFoundError:
    df = pd.DataFrame({
        'Ticker': ['AAPL', 'MSFT', 'GOOG'],
        'Description': [
            'Apple Inc. designs smartphones, computers, and accessories.',
            'Microsoft Corp. creates software, services, and devices worldwide.',
            'Alphabet Inc. offers online ads and related services globally.'
        ]
    })

tickers = df['Ticker'].tolist()
descriptions = df['Description'].tolist()

# === Setup Device and Models ===
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load BGE encoder
bge_model = SentenceTransformer("BAAI/bge-large-en-v1.5", device=device)
bge_model.encode("warmup")

# Encode descriptions
company_embeddings = bge_model.encode(descriptions, convert_to_tensor=True, normalize_embeddings=True, device=device)

# Build FAISS index
faiss_index = faiss.IndexFlatIP(company_embeddings.shape[1])
faiss_index.add(company_embeddings.cpu().numpy())

# Load DeBERTa for relevance scoring
try:
    relevance_tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-large")
    relevance_model = AutoModelForSequenceClassification.from_pretrained("microsoft/deberta-v3-large").to(device).eval()
except:
    relevance_model = None
    relevance_tokenizer = None

# Load FinBERT for sentiment analysis
try:
    sentiment_tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
    sentiment_model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert").to(device).eval()
    sentiment_labels = sentiment_model.config.id2label
except:
    sentiment_model = None
    sentiment_tokenizer = None
    sentiment_labels = None

# === Function to Process Articles ===
def find_relevant_companies_multiple(articles, top_k=5, relevance_threshold=0.6):
    results = []

    article_embeddings = bge_model.encode(articles, convert_to_tensor=True, normalize_embeddings=True, device=device)
    D, I = faiss_index.search(article_embeddings.cpu().numpy(), top_k)

    for i, article in enumerate(articles):
        article_result = {
            "Article": article,
            "Matches": []
        }

        for j, idx in enumerate(I[i]):
            if idx == -1:
                continue

            ticker = tickers[idx]
            desc = descriptions[idx]
            sim_score = float(D[i][j])

            result = {
                "Ticker": ticker,
                "RetrievalScore": round(sim_score, 3)
            }

            # Relevance
            if relevance_model and relevance_tokenizer:
                inputs = relevance_tokenizer(article, desc, return_tensors="pt", truncation=True, padding=True, max_length=512)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                with torch.no_grad():
                    logits = relevance_model(**inputs).logits
                    prob = torch.softmax(logits, dim=1)[0]
                    relevance_score = prob[1].item() if relevance_model.config.num_labels == 2 else prob.max().item()
                    result["RelevanceScore"] = round(relevance_score, 3)
                    if relevance_score < relevance_threshold:
                        continue

            # Sentiment
            if sentiment_model and sentiment_tokenizer:
                inputs = sentiment_tokenizer(article, return_tensors="pt", truncation=True, padding=True, max_length=512)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                with torch.no_grad():
                    logits = sentiment_model(**inputs).logits
                    label_id = torch.argmax(logits).item()
                    sentiment = sentiment_labels[label_id]
                    result["Sentiment"] = sentiment

            article_result["Matches"].append(result)

        results.append(article_result)

    return results

# === Example Usage ===
if __name__ == "__main__":
    # Pass the fetched news headlines for analysis
    matches = find_relevant_companies_multiple(news_headlines, top_k=3)
    for i, res in enumerate(matches):
        print(f"\nArticle {i+1}: {res['Article']}")
        for match in res['Matches']:
            print(match)


Using device: cuda


ValueError: not enough values to unpack (expected 2, got 1)

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss

# === Scrape headlines from Moneycontrol ===
def fetch_moneycontrol_headlines():
    url = 'https://www.moneycontrol.com/news/business/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    articles = soup.find_all('li', class_='clearfix')

    headlines = []
    for article in articles:
        headline_tag = article.find('h2')
        if headline_tag:
            title = headline_tag.get_text(strip=True)
            link_tag = headline_tag.find('a')
            link = link_tag['href'] if link_tag else None
            headlines.append((title, link))

    return headlines

# === Load company data ===
try:
    df = pd.read_csv("company_description.csv")
except FileNotFoundError:
    df = pd.DataFrame({
        'Ticker': ['AAPL', 'MSFT', 'GOOG'],
        'Description': [
            'Apple Inc. designs smartphones, computers, and accessories.',
            'Microsoft Corp. creates software, services, and devices worldwide.',
            'Alphabet Inc. offers online ads and related services globally.'
        ]
    })

tickers = df['Ticker'].tolist()
descriptions = df['Description'].tolist()

# === Device and models ===
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

bge_model = SentenceTransformer("BAAI/bge-large-en-v1.5", device=device)
bge_model.encode("warmup")

company_embeddings = bge_model.encode(descriptions, convert_to_tensor=True, normalize_embeddings=True, device=device)
faiss_index = faiss.IndexFlatIP(company_embeddings.shape[1])
faiss_index.add(company_embeddings.cpu().numpy())

# Load relevance model
try:
    relevance_tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-large")
    relevance_model = AutoModelForSequenceClassification.from_pretrained("microsoft/deberta-v3-large").to(device).eval()
except:
    relevance_model = None
    relevance_tokenizer = None

# Load sentiment model
try:
    sentiment_tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
    sentiment_model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert").to(device).eval()
    sentiment_labels = sentiment_model.config.id2label
except:
    sentiment_model = None
    sentiment_tokenizer = None
    sentiment_labels = None

# === Analyze articles ===
def analyze_articles(articles, top_k=5, relevance_threshold=0.6):
    headlines = [title for title, _ in articles]
    links = [link for _, link in articles]

    results = []

    article_embeddings = bge_model.encode(headlines, convert_to_tensor=True, normalize_embeddings=True, device=device)
    D, I = faiss_index.search(article_embeddings.cpu().numpy(), top_k)

    for i, headline in enumerate(headlines):
        link = links[i]
        for j, idx in enumerate(I[i]):
            if idx == -1:
                continue

            ticker = tickers[idx]
            desc = descriptions[idx]
            sim_score = float(D[i][j])

            # Relevance
            relevance_score = None
            if relevance_model and relevance_tokenizer:
                inputs = relevance_tokenizer(headline, desc, return_tensors="pt", truncation=True, padding=True, max_length=512)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                with torch.no_grad():
                    logits = relevance_model(**inputs).logits
                    prob = torch.softmax(logits, dim=1)[0]
                    relevance_score = prob[1].item() if relevance_model.config.num_labels == 2 else prob.max().item()
                    if relevance_score < relevance_threshold:
                        continue

            # Sentiment
            sentiment = None
            if sentiment_model and sentiment_tokenizer:
                inputs = sentiment_tokenizer(headline, return_tensors="pt", truncation=True, padding=True, max_length=512)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                with torch.no_grad():
                    logits = sentiment_model(**inputs).logits
                    label_id = torch.argmax(logits).item()
                    sentiment = sentiment_labels[label_id]

            results.append({
                "Headline": headline,
                "Link": link,
                "Ticker": ticker,
                "RetrievalScore": round(sim_score, 3),
                "RelevanceScore": round(relevance_score, 3) if relevance_score is not None else None,
                "Sentiment": sentiment
            })

    return results

# === Main Execution ===
if __name__ == "__main__":
    news_articles = fetch_moneycontrol_headlines()
    analyzed_results = analyze_articles(news_articles, top_k=5)

    df_results = pd.DataFrame(analyzed_results)
    df_results.to_csv("analyzed_news.csv", index=False)
    print("Results saved to analyzed_news.csv")


Using device: cuda


ValueError: not enough values to unpack (expected 2, got 1)

In [2]:
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import faiss
import requests
from bs4 import BeautifulSoup

# === Load Company Data ===
try:
    df = pd.read_csv("/home/chebolu_srikanth/.keras/company_description.csv")
except FileNotFoundError:
    df = pd.DataFrame({
        'Ticker': ['AAPL', 'MSFT', 'GOOG'],
        'Description': [
            'Apple Inc. designs smartphones, computers, and accessories.',
            'Microsoft Corp. creates software, services, and devices worldwide.',
            'Alphabet Inc. offers online ads and related services globally.'
        ]
    })

tickers = df['Ticker'].tolist()
descriptions = df['Description'].tolist()

# === Setup Device and Models ===
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

bge_model = SentenceTransformer("BAAI/bge-large-en-v1.5", device=device)
bge_model.encode("warmup")

company_embeddings = bge_model.encode(descriptions, convert_to_tensor=True, normalize_embeddings=True, device=device)
faiss_index = faiss.IndexFlatIP(company_embeddings.shape[1])
faiss_index.add(company_embeddings.cpu().numpy())

try:
    relevance_tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-large")
    relevance_model = AutoModelForSequenceClassification.from_pretrained("microsoft/deberta-v3-large").to(device).eval()
except:
    relevance_model = None
    relevance_tokenizer = None

try:
    sentiment_tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
    sentiment_model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert").to(device).eval()
    sentiment_labels = sentiment_model.config.id2label
except:
    sentiment_model = None
    sentiment_tokenizer = None
    sentiment_labels = None

# === Scrape News from Moneycontrol ===
def scrape_moneycontrol_articles():
    url = "https://www.moneycontrol.com/news/business/"
    response = requests.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.content, "html.parser")
    articles = soup.find_all("li", class_="clearfix")

    scraped = []
    for article in articles:
        headline_tag = article.find("h2")
        headline = headline_tag.get_text(strip=True) if headline_tag else None

        link_tag = article.find("a", href=True)
        link = link_tag['href'] if link_tag else None

        summary_tag = article.find("p")
        summary = summary_tag.get_text(strip=True) if summary_tag else None

        if headline:
            combined_text = f"{headline}. {summary}" if summary else headline
            scraped.append({"headline": headline, "summary": summary, "link": link, "text": combined_text})

    return scraped

# === Match Companies ===
def find_relevant_companies_multiple(articles, top_k=5, relevance_threshold=0.6):
    results = []
    texts = [a["text"] for a in articles]
    article_embeddings = bge_model.encode(texts, convert_to_tensor=True, normalize_embeddings=True, device=device)
    D, I = faiss_index.search(article_embeddings.cpu().numpy(), top_k)

    for i, article in enumerate(articles):
        article_result = {
            "Headline": article["headline"],
            "Summary": article["summary"],
            "Link": article["link"],
            "Matches": []
        }

        for j, idx in enumerate(I[i]):
            if idx == -1:
                continue

            ticker = tickers[idx]
            desc = descriptions[idx]
            sim_score = float(D[i][j])

            result = {
                "Ticker": ticker,
                "RetrievalScore": round(sim_score, 3)
            }

            if relevance_model and relevance_tokenizer:
                inputs = relevance_tokenizer(article["text"], desc, return_tensors="pt", truncation=True, padding=True, max_length=512)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                with torch.no_grad():
                    logits = relevance_model(**inputs).logits
                    prob = torch.softmax(logits, dim=1)[0]
                    relevance_score = prob[1].item() if relevance_model.config.num_labels == 2 else prob.max().item()
                    result["RelevanceScore"] = round(relevance_score, 3)
                    if relevance_score < relevance_threshold:
                        continue

            if sentiment_model and sentiment_tokenizer:
                inputs = sentiment_tokenizer(article["text"], return_tensors="pt", truncation=True, padding=True, max_length=512)
                inputs = {k: v.to(device) for k, v in inputs.items()}
                with torch.no_grad():
                    logits = sentiment_model(**inputs).logits
                    label_id = torch.argmax(logits).item()
                    sentiment = sentiment_labels[label_id]
                    result["Sentiment"] = sentiment

            article_result["Matches"].append(result)
        results.append(article_result)

    return results

# === Run and Save to CSV ===
if __name__ == "__main__":
    scraped_articles = scrape_moneycontrol_articles()
    matches = find_relevant_companies_multiple(scraped_articles, top_k=3)

    # Flatten and convert to DataFrame
    output_data = []
    for article in matches:
        if not article["Matches"]:
            output_data.append({
                "Headline": article["Headline"],
                "Summary": article["Summary"],
                "Link": article["Link"],
                "Ticker": None,
                "RetrievalScore": None,
                "RelevanceScore": None,
                "Sentiment": None
            })
        else:
            for match in article["Matches"]:
                output_data.append({
                    "Headline": article["Headline"],
                    "Summary": article["Summary"],
                    "Link": article["Link"],
                    "Ticker": match.get("Ticker"),
                    "RetrievalScore": match.get("RetrievalScore"),
                    "RelevanceScore": match.get("RelevanceScore"),
                    "Sentiment": match.get("Sentiment")
                })

    df_output = pd.DataFrame(output_data)
    df_output.to_csv("moneycontrol_article_matches.csv", index=False)
    print("Results saved to moneycontrol_article_matches.csv")

Using device: cuda
Results saved to moneycontrol_article_matches.csv


In [2]:
import requests
from bs4 import BeautifulSoup

# URL of Moneycontrol's Business News section
url = "https://www.moneycontrol.com/news/business/"

# Send a GET request to the URL
response = requests.get(url)
response.raise_for_status()  # Raise an exception for HTTP errors

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Find all news articles
articles = soup.find_all("li", class_="clearfix")

# Iterate through the articles and extract information
for article in articles:
    # Extract the headline
    headline_tag = article.find("h2")
    headline = headline_tag.get_text(strip=True) if headline_tag else "No headline"

    # Extract the link
    link_tag = article.find("a", href=True)
    link = link_tag['href'] if link_tag else "No link"

    # Extract the summary
    summary_tag = article.find("p")
    summary = summary_tag.get_text(strip=True) if summary_tag else "No summary"

    print(f"Headline: {headline}")
    print(f"Link: {link}")
    print(f"Summary: {summary}")
    print("-" * 80)


Headline: Fact check: Government flags fake X accounts impersonating Vyomika Singh, Sofia Qureshi
Link: https://www.moneycontrol.com/news/business/fact-check-government-flags-fake-x-accounts-impersonating-vyomika-singh-sofia-qureshi-13019951.html
Summary: The Press Information Bureau’s (PIB) fact-checking unit has clarified that neither Wing Commander Vyomika Singh nor Colonel Sofia Qureshi maintains an official presence on X.
--------------------------------------------------------------------------------
Headline: SP Group launches internal inquiry on arrest of executive in bribery case
Link: https://www.moneycontrol.com/news/business/sp-group-launches-internal-inquiry-on-arrest-of-executive-in-bribery-case-13019943.html
Summary: The CBI arrested Jeevan Lal Lavidiya, Commissioner of Income Tax (Exemption), Hyderabad, for allegedly accepting a bribe of ₹70 lakh to favour the Shapoorji Pallonji Group.
--------------------------------------------------------------------------------
Head