In [63]:
!pip install adaptkeybert datasets transformers sentence-transformers pandas keybert scikit-learn transformers[torch] 'accelerate>=0.26.0'

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


zsh:1: no matches found: transformers[torch]


In [42]:
import pandas as pd
df_sentences = pd.read_csv("preprocessed_sentences.csv")

In [43]:
df_sentences

Unnamed: 0.1,Unnamed: 0,document_id,paragraph_id,sentence_id,sentence
0,0,1,1,1,BUY INVESTMENT HIGHLIGHTS: $150.00 We raise ou...
1,1,1,1,2,We now expect 3Q16 revenue of $6.855B up 52% y...
2,2,1,1,3,"FB will report 3Q16 earnings on Wednesday, Nov..."
3,3,1,1,4,"The call in number is 866 554-3009, ID ."
4,4,1,1,6,Advertising We maintain our Buy rating and $15...
...,...,...,...,...,...
40099,40099,927,4,1,A looming threat for all web publishers relate...
40100,40100,927,4,2,established by governments which limit how pub...
40101,40101,927,4,3,Rules could be established in the future which...
40102,40102,927,5,1,Google has become so OTHER GOVERNMENT REGULATI...


In [64]:
# Version 3.0 (Based on unsupervised list of LLaMA)

#To Revise: sorted by priorities
"""
- Product
- Technology
- Leadership
- Customer
- Research & Development (more updates)
"""

keywords_dict = {
    "Financial Performance": {
        "EPS": ["performance indicators","Earnings per Share expectations","price-to-earnings ratio"],
        "Cash Flow":["free cash flow","liquidity","working capital","cash reserves","FCF"],
        "Revenue": ["operational revenue", "business income", "operating profit","increased income","sales growth rate","earnings trajectory"],
        "Return On Equity": ["capital efficiency", "investment return", "capital ROI","ROE"],
        "Margins": ["EBITDA margins", "operating margin", "profit margin", "cost-to-revenue", "gross profit","profitability changes"],
        "Cost Management": ["expense control", "cost reduction", "efficiency savings","expense trimming", "cost containment","cost optimization","cost-effectiveness"],
        "Dividend Policy": ["dividend payout", "shareholder returns", "yield", "dividend sustainability", "payout ratio"],
        "Investments": ["strategic investment", "capital deployment", "fundraising", "fund allocation"],
        "Balance Sheet": ["assets", "liabilities", "equity", "debt management", "financial health","book value","debt-to-equity ratio","equity issuance","capital structure"],
    },
    "Company": {
        "Long-term Growth": ["sustainable growth", "long-term trajectory", "future growth","scalability", "large-scale expansion","unit expansion","unit volume"],
        "Mergers & Acquisition": ["business acquisition", "M&A activity", "buyout approach","merger"],
        "Refranchising": ["franchise model", "franchise transitions", "refranchising plans"],
        "Sustainability": ["green initiatives", "environmental impact", "sustainable practices","ESG","SRI","sustainability","green energy adoption","social responsibility"],
        "Employees": ["workforce optimization","talent management","upskilling","remote work","employee management","employee benefits","diversity and inclusion","salary levels"],
        "Research & Development":["R&D spending","patent activity","technological breakthrough","future of work","continuous development","research projects"],
        "Marketing":["brand awareness","CI","corporate identity","performance marketing","brand loyalty","word-of-mouth","brand value","consumer perception"],
        "Shares Repurchase": ["buyback", "repurchase programs", "shareholder value", "equity reduction", "stock repurchases","stock rebuy"],
        "Processes":["process improvements","streamlined processes","productivity improvements","operational efficiency"],
        "Leadership":["effective leadership", "executive strength", "executive resilience","management trust","crisis management","risk assessment","contingency planning"],
    },
    "Product": {
        "Innovation": ["new features", "innovative products", "product advancements","new product","product launch","disruptive technology","portfolio diversification"],
        "Product Characteristics":["USP","unique selling point","product quality", "product differentiation","product portfolio"],
        "Pricing Strategy":["price segmentation","price optimization", "dynamic pricing", "competitive pricing","pricing models", "price elasticity", "discount strategies"],
        "Production": ["production capacities","manufacturing delays", "supply issues","production stops","factory problems","logistics bottlenecks","material shortage"],
        "Technology Trends":["autonomous systems","IoT (Internet of Things)", "machine learning","deep learning","natural language processing","AI","robotics","digital transformation","cloud computing","blockchain"]
    },
    "Market": {
        "Market Share": ["market share", "industry share", "market proportion","market penetration"],
        "Market Expansion": ["new markets", "geographical reach", "market entry","worldwide expansion"],
        "Competitors":["market rivalry", "competitive threats", "industry competition", "competitive advantage"],
        "Global Presence": ["international footprint", "global operations", "worldwide coverage"],
        "Industry Outlook":["sector growth","market trends","market evolution","industry trends"],
		"Regulations":["tax regulations","regulatory risks","governmental influence","government incentives", "state funding", "subsidies","political influence","legal disputes"],
        "Partnerships and Collaborations":["strategic alliances","partner relationships","joint venture"],
        "Supply Chain":["logistics optimization", "supply logistics", "supply chain strategies","supply constraints","inventory challenges","distribution channels", "supplier relationships","procurement"],
        "Economic Conditions": ["economic environment", "market economy", "macroeconomic factors","recession","expansion","inflationary impact","interest rate environment","foreign exchange impact"],
        "Demand":["increasing demand","decreasing demand","demand forecasting","consumer visits","store traffic"],
        "Customer": ["user interaction", "customer retention","customer loyalty","frequent buyer","user satisfaction","customer lifetime value (CLV)","per-visit spending","churn rate"]
    }
}

In [None]:
import torch
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.util import batch_to_device
from tqdm import tqdm

# Detect device (MPS for macOS, fallback to CPU if unavailable)
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
print(f"Using device: {device}")

# Prepare training data
train_examples = [
    InputExample(texts=["The company's revenue increased by 10%.", "Revenue growth is evident."], label=1.0),
    InputExample(texts=["The company announced a cost-cutting strategy.", "The company is hiring more employees."], label=0.0),
    InputExample(texts=["The company announced a cost-cutting strategy.", "The company is hiring more employees."], label=0.0),
    InputExample(texts=["The company announced a cost-cutting strategy.", "The company is hiring more employees."], label=0.0),

]

# Custom collate function to handle InputExample objects
def collate_fn(batch):
    texts1 = [example.texts[0] for example in batch]
    texts2 = [example.texts[1] for example in batch]
    labels = torch.tensor([example.label for example in batch], dtype=torch.float)
    return texts1, texts2, labels

# DataLoader for batching
train_dataloader = DataLoader(
    train_examples,
    shuffle=True,
    batch_size=8,
    collate_fn=collate_fn
)

# Load SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
model = model.to(device)  # Move model to the detected device

# Define optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

# Training loop
num_epochs = 3
for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}")
    model.train()
    epoch_loss = 0

    for step, (texts1, texts2, labels) in enumerate(tqdm(train_dataloader)):
        # Tokenize and move inputs to the correct device
        inputs1 = model.tokenize(texts1)
        inputs2 = model.tokenize(texts2)
        batch_to_device(inputs1, device)
        batch_to_device(inputs2, device)
        labels = labels.to(device)

        # Forward pass to generate embeddings
        embeddings1 = model(inputs1)["sentence_embedding"]
        embeddings2 = model(inputs2)["sentence_embedding"]

        # Compute cosine similarity and loss
        cosine_scores = torch.nn.functional.cosine_similarity(embeddings1, embeddings2)
        loss = torch.nn.MSELoss()(cosine_scores, labels)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    print(f"Epoch {epoch + 1} Loss: {epoch_loss / len(train_dataloader)}")

# Save the fine-tuned model
model.save("fine_tuned_model")
print("Model fine-tuned and saved!")


Using device: mps
Epoch 1


100%|██████████| 1/1 [00:08<00:00,  8.30s/it]


Epoch 1 Loss: 0.13690048456192017
Epoch 2


100%|██████████| 1/1 [00:00<00:00,  3.66it/s]


Epoch 2 Loss: 0.11229391396045685
Epoch 3


100%|██████████| 1/1 [00:00<00:00,  8.14it/s]


Epoch 3 Loss: 0.07724197953939438
Model fine-tuned and saved!


In [77]:
kw_model = KeyBERT("fine_tuned_model")

In [78]:
from adaptkeybert import KeyBERT
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from transformers import pipeline


# Training

In [79]:
df_sentences = df_sentences[0:30]

In [80]:
df_sentences

Unnamed: 0.1,Unnamed: 0,document_id,paragraph_id,sentence_id,sentence,arguments,sentiment
0,0,1,1,1,BUY INVESTMENT HIGHLIGHTS: $150.00 We raise ou...,"[Return On Equity, Investments]",positive
1,1,1,1,2,We now expect 3Q16 revenue of $6.855B up 52% y...,"[Revenue, Margins]",positive
2,2,1,1,3,"FB will report 3Q16 earnings on Wednesday, Nov...","[EPS, Revenue]",neutral
3,3,1,1,4,"The call in number is 866 554-3009, ID .",[],neutral
4,4,1,1,6,Advertising We maintain our Buy rating and $15...,[],positive
5,5,1,2,1,revenue should reach approximately $6.663B up ...,"[Revenue, Margins]",positive
6,6,1,3,1,We expect mobile ad revenue to represent appro...,"[Revenue, Margins]",neutral
7,7,1,3,2,yy and no change from previous estimates.,[],neutral
8,8,1,4,1,We raise our estimates for FY16 as we expect a...,[Long-term Growth],positive
9,9,1,4,2,"As a result, we now expect revenue of $27.04B ...","[Revenue, Margins]",positive


# Testing

In [84]:
from transformers import pipeline, BertForSequenceClassification, BertTokenizer
from sentence_transformers import SentenceTransformer
from keybert import KeyBERT
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Initialize KeyBERT and SentenceTransformer
kw_model = KeyBERT()
sentence_transformer = SentenceTransformer('all-MiniLM-L6-v2')

# Initialize FinancialBERT for sentiment analysis
sentiment_model = BertForSequenceClassification.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis", num_labels=3)
tokenizer = BertTokenizer.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis")
financial_bert_sentiment = pipeline("sentiment-analysis", model=sentiment_model, tokenizer=tokenizer)

# Flatten the keywords into a single list for zero-shot KeyBERT
seed_words = [
    keyword
    for category, arguments in keywords_dict.items()
    for argument, keywords in arguments.items()
    for keyword in keywords
]

# Precompute embeddings for each argument's seed_words
precomputed_seed_embeddings = {
    argument: sentence_transformer.encode(seed_words)
    for category, arguments in keywords_dict.items()
    for argument, seed_words in arguments.items()
}

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [85]:
# Define the classification and analysis function
def classify_and_analyze(sentence, kw_model, keywords_dict, precomputed_seed_embeddings, seed_words, threshold=0.7):
    # Extract keywords with KeyBERT
    keywords = kw_model.extract_keywords(
        sentence, top_n=5, seed_keywords=seed_words
    )
    extracted_keywords = [kw[0] for kw in keywords]

    # Initialize results
    categories = []
    for category, arguments in keywords_dict.items():
        for argument, seed_words in arguments.items():
            # Use precomputed embeddings for the argument
            seed_embeddings = precomputed_seed_embeddings[argument]

            if len(extracted_keywords) == 0:
                continue  # Skip to the next category if no keywords are extracted


            keyword_embeddings = sentence_transformer.encode(extracted_keywords)
            if keyword_embeddings.shape[0] == 0 or seed_embeddings.shape[0] == 0:
                continue  # Skip computation if embeddings are empty

            if keyword_embeddings.shape[1] != seed_embeddings.shape[1]:
                print(f"Dimension mismatch: {keyword_embeddings.shape} vs {seed_embeddings.shape}")
                continue  # Skip incompatible embeddings

            # Compute cosine similarities
            similarities = cosine_similarity(keyword_embeddings, seed_embeddings)
            max_similarity = similarities.max() if similarities.size > 0 else 0
            if max_similarity >= threshold:
                categories.append(argument)
    
    # Use FinancialBERT to classify sentiment
    sentiment_result = financial_bert_sentiment(sentence)[0]
    sentiment = sentiment_result['label']

    return categories, sentiment

# Prepare the DataFrame for processing
df_sentences['arguments'] = None
df_sentences['sentiment'] = None

# Apply the function to classify sentences and analyze sentiment
for idx, row in df_sentences.iterrows():
    categories, sentiment = classify_and_analyze(
        row['sentence'], kw_model, keywords_dict, precomputed_seed_embeddings, seed_words
    )
    df_sentences.at[idx, 'arguments'] = categories
    df_sentences.at[idx, 'sentiment'] = sentiment



"""# Test the function on individual sentences
for sentence in df_sentences['sentence']:
    categories, sentiment = classify_and_analyze(sentence, kw_model, keywords_dict, precomputed_seed_embeddings, seed_words)
    print(f"Sentence: {sentence}\nCategories: {categories}, Sentiment: {sentiment}\n")
"""

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sentences['arguments'] = None
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sentences['sentiment'] = None


'# Test the function on individual sentences\nfor sentence in df_sentences[\'sentence\']:\n    categories, sentiment = classify_and_analyze(sentence, kw_model, keywords_dict, precomputed_seed_embeddings, seed_words)\n    print(f"Sentence: {sentence}\nCategories: {categories}, Sentiment: {sentiment}\n")\n'

In [86]:
df_sentences

Unnamed: 0.1,Unnamed: 0,document_id,paragraph_id,sentence_id,sentence,arguments,sentiment
0,0,1,1,1,BUY INVESTMENT HIGHLIGHTS: $150.00 We raise ou...,"[Return On Equity, Investments]",positive
1,1,1,1,2,We now expect 3Q16 revenue of $6.855B up 52% y...,"[Revenue, Margins]",positive
2,2,1,1,3,"FB will report 3Q16 earnings on Wednesday, Nov...","[EPS, Revenue]",neutral
3,3,1,1,4,"The call in number is 866 554-3009, ID .",[],neutral
4,4,1,1,6,Advertising We maintain our Buy rating and $15...,[],positive
5,5,1,2,1,revenue should reach approximately $6.663B up ...,"[Revenue, Margins]",positive
6,6,1,3,1,We expect mobile ad revenue to represent appro...,"[Revenue, Margins]",neutral
7,7,1,3,2,yy and no change from previous estimates.,[],neutral
8,8,1,4,1,We raise our estimates for FY16 as we expect a...,[Long-term Growth],positive
9,9,1,4,2,"As a result, we now expect revenue of $27.04B ...","[Revenue, Margins]",positive


In [55]:
df_sentences.to_csv("argument_sentiment_testing.csv")

In [None]:
"""# Initialize AdaptKeyBERT and SentenceTransformer
kw_model = KeyBERT()
sentence_transformer = SentenceTransformer('all-MiniLM-L6-v2')

precomputed_seed_embeddings = {
    argument: sentence_transformer.encode(seed_words)
    for category, arguments in keywords_dict.items()
    for argument, seed_words in arguments.items()
}

# Flattening the keywords into a single list
seed_words = [
    argument 
    for category, arguments in keywords_dict.items() 
    for argument, keywords in arguments.items() 
    for keyword in keywords
]

def classify_sentence(sentence, kw_model, keywords_dict, precomputed_seed_embeddings, seed_words, threshold=0.65):
    # Extract keywords from the sentence using KeyBERT (zero-shot training)
    keywords = kw_model.extract_keywords(
        sentence, top_n=5, seed_keywords=seed_words, nr_candidates=50
    )
    extracted_keywords = [kw[0] for kw in keywords]

    results = {}
    for category, arguments in keywords_dict.items():
        argument_scores = {}
        for argument, seed_words in arguments.items():
            # Use precomputed embeddings for the argument
            seed_embeddings = precomputed_seed_embeddings[argument]
            # Encode the extracted keywords
            keyword_embeddings = sentence_transformer.encode(extracted_keywords)
            # Compute cosine similarities
            similarities = cosine_similarity(keyword_embeddings, seed_embeddings)
            max_similarity = similarities.max() if similarities.size > 0 else 0
            if max_similarity >= threshold:
                argument_scores[argument] = max_similarity
        
        if argument_scores:
            results[category] = argument_scores
    
    return results if results else {"Uncategorized": "No relevant category found"}

# Global list of seed_words for KeyBERT zero-shot training
seed_words = [
    keyword
    for category, arguments in keywords_dict.items()
    for argument, keywords in arguments.items()
    for keyword in keywords
]

# Test the function on sentences
for sentence in df_sentences['sentence']:
    categories = classify_sentence(sentence, kw_model, keywords_dict, precomputed_seed_embeddings, seed_words)
    print(f"Sentence: {sentence}\nCategories: {categories}\n")
"""

Sentence: Q4 was solid, highlighted by 32 billings growth yy, 80.00 which suggests the health of fundamentals and secular trends.
Categories: {'Company': {'Long-term Growth': 0.7785666}, 'Market': {'Industry Outlook': 0.7731363}}

Sentence: One of the highlights on the call was the early success of recently introduced Wave Analytics, including wins at marquee logos such as Time Warner Cable, Merck, and others, which suggests the credibility of the product cyclegrowth engine against what looks like a massive multi billion dollar market opportunity.
Categories: {'Uncategorized': 'No relevant category found'}

Sentence: Another standout on the call was the number of SaaSApplication Software 8figure deals, which likely reflects the companys industry vertical positioning, the strategic nature of relationships and comprehensive platform strategy.
Categories: {'Financial Performance': {'Investments': 0.7786581}, 'Market': {'Market Share': 0.66499156, 'Competitors': 0.764099, 'Industry Outlook