# Financial Tweet Sentiment Labeling with Gemini (Stock-Focused)

This notebook handles the labeling of financial tweets with verified stock symbols using Google's Gemini API:
1. Load the preprocessed CSV files with NER and stock symbol information
2. Process tweets with verified stock symbols through Gemini
3. Save labeled data

Sentiment Labels:
- STRONGLY_POSITIVE
- POSITIVE
- NEUTRAL
- NEGATIVE
- STRONGLY_NEGATIVE
- NOT_RELATED
- UNCERTAIN

In [None]:
import os
import pandas as pd
import ast
from glob import glob
import google.generativeai as genai
from tqdm import tqdm
import time

# Configure Gemini API
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-pro')

## 1. Load Preprocessed Data

Load the data with NER results and verified stock symbols.

In [None]:
# Load the data with verified stock symbols
df = pd.read_csv('../data/tweets_with_verified_stocks.csv')
print(f"Loaded {len(df)} tweets with verified stock symbols")

# Convert string representations of lists to actual lists
def convert_str_to_list(str_list):
    if pd.isna(str_list):
        return None
    try:
        return ast.literal_eval(str_list)
    except:
        return None

# Apply conversion to list columns
list_columns = ['entity_types', 'entity_values', 'potential_symbols', 'verified_stock_symbols']
for col in list_columns:
    if col in df.columns:
        df[col] = df[col].apply(convert_str_to_list)

# Display a sample of the data
df[['cleaned_text', 'entity_types', 'verified_stock_symbols']].head()

## 2. Configure Sentiment Labeling

In [None]:
def setup_prompt():
    """Configure the system prompt for Gemini with stock symbol context"""
    return """
    You are a financial sentiment analyzer. Classify the given tweet's sentiment into one of these categories, specifically in the context of the mentioned stock symbol(s):

    STRONGLY_POSITIVE - Very bullish, highly confident optimistic outlook for the stock(s)
    POSITIVE - Generally optimistic, bullish view of the stock(s)
    NEUTRAL - Factual, balanced, or no clear sentiment about the stock(s)
    NEGATIVE - Generally pessimistic, bearish view of the stock(s)
    STRONGLY_NEGATIVE - Very bearish, highly confident pessimistic outlook for the stock(s)
    NOT_RELATED - Mentions the stock symbol but not related to its financial performance
    UNCERTAIN - Ambiguous or unclear sentiment about the stock(s)

    Examples:
    "Breaking: $AAPL doubles profit forecast!" -> STRONGLY_POSITIVE
    "Expecting modest gains for $MSFT next quarter" -> POSITIVE
    "$AMZN closed at 3500" -> NEUTRAL
    "Concerned about $TSLA's rising costs" -> NEGATIVE
    "$NFLX crash incoming, sell everything!" -> STRONGLY_NEGATIVE
    "I'm watching $DIS movie on my new TV" -> NOT_RELATED
    "Something might happen with $GME" -> UNCERTAIN

    Tweet to analyze: {text}
    Stock symbols mentioned: {symbols}

    Format: Return only one word from: STRONGLY_POSITIVE, POSITIVE, NEUTRAL, NEGATIVE, STRONGLY_NEGATIVE, NOT_RELATED, UNCERTAIN
    """

def get_sentiment(text, symbols, retries=3):
    """Get sentiment from Gemini with retry logic, focusing on stock symbols"""
    prompt = setup_prompt().format(text=text, symbols=symbols)
    
    for attempt in range(retries):
        try:
            response = model.generate_content(prompt)
            sentiment = response.text.strip().upper()
            
            # Validate the response
            valid_labels = [
                'STRONGLY_POSITIVE', 'POSITIVE', 'NEUTRAL', 'NEGATIVE',
                'STRONGLY_NEGATIVE', 'NOT_RELATED', 'UNCERTAIN'
            ]
            
            if sentiment in valid_labels:
                return sentiment
            else:
                raise ValueError(f"Invalid sentiment: {sentiment}")
                
        except Exception as e:
            if attempt == retries - 1:
                print(f"Error processing text: {text}\nError: {str(e)}")
                return 'UNCERTAIN'
            time.sleep(1)  # Wait before retry
    
    return 'UNCERTAIN'

## 3. Test Sentiment Labeling on a Small Sample

In [None]:
# Test on a small sample
sample_df = df.head(5).copy()
sample_sentiments = []

for _, row in sample_df.iterrows():
    text = row['cleaned_text']
    symbols = row['verified_stock_symbols']
    sentiment = get_sentiment(text, symbols)
    sample_sentiments.append(sentiment)
    print(f"Text: {text[:100]}...\nSymbols: {symbols}\nSentiment: {sentiment}\n---")
    time.sleep(0.5)  # Rate limiting

sample_df['sentiment'] = sample_sentiments
sample_df[['cleaned_text', 'verified_stock_symbols', 'sentiment']]

## 4. Process All Tweets with Verified Stock Symbols

In [None]:
def process_dataframe(input_df, batch_size=50):
    """Process the dataframe in batches to avoid rate limiting"""
    result_df = input_df.copy()
    sentiments = []
    
    # Skip if already processed
    if 'sentiment' in result_df.columns and not result_df['sentiment'].isnull().all():
        print("Data already processed")
        return result_df
    
    total_rows = len(result_df)
    
    for i in tqdm(range(0, total_rows, batch_size), desc="Processing batches"):
        end_idx = min(i + batch_size, total_rows)
        batch = result_df.iloc[i:end_idx]
        
        batch_sentiments = []
        for _, row in batch.iterrows():
            text = row['cleaned_text']
            symbols = row['verified_stock_symbols']
            sentiment = get_sentiment(text, symbols)
            batch_sentiments.append(sentiment)
            time.sleep(0.2)  # Rate limiting
        
        sentiments.extend(batch_sentiments)
        # Save intermediate results after each batch
        temp_df = result_df.copy()
        temp_df.loc[:end_idx-1, 'sentiment'] = sentiments
        temp_df.to_csv('../data/stock_tweets_labeled_in_progress.csv', index=False)
    
    result_df['sentiment'] = sentiments
    return result_df

# Process all data
labeled_df = process_dataframe(df)

# Save final results
labeled_df.to_csv('../data/stock_tweets_labeled.csv', index=False)
print(f"Saved labeled data to '../data/stock_tweets_labeled.csv'")

# Print statistics
print("\nSentiment Distribution:")
print(labeled_df['sentiment'].value_counts())

## 5. Filter for Training Dataset

Create a final dataset that excludes NOT_RELATED tweets for model training.

In [None]:
# Create a filtered dataset excluding NOT_RELATED tweets
filtered_df = labeled_df[labeled_df['sentiment'] != 'NOT_RELATED'].copy()
filtered_df.to_csv('../data/stock_tweets_for_training.csv', index=False)
print(f"Saved {len(filtered_df)} tweets for training to '../data/stock_tweets_for_training.csv'")

# Print statistics for the filtered dataset
print("\nSentiment Distribution (Training Dataset):")
print(filtered_df['sentiment'].value_counts())

## 6. Analyze Results by Stock Symbol

In [None]:
# Explode the dataframe to analyze by individual stock symbol
exploded_df = labeled_df.explode('verified_stock_symbols').dropna(subset=['verified_stock_symbols'])
exploded_df = exploded_df.rename(columns={'verified_stock_symbols': 'stock_symbol'})

# Count tweets by stock symbol and sentiment
symbol_sentiment_counts = exploded_df.groupby(['stock_symbol', 'sentiment']).size().unstack(fill_value=0)

# Show top stocks by tweet count
top_stocks = exploded_df['stock_symbol'].value_counts().head(20)
print("Top stocks by tweet count:")
print(top_stocks)

# Sentiment distribution for top 5 stocks
top_5_stocks = top_stocks.index[:5]
print("\nSentiment distribution for top 5 stocks:")
for stock in top_5_stocks:
    stock_data = exploded_df[exploded_df['stock_symbol'] == stock]
    print(f"\n{stock} sentiment distribution:")
    print(stock_data['sentiment'].value_counts())

# Save the exploded dataframe for further analysis
exploded_df.to_csv('../data/stock_tweets_by_symbol.csv', index=False)
print(f"\nSaved expanded data by stock symbol to '../data/stock_tweets_by_symbol.csv'")

## 7. Summary and Next Steps

This notebook has processed financial tweets with verified stock symbols, labeled their sentiment using Gemini, and prepared datasets for further analysis and model training.

Files created:
1. `stock_tweets_labeled.csv` - All tweets with verified stock symbols and their sentiment
2. `stock_tweets_for_training.csv` - Filtered dataset excluding NOT_RELATED tweets, ready for model training
3. `stock_tweets_by_symbol.csv` - Expanded dataset for analysis by individual stock symbol

Next steps:
1. Use `stock_tweets_for_training.csv` for model training (Gamma 3, Gemma 3, or FinBERT)
2. Analyze sentiment by stock symbol for insights
3. Develop predictive models based on stock-specific sentiment