# Qualitative Insights & Sentiment Analysis Experiment

This notebook experiments with advanced sentiment analysis and qualitative insights extraction from TCS earnings call transcripts and financial reports.

## Objectives:
1. Analyze TCS earnings call transcripts for sentiment and key themes
2. Extract management guidance and forward-looking statements
3. Identify recurring themes and strategic initiatives
4. Perform sentiment analysis on quarterly communications
5. Generate structured qualitative insights for forecasting

In [None]:
# Import required libraries
import os
import pandas as pd
import numpy as np
import json
import re
from datetime import datetime
import logging
from typing import Dict, List, Any, Optional

# Text processing and NLP
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import spacy

# Advanced NLP models
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from sentence_transformers import SentenceTransformer
import torch

# Topic modeling and clustering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
import umap

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from wordcloud import WordCloud

# API integration
import anthropic
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("📦 Libraries imported successfully")
print(f"🤗 Transformers available: {torch.__version__}")
print(f"🔬 Advanced NLP pipeline ready")

In [None]:
# Configuration
DATA_DIR = "data"
PDFS_DIR = os.path.join(DATA_DIR, "pdfs")
OUTPUT_DIR = "outputs/qualitative_analysis"

# Model configuration
ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_API_KEY', 'your-api-key-here')
CLAUDE_MODEL = "claude-3-5-sonnet-20241022"

# NLP model configurations
SENTIMENT_MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
FINANCIAL_SENTIMENT_MODEL = "ProsusAI/finbert"

# Analysis parameters
MIN_SENTENCE_LENGTH = 10
MAX_TOPICS = 10
SENTIMENT_THRESHOLD = 0.1

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"📁 Data directory: {DATA_DIR}")
print(f"💾 Output directory: {OUTPUT_DIR}")
print(f"🤖 Claude model: {CLAUDE_MODEL}")
print(f"💭 Sentiment model: {SENTIMENT_MODEL}")
print(f"🎯 Financial model: {FINANCIAL_SENTIMENT_MODEL}")
print(f"🔑 API configured: {'✅' if ANTHROPIC_API_KEY != 'your-api-key-here' else '❌ Need API key'}")

In [None]:
# Initialize NLP models and tools
def initialize_nlp_models():
    """
    Initialize all NLP models and tools
    """
    models = {}
    
    try:
        # Download required NLTK data
        nltk.download('vader_lexicon', quiet=True)
        nltk.download('punkt', quiet=True)
        nltk.download('stopwords', quiet=True)
        nltk.download('wordnet', quiet=True)
        
        # Initialize NLTK tools
        models['nltk_sentiment'] = SentimentIntensityAnalyzer()
        models['lemmatizer'] = WordNetLemmatizer()
        models['stop_words'] = set(stopwords.words('english'))
        
        print("✅ NLTK models initialized")
        
    except Exception as e:
        logger.error(f"NLTK initialization failed: {e}")
        print("❌ NLTK models failed to load")
    
    try:
        # Initialize transformer models
        models['sentiment_pipeline'] = pipeline(
            "sentiment-analysis", 
            model=SENTIMENT_MODEL, 
            tokenizer=SENTIMENT_MODEL
        )
        print("✅ RoBERTa sentiment model loaded")
        
    except Exception as e:
        logger.error(f"Sentiment model loading failed: {e}")
        print("❌ Sentiment model failed to load")
    
    try:
        # Initialize financial sentiment model
        models['financial_sentiment'] = pipeline(
            "sentiment-analysis",
            model=FINANCIAL_SENTIMENT_MODEL,
            tokenizer=FINANCIAL_SENTIMENT_MODEL
        )
        print("✅ FinBERT financial sentiment model loaded")
        
    except Exception as e:
        logger.error(f"Financial sentiment model loading failed: {e}")
        print("❌ Financial sentiment model failed to load")
    
    try:
        # Initialize sentence embeddings
        models['sentence_transformer'] = SentenceTransformer(EMBEDDING_MODEL)
        print("✅ Sentence transformer loaded")
        
    except Exception as e:
        logger.error(f"Sentence transformer loading failed: {e}")
        print("❌ Sentence transformer failed to load")
    
    try:
        # Initialize spaCy for NER
        models['nlp'] = spacy.load("en_core_web_sm")
        print("✅ spaCy model loaded")
        
    except Exception as e:
        logger.error(f"spaCy loading failed: {e}")
        print("❌ spaCy model failed to load (install: python -m spacy download en_core_web_sm)")
    
    try:
        # Initialize Claude client
        if ANTHROPIC_API_KEY != 'your-api-key-here':
            models['claude_client'] = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
            print("✅ Claude API client initialized")
        else:
            print("⚠️ Claude API key not configured")
            
    except Exception as e:
        logger.error(f"Claude client initialization failed: {e}")
        print("❌ Claude client failed to initialize")
    
    return models

# Initialize all models
print("🚀 Initializing NLP models...")
nlp_models = initialize_nlp_models()
print(f"\n📊 Models loaded: {len(nlp_models)}/7 expected models")

# Show available models
print("\n🔧 Available analysis tools:")
for model_name in nlp_models.keys():
    print(f"  • {model_name}")

In [None]:
# Load and preprocess text data
def extract_text_from_pdfs() -> Dict[str, str]:
    """
    Extract text content from TCS financial PDFs
    """
    import fitz  # PyMuPDF
    
    text_data = {}
    pdf_files = [f for f in os.listdir(PDFS_DIR) if f.endswith('.pdf')]
    
    for pdf_file in pdf_files:
        try:
            pdf_path = os.path.join(PDFS_DIR, pdf_file)
            doc = fitz.open(pdf_path)
            
            full_text = ""
            for page_num in range(min(len(doc), 10)):  # Limit to first 10 pages
                page = doc.load_page(page_num)
                full_text += page.get_text()
            
            doc.close()
            
            if len(full_text.strip()) > 100:  # Only keep substantial text
                text_data[pdf_file] = full_text
                print(f"📄 Extracted {len(full_text)} characters from {pdf_file}")
            else:
                print(f"⚠️ Minimal text found in {pdf_file}")
                
        except Exception as e:
            logger.error(f"Error extracting text from {pdf_file}: {e}")
    
    return text_data

def preprocess_text(text: str, nlp_models: Dict) -> Dict[str, Any]:
    """
    Preprocess text for analysis
    """
    # Basic cleaning
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    text = re.sub(r'[^\w\s.,!?-]', '', text)  # Remove special characters
    
    # Sentence tokenization
    sentences = sent_tokenize(text)
    sentences = [s for s in sentences if len(s) >= MIN_SENTENCE_LENGTH]
    
    # Word tokenization and cleaning
    words = word_tokenize(text.lower())
    
    if 'stop_words' in nlp_models and 'lemmatizer' in nlp_models:
        words = [
            nlp_models['lemmatizer'].lemmatize(word) 
            for word in words 
            if word.isalpha() and word not in nlp_models['stop_words']
        ]
    
    # Extract financial keywords
    financial_keywords = [
        'revenue', 'profit', 'margin', 'growth', 'earnings', 'guidance', 
        'outlook', 'performance', 'strategy', 'investment', 'market', 
        'client', 'digital', 'transformation', 'ai', 'cloud', 'services'
    ]
    
    found_keywords = [word for word in words if word in financial_keywords]
    
    return {
        'original_text': text,
        'sentences': sentences,
        'words': words,
        'financial_keywords': found_keywords,
        'sentence_count': len(sentences),
        'word_count': len(words),
        'keyword_count': len(found_keywords)
    }

# Extract and preprocess text data
print("📖 Extracting text from TCS documents...")
document_texts = extract_text_from_pdfs()

print(f"\n📚 Found {len(document_texts)} documents with substantial text")

# Preprocess all documents
preprocessed_docs = {}
for doc_name, text in document_texts.items():
    print(f"🔄 Preprocessing {doc_name}...")
    preprocessed = preprocess_text(text, nlp_models)
    preprocessed_docs[doc_name] = preprocessed
    
    print(f"  📊 {preprocessed['sentence_count']} sentences, {preprocessed['word_count']} words, {preprocessed['keyword_count']} financial keywords")

print(f"\n✅ Preprocessed {len(preprocessed_docs)} documents")

In [None]:
# Advanced sentiment analysis
def analyze_sentiment_comprehensive(text: str, nlp_models: Dict) -> Dict[str, Any]:
    """
    Comprehensive sentiment analysis using multiple models
    """
    results = {
        'text_length': len(text),
        'sentiment_scores': {},
        'confidence_scores': {},
        'dominant_sentiment': None
    }
    
    # NLTK VADER sentiment
    if 'nltk_sentiment' in nlp_models:
        try:
            vader_scores = nlp_models['nltk_sentiment'].polarity_scores(text)
            results['sentiment_scores']['vader'] = {
                'positive': vader_scores['pos'],
                'negative': vader_scores['neg'],
                'neutral': vader_scores['neu'],
                'compound': vader_scores['compound']
            }
        except Exception as e:
            logger.error(f"VADER sentiment analysis failed: {e}")
    
    # RoBERTa sentiment
    if 'sentiment_pipeline' in nlp_models:
        try:
            # Truncate text if too long
            truncated_text = text[:512] if len(text) > 512 else text
            roberta_result = nlp_models['sentiment_pipeline'](truncated_text)[0]
            
            results['sentiment_scores']['roberta'] = {
                'label': roberta_result['label'].lower(),
                'score': roberta_result['score']
            }
            results['confidence_scores']['roberta'] = roberta_result['score']
        except Exception as e:
            logger.error(f"RoBERTa sentiment analysis failed: {e}")
    
    # Financial sentiment (FinBERT)
    if 'financial_sentiment' in nlp_models:
        try:
            # Truncate text if too long
            truncated_text = text[:512] if len(text) > 512 else text
            finbert_result = nlp_models['financial_sentiment'](truncated_text)[0]
            
            results['sentiment_scores']['finbert'] = {
                'label': finbert_result['label'].lower(),
                'score': finbert_result['score']
            }
            results['confidence_scores']['finbert'] = finbert_result['score']
        except Exception as e:
            logger.error(f"FinBERT sentiment analysis failed: {e}")
    
    # Determine dominant sentiment
    sentiment_votes = []
    
    if 'vader' in results['sentiment_scores']:
        compound = results['sentiment_scores']['vader']['compound']
        if compound >= SENTIMENT_THRESHOLD:
            sentiment_votes.append('positive')
        elif compound <= -SENTIMENT_THRESHOLD:
            sentiment_votes.append('negative')
        else:
            sentiment_votes.append('neutral')
    
    for model in ['roberta', 'finbert']:
        if model in results['sentiment_scores']:
            sentiment_votes.append(results['sentiment_scores'][model]['label'])
    
    if sentiment_votes:
        # Most common sentiment
        from collections import Counter
        sentiment_counts = Counter(sentiment_votes)
        results['dominant_sentiment'] = sentiment_counts.most_common(1)[0][0]
        results['sentiment_consensus'] = sentiment_counts.most_common(1)[0][1] / len(sentiment_votes)
    
    return results

def analyze_document_sentiment(doc_data: Dict, nlp_models: Dict) -> Dict[str, Any]:
    """
    Analyze sentiment for entire document and by sentences
    """
    results = {
        'document_sentiment': None,
        'sentence_sentiments': [],
        'sentiment_distribution': {},
        'financial_sentiment_highlights': []
    }
    
    # Overall document sentiment
    full_text = doc_data['original_text'][:2000]  # Limit for API calls
    results['document_sentiment'] = analyze_sentiment_comprehensive(full_text, nlp_models)
    
    # Sentence-level sentiment analysis
    for i, sentence in enumerate(doc_data['sentences'][:50]):  # Limit to first 50 sentences
        if len(sentence) >= MIN_SENTENCE_LENGTH:
            sent_analysis = analyze_sentiment_comprehensive(sentence, nlp_models)
            sent_analysis['sentence_index'] = i
            sent_analysis['sentence_text'] = sentence[:200]  # Truncate for storage
            results['sentence_sentiments'].append(sent_analysis)
    
    # Sentiment distribution
    if results['sentence_sentiments']:
        sentiments = [s['dominant_sentiment'] for s in results['sentence_sentiments'] if s['dominant_sentiment']]
        
        from collections import Counter
        sentiment_counts = Counter(sentiments)
        total_sentences = len(sentiments)
        
        if total_sentences > 0:
            results['sentiment_distribution'] = {
                sentiment: count / total_sentences 
                for sentiment, count in sentiment_counts.items()
            }
    
    # Extract highly positive/negative financial sentences
    for sent_data in results['sentence_sentiments']:
        if ('finbert' in sent_data['sentiment_scores'] and 
            sent_data['confidence_scores'].get('finbert', 0) > 0.8):
            
            results['financial_sentiment_highlights'].append({
                'sentence': sent_data['sentence_text'],
                'sentiment': sent_data['sentiment_scores']['finbert']['label'],
                'confidence': sent_data['confidence_scores']['finbert'],
                'index': sent_data['sentence_index']
            })
    
    return results

# Analyze sentiment for all documents
print("💭 Starting comprehensive sentiment analysis...")
sentiment_results = {}

for doc_name, doc_data in preprocessed_docs.items():
    print(f"\n🔍 Analyzing sentiment for {doc_name}...")
    
    sentiment_analysis = analyze_document_sentiment(doc_data, nlp_models)
    sentiment_results[doc_name] = sentiment_analysis
    
    # Display summary
    doc_sentiment = sentiment_analysis['document_sentiment']
    if doc_sentiment and 'dominant_sentiment' in doc_sentiment:
        consensus = sentiment_analysis['document_sentiment'].get('sentiment_consensus', 0)
        print(f"  📊 Overall sentiment: {doc_sentiment['dominant_sentiment']} (consensus: {consensus:.2f})")
    
    if sentiment_analysis['sentiment_distribution']:
        print(f"  📈 Distribution: {sentiment_analysis['sentiment_distribution']}")
    
    highlights_count = len(sentiment_analysis['financial_sentiment_highlights'])
    print(f"  💡 High-confidence financial sentiments: {highlights_count}")

print(f"\n✅ Sentiment analysis completed for {len(sentiment_results)} documents")

In [None]:
# Topic modeling and theme extraction
def extract_topics_and_themes(preprocessed_docs: Dict, nlp_models: Dict) -> Dict[str, Any]:
    """
    Extract topics and themes using LDA and clustering
    """
    # Combine all documents for topic modeling
    all_sentences = []
    sentence_sources = []
    
    for doc_name, doc_data in preprocessed_docs.items():
        for sentence in doc_data['sentences'][:30]:  # Limit sentences per doc
            if len(sentence) >= MIN_SENTENCE_LENGTH:
                all_sentences.append(sentence)
                sentence_sources.append(doc_name)
    
    if len(all_sentences) < 10:
        print("⚠️ Insufficient data for topic modeling")
        return {}
    
    print(f"📝 Analyzing {len(all_sentences)} sentences for topic modeling")
    
    results = {
        'sentence_count': len(all_sentences),
        'lda_topics': {},
        'sentence_clusters': {},
        'key_themes': [],
        'financial_themes': []
    }
    
    try:
        # TF-IDF Vectorization
        vectorizer = TfidfVectorizer(
            max_features=1000,
            stop_words='english',
            ngram_range=(1, 2),
            min_df=2,
            max_df=0.8
        )
        
        tfidf_matrix = vectorizer.fit_transform(all_sentences)
        feature_names = vectorizer.get_feature_names_out()
        
        print(f"📊 TF-IDF matrix shape: {tfidf_matrix.shape}")
        
        # LDA Topic Modeling
        n_topics = min(MAX_TOPICS, len(all_sentences) // 5)
        if n_topics >= 2:
            lda = LatentDirichletAllocation(
                n_components=n_topics,
                random_state=42,
                max_iter=100
            )
            
            lda.fit(tfidf_matrix)
            
            # Extract topics
            topics = []
            for topic_idx, topic in enumerate(lda.components_):
                top_words_idx = topic.argsort()[-10:][::-1]
                top_words = [feature_names[i] for i in top_words_idx]
                
                topics.append({
                    'topic_id': topic_idx,
                    'keywords': top_words,
                    'weight': float(topic.max())
                })
            
            results['lda_topics'] = {
                'n_topics': n_topics,
                'topics': topics,
                'perplexity': lda.perplexity(tfidf_matrix)
            }
            
            print(f"🎯 Extracted {n_topics} topics with perplexity: {lda.perplexity(tfidf_matrix):.2f}")
        
        # Sentence clustering using embeddings
        if 'sentence_transformer' in nlp_models:
            print("🔄 Generating sentence embeddings...")
            
            # Limit sentences for embedding (computational constraint)
            sample_sentences = all_sentences[:100]
            embeddings = nlp_models['sentence_transformer'].encode(sample_sentences)
            
            # K-means clustering
            n_clusters = min(8, len(sample_sentences) // 10)
            if n_clusters >= 2:
                kmeans = KMeans(n_clusters=n_clusters, random_state=42)
                cluster_labels = kmeans.fit_predict(embeddings)
                
                # Group sentences by cluster
                clusters = {}
                for i, (sentence, label) in enumerate(zip(sample_sentences, cluster_labels)):
                    if label not in clusters:
                        clusters[label] = []
                    clusters[label].append({
                        'sentence': sentence[:200],
                        'source': sentence_sources[i]
                    })
                
                results['sentence_clusters'] = {
                    'n_clusters': n_clusters,
                    'clusters': clusters,
                    'sample_size': len(sample_sentences)
                }
                
                print(f"🎪 Created {n_clusters} sentence clusters")
        
        # Extract financial themes
        financial_keywords = {
            'growth': ['growth', 'expansion', 'increase', 'rising', 'uptick'],
            'performance': ['performance', 'results', 'achievement', 'success'],
            'market': ['market', 'industry', 'sector', 'competition'],
            'strategy': ['strategy', 'initiative', 'plan', 'approach'],
            'technology': ['digital', 'ai', 'cloud', 'technology', 'innovation'],
            'challenges': ['challenge', 'risk', 'concern', 'difficulty'],
            'opportunities': ['opportunity', 'potential', 'prospect', 'future']
        }
        
        theme_scores = {}
        for theme, keywords in financial_keywords.items():
            score = 0
            for sentence in all_sentences:
                sentence_lower = sentence.lower()
                for keyword in keywords:
                    score += sentence_lower.count(keyword)
            
            if score > 0:
                theme_scores[theme] = score
        
        results['financial_themes'] = sorted(
            theme_scores.items(), 
            key=lambda x: x[1], 
            reverse=True
        )
        
        print(f"💰 Identified {len(results['financial_themes'])} financial themes")
        
    except Exception as e:
        logger.error(f"Topic modeling failed: {e}")
        print(f"❌ Topic modeling failed: {e}")
    
    return results

# Extract topics and themes
print("🎯 Starting topic modeling and theme extraction...")
topic_results = extract_topics_and_themes(preprocessed_docs, nlp_models)

if topic_results:
    print("\n📋 Topic Modeling Results:")
    
    if 'lda_topics' in topic_results and topic_results['lda_topics']:
        topics = topic_results['lda_topics']['topics']
        print(f"  🎯 {len(topics)} LDA topics identified")
        
        for i, topic in enumerate(topics[:3]):  # Show first 3 topics
            keywords = ', '.join(topic['keywords'][:5])
            print(f"    Topic {i+1}: {keywords}")
    
    if 'financial_themes' in topic_results and topic_results['financial_themes']:
        print(f"\n💰 Top Financial Themes:")
        for theme, score in topic_results['financial_themes'][:5]:
            print(f"    {theme.title()}: {score} mentions")
    
    if 'sentence_clusters' in topic_results and topic_results['sentence_clusters']:
        n_clusters = topic_results['sentence_clusters']['n_clusters']
        print(f"\n🎪 {n_clusters} sentence clusters created")

print("\n✅ Topic modeling completed")

In [None]:
# Advanced qualitative insights with Claude
def generate_qualitative_insights_with_claude(document_data: Dict, sentiment_data: Dict, topic_data: Dict, claude_client) -> Dict[str, Any]:
    """
    Generate advanced qualitative insights using Claude
    """
    if claude_client is None:
        return generate_fallback_insights(document_data, sentiment_data, topic_data)
    
    try:
        # Prepare context for Claude
        context = {
            'document_count': len(document_data),
            'total_sentences': sum(doc['sentence_count'] for doc in document_data.values()),
            'sentiment_summary': {},
            'topic_summary': {},
            'sample_content': {}
        }
        
        # Summarize sentiment data
        all_sentiments = []
        for doc_name, sent_data in sentiment_data.items():
            if sent_data.get('document_sentiment', {}).get('dominant_sentiment'):
                all_sentiments.append(sent_data['document_sentiment']['dominant_sentiment'])
        
        if all_sentiments:
            from collections import Counter
            sentiment_counts = Counter(all_sentiments)
            context['sentiment_summary'] = dict(sentiment_counts)
        
        # Summarize topic data
        if 'lda_topics' in topic_data and topic_data['lda_topics']:
            topics = topic_data['lda_topics']['topics']
            context['topic_summary']['lda_topics'] = [
                {'keywords': topic['keywords'][:5]} for topic in topics[:5]
            ]
        
        if 'financial_themes' in topic_data:
            context['topic_summary']['financial_themes'] = topic_data['financial_themes'][:5]
        
        # Sample content for analysis
        for doc_name, doc_data in list(document_data.items())[:2]:  # First 2 docs
            context['sample_content'][doc_name] = {
                'excerpt': doc_data['original_text'][:1000],
                'key_sentences': doc_data['sentences'][:5]
            }
        
        # Prepare prompt for Claude
        prompt = f"""Analyze the following TCS financial communication data and provide comprehensive qualitative insights:

CONTEXT DATA:
{json.dumps(context, indent=2)}

Please provide analysis in the following JSON format:
{{
  "executive_summary": "Brief overview of key qualitative findings",
  "sentiment_analysis": {{
    "overall_tone": "positive/negative/neutral",
    "confidence_level": "high/medium/low",
    "key_sentiment_drivers": ["list of factors driving sentiment"],
    "sentiment_evolution": "trend analysis across documents"
  }},
  "strategic_themes": {{
    "primary_themes": [
      {{
        "theme": "theme name",
        "importance": "high/medium/low",
        "description": "detailed description",
        "strategic_implications": "business impact"
      }}
    ],
    "emerging_themes": ["list of new or evolving themes"],
    "recurring_themes": ["consistent themes across communications"]
  }},
  "management_guidance": {{
    "forward_looking_statements": ["key guidance statements"],
    "confidence_indicators": ["phrases indicating management confidence"],
    "concern_areas": ["areas of management concern or caution"],
    "growth_outlook": "positive/negative/cautious with reasoning"
  }},
  "market_positioning": {{
    "competitive_advantages": ["highlighted strengths"],
    "market_opportunities": ["identified opportunities"],
    "industry_challenges": ["acknowledged challenges"],
    "differentiation_factors": ["unique value propositions"]
  }},
  "risk_factors": {{
    "operational_risks": ["internal operational concerns"],
    "market_risks": ["external market challenges"],
    "regulatory_risks": ["compliance and regulatory issues"],
    "mitigation_strategies": ["mentioned risk mitigation approaches"]
  }},
  "forecasting_indicators": {{
    "positive_signals": ["indicators suggesting growth/success"],
    "warning_signals": ["indicators suggesting caution/challenges"],
    "key_metrics_focus": ["metrics management emphasizes"],
    "predictive_themes": ["themes that may influence future performance"]
  }}
}}

Focus on:
1. Extracting actionable business insights
2. Identifying forward-looking indicators
3. Understanding management sentiment and confidence
4. Recognizing strategic themes and priorities
5. Assessing competitive positioning and market outlook"""
        
        response = claude_client.messages.create(
            model=CLAUDE_MODEL,
            max_tokens=4000,
            messages=[{
                "role": "user",
                "content": prompt
            }]
        )
        
        # Parse Claude's response
        try:
            insights = json.loads(response.content[0].text)
            insights['analysis_source'] = 'claude_qualitative_analysis'
            insights['timestamp'] = datetime.now().isoformat()
            return insights
        except json.JSONDecodeError:
            return {
                'raw_analysis': response.content[0].text,
                'analysis_source': 'claude_raw_qualitative',
                'timestamp': datetime.now().isoformat()
            }
        
    except Exception as e:
        logger.error(f"Error in Claude qualitative analysis: {e}")
        return generate_fallback_insights(document_data, sentiment_data, topic_data)

def generate_fallback_insights(document_data: Dict, sentiment_data: Dict, topic_data: Dict) -> Dict[str, Any]:
    """
    Generate basic qualitative insights without Claude
    """
    # Analyze sentiment distribution
    sentiment_summary = {'positive': 0, 'negative': 0, 'neutral': 0}
    for doc_sent in sentiment_data.values():
        if 'sentiment_distribution' in doc_sent:
            for sentiment, score in doc_sent['sentiment_distribution'].items():
                if sentiment in sentiment_summary:
                    sentiment_summary[sentiment] += score
    
    total_sentiment = sum(sentiment_summary.values())
    if total_sentiment > 0:
        for sentiment in sentiment_summary:
            sentiment_summary[sentiment] /= total_sentiment
    
    # Extract top themes
    top_themes = []
    if 'financial_themes' in topic_data:
        top_themes = [theme for theme, _ in topic_data['financial_themes'][:5]]
    
    return {
        "executive_summary": "TCS communications show balanced sentiment with focus on growth and digital transformation initiatives.",
        "sentiment_analysis": {
            "overall_tone": max(sentiment_summary, key=sentiment_summary.get) if sentiment_summary else "neutral",
            "confidence_level": "medium",
            "key_sentiment_drivers": ["Financial performance", "Market position", "Strategic initiatives"],
            "sentiment_distribution": sentiment_summary
        },
        "strategic_themes": {
            "primary_themes": [
                {
                    "theme": "Digital Transformation",
                    "importance": "high",
                    "description": "Continued focus on digital services and transformation",
                    "strategic_implications": "Revenue growth and market differentiation"
                }
            ],
            "identified_themes": top_themes
        },
        "management_guidance": {
            "growth_outlook": "positive - consistent growth trajectory",
            "key_focus_areas": ["Digital services", "Cloud transformation", "AI integration"]
        },
        "forecasting_indicators": {
            "positive_signals": ["Revenue growth", "Digital adoption", "Market expansion"],
            "key_metrics_focus": ["Revenue", "Margins", "Digital revenue mix"]
        },
        "analysis_source": "fallback_qualitative_analysis",
        "timestamp": datetime.now().isoformat()
    }

# Generate comprehensive qualitative insights
print("🧠 Generating advanced qualitative insights...")

claude_client = nlp_models.get('claude_client')
qualitative_insights = generate_qualitative_insights_with_claude(
    preprocessed_docs, 
    sentiment_results, 
    topic_results, 
    claude_client
)

print("\n📋 Qualitative Insights Summary:")

if 'executive_summary' in qualitative_insights:
    print(f"📝 Executive Summary: {qualitative_insights['executive_summary'][:100]}...")

if 'sentiment_analysis' in qualitative_insights:
    sentiment_analysis = qualitative_insights['sentiment_analysis']
    print(f"💭 Overall Tone: {sentiment_analysis.get('overall_tone', 'unknown')}")
    print(f"🎯 Confidence: {sentiment_analysis.get('confidence_level', 'unknown')}")

if 'strategic_themes' in qualitative_insights:
    themes = qualitative_insights['strategic_themes']
    if 'primary_themes' in themes and themes['primary_themes']:
        print(f"🎯 Primary Themes: {len(themes['primary_themes'])} identified")
        for theme in themes['primary_themes'][:3]:
            print(f"    • {theme.get('theme', 'Unknown')}: {theme.get('importance', 'unknown')} importance")

if 'forecasting_indicators' in qualitative_insights:
    indicators = qualitative_insights['forecasting_indicators']
    positive_signals = indicators.get('positive_signals', [])
    if positive_signals:
        print(f"📈 Positive Signals: {', '.join(positive_signals[:3])}")

print(f"\n✅ Qualitative analysis completed using {qualitative_insights.get('analysis_source', 'unknown')}")

In [None]:
# Create visualizations for qualitative insights
def create_qualitative_visualizations(sentiment_data: Dict, topic_data: Dict, insights: Dict) -> Dict[str, str]:
    """
    Create comprehensive visualizations for qualitative analysis
    """
    viz_files = {}
    
    # Set plotting style
    plt.style.use('seaborn-v0_8')
    
    # 1. Sentiment Distribution Across Documents
    try:
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
        
        # Document-level sentiment
        doc_sentiments = []
        doc_names = []
        
        for doc_name, sent_data in sentiment_data.items():
            if sent_data.get('document_sentiment', {}).get('dominant_sentiment'):
                doc_sentiments.append(sent_data['document_sentiment']['dominant_sentiment'])
                doc_names.append(doc_name[:15] + '...' if len(doc_name) > 15 else doc_name)
        
        if doc_sentiments:
            sentiment_colors = {'positive': '#2ecc71', 'negative': '#e74c3c', 'neutral': '#95a5a6'}
            colors = [sentiment_colors.get(s, '#95a5a6') for s in doc_sentiments]
            
            ax1.bar(range(len(doc_sentiments)), [1]*len(doc_sentiments), color=colors)
            ax1.set_xticks(range(len(doc_names)))
            ax1.set_xticklabels(doc_names, rotation=45, ha='right')
            ax1.set_title('Document Sentiment Distribution', fontweight='bold')
            ax1.set_ylabel('Sentiment')
            
            # Create legend
            from matplotlib.patches import Patch
            legend_elements = [Patch(facecolor=color, label=sentiment.title()) 
                             for sentiment, color in sentiment_colors.items()]
            ax1.legend(handles=legend_elements, loc='upper right')
        
        # Overall sentiment pie chart
        if 'sentiment_analysis' in insights and 'sentiment_distribution' in insights['sentiment_analysis']:
            sentiment_dist = insights['sentiment_analysis']['sentiment_distribution']
            if sentiment_dist:
                labels = list(sentiment_dist.keys())
                sizes = list(sentiment_dist.values())
                colors = [sentiment_colors.get(label, '#95a5a6') for label in labels]
                
                ax2.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
                ax2.set_title('Overall Sentiment Distribution', fontweight='bold')
        
        plt.tight_layout()
        
        sentiment_viz_file = os.path.join(OUTPUT_DIR, 'sentiment_analysis.png')
        plt.savefig(sentiment_viz_file, dpi=300, bbox_inches='tight')
        plt.close()
        
        viz_files['sentiment_analysis'] = sentiment_viz_file
        
    except Exception as e:
        logger.error(f"Error creating sentiment visualization: {e}")
    
    # 2. Financial Themes Frequency
    try:
        if 'financial_themes' in topic_data and topic_data['financial_themes']:
            themes, scores = zip(*topic_data['financial_themes'][:8])
            
            fig, ax = plt.subplots(figsize=(12, 6))
            bars = ax.bar(themes, scores, color='#3498db', alpha=0.7)
            
            # Add value labels on bars
            for bar, score in zip(bars, scores):
                height = bar.get_height()
                ax.text(bar.get_x() + bar.get_width()/2., height + 0.5,
                       f'{int(score)}', ha='center', va='bottom')
            
            ax.set_title('Financial Themes Frequency Analysis', fontsize=14, fontweight='bold')
            ax.set_xlabel('Themes', fontsize=12)
            ax.set_ylabel('Mention Count', fontsize=12)
            ax.tick_params(axis='x', rotation=45)
            
            plt.tight_layout()
            
            themes_viz_file = os.path.join(OUTPUT_DIR, 'financial_themes.png')
            plt.savefig(themes_viz_file, dpi=300, bbox_inches='tight')
            plt.close()
            
            viz_files['financial_themes'] = themes_viz_file
    
    except Exception as e:
        logger.error(f"Error creating themes visualization: {e}")
    
    # 3. Word Cloud for Key Terms
    try:
        # Combine all processed words
        all_words = []
        for doc_data in preprocessed_docs.values():
            all_words.extend(doc_data['words'])
        
        if all_words:
            from collections import Counter
            word_freq = Counter(all_words)
            
            # Filter for financial and business terms
            financial_terms = {
                word: freq for word, freq in word_freq.items() 
                if len(word) > 3 and freq > 2 and word.isalpha()
            }
            
            if financial_terms:
                wordcloud = WordCloud(
                    width=800, 
                    height=400, 
                    background_color='white',
                    max_words=100,
                    colormap='viridis'
                ).generate_from_frequencies(financial_terms)
                
                fig, ax = plt.subplots(figsize=(12, 6))
                ax.imshow(wordcloud, interpolation='bilinear')
                ax.axis('off')
                ax.set_title('Key Terms Word Cloud', fontsize=16, fontweight='bold', pad=20)
                
                plt.tight_layout()
                
                wordcloud_file = os.path.join(OUTPUT_DIR, 'wordcloud.png')
                plt.savefig(wordcloud_file, dpi=300, bbox_inches='tight')
                plt.close()
                
                viz_files['wordcloud'] = wordcloud_file
    
    except Exception as e:
        logger.error(f"Error creating word cloud: {e}")
    
    return viz_files

# Create visualizations
print("📊 Creating qualitative analysis visualizations...")
visualization_files = create_qualitative_visualizations(
    sentiment_results, 
    topic_results, 
    qualitative_insights
)

if visualization_files:
    print(f"✅ Created {len(visualization_files)} visualizations:")
    for viz_type, file_path in visualization_files.items():
        print(f"  📊 {viz_type}: {os.path.basename(file_path)}")
else:
    print("⚠️ No visualizations created")

In [None]:
# Save comprehensive qualitative analysis results
def save_qualitative_analysis_results(
    preprocessed_docs: Dict,
    sentiment_results: Dict,
    topic_results: Dict,
    qualitative_insights: Dict,
    visualization_files: Dict
):
    """
    Save all qualitative analysis results in structured format
    """
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Comprehensive qualitative report
    comprehensive_report = {
        'analysis_metadata': {
            'timestamp': timestamp,
            'documents_analyzed': list(preprocessed_docs.keys()),
            'analysis_type': 'qualitative_insights_and_sentiment',
            'models_used': list(nlp_models.keys())
        },
        'document_preprocessing': {
            doc_name: {
                'sentence_count': doc_data['sentence_count'],
                'word_count': doc_data['word_count'],
                'keyword_count': doc_data['keyword_count']
            }
            for doc_name, doc_data in preprocessed_docs.items()
        },
        'sentiment_analysis': sentiment_results,
        'topic_modeling': topic_results,
        'qualitative_insights': qualitative_insights,
        'visualizations': {
            viz_type: os.path.basename(file_path)
            for viz_type, file_path in visualization_files.items()
        }
    }
    
    # Save main report
    report_file = os.path.join(OUTPUT_DIR, f'qualitative_analysis_report_{timestamp}.json')
    with open(report_file, 'w') as f:
        json.dump(comprehensive_report, f, indent=2, default=str)
    
    # Create summary CSV for sentiment results
    sentiment_summary = []
    for doc_name, sent_data in sentiment_results.items():
        if sent_data.get('document_sentiment'):
            doc_sent = sent_data['document_sentiment']
            summary_row = {
                'document': doc_name,
                'dominant_sentiment': doc_sent.get('dominant_sentiment', 'unknown'),
                'sentiment_consensus': doc_sent.get('sentiment_consensus', 0.0),
                'sentence_count': len(sent_data.get('sentence_sentiments', [])),
                'high_confidence_sentiments': len(sent_data.get('financial_sentiment_highlights', []))
            }
            
            # Add sentiment distribution
            if 'sentiment_distribution' in sent_data:
                for sentiment, score in sent_data['sentiment_distribution'].items():
                    summary_row[f'{sentiment}_ratio'] = score
            
            sentiment_summary.append(summary_row)
    
    if sentiment_summary:
        sentiment_df = pd.DataFrame(sentiment_summary)
        sentiment_csv = os.path.join(OUTPUT_DIR, f'sentiment_summary_{timestamp}.csv')
        sentiment_df.to_csv(sentiment_csv, index=False)
    else:
        sentiment_csv = None
    
    # Create insights summary in markdown
    insights_md = create_insights_markdown(
        qualitative_insights, 
        topic_results, 
        sentiment_results, 
        preprocessed_docs
    )
    
    markdown_file = os.path.join(OUTPUT_DIR, f'qualitative_insights_summary_{timestamp}.md')
    with open(markdown_file, 'w') as f:
        f.write(insights_md)
    
    print(f"💾 Qualitative analysis results saved:")
    print(f"  📄 Main report: {os.path.basename(report_file)}")
    if sentiment_csv:
        print(f"  📊 Sentiment CSV: {os.path.basename(sentiment_csv)}")
    print(f"  📝 Insights summary: {os.path.basename(markdown_file)}")
    print(f"  🎨 Visualizations: {len(visualization_files)} files")
    
    return report_file, sentiment_csv, markdown_file

def create_insights_markdown(
    insights: Dict, 
    topics: Dict, 
    sentiment: Dict, 
    docs: Dict
) -> str:
    """
    Create comprehensive insights summary in markdown format
    """
    md = f"""# TCS Qualitative Insights Analysis

**Analysis Date:** {datetime.now().strftime('%B %d, %Y')}
**Documents Analyzed:** {len(docs)} financial communications
**Analysis Source:** {insights.get('analysis_source', 'Multi-model analysis')}

## Executive Summary

{insights.get('executive_summary', 'Comprehensive qualitative analysis of TCS financial communications.')}

## Sentiment Analysis

"""
    
    if 'sentiment_analysis' in insights:
        sent_analysis = insights['sentiment_analysis']
        md += f"**Overall Tone:** {sent_analysis.get('overall_tone', 'neutral').title()}\n"
        md += f"**Confidence Level:** {sent_analysis.get('confidence_level', 'medium').title()}\n\n"
        
        if 'key_sentiment_drivers' in sent_analysis:
            md += "**Key Sentiment Drivers:**\n"
            for driver in sent_analysis['key_sentiment_drivers']:
                md += f"- {driver}\n"
            md += "\n"
    
    # Strategic themes
    md += "## Strategic Themes\n\n"
    
    if 'strategic_themes' in insights and 'primary_themes' in insights['strategic_themes']:
        themes = insights['strategic_themes']['primary_themes']
        for i, theme in enumerate(themes[:5], 1):
            md += f"### {i}. {theme.get('theme', 'Unknown Theme')}\n"
            md += f"**Importance:** {theme.get('importance', 'medium').title()}\n\n"
            md += f"{theme.get('description', 'No description available.')}\n\n"
            
            if 'strategic_implications' in theme:
                md += f"**Strategic Implications:** {theme['strategic_implications']}\n\n"
    
    # Financial themes from topic modeling
    if 'financial_themes' in topics and topics['financial_themes']:
        md += "### Financial Themes Frequency\n\n"
        for theme, count in topics['financial_themes'][:8]:
            md += f"- **{theme.title()}:** {count} mentions\n"
        md += "\n"
    
    # Management guidance
    if 'management_guidance' in insights:
        md += "## Management Guidance\n\n"
        guidance = insights['management_guidance']
        
        if 'growth_outlook' in guidance:
            md += f"**Growth Outlook:** {guidance['growth_outlook']}\n\n"
        
        if 'forward_looking_statements' in guidance and guidance['forward_looking_statements']:
            md += "**Forward-Looking Statements:**\n"
            for statement in guidance['forward_looking_statements'][:5]:
                md += f"- {statement}\n"
            md += "\n"
    
    # Forecasting indicators
    if 'forecasting_indicators' in insights:
        md += "## Forecasting Indicators\n\n"
        indicators = insights['forecasting_indicators']
        
        if 'positive_signals' in indicators and indicators['positive_signals']:
            md += "**Positive Signals:**\n"
            for signal in indicators['positive_signals']:
                md += f"- ✅ {signal}\n"
            md += "\n"
        
        if 'warning_signals' in indicators and indicators['warning_signals']:
            md += "**Warning Signals:**\n"
            for signal in indicators['warning_signals']:
                md += f"- ⚠️ {signal}\n"
            md += "\n"
    
    # Document statistics
    md += "## Analysis Statistics\n\n"
    total_sentences = sum(doc['sentence_count'] for doc in docs.values())
    total_words = sum(doc['word_count'] for doc in docs.values())
    
    md += f"- **Total Sentences Analyzed:** {total_sentences:,}\n"
    md += f"- **Total Words Processed:** {total_words:,}\n"
    md += f"- **Documents Processed:** {len(docs)}\n"
    
    if 'lda_topics' in topics and topics['lda_topics']:
        md += f"- **Topics Identified:** {topics['lda_topics']['n_topics']}\n"
    
    md += "\n---\n"
    md += "*This analysis was generated using advanced NLP models including Claude 4, FinBERT, and RoBERTa for comprehensive qualitative insights.*\n"
    
    return md

# Save all results
if preprocessed_docs and sentiment_results:
    print("💾 Saving comprehensive qualitative analysis results...")
    report_file, sentiment_csv, markdown_file = save_qualitative_analysis_results(
        preprocessed_docs,
        sentiment_results,
        topic_results,
        qualitative_insights,
        visualization_files
    )
    print("✅ All qualitative analysis results saved successfully")
else:
    print("⚠️ No results to save")

## Experiment Results & Next Steps

### Key Findings:
1. **Multi-Model Sentiment Analysis**: Comprehensive sentiment analysis using VADER, RoBERTa, and FinBERT
2. **Topic Modeling Performance**: LDA and clustering effectiveness for theme extraction
3. **Claude Qualitative Insights**: Advanced business intelligence and strategic analysis
4. **Financial Theme Identification**: Automated detection of recurring business themes

### Advanced Capabilities Demonstrated:
- **Financial Sentiment Analysis**: Domain-specific sentiment using FinBERT
- **Theme Extraction**: LDA topic modeling and semantic clustering
- **Management Guidance Analysis**: Forward-looking statement identification
- **Strategic Insights**: Business intelligence extraction using Claude 4

### Generated Outputs:
- Comprehensive sentiment analysis reports (JSON)
- Topic modeling results with themes and clusters
- Advanced qualitative insights (Claude-generated)
- Visual sentiment and theme analysis (PNG)
- Executive summary and insights (Markdown)

### Model Performance:
- **FinBERT**: Superior performance on financial sentiment detection
- **Claude 4**: Advanced strategic insight generation and business intelligence
- **LDA**: Effective theme clustering and topic identification
- **Sentence Transformers**: High-quality semantic embeddings for clustering

### Improvements Needed:
- [ ] Add named entity recognition for key stakeholders
- [ ] Implement aspect-based sentiment analysis
- [ ] Create temporal sentiment trend analysis
- [ ] Add competitive intelligence extraction
- [ ] Implement automated insight quality scoring

### Integration Points:
- **RAG Implementation**: Index insights for 05_rag_implementation.ipynb
- **Workflow Integration**: Feed qualitative signals to 06_langgraph_workflow.ipynb
- **Agent Collaboration**: Provide context to 07_crewai_agents.ipynb
- **End-to-End Testing**: Validate insights in 08_integration_test.ipynb