# Luganda YouTube Comments Sentiment Analysis with Ganda Gemma

**Project Overview:**
This notebook demonstrates sentiment analysis of Luganda (Ugandan language) YouTube comments using the CraneAILabs Ganda Gemma model. We classify comments as "Kirungi" (good/positive) or "Kibi" (bad/negative) to help Ugandan content creators understand their audience sentiment.

**Key Features:**
- YouTube Data API integration for comment extraction
- Ganda Gemma model for Luganda sentiment analysis
- Authentic Ugandan sentiment labels (Kirungi/Kibi)
- Real-world application for content creators

**Date:** August 2025  
**Model:** CraneAILabs/ganda-gemma-1b

# 1. SETUP AND INSTALLATIONS

In [None]:
# Install required packages
#"""
!pip install transformers torch google-api-python-client pandas matplotlib seaborn plotly
!pip install huggingface_hub python-dotenv wordcloud

# Import libraries
import pandas as pd
import numpy as np
import torch
import time
import re
import json
import os
import time
import glob
from scipy.stats import pearsonr, spearmanr
from datetime import datetime, timedelta
from collections import Counter, defaultdict
import string

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from wordcloud import WordCloud

# Transformers and Google API
from transformers import AutoTokenizer, AutoModelForCausalLM
from googleapiclient.discovery import build

# Configure plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ All packages installed and imported successfully!")

#"""

# 2. AUTHENTICATION AND API SETUP

### 2.1 Hugging Face Authentication

In [None]:
#First, we authenticate with Hugging Face to access the private Ganda Gemma model.

from huggingface_hub import login
from google.colab import userdata

HF_TOKEN = userdata.get('Ganda_Gemma_Token')
login(token=HF_TOKEN)
print("Hugging Face authentication successful!")

### 2.2 YouTube Data API Setup

In [None]:
#Set up YouTube API for comment extraction.
YOUTUBE_API_KEY = userdata.get('YouTube_API_Key')
youtube = build('youtube', 'v3', developerKey=YOUTUBE_API_KEY)
print("YouTube API authentication successful!")

# 3. DATA COLLECTION

In [None]:
"""
Clean Uganda YouTube Comment Extractor
Rewritten with proper indentation and structure
"""
class CleanUgandaCommentExtractor:
    def __init__(self, api_key):
        """Initialize with YouTube API key."""
        self.api_key = api_key
        self.youtube = build('youtube', 'v3', developerKey=api_key)

        # Target channels (2 per category)
        self.target_channels = {
            'Music & Entertainment': [
                {'name': 'UGXTRA Music', 'search_term': 'UGXTRA Music'},
                {'name': 'Elijah Kitaka', 'search_term': 'Elijah Kitaka'}
            ],
            'Comedy & Lifestyle': [
                {'name': 'UGXTRA Comedy', 'search_term': 'UGXTRA Comedy'},
                {'name': 'Comedy Store', 'search_term': 'Comedy Store Uganda'}
            ],
            'News & Current Affairs': [
                {'name': 'Kasuku Live', 'search_term': 'Kasuku Live'},
                {'name': 'NTV Akawungeezi', 'search_term': 'NTV Akawungeezi'}
            ],
            'Sports & Events': [
                {'name': 'Uganda Fan TV', 'search_term': 'Uganda Fan TV'},
                {'name': 'KWEZI MEDIA GROUP', 'search_term': 'KWEZI MEDIA GROUP'}
            ],
            'Politics & Social Issues': [
                {'name': 'Ug Wolokoso Extra', 'search_term': 'Ug Wolokoso Extra'},
                {'name': 'BBS Terefayina', 'search_term': 'BBS Terefayina'}
            ]
        }

    def find_channel_id(self, search_term):
        """Find channel ID using multiple search strategies - Uganda only."""
        try:
            # Strategy 1: Direct search with Uganda region
            request = self.youtube.search().list(
                part="snippet",
                q=search_term,
                type="channel",
                maxResults=10,  # Increased from 5
                regionCode="UG"
            )
            response = request.execute()

            if response['items']:
                return response['items'][0]['snippet']['channelId']

            # Strategy 2: Try variations of the search term (still Uganda only)
            variations = [
                search_term.replace(" Official", ""),
                search_term.replace(" ", ""),
                f"{search_term} channel",
                f"{search_term} TV",
                f"{search_term} Uganda"
            ]

            for variation in variations:
                print(f"   🔄 Trying Uganda variation: {variation}")
                request = self.youtube.search().list(
                    part="snippet",
                    q=variation,
                    type="channel",
                    maxResults=10,
                    regionCode="UG"  # Keep Uganda region code
                )
                response = request.execute()

                if response['items']:
                    return response['items'][0]['snippet']['channelId']

            return None

        except Exception as e:
            print(f"❌ Error finding channel '{search_term}': {e}")
            return None

    def get_trending_video(self, channel_id, channel_name):
        """Get the most trending video from a channel with flexible criteria."""
        try:
            # Try different time ranges if needed
            time_ranges = [
                (180, "6 months"),  # Original
                (365, "1 year"),    # Expand to 1 year
                (730, "2 years")    # Expand to 2 years if needed
            ]

            for days, period_name in time_ranges:
                print(f"   🔍 Searching {period_name} of videos...")

                time_ago = (datetime.now() - timedelta(days=days)).isoformat() + 'Z'

                request = self.youtube.search().list(
                    part="snippet",
                    channelId=channel_id,
                    type="video",
                    order="relevance",
                    maxResults=20,  # Increased from 10
                    publishedAfter=time_ago
                )
                response = request.execute()

                if not response['items']:
                    continue

                # Get video statistics with flexible comment threshold
                video_stats = []
                for item in response['items']:
                    video_id = item['id']['videoId']
                    title = item['snippet']['title']

                    # Get detailed stats
                    stats_request = self.youtube.videos().list(
                        part="statistics",
                        id=video_id
                    )
                    stats_response = stats_request.execute()

                    if stats_response['items']:
                        stats = stats_response['items'][0]['statistics']
                        comment_count = int(stats.get('commentCount', 0))
                        view_count = int(stats.get('viewCount', 0))

                        # Flexible comment threshold: try 50+, then 20+, then 10+
                        min_comments = 50 if days <= 180 else (20 if days <= 365 else 10)

                        if comment_count >= min_comments:
                            engagement_score = comment_count + (view_count * 0.001)
                            video_stats.append({
                                'video_id': video_id,
                                'title': title,
                                'comment_count': comment_count,
                                'view_count': view_count,
                                'engagement_score': engagement_score,
                                'period': period_name
                            })

                if video_stats:
                    # Found videos with comments - return the best one
                    trending_video = max(video_stats, key=lambda x: x['engagement_score'])
                    print(f"   📹 Found: '{trending_video['title'][:50]}...' ({trending_video['comment_count']} comments, {trending_video['period']})")
                    return trending_video

                print(f"   ⚠️  No videos with {50 if days <= 180 else (20 if days <= 365 else 10)}+ comments in {period_name}")

            # If still no luck, get ANY video with comments
            print(f"   🔄 Trying any video with comments...")
            request = self.youtube.search().list(
                part="snippet",
                channelId=channel_id,
                type="video",
                order="relevance",
                maxResults=50
            )
            response = request.execute()

            for item in response['items']:
                video_id = item['id']['videoId']
                title = item['snippet']['title']

                stats_request = self.youtube.videos().list(part="statistics", id=video_id)
                stats_response = stats_request.execute()

                if stats_response['items']:
                    stats = stats_response['items'][0]['statistics']
                    comment_count = int(stats.get('commentCount', 0))

                    if comment_count >= 5:  # Any video with 5+ comments
                        print(f"   📹 Fallback: '{title[:50]}...' ({comment_count} comments)")
                        return {
                            'video_id': video_id,
                            'title': title,
                            'comment_count': comment_count,
                            'view_count': int(stats.get('viewCount', 0)),
                            'engagement_score': comment_count,
                            'period': 'all time'
                        }

            print(f"   ❌ No videos with comments found for {channel_name}")
            return None

        except Exception as e:
            print(f"   ❌ Error getting video for {channel_name}: {e}")
            return None

    def extract_comments(self, video_id, video_title, target_comments=50):
        """Extract comments with flexible target."""
        print(f"      💬 Extracting from: {video_title[:40]}...")

        comments = []
        next_page_token = None
        max_attempts = 5  # Prevent infinite loops
        attempts = 0

        try:
            while len(comments) < target_comments and attempts < max_attempts:
                attempts += 1

                request = self.youtube.commentThreads().list(
                    part="snippet,replies",
                    videoId=video_id,
                    maxResults=100,
                    pageToken=next_page_token,
                    order="relevance"
                )
                response = request.execute()

                for item in response['items']:
                    comment_data = item['snippet']['topLevelComment']['snippet']

                    comment_info = {
                        'comment_id': item['snippet']['topLevelComment']['id'],
                        'text': comment_data['textDisplay'],
                        'author': comment_data['authorDisplayName'],
                        'likes': comment_data['likeCount'],
                        'published': comment_data['publishedAt'],
                        'video_id': video_id,
                        'video_title': video_title
                    }
                    comments.append(comment_info)

                    # Add replies (limit to 2 per comment to get more variety)
                    if 'replies' in item and len(comments) < target_comments:
                        for reply_item in item['replies']['comments'][:2]:
                            if len(comments) >= target_comments:
                                break
                            reply_data = reply_item['snippet']
                            reply_info = {
                                'comment_id': reply_item['id'],
                                'text': reply_data['textDisplay'],
                                'author': reply_data['authorDisplayName'],
                                'likes': reply_data['likeCount'],
                                'published': reply_data['publishedAt'],
                                'video_id': video_id,
                                'video_title': video_title,
                                'is_reply': True,
                                'parent_id': item['snippet']['topLevelComment']['id']
                            }
                            comments.append(reply_info)

                # Check for more pages
                if 'nextPageToken' in response:
                    next_page_token = response['nextPageToken']
                else:
                    break  # No more pages

            print(f"      ✅ Extracted {len(comments)} comments")
            return comments

        except Exception as e:
            print(f"      ❌ Error extracting comments: {e}")
            return []

    def filter_comments(self, comments):
        """Minimal filtering - remove obvious spam only."""
        print("🇺🇬 Applying minimal filtering...")

        filtered_comments = []
        spam_patterns = ['first!', 'first comment', 'subscribe', 'follow me', 'sub 4 sub']

        for comment in comments:
            text = comment['text'].lower().strip()

            # Skip very short comments
            if len(text) < 3:
                continue

            # Skip spam
            if any(pattern in text for pattern in spam_patterns):
                continue

            # Skip pure symbols/emoji
            if re.match(r'^[\W\d_]+$', text):
                continue

            # Keep everything else
            filtered_comments.append(comment)

        removed_count = len(comments) - len(filtered_comments)
        print(f"      ✅ Kept {len(filtered_comments)} comments")
        print(f"      ❌ Filtered out {removed_count} spam comments")

        return filtered_comments

    def get_next_version(self):
        """Get next version number."""
        existing_files = glob.glob("uganda_comments_v*.csv")

        if not existing_files:
            return 1

        versions = []
        for filename in existing_files:
            try:
                version_str = filename.split('_v')[1].split('.')[0]
                versions.append(int(version_str))
            except (IndexError, ValueError):
                continue

        return max(versions) + 1 if versions else 1

    def collect_all_comments(self):
        """Main collection function."""
        print("🇺🇬 UGANDA YOUTUBE COMMENT COLLECTION")
        print("=" * 50)
        print("Target: 2 channels per category, trending videos, 100+ comments each")
        print()

        all_comments = []
        channel_summary = []

        for category, channels in self.target_channels.items():
            print(f"📂 Category: {category}")
            print("-" * 40)

            for channel_info in channels:
                channel_name = channel_info['name']
                search_term = channel_info['search_term']

                print(f"🔍 Processing: {channel_name}")

                # Find channel
                channel_id = self.find_channel_id(search_term)
                if not channel_id:
                    print(f"   ❌ Could not find channel: {channel_name}")
                    continue

                # Get trending video
                trending_video = self.get_trending_video(channel_id, channel_name)
                if not trending_video:
                    continue

                # Extract comments
                comments = self.extract_comments(
                    trending_video['video_id'],
                    trending_video['title'],
                    target_comments=50  # Fixed parameter name
                )

                if comments:
                    # Filter comments
                    filtered_comments = self.filter_comments(comments)

                    # Add metadata to each comment
                    for comment in filtered_comments:
                        comment['category'] = category
                        comment['channel_name'] = channel_name

                    all_comments.extend(filtered_comments)

                    # Add to summary
                    channel_summary.append({
                        'category': category,
                        'channel_name': channel_name,
                        'video_title': trending_video['title'],
                        'video_views': trending_video['view_count'],
                        'total_comments': len(comments),
                        'filtered_comments': len(filtered_comments),
                        'video_id': trending_video['video_id']
                    })

                    print(f"   ✅ {len(filtered_comments)} quality comments collected")
                else:
                    print(f"   ❌ No comments collected from {channel_name}")

                print()
                time.sleep(1)  # Rate limiting

            print()

        # Save results
        self.save_results(all_comments, channel_summary)

        return all_comments, channel_summary

    def save_results(self, all_comments, channel_summary):
        """Save results with versioned naming."""
        version = self.get_next_version()

        # Save comments
        if all_comments:
            df_comments = pd.DataFrame(all_comments)
            comments_filename = f"uganda_comments_v{version}.csv"
            df_comments.to_csv(comments_filename, index=False, encoding='utf-8')
            print(f"💾 Saved {len(all_comments)} comments to: {comments_filename}")

        # Save summary
        if channel_summary:
            df_summary = pd.DataFrame(channel_summary)
            summary_filename = f"channel_summary_v{version}.csv"
            df_summary.to_csv(summary_filename, index=False, encoding='utf-8')
            print(f"📊 Saved summary to: {summary_filename}")

        # Save metadata
        metadata = {
            'version': version,
            'collection_date': datetime.now().isoformat(),
            'total_comments': len(all_comments),
            'channels_processed': len(channel_summary),
            'categories': list(self.target_channels.keys())
        }

        metadata_filename = f"collection_metadata_v{version}.json"
        with open(metadata_filename, 'w') as f:
            json.dump(metadata, f, indent=2)
        print(f"📋 Saved metadata to: {metadata_filename}")

        # Summary
        print(f"\n🎉 COLLECTION COMPLETE!")
        print("=" * 30)
        print(f"📊 Total comments: {len(all_comments)}")
        print(f"📺 Channels processed: {len(channel_summary)}")

        if all_comments:
            print(f"\n📈 Comments by category:")
            category_counts = {}
            for comment in all_comments:
                category = comment['category']
                category_counts[category] = category_counts.get(category, 0) + 1

            for category, count in category_counts.items():
                print(f"   {category}: {count} comments")

def main():
    """Main function."""
    try:
        from google.colab import userdata
        YOUTUBE_API_KEY = userdata.get('YouTube_API_Key')
    except:
        YOUTUBE_API_KEY = "your-youtube-api-key-here"

    if YOUTUBE_API_KEY == "your-youtube-api-key-here":
        print("❌ Please set YOUTUBE_API_KEY in Colab secrets!")
        return

    try:
        extractor = CleanUgandaCommentExtractor(YOUTUBE_API_KEY)
        comments, summary = extractor.collect_all_comments()

        print(f"\n🚀 Ready for sentiment analysis!")
        return comments, summary

    except Exception as e:
        print(f"❌ Collection failed: {e}")
        return None, None

if __name__ == "__main__":
    main()

# 4. GANDA GEMMA MODEL SETUP

In [None]:
#loading the CraneAILabs Ganda Gemma model for sentiment analysis.

class LugandaSentimentAnalyzer:
    def __init__(self, model_name="CraneAILabs/ganda-gemma-1b"):
        print("🇺🇬 Loading Ganda Gemma model...")
        start_time = time.time()

        # Load model and tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            token=HF_TOKEN if HF_TOKEN != "your-hf-token-here" else None,
            trust_remote_code=True
        )
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            token=HF_TOKEN if HF_TOKEN != "your-hf-token-here" else None,
            torch_dtype=torch.float32,
            device_map="cpu",
            trust_remote_code=True,
            low_cpu_mem_usage=True
        )

        # Set pad token
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

        load_time = time.time() - start_time
        print(f"Ganda Gemma loaded in {load_time:.2f} seconds")

    def analyze_sentiment(self, comment):
        """Analyze sentiment of a Luganda comment."""
        # Clean comment
        clean_comment = re.sub(r'<[^>]+>', ' ', comment).strip()
        if len(clean_comment) < 5:
            return "kibi"

        # Create prompt
        prompt = f"""Analyze the sentiment of this Luganda comment. If the comment expresses positive feelings, happiness, love, or praise, respond with "kirungi". If the comment expresses negative feelings, sadness, anger, or criticism, respond with "kibi". Respond with only one word.

Comment: {clean_comment}
Sentiment:"""

        try:
            # Tokenize and generate
            inputs = self.tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)

            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=10,
                    temperature=0.3,
                    do_sample=True,
                    top_p=0.7
                )

            # Extract sentiment
            response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
            response_clean = response[len(prompt):].strip().lower()

            # Map to Ugandan labels
            if 'kirungi' in response_clean or any(word in response_clean for word in ['positive', 'good', 'happy']):
                return "kirungi"
            else:
                return "kibi"

        except Exception as e:
            print(f"Analysis error: {e}")
            return "kibi"

# Initialize the analyzer
if HF_TOKEN != "your-hf-token-here":
    analyzer = LugandaSentimentAnalyzer()
    print("Sentiment analyzer ready!")
else:
    print("Add your Hugging Face token to load Ganda Gemma")

# 5. SENTIMENT ANALYSIS

In [None]:
# Load your collected data
df_comments = pd.read_csv('uganda_comments_v1.csv')
print(f"✅ Loaded {len(df_comments)} comments from uganda_comments_v1.csv")

# Show a quick preview
print(f"📊 Comments by category:")
print(df_comments['category'].value_counts())

In [None]:
df_comments.tail(5)

In [None]:
#Analyze the sentiment of our Luganda comments using authentic Ugandan labels.

#Labels
#Kirungi = Positive sentiment (good, nice, happy)
#Kibi = Negative sentiment (bad, sad, angry)


# Test with sample comments first
test_comments = [
    "Nsanyuse nnyo! Eddy Kenzo webale kutuwereza!",  # Should be kirungi
    "Alina omutima omubi",                           # Should be kibi
    "Enkola ya government emalamu amaanyi",          # Shouild be kibi
]

if 'analyzer' in globals():
    print("🧪 Testing sentiment analysis:")
    for comment in test_comments:
        sentiment = analyzer.analyze_sentiment(comment)
        emoji = "😊" if sentiment == "kirungi" else "😞"
        print(f"{emoji} \"{comment}\" → {sentiment}")

    # Analyze all comments
    print(f"\nAnalyzing {len(df_comments)} comments...")

    sentiments = []
    for comment in df_comments['text']:
        sentiment = analyzer.analyze_sentiment(comment)
        sentiments.append(sentiment)

    df_comments['sentiment'] = sentiments
    print("Sentiment analysis complete!")

else:
    print("Using sample sentiment data")
    # Sample data for demonstration
    df_comments['sentiment'] = ['kirungi', 'kibi', 'kirungi', 'kibi', 'kirungi']

# 6. RESULTS ANALYSIS

In [None]:
df_comments.query("sentiment == 'kibi'").head(30)

## SENTIMENT ANALYSIS SUMMARY


In [None]:
"""
Professional summary of Luganda sentiment analysis results.
Provides core statistics and sample examples without visual embellishments.
"""
def generate_professional_sentiment_summary(df_comments):
    """
    Generate a professional summary of sentiment analysis results.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with 'sentiment' and 'text' columns

    Returns:
    dict: Summary statistics for further use
    """

    # Calculate sentiment distribution
    sentiment_counts = df_comments['sentiment'].value_counts()
    total_comments = len(df_comments)

    print("LUGANDA SENTIMENT ANALYSIS RESULTS")
    print("=" * 50)
    print(f"Total Comments Analyzed: {total_comments:,}")
    print(f"Analysis Date: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M')}")
    print()

    # Overall statistics
    print("SENTIMENT DISTRIBUTION")
    print("-" * 25)

    summary_stats = {}
    for sentiment, count in sentiment_counts.items():
        percentage = (count / total_comments) * 100
        label = "Positive (Good)" if sentiment == "kirungi" else "Negative (Bad)"

        print(f"{sentiment.capitalize()} ({label}):")
        print(f"  Count: {count:,}")
        print(f"  Percentage: {percentage:.1f}%")
        print()

        summary_stats[sentiment] = {
            'count': count,
            'percentage': percentage,
            'label': label
        }

    # Calculate and interpret ratio
    kirungi_count = sentiment_counts.get('kirungi', 0)
    kibi_count = sentiment_counts.get('kibi', 0)

    if kirungi_count > 0 and kibi_count > 0:
        ratio = kirungi_count / kibi_count
        print("SENTIMENT RATIO ANALYSIS")
        print("-" * 25)
        print(f"Kirungi:Kibi Ratio: {ratio:.2f}:1")

        # Professional interpretation
        if ratio > 2:
            interpretation = "Very positive audience reaction"
            recommendation = "Strong positive engagement indicates high content quality and audience satisfaction."
        elif ratio > 1:
            interpretation = "Generally positive audience reaction"
            recommendation = "Positive engagement with room for improvement in addressing negative feedback."
        elif ratio > 0.5:
            interpretation = "Mixed audience reaction"
            recommendation = "Balanced feedback suggests need for content strategy review."
        else:
            interpretation = "Predominantly negative audience reaction"
            recommendation = "Significant negative feedback requires immediate content strategy reassessment."

        print(f"Interpretation: {interpretation}")
        print(f"Recommendation: {recommendation}")
        print()

        summary_stats['ratio'] = ratio
        summary_stats['interpretation'] = interpretation
        summary_stats['recommendation'] = recommendation

    # Sample examples
    print("REPRESENTATIVE EXAMPLES")
    print("-" * 25)

    for sentiment in ['kirungi', 'kibi']:
        if sentiment in sentiment_counts:
            samples = df_comments[df_comments['sentiment'] == sentiment].head(3)
            label = "POSITIVE (Kirungi)" if sentiment == "kirungi" else "NEGATIVE (Kibi)"

            print(f"\n{label} Examples:")

            for i, (_, row) in enumerate(samples.iterrows(), 1):
                comment = row['text']
                # Truncate long comments professionally
                display_comment = comment if len(comment) <= 80 else comment[:77] + "..."
                likes = row.get('likes', 0)

                print(f"  {i}. \"{display_comment}\"")
                print(f"     Engagement: {likes} likes")
                if i < len(samples):
                    print()

    # Summary statistics for return
    summary_stats['total_comments'] = total_comments
    summary_stats['analysis_timestamp'] = pd.Timestamp.now()

    return summary_stats

def display_key_metrics(summary_stats):
    """
    Display key metrics in a clean format.

    Parameters:
    summary_stats (dict): Summary statistics from generate_professional_sentiment_summary
    """

    print("\n" + "=" * 50)
    print("KEY PERFORMANCE INDICATORS")
    print("=" * 50)

    total = summary_stats['total_comments']
    kirungi_pct = summary_stats.get('kirungi', {}).get('percentage', 0)
    kibi_pct = summary_stats.get('kibi', {}).get('percentage', 0)
    ratio = summary_stats.get('ratio', 0)

    print(f"Sample Size:           {total:,} comments")
    print(f"Positive Rate:         {kirungi_pct:.1f}%")
    print(f"Negative Rate:         {kibi_pct:.1f}%")
    print(f"Positivity Ratio:      {ratio:.2f}:1")

    # Quality assessment
    if kirungi_pct >= 60:
        quality_score = "Excellent"
    elif kirungi_pct >= 50:
        quality_score = "Good"
    elif kirungi_pct >= 40:
        quality_score = "Fair"
    else:
        quality_score = "Poor"

    print(f"Content Quality:       {quality_score}")

    # Statistical confidence (basic)
    confidence_level = "High" if total >= 500 else ("Medium" if total >= 100 else "Low")
    print(f"Statistical Confidence: {confidence_level}")

def main():
    """
    Main function to run the professional sentiment summary.
    Assumes df_comments exists in the global scope.
    """

    # Check if df_comments exists
    try:
        # This will be your loaded DataFrame
        global df_comments

        # Generate professional summary
        stats = generate_professional_sentiment_summary(df_comments)

        # Display key metrics
        display_key_metrics(stats)

        return stats

    except NameError:
        print("Error: df_comments not found. Please load your sentiment analysis data first.")
        print("Example: df_comments = pd.read_csv('uganda_comments_v1_with_sentiment.csv')")
        return None

# Run the analysis
if __name__ == "__main__":
    # Assuming df_comments is loaded
    summary_statistics = main()

## CATEGORY-BASED ANALYSIS

In [None]:
"""
Analyzes sentiment patterns across different content categories, channels, and video types.
Provides insights for content creators and stakeholders about audience preferences.
"""
def analyze_sentiment_by_category(df_comments):
    """
    Analyze sentiment distribution across content categories.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with 'sentiment', 'category' columns

    Returns:
    pd.DataFrame: Category analysis results
    """

    print("SENTIMENT ANALYSIS BY CONTENT CATEGORY")
    print("=" * 50)

    # Group by category and sentiment
    category_sentiment = df_comments.groupby(['category', 'sentiment']).size().unstack(fill_value=0)

    # Calculate percentages and totals
    category_analysis = pd.DataFrame()
    category_analysis['Total_Comments'] = category_sentiment.sum(axis=1)
    category_analysis['Kirungi_Count'] = category_sentiment.get('kirungi', 0)
    category_analysis['Kibi_Count'] = category_sentiment.get('kibi', 0)
    category_analysis['Kirungi_Percentage'] = (category_analysis['Kirungi_Count'] / category_analysis['Total_Comments'] * 100).round(1)
    category_analysis['Kibi_Percentage'] = (category_analysis['Kibi_Count'] / category_analysis['Total_Comments'] * 100).round(1)
    category_analysis['Positivity_Ratio'] = (category_analysis['Kirungi_Count'] / category_analysis['Kibi_Count'].replace(0, 1)).round(2)

    # Sort by positivity ratio (best performing first)
    category_analysis = category_analysis.sort_values('Positivity_Ratio', ascending=False)

    print("CATEGORY PERFORMANCE RANKING")
    print("-" * 35)
    print(f"{'Category':<25} {'Total':<8} {'Positive':<10} {'Negative':<10} {'Ratio':<8}")
    print("-" * 70)

    for category, row in category_analysis.iterrows():
        print(f"{category:<25} {row['Total_Comments']:<8} "
              f"{row['Kirungi_Percentage']:>6.1f}% {row['Kibi_Percentage']:>9.1f}% "
              f"{row['Positivity_Ratio']:>7.2f}")

    print()

    # Category insights
    print("CATEGORY INSIGHTS")
    print("-" * 20)

    best_category = category_analysis.index[0]
    worst_category = category_analysis.index[-1]
    most_commented = category_analysis.loc[category_analysis['Total_Comments'].idxmax()].name

    print(f"Best Performing Category:  {best_category}")
    print(f"  Positivity Ratio: {category_analysis.loc[best_category, 'Positivity_Ratio']:.2f}:1")
    print(f"  Positive Rate: {category_analysis.loc[best_category, 'Kirungi_Percentage']:.1f}%")
    print()

    print(f"Most Commented Category:   {most_commented}")
    print(f"  Total Comments: {category_analysis.loc[most_commented, 'Total_Comments']:,}")
    print(f"  Engagement Level: High")
    print()

    print(f"Needs Improvement:         {worst_category}")
    print(f"  Positivity Ratio: {category_analysis.loc[worst_category, 'Positivity_Ratio']:.2f}:1")
    print(f"  Negative Rate: {category_analysis.loc[worst_category, 'Kibi_Percentage']:.1f}%")
    print()

    return category_analysis

def analyze_sentiment_by_channel(df_comments):
    """
    Analyze sentiment distribution across different channels.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with 'sentiment', 'channel_name' columns

    Returns:
    pd.DataFrame: Channel analysis results
    """

    print("\n" + "=" * 50)
    print("SENTIMENT ANALYSIS BY CHANNEL")
    print("=" * 50)

    # Group by channel and sentiment
    channel_sentiment = df_comments.groupby(['channel_name', 'sentiment']).size().unstack(fill_value=0)

    # Calculate metrics
    channel_analysis = pd.DataFrame()
    channel_analysis['Total_Comments'] = channel_sentiment.sum(axis=1)
    channel_analysis['Kirungi_Count'] = channel_sentiment.get('kirungi', 0)
    channel_analysis['Kibi_Count'] = channel_sentiment.get('kibi', 0)
    channel_analysis['Kirungi_Percentage'] = (channel_analysis['Kirungi_Count'] / channel_analysis['Total_Comments'] * 100).round(1)
    channel_analysis['Positivity_Ratio'] = (channel_analysis['Kirungi_Count'] / channel_analysis['Kibi_Count'].replace(0, 1)).round(2)

    # Sort by total comments (most engaging first)
    channel_analysis = channel_analysis.sort_values('Total_Comments', ascending=False)

    print("CHANNEL PERFORMANCE OVERVIEW")
    print("-" * 40)
    print(f"{'Channel':<20} {'Comments':<10} {'Positive%':<10} {'Ratio':<8}")
    print("-" * 55)

    for channel, row in channel_analysis.iterrows():
        print(f"{channel:<20} {row['Total_Comments']:<10} "
              f"{row['Kirungi_Percentage']:>7.1f}% {row['Positivity_Ratio']:>9.2f}")

    # Channel insights
    print("\nCHANNEL INSIGHTS")
    print("-" * 20)

    top_engagement = channel_analysis.index[0]
    top_positivity = channel_analysis.loc[channel_analysis['Positivity_Ratio'].idxmax()].name

    print(f"Highest Engagement:        {top_engagement}")
    print(f"  Total Comments: {channel_analysis.loc[top_engagement, 'Total_Comments']:,}")
    print(f"  Positive Rate: {channel_analysis.loc[top_engagement, 'Kirungi_Percentage']:.1f}%")
    print()

    print(f"Most Positive Channel:     {top_positivity}")
    print(f"  Positivity Ratio: {channel_analysis.loc[top_positivity, 'Positivity_Ratio']:.2f}:1")
    print(f"  Comments: {channel_analysis.loc[top_positivity, 'Total_Comments']:,}")
    print()

    return channel_analysis

def generate_content_strategy_recommendations(category_analysis, channel_analysis):
    """
    Generate actionable recommendations based on category and channel analysis.

    Parameters:
    category_analysis (pd.DataFrame): Results from analyze_sentiment_by_category
    channel_analysis (pd.DataFrame): Results from analyze_sentiment_by_channel

    Returns:
    dict: Structured recommendations
    """

    print("\n" + "=" * 50)
    print("CONTENT STRATEGY RECOMMENDATIONS")
    print("=" * 50)

    recommendations = {
        'high_priority': [],
        'medium_priority': [],
        'opportunities': []
    }

    # High Priority Recommendations
    print("HIGH PRIORITY ACTIONS")
    print("-" * 25)

    # Find categories with low positivity
    low_positive_categories = category_analysis[category_analysis['Kirungi_Percentage'] < 50]

    if not low_positive_categories.empty:
        for category in low_positive_categories.index:
            pct = category_analysis.loc[category, 'Kirungi_Percentage']
            rec = f"Improve {category} content quality (currently {pct:.1f}% positive)"
            print(f"• {rec}")
            recommendations['high_priority'].append(rec)

    # Find channels with high negative sentiment
    high_negative_channels = channel_analysis[channel_analysis['Kirungi_Percentage'] < 45]

    if not high_negative_channels.empty:
        for channel in high_negative_channels.index:
            pct = channel_analysis.loc[channel, 'Kirungi_Percentage']
            rec = f"Review {channel} content strategy ({pct:.1f}% positive rate)"
            print(f"• {rec}")
            recommendations['high_priority'].append(rec)

    print()

    # Medium Priority Recommendations
    print("MEDIUM PRIORITY ACTIONS")
    print("-" * 27)

    # Categories with good but improvable performance
    medium_categories = category_analysis[
        (category_analysis['Kirungi_Percentage'] >= 50) &
        (category_analysis['Kirungi_Percentage'] < 70)
    ]

    for category in medium_categories.index:
        pct = category_analysis.loc[category, 'Kirungi_Percentage']
        rec = f"Optimize {category} content to reach 70%+ positive rate"
        print(f"• {rec}")
        recommendations['medium_priority'].append(rec)

    print()

    # Opportunities
    print("GROWTH OPPORTUNITIES")
    print("-" * 25)

    # Best performing categories to expand
    top_categories = category_analysis.head(2)

    for category in top_categories.index:
        ratio = category_analysis.loc[category, 'Positivity_Ratio']
        rec = f"Expand {category} content production (strong {ratio:.2f}:1 ratio)"
        print(f"• {rec}")
        recommendations['opportunities'].append(rec)

    # High engagement channels
    top_channels = channel_analysis.head(2)

    for channel in top_channels.index:
        comments = channel_analysis.loc[channel, 'Total_Comments']
        rec = f"Collaborate more with {channel} (high engagement: {comments:,} comments)"
        print(f"• {rec}")
        recommendations['opportunities'].append(rec)

    print()

    return recommendations

def calculate_category_benchmarks(category_analysis):
    """
    Calculate industry benchmarks and performance metrics.

    Parameters:
    category_analysis (pd.DataFrame): Category analysis results

    Returns:
    dict: Benchmark metrics
    """

    print("\n" + "=" * 50)
    print("PERFORMANCE BENCHMARKS")
    print("=" * 50)

    # Calculate benchmarks
    avg_positivity = category_analysis['Kirungi_Percentage'].mean()
    median_positivity = category_analysis['Kirungi_Percentage'].median()
    avg_engagement = category_analysis['Total_Comments'].mean()

    benchmarks = {
        'average_positivity': avg_positivity,
        'median_positivity': median_positivity,
        'average_engagement': avg_engagement,
        'excellence_threshold': avg_positivity + category_analysis['Kirungi_Percentage'].std(),
        'improvement_threshold': avg_positivity - category_analysis['Kirungi_Percentage'].std()
    }

    print("UGANDA CONTENT BENCHMARKS")
    print("-" * 30)
    print(f"Average Positivity Rate:    {avg_positivity:.1f}%")
    print(f"Median Positivity Rate:     {median_positivity:.1f}%")
    print(f"Average Comments per Video: {avg_engagement:.0f}")
    print(f"Excellence Threshold:       {benchmarks['excellence_threshold']:.1f}%")
    print(f"Improvement Needed Below:   {benchmarks['improvement_threshold']:.1f}%")
    print()

    # Performance classification
    print("CATEGORY PERFORMANCE CLASSIFICATION")
    print("-" * 40)

    for category, row in category_analysis.iterrows():
        pct = row['Kirungi_Percentage']

        if pct >= benchmarks['excellence_threshold']:
            status = "Excellent"
        elif pct >= avg_positivity:
            status = "Above Average"
        elif pct >= benchmarks['improvement_threshold']:
            status = "Below Average"
        else:
            status = "Needs Improvement"

        print(f"{category:<25} {status}")

    return benchmarks

def run_category_analysis(df_comments):
    """
    Run complete category-based sentiment analysis.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with sentiment analysis results

    Returns:
    dict: Complete analysis results
    """

    # Run all analyses
    category_results = analyze_sentiment_by_category(df_comments)
    channel_results = analyze_sentiment_by_channel(df_comments)
    recommendations = generate_content_strategy_recommendations(category_results, channel_results)
    benchmarks = calculate_category_benchmarks(category_results)

    # Return comprehensive results
    return {
        'category_analysis': category_results,
        'channel_analysis': channel_results,
        'recommendations': recommendations,
        'benchmarks': benchmarks
    }

if __name__ == "__main__":
    # Assuming df_comments is loaded with sentiment analysis
    try:
        # This will use your loaded DataFrame
        analysis_results = run_category_analysis(df_comments)
        print("\n" + "=" * 50)
        print("CATEGORY ANALYSIS COMPLETE")
        print("All results stored in analysis_results dictionary")

    except NameError:
        print("Error: df_comments not found. Please load your sentiment analysis data first.")
        print("Example: df_comments = pd.read_csv('uganda_comments_v1_with_sentiment.csv')")

## ENGAGEMENT ANALYSIS


In [None]:
"""
Analyzes engagement patterns in Luganda comments including likes correlation,
comment length patterns, and identifies most engaging content.
"""
def analyze_likes_sentiment_correlation(df_comments):
    """
    Analyze correlation between likes and sentiment patterns.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with 'sentiment', 'likes' columns

    Returns:
    dict: Correlation analysis results
    """

    print("ENGAGEMENT VS SENTIMENT CORRELATION ANALYSIS")
    print("=" * 50)

    # Basic statistics
    total_likes = df_comments['likes'].sum()
    avg_likes_overall = df_comments['likes'].mean()

    # Sentiment-based like analysis
    sentiment_engagement = df_comments.groupby('sentiment').agg({
        'likes': ['count', 'sum', 'mean', 'median', 'std'],
        'text': 'count'
    }).round(2)

    sentiment_engagement.columns = ['Comment_Count', 'Total_Likes', 'Avg_Likes', 'Median_Likes', 'Std_Likes', 'Text_Count']

    print("SENTIMENT ENGAGEMENT BREAKDOWN")
    print("-" * 35)
    print(f"{'Sentiment':<12} {'Comments':<10} {'Total Likes':<12} {'Avg Likes':<10} {'Median':<8}")
    print("-" * 65)

    results = {}

    for sentiment in sentiment_engagement.index:
        count = sentiment_engagement.loc[sentiment, 'Comment_Count']
        total = sentiment_engagement.loc[sentiment, 'Total_Likes']
        avg = sentiment_engagement.loc[sentiment, 'Avg_Likes']
        median = sentiment_engagement.loc[sentiment, 'Median_Likes']

        label = "Positive" if sentiment == "kirungi" else "Negative"
        print(f"{label:<12} {count:<10} {total:<12} {avg:<10.1f} {median:<8.1f}")

        results[sentiment] = {
            'count': count,
            'total_likes': total,
            'avg_likes': avg,
            'median_likes': median
        }

    # Calculate engagement ratios
    kirungi_avg = results.get('kirungi', {}).get('avg_likes', 0)
    kibi_avg = results.get('kibi', {}).get('avg_likes', 0)

    if kibi_avg > 0:
        engagement_ratio = kirungi_avg / kibi_avg
    else:
        engagement_ratio = float('inf') if kirungi_avg > 0 else 1

    print(f"\nENGAGEMENT INSIGHTS")
    print("-" * 20)
    print(f"Total Platform Likes: {total_likes:,}")
    print(f"Average Likes per Comment: {avg_likes_overall:.1f}")
    print(f"Positive vs Negative Engagement Ratio: {engagement_ratio:.2f}:1")

    # Statistical correlation
    # Convert sentiment to numeric for correlation
    df_numeric = df_comments.copy()
    df_numeric['sentiment_numeric'] = df_numeric['sentiment'].map({'kirungi': 1, 'kibi': 0})

    if len(df_numeric) > 10:  # Need sufficient data for correlation
        correlation, p_value = pearsonr(df_numeric['sentiment_numeric'], df_numeric['likes'])

        print(f"Pearson Correlation (Sentiment-Likes): {correlation:.3f}")
        print(f"Statistical Significance: {'Yes' if p_value < 0.05 else 'No'} (p={p_value:.3f})")

        if correlation > 0.1:
            interpretation = "Positive comments tend to receive more likes"
        elif correlation < -0.1:
            interpretation = "Negative comments tend to receive more likes"
        else:
            interpretation = "No significant correlation between sentiment and likes"

        print(f"Interpretation: {interpretation}")

        results['correlation'] = {
            'value': correlation,
            'p_value': p_value,
            'interpretation': interpretation
        }

    return results

def analyze_comment_length_patterns(df_comments):
    """
    Analyze comment length patterns and their relationship with engagement.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with comment data

    Returns:
    dict: Length analysis results
    """

    print("\n" + "=" * 50)
    print("COMMENT LENGTH ANALYSIS")
    print("=" * 50)

    # Calculate comment lengths
    df_comments['comment_length'] = df_comments['text'].str.len()
    df_comments['word_count'] = df_comments['text'].str.split().str.len()

    # Length categories
    def categorize_length(length):
        if length < 50:
            return "Short"
        elif length < 150:
            return "Medium"
        else:
            return "Long"

    df_comments['length_category'] = df_comments['comment_length'].apply(categorize_length)

    # Analysis by length category
    length_analysis = df_comments.groupby('length_category').agg({
        'likes': ['count', 'mean', 'sum'],
        'comment_length': 'mean',
        'word_count': 'mean'
    }).round(2)

    length_analysis.columns = ['Comment_Count', 'Avg_Likes', 'Total_Likes', 'Avg_Char_Length', 'Avg_Word_Count']

    print("LENGTH CATEGORY PERFORMANCE")
    print("-" * 35)
    print(f"{'Category':<10} {'Count':<8} {'Avg Likes':<10} {'Avg Length':<12} {'Avg Words':<10}")
    print("-" * 65)

    for category in ['Short', 'Medium', 'Long']:
        if category in length_analysis.index:
            row = length_analysis.loc[category]
            print(f"{category:<10} {row['Comment_Count']:<8} {row['Avg_Likes']:<10.1f} "
                  f"{row['Avg_Char_Length']:<12.0f} {row['Avg_Word_Count']:<10.1f}")

    # Sentiment by length
    print("\nSENTIMENT DISTRIBUTION BY LENGTH")
    print("-" * 40)

    sentiment_length = pd.crosstab(df_comments['length_category'], df_comments['sentiment'], normalize='index') * 100

    if 'kirungi' in sentiment_length.columns and 'kibi' in sentiment_length.columns:
        print(f"{'Category':<10} {'Positive %':<12} {'Negative %':<12}")
        print("-" * 40)

        for category in sentiment_length.index:
            pos = sentiment_length.loc[category, 'kirungi']
            neg = sentiment_length.loc[category, 'kibi']
            print(f"{category:<10} {pos:<12.1f} {neg:<12.1f}")

    # Optimal length insights
    print(f"\nLENGTH INSIGHTS")
    print("-" * 20)

    best_length_cat = length_analysis.loc[length_analysis['Avg_Likes'].idxmax()].name
    most_common_cat = length_analysis.loc[length_analysis['Comment_Count'].idxmax()].name

    print(f"Most Engaging Length: {best_length_cat}")
    print(f"Most Common Length: {most_common_cat}")

    avg_length = df_comments['comment_length'].mean()
    avg_words = df_comments['word_count'].mean()

    print(f"Average Comment Length: {avg_length:.0f} characters")
    print(f"Average Word Count: {avg_words:.1f} words")

    return {
        'length_analysis': length_analysis,
        'avg_length': avg_length,
        'avg_words': avg_words,
        'best_length_category': best_length_cat
    }

def identify_most_engaging_comments(df_comments, top_n=10):
    """
    Identify and analyze the most engaging comments.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with comment data
    top_n (int): Number of top comments to analyze

    Returns:
    pd.DataFrame: Top engaging comments with analysis
    """

    print("\n" + "=" * 50)
    print("MOST ENGAGING COMMENTS ANALYSIS")
    print("=" * 50)

    # Sort by likes and get top comments
    top_comments = df_comments.nlargest(top_n, 'likes').copy()

    # Add engagement metrics
    top_comments['text_preview'] = top_comments['text'].apply(
        lambda x: x[:80] + "..." if len(x) > 80 else x
    )

    print(f"TOP {top_n} MOST LIKED COMMENTS")
    print("-" * 35)
    print(f"{'Rank':<5} {'Likes':<7} {'Sentiment':<10} {'Category':<20} {'Preview'}")
    print("-" * 90)

    for i, (_, comment) in enumerate(top_comments.iterrows(), 1):
        likes = int(comment['likes'])
        sentiment = "Positive" if comment['sentiment'] == 'kirungi' else "Negative"
        category = comment.get('category', 'Unknown')[:18]
        preview = comment['text_preview']

        print(f"{i:<5} {likes:<7} {sentiment:<10} {category:<20} {preview}")

    # Analyze patterns in top comments
    print(f"\nTOP COMMENTS ANALYSIS")
    print("-" * 25)

    sentiment_dist = top_comments['sentiment'].value_counts()
    category_dist = top_comments['category'].value_counts() if 'category' in top_comments.columns else pd.Series()

    print("Sentiment Distribution:")
    for sentiment, count in sentiment_dist.items():
        label = "Positive" if sentiment == 'kirungi' else "Negative"
        pct = (count / len(top_comments)) * 100
        print(f"  {label}: {count} ({pct:.1f}%)")

    if not category_dist.empty:
        print(f"\nTop Categories:")
        for category, count in category_dist.head(3).items():
            pct = (count / len(top_comments)) * 100
            print(f"  {category}: {count} ({pct:.1f}%)")

    # Engagement benchmarks
    min_likes_top10 = top_comments['likes'].min()
    avg_likes_top10 = top_comments['likes'].mean()

    print(f"\nEngagement Benchmarks:")
    print(f"  Top 10 Minimum Likes: {min_likes_top10}")
    print(f"  Top 10 Average Likes: {avg_likes_top10:.1f}")
    print(f"  Highest Single Comment: {top_comments['likes'].max()} likes")

    return top_comments[['text', 'likes', 'sentiment', 'category', 'comment_length', 'word_count']].copy()

def calculate_engagement_score(df_comments):
    """
    Calculate comprehensive engagement scores for content analysis.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with comment data

    Returns:
    pd.DataFrame: Engagement scores by category/channel
    """

    print("\n" + "=" * 50)
    print("ENGAGEMENT SCORE CALCULATION")
    print("=" * 50)

    # Calculate engagement score components
    # Score = (Weighted Likes + Comment Volume + Sentiment Bonus) / Normalization Factor

    category_engagement = df_comments.groupby('category').agg({
        'likes': ['sum', 'mean', 'count'],
        'sentiment': lambda x: (x == 'kirungi').sum() / len(x) * 100,
        'comment_length': 'mean'
    }).round(2)

    category_engagement.columns = ['Total_Likes', 'Avg_Likes', 'Comment_Count', 'Positivity_Rate', 'Avg_Length']

    # Calculate engagement score
    max_likes = category_engagement['Total_Likes'].max()
    max_comments = category_engagement['Comment_Count'].max()

    category_engagement['Engagement_Score'] = (
        (category_engagement['Total_Likes'] / max_likes * 40) +  # 40% weight on total likes
        (category_engagement['Comment_Count'] / max_comments * 30) +  # 30% weight on volume
        (category_engagement['Positivity_Rate'] / 100 * 20) +  # 20% weight on positivity
        (category_engagement['Avg_Likes'] / category_engagement['Avg_Likes'].max() * 10)  # 10% weight on avg likes
    ).round(1)

    # Sort by engagement score
    category_engagement = category_engagement.sort_values('Engagement_Score', ascending=False)

    print("CATEGORY ENGAGEMENT SCORES")
    print("-" * 30)
    print(f"{'Category':<25} {'Score':<8} {'Total Likes':<12} {'Comments':<10} {'Positivity'}")
    print("-" * 70)

    for category, row in category_engagement.iterrows():
        score = row['Engagement_Score']
        likes = int(row['Total_Likes'])
        comments = int(row['Comment_Count'])
        positivity = row['Positivity_Rate']

        print(f"{category:<25} {score:<8} {likes:<12} {comments:<10} {positivity:.1f}%")

    # Performance tiers
    print(f"\nPERFORMANCE TIERS")
    print("-" * 20)

    for category, row in category_engagement.iterrows():
        score = row['Engagement_Score']

        if score >= 80:
            tier = "Excellent"
        elif score >= 60:
            tier = "Good"
        elif score >= 40:
            tier = "Average"
        else:
            tier = "Needs Improvement"

        print(f"{category:<25} {tier}")

    return category_engagement

def run_engagement_analysis(df_comments):
    """
    Run complete engagement analysis on sentiment-analyzed comments.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with sentiment analysis results

    Returns:
    dict: Complete engagement analysis results
    """

    print("STARTING COMPREHENSIVE ENGAGEMENT ANALYSIS")
    print("=" * 55)
    print(f"Analyzing {len(df_comments)} comments with sentiment data")
    print()

    # Run all engagement analyses
    correlation_results = analyze_likes_sentiment_correlation(df_comments)
    length_results = analyze_comment_length_patterns(df_comments)
    top_comments = identify_most_engaging_comments(df_comments)
    engagement_scores = calculate_engagement_score(df_comments)

    print("\n" + "=" * 50)
    print("ENGAGEMENT ANALYSIS COMPLETE")
    print("=" * 50)
    print("Key Takeaways:")

    # Generate key insights
    if 'correlation' in correlation_results:
        corr_value = correlation_results['correlation']['value']
        if corr_value > 0.1:
            print("• Positive sentiment correlates with higher engagement")
        elif corr_value < -0.1:
            print("• Controversial content may drive engagement")
        else:
            print("• Sentiment and likes show no strong correlation")

    best_length = length_results['best_length_category']
    print(f"• {best_length} comments generate highest engagement")

    top_category = engagement_scores.index[0]
    print(f"• {top_category} leads overall engagement metrics")

    return {
        'correlation_analysis': correlation_results,
        'length_analysis': length_results,
        'top_comments': top_comments,
        'engagement_scores': engagement_scores
    }

if __name__ == "__main__":
    # This script should be run after sentiment analysis is complete
    try:
        # Run engagement analysis on your sentiment-analyzed data
        engagement_results = run_engagement_analysis(df_comments)

        print("\nAll engagement analysis results stored in engagement_results dictionary")
        print("Available keys:", list(engagement_results.keys()))

    except NameError:
        print("Error: df_comments not found. Please run sentiment analysis first.")
        print("Make sure df_comments has 'sentiment', 'likes', 'text', and 'category' columns")

## CONTENT ANALYSIS

In [None]:
"""
Analyzes content patterns in Luganda comments including word frequency,
language mixing patterns, and keyword extraction for content insights.
"""

def preprocess_luganda_text(text):
    """
    Preprocess Luganda text for analysis while preserving language mixing.

    Parameters:
    text (str): Raw comment text

    Returns:
    str: Cleaned text
    """

    # Convert to lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)

    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Keep only letters, numbers, and basic punctuation
    text = re.sub(r'[^\w\s\'\-]', ' ', text)

    return text

def analyze_word_frequency(df_comments, top_n=50):
    """
    Analyze word frequency patterns in Luganda comments.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with text column
    top_n (int): Number of top words to analyze

    Returns:
    dict: Word frequency analysis results
    """

    print("LUGANDA WORD FREQUENCY ANALYSIS")
    print("=" * 40)

    # Common English/Luganda stop words to filter
    stop_words = {
        'wa', 'ku', 'mu', 'ba', 'ki', 'ka', 'ga', 'lu', 'bu', 'tu', 'ma',  # Luganda prefixes
        'ne', 'era', 'naye', 'oba', 'ate', 'nga', 'bwe', 'gwe', 'ye',  # Luganda conjunctions
        'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',  # English
        'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
        'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
        'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during',
        'this', 'that', 'these', 'those', 'i', 'me', 'my', 'myself', 'we', 'our',
        'you', 'your', 'he', 'him', 'his', 'she', 'her', 'it', 'its', 'they'
    }

    # Combine all text and process
    all_text = ' '.join(df_comments['text'].astype(str))
    cleaned_text = preprocess_luganda_text(all_text)

    # Split into words and filter
    words = cleaned_text.split()
    filtered_words = [word for word in words if len(word) > 2 and word not in stop_words]

    # Count frequencies
    word_freq = Counter(filtered_words)

    print(f"OVERALL WORD STATISTICS")
    print("-" * 25)
    print(f"Total words processed: {len(words):,}")
    print(f"Unique words found: {len(word_freq):,}")
    print(f"Words after filtering: {len(filtered_words):,}")
    print()

    # Top words overall
    print(f"TOP {min(top_n, 30)} MOST FREQUENT WORDS")
    print("-" * 35)
    print(f"{'Rank':<5} {'Word':<20} {'Count':<8} {'% of Total'}")
    print("-" * 50)

    total_filtered = len(filtered_words)
    top_words = {}

    for i, (word, count) in enumerate(word_freq.most_common(min(top_n, 30)), 1):
        percentage = (count / total_filtered) * 100
        print(f"{i:<5} {word:<20} {count:<8} {percentage:.2f}%")
        top_words[word] = {'count': count, 'percentage': percentage}

    # Sentiment-based word analysis
    print(f"\nSENTIMENT-BASED WORD PATTERNS")
    print("-" * 35)

    sentiment_words = {}

    for sentiment in ['kirungi', 'kibi']:
        sentiment_text = ' '.join(
            df_comments[df_comments['sentiment'] == sentiment]['text'].astype(str)
        )
        sentiment_cleaned = preprocess_luganda_text(sentiment_text)
        sentiment_word_list = [word for word in sentiment_cleaned.split()
                              if len(word) > 2 and word not in stop_words]
        sentiment_freq = Counter(sentiment_word_list)

        label = "POSITIVE (Kirungi)" if sentiment == 'kirungi' else "NEGATIVE (Kibi)"
        print(f"\n{label} - Top 10 Words:")

        sentiment_words[sentiment] = {}
        for i, (word, count) in enumerate(sentiment_freq.most_common(10), 1):
            pct = (count / len(sentiment_word_list)) * 100 if sentiment_word_list else 0
            print(f"  {i:2}. {word:<15} ({count:3}, {pct:.1f}%)")
            sentiment_words[sentiment][word] = {'count': count, 'percentage': pct}

    return {
        'total_words': len(words),
        'unique_words': len(word_freq),
        'filtered_words': len(filtered_words),
        'top_words': top_words,
        'sentiment_words': sentiment_words,
        'word_frequency': dict(word_freq.most_common(top_n))
    }

def analyze_language_mixing(df_comments):
    """
    Analyze English-Luganda language mixing patterns.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with text column

    Returns:
    dict: Language mixing analysis results
    """

    print("\n" + "=" * 50)
    print("LANGUAGE MIXING ANALYSIS")
    print("=" * 50)

    # Common English words that appear in mixed content
    english_indicators = {
        'the', 'and', 'is', 'are', 'was', 'were', 'have', 'has', 'had', 'do', 'does', 'did',
        'will', 'would', 'can', 'could', 'should', 'may', 'might', 'must', 'shall',
        'this', 'that', 'these', 'those', 'my', 'your', 'his', 'her', 'our', 'their',
        'good', 'bad', 'nice', 'great', 'best', 'worst', 'love', 'like', 'hate',
        'people', 'time', 'work', 'money', 'government', 'president', 'minister',
        'uganda', 'kampala', 'thanks', 'thank', 'please', 'sorry', 'very', 'really',
        'always', 'never', 'sometimes', 'maybe', 'because', 'but', 'so', 'also'
    }

    # Common Luganda words
    luganda_indicators = {
        'webale', 'nkwagala', 'ssebo', 'nnyabo', 'bambi', 'kale', 'laba', 'genda',
        'jjuuza', 'nkuba', 'nsanyuse', 'ntya', 'manya', 'temanyi', 'kuba', 'olemwa',
        'abantu', 'emyaka', 'omulimu', 'ssente', 'gavumenti', 'pulezidenti',
        'minisita', 'uganda', 'kampala', 'ddala', 'nnyo', 'banange', 'lwaki',
        'otya', 'bulungi', 'bubi', 'kirungi', 'kibi', 'katonda', 'yesu', 'kristo'
    }

    def classify_language_mixing(text):
        """Classify text by language mixing level."""
        cleaned = preprocess_luganda_text(text)
        words = set(cleaned.split())

        english_count = len(words.intersection(english_indicators))
        luganda_count = len(words.intersection(luganda_indicators))
        total_indicator_words = english_count + luganda_count

        if total_indicator_words == 0:
            return "Unknown"
        elif english_count == 0:
            return "Pure Luganda"
        elif luganda_count == 0:
            return "Pure English"
        elif english_count > luganda_count:
            return "English-Heavy Mix"
        elif luganda_count > english_count:
            return "Luganda-Heavy Mix"
        else:
            return "Balanced Mix"

    # Classify all comments
    df_comments['language_mix'] = df_comments['text'].apply(classify_language_mixing)

    # Analyze distribution
    mix_distribution = df_comments['language_mix'].value_counts()

    print("LANGUAGE MIXING DISTRIBUTION")
    print("-" * 30)
    print(f"{'Category':<20} {'Count':<8} {'Percentage'}")
    print("-" * 45)

    total_comments = len(df_comments)
    mix_analysis = {}

    for category, count in mix_distribution.items():
        percentage = (count / total_comments) * 100
        print(f"{category:<20} {count:<8} {percentage:.1f}%")
        mix_analysis[category] = {'count': count, 'percentage': percentage}

    # Analyze mixing by sentiment
    print(f"\nLANGUAGE MIXING BY SENTIMENT")
    print("-" * 35)

    mixing_sentiment = pd.crosstab(df_comments['language_mix'], df_comments['sentiment'], normalize='index') * 100

    if 'kirungi' in mixing_sentiment.columns and 'kibi' in mixing_sentiment.columns:
        print(f"{'Language Category':<20} {'Positive %':<12} {'Negative %'}")
        print("-" * 50)

        for category in mixing_sentiment.index:
            pos = mixing_sentiment.loc[category, 'kirungi']
            neg = mixing_sentiment.loc[category, 'kibi']
            print(f"{category:<20} {pos:<12.1f} {neg:.1f}%")

    # Analyze mixing by category
    if 'category' in df_comments.columns:
        print(f"\nLANGUAGE MIXING BY CONTENT CATEGORY")
        print("-" * 40)

        category_mixing = pd.crosstab(df_comments['category'], df_comments['language_mix'], normalize='index') * 100

        print("Top mixing patterns by content category:")
        for content_cat in category_mixing.index:
            top_mix = category_mixing.loc[content_cat].idxmax()
            top_pct = category_mixing.loc[content_cat].max()
            print(f"  {content_cat}: {top_mix} ({top_pct:.1f}%)")

    return {
        'distribution': mix_analysis,
        'total_comments': total_comments,
        'mixing_sentiment_correlation': mixing_sentiment.to_dict() if 'kirungi' in mixing_sentiment.columns else {},
        'dominant_pattern': mix_distribution.index[0]
    }

def extract_keywords_and_themes(df_comments):
    """
    Extract meaningful keywords and themes from comments.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with text and category columns

    Returns:
    dict: Keywords and themes analysis
    """

    print("\n" + "=" * 50)
    print("KEYWORD AND THEME EXTRACTION")
    print("=" * 50)

    # Define theme keywords in both languages
    theme_keywords = {
        'praise_appreciation': [
            'webale', 'nkwagala', 'thanks', 'thank', 'love', 'asante', 'good', 'nice',
            'bulungi', 'kirungi', 'nsanyuse', 'amazing', 'great', 'excellent', 'perfect'
        ],
        'criticism_complaints': [
            'kibi', 'bubi', 'bad', 'worst', 'hate', 'angry', 'disappointed', 'terrible',
            'horrible', 'disgusting', 'shame', 'ntya', 'olemwa', 'problem', 'issue'
        ],
        'politics_government': [
            'museveni', 'government', 'gavumenti', 'president', 'pulezidenti', 'minister',
                'minisita', 'parliament', 'palimenti', 'nup', 'nrm', 'bobi', 'wine', 'politics',
            'election', 'vote', 'leader', 'mukulembeze', 'opposition', 'ruling', 'party'
        ],
        'entertainment_music': [
            'music', 'muziki', 'song', 'oluyimba', 'dance', 'okuzina', 'concert', 'show',
            'artist', 'musician', 'singer', 'band', 'album', 'video', 'youtube', 'comedy',
            'comedian', 'laugh', 'funny', 'entertainment', 'perform', 'stage'
        ],
        'religion_spirituality': [
            'katonda', 'god', 'yesu', 'jesus', 'kristo', 'christ', 'bible', 'church',
            'kkanisa', 'pray', 'prayer', 'okusaba', 'pastor', 'reverend', 'faith',
            'okukkiriza', 'blessing', 'mukama', 'lord', 'amen', 'hallelujah'
        ],
        'social_issues': [
            'poverty', 'obwavu', 'corruption', 'obuli', 'unemployment', 'education',
            'ebyenjigiriza', 'health', 'obulamu', 'hospital', 'ddwaliro', 'medicine',
            'eddagala', 'transport', 'entambula', 'roads', 'enguudo', 'infrastructure'
        ],
        'personal_emotions': [
            'happy', 'nsanyuse', 'sad', 'nkungubaga', 'excited', 'tired', 'nkoowa',
            'proud', 'neegulumiza', 'worried', 'nfudde', 'confused', 'stressed',
            'relaxed', 'motivated', 'inspired', 'grateful', 'blessed'
        ]
    }

    def extract_themes_from_text(text):
        """Extract themes present in a text."""
        cleaned = preprocess_luganda_text(text)
        words = set(cleaned.split())

        themes_found = []
        for theme, keywords in theme_keywords.items():
            if any(keyword in words for keyword in keywords):
                themes_found.append(theme)

        return themes_found

    # Extract themes for all comments
    df_comments['themes'] = df_comments['text'].apply(extract_themes_from_text)

    # Count theme occurrences
    theme_counts = defaultdict(int)
    for themes_list in df_comments['themes']:
        for theme in themes_list:
            theme_counts[theme] += 1

    total_comments = len(df_comments)

    print("THEME FREQUENCY ANALYSIS")
    print("-" * 30)
    print(f"{'Theme':<25} {'Count':<8} {'Percentage'}")
    print("-" * 50)

    theme_analysis = {}
    for theme, count in sorted(theme_counts.items(), key=lambda x: x[1], reverse=True):
        percentage = (count / total_comments) * 100
        theme_name = theme.replace('_', ' ').title()
        print(f"{theme_name:<25} {count:<8} {percentage:.1f}%")
        theme_analysis[theme] = {'count': count, 'percentage': percentage}

    # Analyze themes by sentiment
    print(f"\nTHEME SENTIMENT ANALYSIS")
    print("-" * 30)

    theme_sentiment = {}

    for theme in theme_counts.keys():
        # Find comments containing this theme
        theme_comments = df_comments[df_comments['themes'].apply(lambda x: theme in x)]

        if len(theme_comments) > 0:
            sentiment_dist = theme_comments['sentiment'].value_counts(normalize=True) * 100

            kirungi_pct = sentiment_dist.get('kirungi', 0)
            kibi_pct = sentiment_dist.get('kibi', 0)

            theme_sentiment[theme] = {
                'positive_pct': kirungi_pct,
                'negative_pct': kibi_pct,
                'total_comments': len(theme_comments)
            }

    print(f"{'Theme':<25} {'Positive %':<12} {'Negative %':<12} {'Comments'}")
    print("-" * 65)

    for theme, data in sorted(theme_sentiment.items(), key=lambda x: x[1]['positive_pct'], reverse=True):
        theme_name = theme.replace('_', ' ').title()
        pos_pct = data['positive_pct']
        neg_pct = data['negative_pct']
        comments = data['total_comments']
        print(f"{theme_name:<25} {pos_pct:<12.1f} {neg_pct:<12.1f} {comments}")

    # Analyze themes by category
    if 'category' in df_comments.columns:
        print(f"\nTHEME DISTRIBUTION BY CATEGORY")
        print("-" * 40)

        category_themes = {}
        for category in df_comments['category'].unique():
            cat_comments = df_comments[df_comments['category'] == category]
            cat_theme_counts = defaultdict(int)

            for themes_list in cat_comments['themes']:
                for theme in themes_list:
                    cat_theme_counts[theme] += 1

            if cat_theme_counts:
                top_theme = max(cat_theme_counts.items(), key=lambda x: x[1])
                theme_name = top_theme[0].replace('_', ' ').title()
                count = top_theme[1]
                pct = (count / len(cat_comments)) * 100

                print(f"{category}: {theme_name} ({count} comments, {pct:.1f}%)")
                category_themes[category] = {
                    'top_theme': top_theme[0],
                    'count': count,
                    'percentage': pct
                }

    return {
        'theme_counts': dict(theme_counts),
        'theme_analysis': theme_analysis,
        'theme_sentiment': theme_sentiment,
        'category_themes': category_themes if 'category' in df_comments.columns else {},
        'total_comments_analyzed': total_comments
    }

def analyze_viral_content_patterns(df_comments, viral_threshold=None):
    """
    Analyze patterns in viral content (high-engagement comments).

    Parameters:
    df_comments (pd.DataFrame): DataFrame with likes and text columns
    viral_threshold (int): Minimum likes to be considered viral (auto-calculated if None)

    Returns:
    dict: Viral content analysis results
    """

    print("\n" + "=" * 50)
    print("VIRAL CONTENT PATTERN ANALYSIS")
    print("=" * 50)

    # Calculate viral threshold if not provided
    if viral_threshold is None:
        likes_75th = df_comments['likes'].quantile(0.75)
        likes_90th = df_comments['likes'].quantile(0.90)
        viral_threshold = max(likes_75th, 10)  # At least 10 likes or 75th percentile

    viral_comments = df_comments[df_comments['likes'] >= viral_threshold].copy()
    regular_comments = df_comments[df_comments['likes'] < viral_threshold].copy()

    print(f"VIRAL CONTENT DEFINITION")
    print("-" * 25)
    print(f"Viral Threshold: {viral_threshold} likes")
    print(f"Viral Comments: {len(viral_comments)} ({len(viral_comments)/len(df_comments)*100:.1f}%)")
    print(f"Regular Comments: {len(regular_comments)} ({len(regular_comments)/len(df_comments)*100:.1f}%)")
    print()

    # Analyze viral content characteristics
    viral_analysis = {}

    # Length analysis
    viral_avg_length = viral_comments['text'].str.len().mean() if len(viral_comments) > 0 else 0
    regular_avg_length = regular_comments['text'].str.len().mean() if len(regular_comments) > 0 else 0

    print(f"VIRAL CONTENT CHARACTERISTICS")
    print("-" * 35)
    print(f"Average Length - Viral: {viral_avg_length:.0f} characters")
    print(f"Average Length - Regular: {regular_avg_length:.0f} characters")
    print(f"Length Difference: {viral_avg_length - regular_avg_length:+.0f} characters")
    print()

    # Sentiment analysis
    if len(viral_comments) > 0 and len(regular_comments) > 0:
        viral_sentiment = viral_comments['sentiment'].value_counts(normalize=True) * 100
        regular_sentiment = regular_comments['sentiment'].value_counts(normalize=True) * 100

        print(f"SENTIMENT DISTRIBUTION")
        print("-" * 25)
        print(f"{'Sentiment':<12} {'Viral %':<10} {'Regular %'}")
        print("-" * 35)

        for sentiment in ['kirungi', 'kibi']:
            viral_pct = viral_sentiment.get(sentiment, 0)
            regular_pct = regular_sentiment.get(sentiment, 0)
            label = "Positive" if sentiment == 'kirungi' else "Negative"
            print(f"{label:<12} {viral_pct:<10.1f} {regular_pct:.1f}%")

    # Extract common patterns in viral content
    if len(viral_comments) > 0:
        viral_text = ' '.join(viral_comments['text'].astype(str))
        viral_cleaned = preprocess_luganda_text(viral_text)
        viral_words = [word for word in viral_cleaned.split() if len(word) > 3]
        viral_word_freq = Counter(viral_words)

        print(f"\nTOP VIRAL CONTENT WORDS")
        print("-" * 25)

        for i, (word, count) in enumerate(viral_word_freq.most_common(15), 1):
            print(f"{i:2}. {word:<15} ({count} times)")

    viral_analysis = {
        'threshold': viral_threshold,
        'viral_count': len(viral_comments),
        'viral_percentage': len(viral_comments)/len(df_comments)*100 if len(df_comments) > 0 else 0,
        'avg_length_viral': viral_avg_length,
        'avg_length_regular': regular_avg_length,
        'viral_sentiment': viral_sentiment.to_dict() if len(viral_comments) > 0 else {},
        'top_viral_words': dict(viral_word_freq.most_common(20)) if len(viral_comments) > 0 else {}
    }

    return viral_analysis

def analyze_viral_content_patterns(df_comments, viral_threshold=None):
    """
    Analyze patterns in viral content (high-engagement comments).

    Parameters:
    df_comments (pd.DataFrame): DataFrame with likes and text columns
    viral_threshold (int): Minimum likes to be considered viral (auto-calculated if None)

    Returns:
    dict: Viral content analysis results
    """

    print("\n" + "=" * 50)
    print("VIRAL CONTENT PATTERN ANALYSIS")
    print("=" * 50)

    # Calculate viral threshold if not provided
    if viral_threshold is None:
        likes_75th = df_comments['likes'].quantile(0.75)
        likes_90th = df_comments['likes'].quantile(0.90)
        viral_threshold = max(likes_75th, 10)  # At least 10 likes or 75th percentile

    viral_comments = df_comments[df_comments['likes'] >= viral_threshold].copy()
    regular_comments = df_comments[df_comments['likes'] < viral_threshold].copy()

    print(f"VIRAL CONTENT DEFINITION")
    print("-" * 25)
    print(f"Viral Threshold: {viral_threshold} likes")
    print(f"Viral Comments: {len(viral_comments)} ({len(viral_comments)/len(df_comments)*100:.1f}%)")
    print(f"Regular Comments: {len(regular_comments)} ({len(regular_comments)/len(df_comments)*100:.1f}%)")
    print()

    # Analyze viral content characteristics
    viral_analysis = {}

    # Length analysis
    viral_avg_length = viral_comments['text'].str.len().mean()
    regular_avg_length = regular_comments['text'].str.len().mean()

    print(f"VIRAL CONTENT CHARACTERISTICS")
    print("-" * 35)
    print(f"Average Length - Viral: {viral_avg_length:.0f} characters")
    print(f"Average Length - Regular: {regular_avg_length:.0f} characters")
    print(f"Length Difference: {viral_avg_length - regular_avg_length:+.0f} characters")
    print()

    # Sentiment analysis
    if len(viral_comments) > 0:
        viral_sentiment = viral_comments['sentiment'].value_counts(normalize=True) * 100
        regular_sentiment = regular_comments['sentiment'].value_counts(normalize=True) * 100

        print(f"SENTIMENT DISTRIBUTION")
        print("-" * 25)
        print(f"{'Sentiment':<12} {'Viral %':<10} {'Regular %'}")
        print("-" * 35)

        for sentiment in ['kirungi', 'kibi']:
            viral_pct = viral_sentiment.get(sentiment, 0)
            regular_pct = regular_sentiment.get(sentiment, 0)
            label = "Positive" if sentiment == 'kirungi' else "Negative"
            print(f"{label:<12} {viral_pct:<10.1f} {regular_pct:.1f}%")

    # Extract common patterns in viral content
    if len(viral_comments) > 0:
        viral_text = ' '.join(viral_comments['text'].astype(str))
        viral_cleaned = preprocess_luganda_text(viral_text)
        viral_words = [word for word in viral_cleaned.split() if len(word) > 3]
        viral_word_freq = Counter(viral_words)

        print(f"\nTOP VIRAL CONTENT WORDS")
        print("-" * 25)

        for i, (word, count) in enumerate(viral_word_freq.most_common(15), 1):
            print(f"{i:2}. {word:<15} ({count} times)")

    viral_analysis = {
        'threshold': viral_threshold,
        'viral_count': len(viral_comments),
        'viral_percentage': len(viral_comments)/len(df_comments)*100,
        'avg_length_viral': viral_avg_length,
        'avg_length_regular': regular_avg_length,
        'viral_sentiment': viral_sentiment.to_dict() if len(viral_comments) > 0 else {},
        'top_viral_words': dict(viral_word_freq.most_common(20)) if len(viral_comments) > 0 else {}
    }

    return viral_analysis

def run_content_analysis(df_comments):
    """
    Run comprehensive content analysis on Luganda comments.

    Parameters:
    df_comments (pd.DataFrame): DataFrame with sentiment analysis results

    Returns:
    dict: Complete content analysis results
    """

    print("STARTING COMPREHENSIVE CONTENT ANALYSIS")
    print("=" * 55)
    print(f"Analyzing {len(df_comments)} Luganda comments")
    print()

    # Run all content analyses
    word_frequency = analyze_word_frequency(df_comments)
    language_mixing = analyze_language_mixing(df_comments)
    themes_keywords = extract_keywords_and_themes(df_comments)
    viral_patterns = analyze_viral_content_patterns(df_comments)

    print("\n" + "=" * 50)
    print("CONTENT ANALYSIS COMPLETE")
    print("=" * 50)

    # Generate key insights
    print("KEY CONTENT INSIGHTS:")

    # Language insights
    dominant_mix = language_mixing['dominant_pattern']
    print(f"• Dominant language pattern: {dominant_mix}")

    # Theme insights
    if themes_keywords['theme_counts']:
        top_theme = max(themes_keywords['theme_counts'].items(), key=lambda x: x[1])
        theme_name = top_theme[0].replace('_', ' ').title()
        print(f"• Most discussed theme: {theme_name} ({top_theme[1]} comments)")

    # Viral content insights
    viral_pct = viral_patterns['viral_percentage']
    print(f"• Viral content rate: {viral_pct:.1f}% of comments")

    # Word frequency insights
    if word_frequency['top_words']:
        top_word = list(word_frequency['top_words'].keys())[0]
        print(f"• Most frequent word: '{top_word}'")

    return {
        'word_frequency': word_frequency,
        'language_mixing': language_mixing,
        'themes_keywords': themes_keywords,
        'viral_patterns': viral_patterns
    }

if __name__ == "__main__":
    # This script should be run after sentiment analysis is complete
    try:
        # Run content analysis on your sentiment-analyzed data
        content_results = run_content_analysis(df_comments)

        print("\nAll content analysis results stored in content_results dictionary")
        print("Available keys:", list(content_results.keys()))

    except NameError:
        print("Error: df_comments not found. Please run sentiment analysis first.")
        print("Make sure df_comments has 'sentiment', 'text', and 'category' columns")

## BUSINESS INTELLIGENCE DASHBOARD

In [None]:
"""
Creates comprehensive business intelligence dashboard with creator insights,
strategic recommendations, and executive summary metrics for Ugandan content.
"""

def generate_executive_summary(df_comments, analysis_results=None):
    """
    Generate executive-level summary of Ugandan content performance.

    Parameters:
    df_comments (pd.DataFrame): Comments with sentiment analysis
    analysis_results (dict): Results from previous analysis scripts

    Returns:
    dict: Executive summary metrics
    """

    print("UGANDA CONTENT INTELLIGENCE DASHBOARD")
    print("=" * 55)
    print(f"Executive Summary | Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
    print("=" * 55)

    # Core metrics
    total_comments = len(df_comments)
    total_engagement = df_comments['likes'].sum()
    avg_engagement = df_comments['likes'].mean()

    # Sentiment metrics
    sentiment_dist = df_comments['sentiment'].value_counts(normalize=True) * 100
    positive_rate = sentiment_dist.get('kirungi', 0)
    negative_rate = sentiment_dist.get('kibi', 0)

    # Calculate business impact score
    # Score based on volume, engagement, sentiment, and content diversity
    categories = df_comments['category'].nunique() if 'category' in df_comments.columns else 1
    channels = df_comments['channel_name'].nunique() if 'channel_name' in df_comments.columns else 1

    # Business Impact Score (0-100)
    volume_score = min((total_comments / 500) * 25, 25)  # Max 25 points for volume
    engagement_score = min((avg_engagement / 10) * 25, 25)  # Max 25 points for engagement
    sentiment_score = (positive_rate / 100) * 30  # Max 30 points for positivity
    diversity_score = min((categories + channels) / 10 * 20, 20)  # Max 20 points for diversity

    business_impact_score = volume_score + engagement_score + sentiment_score + diversity_score

    print("📊 KEY PERFORMANCE INDICATORS")
    print("-" * 35)
    print(f"Total Comments Analyzed:     {total_comments:,}")
    print(f"Total Engagement (Likes):    {total_engagement:,}")
    print(f"Average Engagement:          {avg_engagement:.1f} likes/comment")
    print(f"Audience Sentiment:          {positive_rate:.1f}% Positive")
    print(f"Content Categories:          {categories}")
    print(f"Active Channels:             {channels}")
    print(f"Business Impact Score:       {business_impact_score:.1f}/100")
    print()

    # Performance assessment
    if business_impact_score >= 80:
        performance_grade = "Excellent (A)"
        performance_color = "🟢"
        performance_action = "Maintain current strategy"
    elif business_impact_score >= 65:
        performance_grade = "Good (B)"
        performance_color = "🟡"
        performance_action = "Optimize high-performing areas"
    elif business_impact_score >= 50:
        performance_grade = "Average (C)"
        performance_color = "🟠"
        performance_action = "Significant improvement needed"
    else:
        performance_grade = "Poor (D)"
        performance_color = "🔴"
        performance_action = "Complete strategy overhaul required"

    print(f"📈 OVERALL PERFORMANCE ASSESSMENT")
    print("-" * 40)
    print(f"Grade: {performance_color} {performance_grade}")
    print(f"Recommendation: {performance_action}")
    print()

    return {
        'total_comments': total_comments,
        'total_engagement': total_engagement,
        'avg_engagement': avg_engagement,
        'positive_rate': positive_rate,
        'negative_rate': negative_rate,
        'business_impact_score': business_impact_score,
        'performance_grade': performance_grade,
        'performance_action': performance_action,
        'categories': categories,
        'channels': channels
    }

def analyze_creator_performance(df_comments):
    """
    Analyze individual creator/channel performance with actionable insights.

    Parameters:
    df_comments (pd.DataFrame): Comments with channel information

    Returns:
    dict: Creator performance analysis
    """

    print("🎥 CREATOR PERFORMANCE ANALYSIS")
    print("=" * 40)

    if 'channel_name' not in df_comments.columns:
        print("Channel information not available for creator analysis")
        return {}

    # Calculate creator metrics
    creator_metrics = df_comments.groupby('channel_name').agg({
        'likes': ['sum', 'mean', 'count'],
        'sentiment': lambda x: (x == 'kirungi').sum() / len(x) * 100,
        'text': lambda x: x.str.len().mean()
    }).round(2)

    creator_metrics.columns = ['Total_Likes', 'Avg_Likes', 'Comment_Count', 'Positive_Rate', 'Avg_Comment_Length']

    # Calculate creator scores
    max_likes = creator_metrics['Total_Likes'].max()
    max_comments = creator_metrics['Comment_Count'].max()

    creator_metrics['Performance_Score'] = (
        (creator_metrics['Total_Likes'] / max_likes * 30) +
        (creator_metrics['Comment_Count'] / max_comments * 30) +
        (creator_metrics['Positive_Rate'] / 100 * 25) +
        (creator_metrics['Avg_Likes'] / creator_metrics['Avg_Likes'].max() * 15)
    ).round(1)

    creator_metrics = creator_metrics.sort_values('Performance_Score', ascending=False)

    print("CREATOR PERFORMANCE RANKING")
    print("-" * 30)
    print(f"{'Creator':<20} {'Score':<8} {'Comments':<10} {'Avg Likes':<10} {'Positive%'}")
    print("-" * 70)

    creator_insights = {}

    for creator, row in creator_metrics.iterrows():
        score = row['Performance_Score']
        comments = int(row['Comment_Count'])
        avg_likes = row['Avg_Likes']
        positive = row['Positive_Rate']

        print(f"{creator[:18]:<20} {score:<8} {comments:<10} {avg_likes:<10.1f} {positive:.1f}%")

        # Performance classification
        if score >= 80:
            tier = "Top Performer"
            action = "Scale content production"
        elif score >= 60:
            tier = "Strong Performer"
            action = "Optimize content strategy"
        elif score >= 40:
            tier = "Average Performer"
            action = "Improve content quality"
        else:
            tier = "Needs Improvement"
            action = "Complete strategy review"

        creator_insights[creator] = {
            'score': score,
            'tier': tier,
            'action': action,
            'metrics': row.to_dict()
        }

    print()

    # Top performer insights
    print("🏆 TOP PERFORMER INSIGHTS")
    print("-" * 30)

    top_3 = list(creator_metrics.head(3).index)

    for i, creator in enumerate(top_3, 1):
        metrics = creator_metrics.loc[creator]
        print(f"{i}. {creator}")
        print(f"   Performance Score: {metrics['Performance_Score']}")
        print(f"   Strength: {get_creator_strength(metrics)}")
        print(f"   Opportunity: {get_creator_opportunity(metrics)}")
        print()

    return creator_insights

def get_creator_strength(metrics):
    """Identify creator's main strength."""
    if metrics['Positive_Rate'] >= 90:
        return "Exceptional audience sentiment"
    elif metrics['Avg_Likes'] > 5:  # High engagement per comment
        return "High engagement per comment"
    elif metrics['Comment_Count'] >= 50:
        return "Strong audience engagement volume"
    else:
        return "Consistent content production"

def get_creator_opportunity(metrics):
    """Identify creator's main improvement opportunity."""
    if metrics['Positive_Rate'] < 70:
        return "Improve content sentiment"
    elif metrics['Avg_Likes'] < 3:
        return "Increase engagement tactics"
    elif metrics['Comment_Count'] < 20:
        return "Boost audience interaction"
    else:
        return "Expand content reach"

def generate_strategic_recommendations(df_comments, analysis_results=None):
    """
    Generate strategic recommendations based on comprehensive analysis.

    Parameters:
    df_comments (pd.DataFrame): Comments data
    analysis_results (dict): Results from previous analyses

    Returns:
    dict: Strategic recommendations
    """

    print("🎯 STRATEGIC RECOMMENDATIONS")
    print("=" * 35)

    recommendations = {
        'immediate_actions': [],
        'short_term_strategy': [],
        'long_term_goals': [],
        'risk_mitigation': []
    }

    # Immediate Actions (0-30 days)
    print("🚨 IMMEDIATE ACTIONS (0-30 Days)")
    print("-" * 35)

    # Analyze critical issues
    sentiment_dist = df_comments['sentiment'].value_counts(normalize=True) * 100
    positive_rate = sentiment_dist.get('kirungi', 0)

    if positive_rate < 60:
        action = "URGENT: Address negative sentiment crisis"
        detail = f"Only {positive_rate:.1f}% positive sentiment. Audit content quality immediately."
        print(f"• {action}")
        print(f"  {detail}")
        recommendations['immediate_actions'].append({'action': action, 'detail': detail, 'priority': 'Critical'})

    # Channel-specific urgent actions
    if 'channel_name' in df_comments.columns:
        channel_sentiment = df_comments.groupby('channel_name')['sentiment'].apply(
            lambda x: (x == 'kirungi').sum() / len(x) * 100
        )

        crisis_channels = channel_sentiment[channel_sentiment < 40]
        for channel in crisis_channels.index:
            action = f"Emergency review: {channel}"
            detail = f"Critical sentiment issue ({channel_sentiment[channel]:.1f}% positive)"
            print(f"• {action}")
            print(f"  {detail}")
            recommendations['immediate_actions'].append({'action': action, 'detail': detail, 'priority': 'High'})

    # Content category issues
    if 'category' in df_comments.columns:
        category_sentiment = df_comments.groupby('category')['sentiment'].apply(
            lambda x: (x == 'kirungi').sum() / len(x) * 100
        )

        problem_categories = category_sentiment[category_sentiment < 50]
        for category in problem_categories.index:
            action = f"Fix {category} content strategy"
            detail = f"Underperforming category ({category_sentiment[category]:.1f}% positive)"
            print(f"• {action}")
            print(f"  {detail}")
            recommendations['immediate_actions'].append({'action': action, 'detail': detail, 'priority': 'Medium'})

    print()

    # Short-term Strategy (1-6 months)
    print("📈 SHORT-TERM STRATEGY (1-6 Months)")
    print("-" * 40)

    # Identify growth opportunities
    if 'category' in df_comments.columns:
        category_performance = df_comments.groupby('category').agg({
            'sentiment': lambda x: (x == 'kirungi').sum() / len(x) * 100,
            'likes': 'sum'
        })

        top_categories = category_performance.sort_values('sentiment', ascending=False).head(2)

        for category in top_categories.index:
            sentiment = top_categories.loc[category, 'sentiment']
            action = f"Scale {category} content production"
            detail = f"High-performing category ({sentiment:.1f}% positive sentiment)"
            print(f"• {action}")
            print(f"  {detail}")
            recommendations['short_term_strategy'].append({'action': action, 'detail': detail, 'timeline': '3-6 months'})

    # Engagement optimization
    avg_engagement = df_comments['likes'].mean()
    if avg_engagement < 5:
        action = "Implement engagement optimization program"
        detail = f"Current avg engagement ({avg_engagement:.1f}) below industry standard"
        print(f"• {action}")
        print(f"  {detail}")
        recommendations['short_term_strategy'].append({'action': action, 'detail': detail, 'timeline': '2-4 months'})

    print()

    # Long-term Goals (6+ months)
    print("🎯 LONG-TERM GOALS (6+ Months)")
    print("-" * 35)

    # Market expansion
    current_categories = df_comments['category'].nunique() if 'category' in df_comments.columns else 0
    if current_categories < 5:
        action = "Diversify content portfolio"
        detail = f"Expand from {current_categories} to 7-10 content categories"
        print(f"• {action}")
        print(f"  {detail}")
        recommendations['long_term_goals'].append({'action': action, 'detail': detail, 'timeline': '6-12 months'})

    # Audience development
    total_engagement = df_comments['likes'].sum()
    action = "Achieve 90%+ positive sentiment rate"
    detail = f"Target: Maintain high-quality content standards across all categories"
    print(f"• {action}")
    print(f"  {detail}")
    recommendations['long_term_goals'].append({'action': action, 'detail': detail, 'timeline': '12+ months'})

    print()

    # Risk Mitigation
    print("⚠️ RISK MITIGATION")
    print("-" * 20)

    # Political content risk
    if 'category' in df_comments.columns:
        political_comments = df_comments[df_comments['category'].str.contains('Politics', case=False, na=False)]
        if len(political_comments) > 0:
            political_sentiment = (political_comments['sentiment'] == 'kirungi').sum() / len(political_comments) * 100
            if political_sentiment < 40:
                risk = "High-risk political content"
                mitigation = "Implement editorial review for political content"
                print(f"• Risk: {risk}")
                print(f"  Mitigation: {mitigation}")
                recommendations['risk_mitigation'].append({'risk': risk, 'mitigation': mitigation})

    # Low engagement risk
    low_engagement_threshold = df_comments['likes'].quantile(0.25)
    low_engagement_pct = (df_comments['likes'] <= low_engagement_threshold).sum() / len(df_comments) * 100

    if low_engagement_pct > 50:
        risk = "High proportion of low-engagement content"
        mitigation = "Develop content quality standards and creator training"
        print(f"• Risk: {risk}")
        print(f"  Mitigation: {mitigation}")
        recommendations['risk_mitigation'].append({'risk': risk, 'mitigation': mitigation})

    return recommendations

def create_performance_scorecard(df_comments, analysis_results=None):
    """
    Create a comprehensive performance scorecard with key metrics.

    Parameters:
    df_comments (pd.DataFrame): Comments data
    analysis_results (dict): Previous analysis results

    Returns:
    dict: Performance scorecard
    """

    print("\n" + "📊 PERFORMANCE SCORECARD")
    print("=" * 30)

    # Calculate key metrics
    metrics = {}

    # Audience Engagement Score
    total_likes = df_comments['likes'].sum()
    avg_likes = df_comments['likes'].mean()
    engagement_score = min((avg_likes / 10) * 100, 100)  # Normalized to 100

    # Content Quality Score
    sentiment_dist = df_comments['sentiment'].value_counts(normalize=True) * 100
    positive_rate = sentiment_dist.get('kirungi', 0)
    quality_score = positive_rate

    # Content Diversity Score
    categories = df_comments['category'].nunique() if 'category' in df_comments.columns else 1
    channels = df_comments['channel_name'].nunique() if 'channel_name' in df_comments.columns else 1
    diversity_score = min((categories + channels) / 10 * 100, 100)

    # Growth Potential Score
    comment_volume = len(df_comments)
    volume_score = min((comment_volume / 500) * 100, 100)

    # Overall Performance Score
    overall_score = (engagement_score * 0.3 + quality_score * 0.4 +
                    diversity_score * 0.2 + volume_score * 0.1)

    # Performance grades
    def get_grade(score):
        if score >= 90: return "A+"
        elif score >= 85: return "A"
        elif score >= 80: return "A-"
        elif score >= 75: return "B+"
        elif score >= 70: return "B"
        elif score >= 65: return "B-"
        elif score >= 60: return "C+"
        elif score >= 55: return "C"
        elif score >= 50: return "C-"
        else: return "D"

    print(f"{'Metric':<25} {'Score':<8} {'Grade':<6} {'Status'}")
    print("-" * 55)
    print(f"{'Audience Engagement':<25} {engagement_score:<8.1f} {get_grade(engagement_score):<6} {'📈' if engagement_score >= 70 else '📉'}")
    print(f"{'Content Quality':<25} {quality_score:<8.1f} {get_grade(quality_score):<6} {'🟢' if quality_score >= 70 else '🔴'}")
    print(f"{'Content Diversity':<25} {diversity_score:<8.1f} {get_grade(diversity_score):<6} {'🎯' if diversity_score >= 70 else '⚠️'}")
    print(f"{'Volume/Reach':<25} {volume_score:<8.1f} {get_grade(volume_score):<6} {'📊' if volume_score >= 70 else '📋'}")
    print("-" * 55)
    print(f"{'OVERALL PERFORMANCE':<25} {overall_score:<8.1f} {get_grade(overall_score):<6} {'🏆' if overall_score >= 80 else '🎯'}")

    scorecard = {
        'engagement_score': engagement_score,
        'quality_score': quality_score,
        'diversity_score': diversity_score,
        'volume_score': volume_score,
        'overall_score': overall_score,
        'overall_grade': get_grade(overall_score),
        'metrics': {
            'total_likes': total_likes,
            'avg_likes': avg_likes,
            'positive_rate': positive_rate,
            'categories': categories,
            'channels': channels,
            'comment_volume': comment_volume
        }
    }

    return scorecard

def generate_roi_analysis(df_comments):
    """
    Generate ROI and business value analysis for content strategy.

    Parameters:
    df_comments (pd.DataFrame): Comments data with engagement metrics

    Returns:
    dict: ROI analysis results
    """

    print("\n" + "💰 RETURN ON INVESTMENT ANALYSIS")
    print("=" * 40)

    # Calculate content performance metrics
    total_engagement = df_comments['likes'].sum()
    total_comments = len(df_comments)
    avg_engagement_per_piece = df_comments.groupby('category')['likes'].sum() if 'category' in df_comments.columns else pd.Series([total_engagement])

    # Estimate content value (simplified model)
    # Assumptions: 1 like = $0.10 value, 1 comment = $0.05 base value
    engagement_value = total_engagement * 0.10
    comment_value = total_comments * 0.05
    total_estimated_value = engagement_value + comment_value

    print("CONTENT VALUE ESTIMATION")
    print("-" * 25)
    print(f"Total Engagement Value:    ${engagement_value:,.2f}")
    print(f"Total Comment Value:       ${comment_value:,.2f}")
    print(f"Estimated Content Value:   ${total_estimated_value:,.2f}")
    print()

    # Category ROI analysis
    if 'category' in df_comments.columns:
        category_roi = df_comments.groupby('category').agg({
            'likes': 'sum',
            'text': 'count'
        }).rename(columns={'text': 'comments'})

        category_roi['estimated_value'] = (category_roi['likes'] * 0.10 +
                                         category_roi['comments'] * 0.05)
        category_roi['value_per_piece'] = category_roi['estimated_value'] / category_roi['comments']
        category_roi = category_roi.sort_values('value_per_piece', ascending=False)

        print("CATEGORY ROI RANKING")
        print("-" * 25)
        print(f"{'Category':<25} {'Value/Piece':<12} {'Total Value':<12} {'ROI Tier'}")
        print("-" * 70)

        roi_analysis = {}
        for category, row in category_roi.iterrows():
            value_per_piece = row['value_per_piece']
            total_value = row['estimated_value']

            if value_per_piece >= 1.0:
                roi_tier = "High ROI"
            elif value_per_piece >= 0.5:
                roi_tier = "Medium ROI"
            else:
                roi_tier = "Low ROI"

            print(f"{category:<25} ${value_per_piece:<11.2f} ${total_value:<11.2f} {roi_tier}")

            roi_analysis[category] = {
                'value_per_piece': value_per_piece,
                'total_value': total_value,
                'roi_tier': roi_tier,
                'comments': row['comments'],
                'likes': row['likes']
            }

        print()

    # Investment recommendations
    print("💡 INVESTMENT RECOMMENDATIONS")
    print("-" * 35)

    if 'category' in df_comments.columns:
        high_roi_categories = [cat for cat, data in roi_analysis.items()
                              if data['roi_tier'] == "High ROI"]

        if high_roi_categories:
            print("🎯 HIGH-PRIORITY INVESTMENTS:")
            for category in high_roi_categories[:3]:
                value = roi_analysis[category]['value_per_piece']
                print(f"   • Scale {category} production (${value:.2f}/piece)")

        low_roi_categories = [cat for cat, data in roi_analysis.items()
                             if data['roi_tier'] == "Low ROI"]

        if low_roi_categories:
            print("\n⚠️ OPTIMIZATION NEEDED:")
            for category in low_roi_categories:
                value = roi_analysis[category]['value_per_piece']
                print(f"   • Improve {category} strategy (${value:.2f}/piece)")

    return {
        'total_estimated_value': total_estimated_value,
        'engagement_value': engagement_value,
        'comment_value': comment_value,
        'category_roi': roi_analysis if 'category' in df_comments.columns else {},
        'avg_value_per_comment': total_estimated_value / total_comments
    }

def create_action_plan(df_comments, recommendations, scorecard):
    """
    Create a prioritized action plan based on analysis results.

    Parameters:
    df_comments (pd.DataFrame): Comments data
    recommendations (dict): Strategic recommendations
    scorecard (dict): Performance scorecard

    Returns:
    dict: Prioritized action plan
    """

    print("\n" + "📋 PRIORITIZED ACTION PLAN")
    print("=" * 35)

    # Determine priority areas based on scorecard
    priority_areas = []

    if scorecard['quality_score'] < 70:
        priority_areas.append(("Content Quality", scorecard['quality_score'], "High"))
    if scorecard['engagement_score'] < 70:
        priority_areas.append(("Audience Engagement", scorecard['engagement_score'], "High"))
    if scorecard['diversity_score'] < 50:
        priority_areas.append(("Content Diversity", scorecard['diversity_score'], "Medium"))
    if scorecard['volume_score'] < 60:
        priority_areas.append(("Content Volume", scorecard['volume_score'], "Medium"))

    # Create 30-60-90 day plan
    action_plan = {
        '30_day_plan': [],
        '60_day_plan': [],
        '90_day_plan': [],
        'success_metrics': []
    }

    print("🚀 30-DAY SPRINT PLAN")
    print("-" * 25)

    # 30-day critical actions
    if scorecard['quality_score'] < 60:
        action = "Emergency content quality audit"
        metric = f"Target: Improve positive sentiment from {scorecard['quality_score']:.1f}% to 70%"
        print(f"1. {action}")
        print(f"   Success Metric: {metric}")
        action_plan['30_day_plan'].append({'action': action, 'metric': metric, 'priority': 'Critical'})

    # Address immediate channel issues
    if 'channel_name' in df_comments.columns:
        channel_performance = df_comments.groupby('channel_name')['sentiment'].apply(
            lambda x: (x == 'kirungi').sum() / len(x) * 100
        )
        worst_channel = channel_performance.idxmin()
        worst_performance = channel_performance.min()

        if worst_performance < 50:
            action = f"Intensive support for {worst_channel}"
            metric = f"Target: Improve {worst_channel} from {worst_performance:.1f}% to 60% positive"
            print(f"2. {action}")
            print(f"   Success Metric: {metric}")
            action_plan['30_day_plan'].append({'action': action, 'metric': metric, 'priority': 'High'})

    print()
    print("📈 60-DAY DEVELOPMENT PLAN")
    print("-" * 30)

    # 60-day optimization actions
    if scorecard['engagement_score'] < 75:
        action = "Launch engagement optimization program"
        current_avg = df_comments['likes'].mean()
        target_avg = current_avg * 1.5
        metric = f"Target: Increase avg engagement from {current_avg:.1f} to {target_avg:.1f} likes"
        print(f"1. {action}")
        print(f"   Success Metric: {metric}")
        action_plan['60_day_plan'].append({'action': action, 'metric': metric, 'priority': 'High'})

    # Scale successful content
    if 'category' in df_comments.columns:
        category_performance = df_comments.groupby('category')['sentiment'].apply(
            lambda x: (x == 'kirungi').sum() / len(x) * 100
        ).sort_values(ascending=False)

        best_category = category_performance.index[0]
        action = f"Scale {best_category} content production by 50%"
        metric = f"Target: Maintain {best_category} quality while increasing volume"
        print(f"2. {action}")
        print(f"   Success Metric: {metric}")
        action_plan['60_day_plan'].append({'action': action, 'metric': metric, 'priority': 'Medium'})

    print()
    print("🎯 90-DAY STRATEGIC PLAN")
    print("-" * 30)

    # 90-day strategic actions
    if scorecard['diversity_score'] < 80:
        action = "Expand content portfolio diversity"
        current_categories = df_comments['category'].nunique() if 'category' in df_comments.columns else 1
        target_categories = current_categories + 2
        metric = f"Target: Expand from {current_categories} to {target_categories} content categories"
        print(f"1. {action}")
        print(f"   Success Metric: {metric}")
        action_plan['90_day_plan'].append({'action': action, 'metric': metric, 'priority': 'Medium'})

    # Overall performance target
    action = "Achieve overall performance grade of B+ or higher"
    current_grade = scorecard['overall_grade']
    metric = f"Target: Improve from {current_grade} to B+ (Overall Score: 75+)"
    print(f"2. {action}")
    print(f"   Success Metric: {metric}")
    action_plan['90_day_plan'].append({'action': action, 'metric': metric, 'priority': 'High'})

    # Success metrics summary
    print()
    print("📊 KEY SUCCESS METRICS")
    print("-" * 25)

    success_metrics = [
        f"Positive Sentiment Rate: {scorecard['quality_score']:.1f}% → 80%+",
        f"Average Engagement: {df_comments['likes'].mean():.1f} → {df_comments['likes'].mean() * 1.5:.1f} likes",
        f"Overall Performance: {scorecard['overall_grade']} → A- grade",
        f"Content Quality Score: {scorecard['quality_score']:.1f} → 85+ points"
    ]

    for i, metric in enumerate(success_metrics, 1):
        print(f"{i}. {metric}")
        action_plan['success_metrics'].append(metric)

    return action_plan

def create_business_intelligence_dashboard(df_comments, previous_analyses=None):
    """
    Create comprehensive business intelligence dashboard.

    Parameters:
    df_comments (pd.DataFrame): Comments with sentiment analysis
    previous_analyses (dict): Results from engagement and content analysis

    Returns:
    dict: Complete business intelligence dashboard
    """

    print("🚀 GENERATING BUSINESS INTELLIGENCE DASHBOARD")
    print("=" * 60)
    print(f"Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"Dataset: {len(df_comments):,} Ugandan comments analyzed")
    print()

    # Generate all dashboard components
    executive_summary = generate_executive_summary(df_comments, previous_analyses)
    creator_performance = analyze_creator_performance(df_comments)
    strategic_recommendations = generate_strategic_recommendations(df_comments, previous_analyses)
    performance_scorecard = create_performance_scorecard(df_comments, previous_analyses)
    roi_analysis = generate_roi_analysis(df_comments)
    action_plan = create_action_plan(df_comments, strategic_recommendations, performance_scorecard)

    print("\n" + "=" * 60)
    print("✅ BUSINESS INTELLIGENCE DASHBOARD COMPLETE")
    print("=" * 60)

    # Final dashboard summary
    overall_score = performance_scorecard['overall_score']
    business_impact = executive_summary['business_impact_score']

    print("📈 DASHBOARD SUMMARY")
    print("-" * 20)
    print(f"Overall Performance Score: {overall_score:.1f}/100 ({performance_scorecard['overall_grade']})")
    print(f"Business Impact Score: {business_impact:.1f}/100")
    print(f"Estimated Content Value: ${roi_analysis['total_estimated_value']:,.2f}")
    print(f"Primary Focus Area: {action_plan['30_day_plan'][0]['action'] if action_plan['30_day_plan'] else 'Maintain current performance'}")

    # Compile complete dashboard
    dashboard = {
        'executive_summary': executive_summary,
        'creator_performance': creator_performance,
        'strategic_recommendations': strategic_recommendations,
        'performance_scorecard': performance_scorecard,
        'roi_analysis': roi_analysis,
        'action_plan': action_plan,
        'generated_timestamp': datetime.now().isoformat(),
        'dataset_info': {
            'total_comments': len(df_comments),
            'date_range': 'Full dataset',
            'analysis_version': '1.0'
        }
    }

    return dashboard

def export_dashboard_summary(dashboard, filename=None):
    """
    Export dashboard summary to a structured format.

    Parameters:
    dashboard (dict): Complete dashboard results
    filename (str): Optional filename for export

    Returns:
    str: Formatted summary report
    """

    if filename is None:
        filename = f"uganda_content_dashboard_{datetime.now().strftime('%Y%m%d_%H%M')}.txt"

    summary = f"""
UGANDA CONTENT BUSINESS INTELLIGENCE DASHBOARD
=============================================
Generated: {dashboard['generated_timestamp']}
Dataset: {dashboard['dataset_info']['total_comments']:,} comments

EXECUTIVE SUMMARY
-----------------
Overall Performance: {dashboard['performance_scorecard']['overall_score']:.1f}/100 ({dashboard['performance_scorecard']['overall_grade']})
Business Impact Score: {dashboard['executive_summary']['business_impact_score']:.1f}/100
Audience Sentiment: {dashboard['executive_summary']['positive_rate']:.1f}% Positive
Total Engagement: {dashboard['executive_summary']['total_engagement']:,} likes

PERFORMANCE BREAKDOWN
--------------------
Content Quality: {dashboard['performance_scorecard']['quality_score']:.1f}/100
Audience Engagement: {dashboard['performance_scorecard']['engagement_score']:.1f}/100
Content Diversity: {dashboard['performance_scorecard']['diversity_score']:.1f}/100
Volume/Reach: {dashboard['performance_scorecard']['volume_score']:.1f}/100

TOP PRIORITIES (Next 30 Days)
-----------------------------
"""

    for i, action in enumerate(dashboard['action_plan']['30_day_plan'], 1):
        summary += f"{i}. {action['action']}\n"
        summary += f"   {action['metric']}\n\n"

    summary += f"""
ESTIMATED CONTENT VALUE
-----------------------
Total Portfolio Value: ${dashboard['roi_analysis']['total_estimated_value']:,.2f}
Average Value per Comment: ${dashboard['roi_analysis']['avg_value_per_comment']:.2f}

STRATEGIC FOCUS
---------------
Performance Grade: {dashboard['executive_summary']['performance_grade']}
Recommended Action: {dashboard['executive_summary']['performance_action']}
"""

    print(f"\n📄 Dashboard summary exported to: {filename}")
    print("Use this report for stakeholder presentations and strategic planning.")

    return summary

if __name__ == "__main__":
    # This script should be run after all previous analyses
    try:
        # Create complete business intelligence dashboard
        bi_dashboard = create_business_intelligence_dashboard(df_comments)

        # Export summary report
        dashboard_summary = export_dashboard_summary(bi_dashboard)

        print("\nBusiness Intelligence Dashboard created successfully!")
        print("Access results via bi_dashboard dictionary")
        print("Available components:", list(bi_dashboard.keys()))

    except NameError:
        print("Error: df_comments not found. Please run sentiment analysis first.")
        print("This script requires completed sentiment analysis data.")

# 7. VISUALIZATIONS

## CATEGORY-CHANNEL PERFORMANCE HEATMAP

In [None]:
"""
Creates a consumer-friendly heatmap showing which channels perform best
with different types of content. Green areas show excellent performance.
"""

def create_performance_heatmap(df_comments, save_plot=True):
    """
    Create a consumer-friendly heatmap showing channel performance by category.

    Parameters:
    df_comments (pd.DataFrame): Comments with sentiment analysis
    save_plot (bool): Whether to save the plot

    Returns:
    matplotlib.figure.Figure: The created figure
    """

    print("📊 Creating Channel-Category Performance Heatmap...")

    # Check if required columns exist
    if 'channel_name' not in df_comments.columns or 'category' not in df_comments.columns:
        print("❌ Error: Missing required columns (channel_name or category)")
        return None

    # Calculate positive sentiment percentage for each category-channel combination
    sentiment_matrix = df_comments.groupby(['category', 'channel_name']).agg({
        'sentiment': lambda x: (x == 'kirungi').sum() / len(x) * 100
    }).unstack(fill_value=0)

    # Flatten column names
    sentiment_matrix.columns = sentiment_matrix.columns.droplevel(0)

    # Only include channels with significant data (>8 comments)
    channel_counts = df_comments.groupby('channel_name').size()
    significant_channels = channel_counts[channel_counts >= 8].index
    sentiment_matrix = sentiment_matrix[significant_channels]

    # Create the visualization
    plt.figure(figsize=(14, 8))

    # Use a green-focused colormap for positive emphasis
    cmap = sns.light_palette("green", as_cmap=True, reverse=False)

    # Create heatmap
    ax = sns.heatmap(sentiment_matrix,
                     annot=True,
                     fmt='.0f',
                     cmap=cmap,
                     square=False,
                     linewidths=1,
                     cbar_kws={'label': 'Audience Satisfaction (% Positive Comments)'},
                     vmin=0,
                     vmax=100)

    # Styling
    plt.title('🇺🇬 Uganda YouTube Channel Performance by Content Type\n' +
              'Which Channels Excel at Different Types of Content?',
              fontsize=16, fontweight='bold', pad=25)
    plt.xlabel('YouTube Channels', fontsize=14, fontweight='bold')
    plt.ylabel('Content Categories', fontsize=14, fontweight='bold')

    # Rotate labels for better readability
    plt.xticks(rotation=45, ha='right', fontsize=11)
    plt.yticks(rotation=0, fontsize=11)

    # Add consumer-friendly interpretation
    plt.figtext(0.02, 0.005,
                'How to Read: Darker green = Higher audience satisfaction. ' +
                'Look for the darkest green areas to see which channels excel with specific content types.',
                fontsize=10, style='italic', wrap=True)

    plt.tight_layout()

    # Print key insights for consumers
    print("\n🔍 KEY INSIGHTS:")

    # Find top performers overall
    overall_performance = sentiment_matrix.mean(axis=0).sort_values(ascending=False)
    print(f"🏆 Best Overall Channel: {overall_performance.index[0]} ({overall_performance.iloc[0]:.0f}% satisfaction)")

    # Find best category-channel combinations
    max_performance = sentiment_matrix.max().max()
    best_combo = sentiment_matrix.stack().idxmax()
    print(f"🎯 Perfect Match: {best_combo[1]} excels at {best_combo[0]} ({max_performance:.0f}% satisfaction)")

    # Find channels with consistent performance (low variation)
    channel_consistency = sentiment_matrix.std(axis=0).sort_values()
    most_consistent = channel_consistency.index[0] if len(channel_consistency) > 0 else "N/A"
    print(f"🎪 Most Consistent: {most_consistent} delivers reliable quality across content types")

    if save_plot:
        plt.savefig('uganda_channel_performance_heatmap.png', dpi=300, bbox_inches='tight')
        print(f"💾 Plot saved as 'uganda_channel_performance_heatmap.png'")

    return plt.gcf()

# Run the visualization
if __name__ == "__main__":
    try:
        fig = create_performance_heatmap(df_comments)
        plt.show()

    except NameError:
        print("❌ Please run the sentiment analysis scripts first to load df_comments")
    except Exception as e:
        print(f"❌ Error creating heatmap: {e}")

## SENTIMENT vs ENGAGEMENT ANALYSIS

In [None]:
"""
Creates consumer-friendly charts showing the relationship between how people
feel about content (sentiment) and how much they engage with it (likes).
"""

def create_sentiment_engagement_analysis(df_comments, save_plot=True):
    """
    Create easy-to-understand charts showing sentiment vs engagement patterns.

    Parameters:
    df_comments (pd.DataFrame): Comments with sentiment analysis
    save_plot (bool): Whether to save the plot

    Returns:
    matplotlib.figure.Figure: The created figure
    """

    print("📈 Creating Sentiment vs Engagement Analysis...")

    # Prepare data with friendly labels
    df_plot = df_comments.copy()
    df_plot['sentiment_label'] = df_plot['sentiment'].map({
        'kirungi': 'Positive Comments',
        'kibi': 'Negative Comments'
    })

    # Create side-by-side comparison
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

    # Plot 1: Simple comparison of average engagement
    sentiment_stats = df_plot.groupby('sentiment_label')['likes'].agg(['mean', 'count']).round(1)

    # Bar chart showing average likes by sentiment
    colors = ['#ff6b6b', '#4ecdc4']  # Red for negative, teal for positive
    bars = ax1.bar(sentiment_stats.index, sentiment_stats['mean'],
                   color=colors, alpha=0.8, edgecolor='white', linewidth=2)

    # Add value labels on bars
    for i, bar in enumerate(bars):
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height + 0.1,
                f'{height:.1f}\nlikes avg',
                ha='center', va='bottom', fontweight='bold', fontsize=12)

    ax1.set_title('Average Engagement by Comment Type\nDo positive comments get more likes?',
                  fontweight='bold', fontsize=14)
    ax1.set_ylabel('Average Likes per Comment', fontweight='bold', fontsize=12)
    ax1.set_xlabel('Comment Sentiment', fontweight='bold', fontsize=12)

    # Make it more visual
    ax1.grid(axis='y', alpha=0.3)
    ax1.set_ylim(0, max(sentiment_stats['mean']) * 1.2)

    # Plot 2: Distribution comparison (box plot) - Filter outliers for better visualization
    df_plot_filtered = df_plot[df_plot['likes'] <= 50]  # Remove comments with >50 likes for clearer visualization

    sns.boxplot(data=df_plot_filtered, x='sentiment_label', y='likes',
                palette=colors, ax=ax2)

    # Add mean markers (without text labels) - using all data for accurate means
    means = df_plot.groupby('sentiment_label')['likes'].mean()
    for i, (sentiment, mean_val) in enumerate(means.items()):
        ax2.scatter(i, mean_val, color='black', s=150, marker='D', zorder=5,
                   edgecolor='white', linewidth=2)

    ax2.set_title('Engagement Distribution Comparison\nHow varied is the engagement? (Filtered for clarity)',
                  fontweight='bold', fontsize=14)
    ax2.set_ylabel('Number of Likes', fontweight='bold', fontsize=12)
    ax2.set_xlabel('Comment Sentiment', fontweight='bold', fontsize=12)

    # Overall title
    plt.suptitle('Does Positive Content Get More Engagement?\nAnalysis of Uganda YouTube Comment Patterns',
                fontsize=16, fontweight='bold', y=1.02)

    plt.tight_layout()

    # Consumer-friendly insights
    print("\n🔍 WHAT THIS MEANS:")

    positive_avg = means['Positive Comments']
    negative_avg = means['Negative Comments']
    difference = positive_avg - negative_avg
    ratio = positive_avg / negative_avg if negative_avg > 0 else float('inf')

    if difference > 1:
        print(f"✅ Positive comments get more engagement!")
        print(f"   • Positive comments: {positive_avg:.1f} likes on average")
        print(f"   • Negative comments: {negative_avg:.1f} likes on average")
        print(f"   • Difference: {difference:.1f} more likes for positive comments")
        print(f"   • Ratio: {ratio:.1f}x more engagement for positive content")
    elif difference < -1:
        print(f"⚠️ Negative comments get more engagement!")
        print(f"   • This might indicate controversial content drives discussion")
    else:
        print(f"📊 Engagement is similar regardless of sentiment")
        print(f"   • Both positive and negative comments get similar likes")
        print(f"   • Content quality might matter more than sentiment")

    # Add volume info
    positive_count = sentiment_stats.loc['Positive Comments', 'count']
    negative_count = sentiment_stats.loc['Negative Comments', 'count']
    total_comments = positive_count + negative_count

    print(f"\n📊 COMMENT BREAKDOWN:")
    print(f"   • {positive_count} positive comments ({positive_count/total_comments*100:.0f}%)")
    print(f"   • {negative_count} negative comments ({negative_count/total_comments*100:.0f}%)")

    # Simple interpretation
    plt.figtext(0.5, 0.001,
                f'Simple Summary: Positive comments average {positive_avg:.1f} likes, ' +
                f'negative comments average {negative_avg:.1f} likes. ' +
                f'{"Positive wins!" if difference > 0.5 else "Very similar engagement."}',
                fontsize=11, style='italic', ha='center', bbox=dict(boxstyle="round,pad=0.2",
                facecolor="lightblue", alpha=0.7))

    if save_plot:
        plt.savefig('uganda_sentiment_engagement_analysis.png', dpi=300, bbox_inches='tight')
        print(f"💾 Plot saved as 'uganda_sentiment_engagement_analysis.png'")

    return fig

# Run the visualization
if __name__ == "__main__":
    try:
        fig = create_sentiment_engagement_analysis(df_comments)
        plt.show()

    except NameError:
        print("❌ Please run the sentiment analysis scripts first to load df_comments")
    except Exception as e:
        print(f"❌ Error creating analysis: {e}")

## COMMENT LENGTH SUCCESS PATTERNS

In [None]:
"""
Creates consumer-friendly analysis showing whether longer or shorter comments
get more engagement, helping content creators understand what works.
"""

def create_comment_length_analysis(df_comments, save_plot=True):
    """
    Create easy-to-understand analysis of comment length vs success.

    Parameters:
    df_comments (pd.DataFrame): Comments with sentiment analysis
    save_plot (bool): Whether to save the plot

    Returns:
    matplotlib.figure.Figure: The created figure
    """

    print("📏 Creating Comment Length Success Analysis...")

    # Prepare data with simple categories
    df_plot = df_comments.copy()
    df_plot['comment_length'] = df_plot['text'].str.len()
    df_plot['word_count'] = df_plot['text'].str.split().str.len()

    # Create simple length categories
    def categorize_length(length):
        if length <= 50:
            return "Short\n(≤50 characters)"
        elif length <= 150:
            return "Medium\n(51-150 characters)"
        else:
            return "Long\n(>150 characters)"

    df_plot['length_category'] = df_plot['comment_length'].apply(categorize_length)

    # Create comparison visualization
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))

    # 1. How many comments in each category?
    length_counts = df_plot['length_category'].value_counts()
    colors = ['#FF9999', '#66B2FF', '#99FF99']  # Light red, blue, green

    wedges, texts, autotexts = ax1.pie(length_counts.values,
                                      labels=length_counts.index,
                                      colors=colors,
                                      autopct='%1.0f%%',
                                      startangle=90,
                                      textprops={'fontsize': 11, 'fontweight': 'bold'})

    ax1.set_title('How Long Are Most Comments?\nDistribution of Comment Lengths',
                  fontweight='bold', fontsize=12)

    # 2. Which length gets more likes?
    length_engagement = df_plot.groupby('length_category')['likes'].mean().sort_values(ascending=True)

    bars = ax2.barh(length_engagement.index, length_engagement.values,
                    color=colors, alpha=0.8)
    ax2.set_title('Which Length Gets More Engagement?\nAverage Likes by Comment Length',
                  fontweight='bold', fontsize=12)
    ax2.set_xlabel('Average Likes per Comment', fontweight='bold')

    # Add value labels
    for i, bar in enumerate(bars):
        width = bar.get_width()
        ax2.text(width + 0.1, bar.get_y() + bar.get_height()/2,
                f'{width:.1f}', ha='left', va='center', fontweight='bold', fontsize=11)

    ax2.grid(axis='x', alpha=0.3)

    # 3. Sentiment by length
    sentiment_by_length = pd.crosstab(df_plot['length_category'],
                                     df_plot['sentiment'],
                                     normalize='index') * 100

    sentiment_by_length.plot(kind='bar', ax=ax3,
                           color=['#ff6b6b', '#4ecdc4'],
                           alpha=0.8)
    ax3.set_title('Are Longer Comments More Positive?\nSentiment by Comment Length',
                  fontweight='bold', fontsize=12)
    ax3.set_ylabel('Percentage of Comments (%)', fontweight='bold')
    ax3.set_xlabel('Comment Length Category', fontweight='bold')
    ax3.legend(['Negative (Kibi)', 'Positive (Kirungi)'], loc='upper right')
    ax3.set_xticklabels(ax3.get_xticklabels(), rotation=0)
    ax3.grid(axis='y', alpha=0.3)

    # Add percentage labels on bars
    for container in ax3.containers:
        ax3.bar_label(container, fmt='%.0f%%', fontsize=9, fontweight='bold')

    # 4. Success pattern summary
    ax4.axis('off')  # Turn off axis for text summary

    # Calculate key insights
    best_engagement_category = length_engagement.idxmax()
    best_engagement_value = length_engagement.max()
    most_common_category = length_counts.idxmax()
    most_common_percentage = (length_counts.max() / length_counts.sum()) * 100

    # Calculate sentiment percentages by length
    sentiment_by_length_dict = {}
    for category in df_plot['length_category'].unique():
        if pd.notna(category):
            cat_data = df_plot[df_plot['length_category'] == category]
            positive_pct = (cat_data['sentiment'] == 'kirungi').mean() * 100
            sentiment_by_length_dict[category] = positive_pct

    # Text summary
    summary_text = f"""SUCCESS PATTERNS DISCOVERED:

🏆 BEST ENGAGEMENT:
{best_engagement_category.replace(chr(10), ' ')} comments get the most likes
({best_engagement_value:.1f} likes on average)

📊 MOST COMMON:
{most_common_category.replace(chr(10), ' ')} comments are most popular
({most_common_percentage:.0f}% of all comments)

💡 RECOMMENDATION:
{"Write longer comments for better engagement!" if "Long" in best_engagement_category else
 "Keep it concise - shorter works better!" if "Short" in best_engagement_category else
 "Medium length hits the sweet spot!"}

😊 SENTIMENT INSIGHT:
Longer comments tend to be more thoughtful and positive"""

    ax4.text(0.05, 0.95, summary_text, transform=ax4.transAxes,
            fontsize=11, verticalalignment='top', fontweight='bold',
            bbox=dict(boxstyle="round,pad=0.5", facecolor="lightblue", alpha=0.7))

    # Overall title
    plt.suptitle('Uganda YouTube Comment Length Success Guide\n' +
                'Should You Write Long or Short Comments for Better Engagement?',
                fontsize=16, fontweight='bold', y=0.98)

    plt.tight_layout()

    # Print insights for console
    print("\nCOMMENT LENGTH INSIGHTS:")
    print(f"Best for Engagement: {best_engagement_category.replace(chr(10), ' ')} ({best_engagement_value:.1f} avg likes)")
    print(f"Most Popular: {most_common_category.replace(chr(10), ' ')} ({most_common_percentage:.0f}% of comments)")

    # Calculate the "sweet spot"
    avg_by_length = df_plot.groupby('length_category')['likes'].mean()
    count_by_length = df_plot.groupby('length_category').size()

    print(f"\nWRITING TIPS:")
    if "Long" in best_engagement_category:
        print("   • Longer, detailed comments get more engagement")
        print("   • Take time to express your thoughts fully")
        print("   • People appreciate thoughtful responses")
    elif "Short" in best_engagement_category:
        print("   • Short and sweet comments work best")
        print("   • Get to the point quickly")
        print("   • People like concise thoughts")
    else:
        print("   • Medium length is the sweet spot")
        print("   • Not too short, not too long")
        print("   • A few sentences work perfectly")

    # Add character count recommendations
    char_stats = df_plot.groupby('length_category')['comment_length'].agg(['mean', 'median'])
    best_category_stats = char_stats.loc[best_engagement_category]
    print(f"\nOPTIMAL LENGTH: Around {best_category_stats['median']:.0f} characters")
    print(f"   (That's about {best_category_stats['median']/5:.0f} words)")

    if save_plot:
        plt.savefig('uganda_comment_length_success_guide.png', dpi=300, bbox_inches='tight')
        print(f"Plot saved as 'uganda_comment_length_success_guide.png'")

    return fig

# Run the visualization
if __name__ == "__main__":
    try:
        fig = create_comment_length_analysis(df_comments)
        plt.show()

    except NameError:
        print("❌ Please run the sentiment analysis scripts first to load df_comments")
    except Exception as e:
        print(f"❌ Error creating analysis: {e}")

# 8. INSIGHTS

### 8.1 Technical Achievement

- Successfully integrated YouTube Data API for Luganda comment extraction
- Deployed CraneAILabs Ganda Gemma model for sentiment analysis
- Implemented authentic Ugandan sentiment labels (Kirungi/Kibi)
- Created end-to-end pipeline for real-world application


### 8.2 Business Value

This sentiment analysis tool can help:
- **Content Creators**: Understand audience reaction to their content
- **Musicians**: Gauge fan sentiment on new releases
- **Brands**: Analyze customer feedback in Luganda
- **Researchers**: Study social media sentiment in Ugandan context

### 8.3 Future Improvements

1. **Scale up data collection** - Analyze thousands of comments
2. **Add more nuanced labels and human validation**
3. **Implement confidence scoring** - Measure prediction certainty
4. **Create real-time dashboard** - Live sentiment monitoring
5. **Expand language support** - Include other Ugandan languages

## 8.4 Technical Appendix

### Model Details
- **Model**: CraneAILabs/ganda-gemma-1b
- **Architecture**: Gemma-based language model fine-tuned for Luganda
- **Parameters**: ~1 billion parameters
- **Task**: Binary sentiment classification

### Data Pipeline
1. YouTube Data API v3 for comment extraction
2. Luganda language filtering using linguistic indicators
3. Ganda Gemma model inference for sentiment prediction
4. Post-processing and result aggregation

### Performance Metrics
- **Sample accuracy**: 100% on test cases
- **Processing speed**: ~1 comment per second
- **Memory usage**: Optimized for CPU inference

### Ethical Considerations
- Respects YouTube's Terms of Service
- Uses publicly available comments only
- Maintains user privacy (no personal data storage)
- Supports indigenous language preservation