# Twitter Sentiment Analysis v·ªõi RoBERTa Model

## T√°i hi·ªán ƒë·ªì √°n NLP ph√¢n t√≠ch c·∫£m x√∫c v·ªõi ki·∫øn tr√∫c ƒë∆°n gi·∫£n h√≥a

**D·ª± √°n n√†y t√°i hi·ªán v√† m·ªü r·ªông nghi√™n c·ª©u v·ªÅ ph√¢n t√≠ch c·∫£m x√∫c Twitter s·ª≠ d·ª•ng m√¥ h√¨nh RoBERTa-Twitter, d·ª±a tr√™n ƒë·ªì √°n g·ªëc c·ªßa anh Th·ªãnh L√¢m T·∫•n.**

### üéØ **M·ª•c ti√™u**
- T√°i hi·ªán m√¥ h√¨nh AI ph√¢n t√≠ch c·∫£m x√∫c v·ªõi d·ªØ li·ªáu m·ªõi
- S·ª≠ d·ª•ng RoBERTa-Twitter model t·ª´ Hugging Face
- ƒê∆°n gi·∫£n h√≥a ki·∫øn tr√∫c: Python + Pandas + Transformers (thay v√¨ Kafka + Spark + MongoDB)
- T·∫°o tr·ª±c quan h√≥a t∆∞∆°ng t·ª± ƒë·ªì √°n g·ªëc

### üîÑ **Quy tr√¨nh**
```
Crawl Data ‚Üí Text Preprocessing ‚Üí RoBERTa Analysis ‚Üí Visualization ‚Üí Report
```

### üìä **So s√°nh ki·∫øn tr√∫c**
| ƒê·ªì √°n g·ªëc | D·ª± √°n hi·ªán t·∫°i |
|-----------|----------------|
| Producer ‚Üí Kafka ‚Üí Spark ‚Üí MongoDB | Python Script ‚Üí CSV ‚Üí RoBERTa ‚Üí Charts |
| Big Data Architecture | Simplified Pipeline |
| Distributed Processing | Single Machine |

---

*Notebook n√†y s·∫Ω h∆∞·ªõng d·∫´n chi ti·∫øt t·ª´ng b∆∞·ªõc ƒë·ªÉ x√¢y d·ª±ng h·ªá th·ªëng ph√¢n t√≠ch c·∫£m x√∫c ho√†n ch·ªânh.*

# 1. Thi·∫øt l·∫≠p M√¥i tr∆∞·ªùng v√† C√†i ƒë·∫∑t Th∆∞ vi·ªán

ƒê·∫ßu ti√™n, ch√∫ng ta s·∫Ω c√†i ƒë·∫∑t v√† import c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt cho d·ª± √°n.

In [None]:
# C√†i ƒë·∫∑t c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt (ch·∫°y l·∫ßn ƒë·∫ßu)
# !pip install pandas requests transformers torch matplotlib seaborn plotly tqdm python-dotenv

# Import c√°c th∆∞ vi·ªán ch√≠nh
import pandas as pd
import numpy as np
import requests
import json
import time
import re
import warnings
from datetime import datetime
from typing import List, Dict, Optional, Any

# Th∆∞ vi·ªán cho machine learning v√† NLP
try:
    from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
    import torch
    print("‚úÖ Transformers v√† PyTorch ƒë√£ ƒë∆∞·ª£c c√†i ƒë·∫∑t")
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"GPU device: {torch.cuda.get_device_name(0)}")
except ImportError as e:
    print("‚ùå L·ªói import transformers/torch:", e)
    print("Vui l√≤ng ch·∫°y: !pip install transformers torch")

# Th∆∞ vi·ªán cho visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Thi·∫øt l·∫≠p style cho plots
plt.style.use('default')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

print("=" * 50)
print("üöÄ THI·∫æT L·∫¨P HO√ÄN T·∫§T")
print("=" * 50)

# 2. Thu th·∫≠p D·ªØ li·ªáu t·ª´ Twitter API

Ch√∫ng ta s·∫Ω s·ª≠ d·ª•ng twitterapi.io ƒë·ªÉ thu th·∫≠p tweets v·ªõi c√°c t·ª´ kh√≥a AI ph·ªï bi·∫øn, t∆∞∆°ng t·ª± nh∆∞ ƒë·ªì √°n g·ªëc.

In [None]:
# C·∫•u h√¨nh API v√† t·ª´ kh√≥a (gi·ªëng ƒë·ªì √°n g·ªëc + th√™m t·ª´ kh√≥a m·ªõi)
TWITTER_API_KEY = "your_api_key_here"  # Thay th·∫ø b·∫±ng API key th·ª±c t·ª´ twitterapi.io
TWITTER_API_BASE_URL = "https://api.twitterapi.io/v1"

# T·ª´ kh√≥a t·ª´ ƒë·ªì √°n g·ªëc + t·ª´ kh√≥a m·ªõi
KEYWORDS = [
    "GPT", "Copilot", "Gemini",      # T·ª´ ƒë·ªì √°n g·ªëc c·ªßa anh Th·ªãnh
    "GPT-4o", "Sora", "Llama 3",     # T·ª´ kh√≥a AI m·ªõi
    "Claude", "ChatGPT"              # B·ªï sung th√™m
]

MAX_TWEETS_PER_KEYWORD = 100  # Gi·∫£m ƒë·ªÉ demo nhanh
RATE_LIMIT_DELAY = 2

print("üìã C·∫•u h√¨nh thu th·∫≠p d·ªØ li·ªáu:")
print(f"   Keywords: {KEYWORDS}")
print(f"   Max tweets per keyword: {MAX_TWEETS_PER_KEYWORD}")
print(f"   Total expected tweets: {len(KEYWORDS) * MAX_TWEETS_PER_KEYWORD}")

class TwitterCrawler:
    """
    Twitter data crawler s·ª≠ d·ª•ng twitterapi.io
    T∆∞∆°ng t·ª± ƒë·ªì √°n g·ªëc nh∆∞ng ƒë∆°n gi·∫£n h√≥a
    """
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = TWITTER_API_BASE_URL
        self.session = requests.Session()
        self.session.headers.update({
            'Authorization': f'Bearer {api_key}',
            'Content-Type': 'application/json'
        })
    
    def search_tweets(self, query: str, max_results: int = 100) -> List[Dict]:
        """Thu th·∫≠p tweets theo t·ª´ kh√≥a"""
        endpoint = f"{self.base_url}/search"
        
        params = {
            'query': query,
            'max_results': min(max_results, 100),
            'tweet.fields': 'created_at,author_id,public_metrics,lang',
            'user.fields': 'name,username,verified',
            'expansions': 'author_id'
        }
        
        try:
            response = self.session.get(endpoint, params=params)
            response.raise_for_status()
            
            data = response.json()
            tweets = []
            
            if 'data' in data:
                # T·∫°o mapping user info
                users = {}
                if 'includes' in data and 'users' in data['includes']:
                    users = {user['id']: user for user in data['includes']['users']}
                
                for tweet in data['data']:
                    user_info = users.get(tweet.get('author_id', ''), {})
                    
                    tweet_data = {
                        'id': tweet.get('id', ''),
                        'text': tweet.get('text', ''),
                        'created_at': tweet.get('created_at', ''),
                        'username': user_info.get('username', ''),
                        'user_name': user_info.get('name', ''),
                        'retweet_count': tweet.get('public_metrics', {}).get('retweet_count', 0),
                        'like_count': tweet.get('public_metrics', {}).get('like_count', 0),
                        'lang': tweet.get('lang', ''),
                        'keyword': query
                    }
                    tweets.append(tweet_data)
            
            return tweets
            
        except requests.exceptions.RequestException as e:
            print(f"‚ùå L·ªói API cho t·ª´ kh√≥a '{query}': {e}")
            return []
        except json.JSONDecodeError as e:
            print(f"‚ùå L·ªói parse JSON cho t·ª´ kh√≥a '{query}': {e}")
            return []
    
    def crawl_multiple_keywords(self, keywords: List[str], max_tweets: int) -> pd.DataFrame:
        """Thu th·∫≠p tweets cho nhi·ªÅu t·ª´ kh√≥a"""
        all_tweets = []
        
        print(f"üîÑ B·∫Øt ƒë·∫ßu thu th·∫≠p tweets cho {len(keywords)} t·ª´ kh√≥a...")
        
        for i, keyword in enumerate(keywords, 1):
            print(f"   [{i}/{len(keywords)}] ƒêang crawl: '{keyword}'")
            
            tweets = self.search_tweets(keyword, max_tweets)
            
            if tweets:
                all_tweets.extend(tweets)
                print(f"      ‚úÖ Thu th·∫≠p ƒë∆∞·ª£c {len(tweets)} tweets")
            else:
                print(f"      ‚ö†Ô∏è Kh√¥ng thu th·∫≠p ƒë∆∞·ª£c tweets n√†o")
            
            # Rate limiting
            if i < len(keywords):  # Kh√¥ng delay ·ªü l·∫ßn cu·ªëi
                time.sleep(RATE_LIMIT_DELAY)
        
        # Chuy·ªÉn th√†nh DataFrame
        df = pd.DataFrame(all_tweets)
        
        if not df.empty:
            # X·ª≠ l√Ω d·ªØ li·ªáu
            df['created_at'] = pd.to_datetime(df['created_at'])
            df = df.drop_duplicates(subset=['id'])  # Lo·∫°i b·ªè duplicates
            df = df[df['lang'] == 'en']  # Ch·ªâ l·∫•y tweets ti·∫øng Anh
            
            print(f"‚úÖ Ho√†n th√†nh! T·ªïng c·ªông {len(df)} tweets unique (ti·∫øng Anh)")
        else:
            print("‚ùå Kh√¥ng thu th·∫≠p ƒë∆∞·ª£c d·ªØ li·ªáu n√†o")
        
        return df

# T·∫°o crawler instance 
print("ü§ñ Kh·ªüi t·∫°o Twitter Crawler...")
if TWITTER_API_KEY == "your_api_key_here":
    print("‚ö†Ô∏è C·∫¢NH B√ÅO: B·∫°n c·∫ßn thay th·∫ø TWITTER_API_KEY b·∫±ng API key th·ª±c")
    print("   ƒê·ªÉ demo, ch√∫ng ta s·∫Ω t·∫°o d·ªØ li·ªáu m·∫´u...")
    use_real_api = False
else:
    crawler = TwitterCrawler(TWITTER_API_KEY)
    use_real_api = True

In [None]:
# Thu th·∫≠p d·ªØ li·ªáu th·ª±c t·∫ø ho·∫∑c t·∫°o d·ªØ li·ªáu m·∫´u ƒë·ªÉ demo
if use_real_api:
    # S·ª≠ d·ª•ng API th·ª±c
    print("üîÑ Thu th·∫≠p d·ªØ li·ªáu t·ª´ Twitter API...")
    raw_data = crawler.crawl_multiple_keywords(KEYWORDS, MAX_TWEETS_PER_KEYWORD)
else:
    # T·∫°o d·ªØ li·ªáu m·∫´u ƒë·ªÉ demo
    print("üîß T·∫°o d·ªØ li·ªáu m·∫´u ƒë·ªÉ demo...")
    
    sample_tweets = [
        "I love using GPT-4! It's amazing for coding assistance and problem solving.",
        "ChatGPT is helpful but sometimes gives wrong information. Need to be careful.",
        "GitHub Copilot saves me so much time when programming. Highly recommend!",
        "Gemini is decent but I still prefer GPT for most tasks.",
        "The new GPT-4o model is incredibly fast and accurate. Impressed!",
        "Sora AI video generation is mind-blowing. The future is here!",
        "Llama 3 open source model is surprisingly good for a free alternative.",
        "Claude is great for writing and analysis tasks. Very thoughtful responses.",
        "AI copilots in coding are game changers. Can't imagine coding without them now.",
        "These AI tools are making everyone more productive. Exciting times!",
        "GPT sometimes hallucinates facts. Always double-check important information.",
        "Copilot suggestions are usually good but sometimes completely off-topic.",
        "I'm worried about AI replacing human creativity and jobs.",
        "The quality of AI responses keeps getting better every month.",
        "Using multiple AI tools together gives the best results for complex tasks."
    ] * 10  # Repeat ƒë·ªÉ c√≥ ƒë·ªß d·ªØ li·ªáu
    
    # T·∫°o DataFrame m·∫´u
    import random
    
    raw_data = []
    for i, text in enumerate(sample_tweets):
        tweet_data = {
            'id': f'tweet_{i}',
            'text': text,
            'created_at': pd.Timestamp.now() - pd.Timedelta(days=random.randint(0, 30)),
            'username': f'user_{i % 20}',
            'user_name': f'User {i % 20}',
            'retweet_count': random.randint(0, 100),
            'like_count': random.randint(0, 500),
            'lang': 'en',
            'keyword': random.choice(KEYWORDS)
        }
        raw_data.append(tweet_data)
    
    raw_data = pd.DataFrame(raw_data)
    print(f"‚úÖ T·∫°o th√†nh c√¥ng {len(raw_data)} tweets m·∫´u")

# Hi·ªÉn th·ªã th√¥ng tin d·ªØ li·ªáu thu th·∫≠p ƒë∆∞·ª£c
print("\\nüìä TH√îNG TIN D·ªÆ LI·ªÜU THU TH·∫¨P:")
print(f"   T·ªïng s·ªë tweets: {len(raw_data)}")
print(f"   Kho·∫£ng th·ªùi gian: {raw_data['created_at'].min()} ƒë·∫øn {raw_data['created_at'].max()}")

# Ph√¢n b·ªë theo t·ª´ kh√≥a
keyword_distribution = raw_data['keyword'].value_counts()
print("\\nüìà Ph√¢n b·ªë theo t·ª´ kh√≥a:")
for keyword, count in keyword_distribution.items():
    print(f"   {keyword}: {count} tweets")

# Hi·ªÉn th·ªã m·∫´u d·ªØ li·ªáu
print("\\nüîç M·∫´u d·ªØ li·ªáu:")
print(raw_data[['text', 'username', 'keyword', 'created_at']].head())

# 3. Ti·ªÅn x·ª≠ l√Ω v√† L√†m s·∫°ch VƒÉn b·∫£n

√Åp d·ª•ng c√°c b∆∞·ªõc ti·ªÅn x·ª≠ l√Ω gi·ªëng nh∆∞ ƒë·ªì √°n g·ªëc: chuy·ªÉn ch·ªØ th∆∞·ªùng, lo·∫°i b·ªè URL, mentions, hashtags, k√Ω t·ª± ƒë·∫∑c bi·ªát.

In [None]:
class TextPreprocessor:
    """
    Text preprocessing class theo ƒë·ªì √°n g·ªëc
    √Åp d·ª•ng c√°c b∆∞·ªõc l√†m s·∫°ch t∆∞∆°ng t·ª± SQLTransformer trong Spark
    """
    
    def __init__(self):
        self.cleaning_stats = {
            'urls_removed': 0,
            'mentions_removed': 0,
            'hashtags_removed': 0,
            'texts_too_short': 0
        }
    
    def remove_urls(self, text: str) -> str:
        """Lo·∫°i b·ªè URLs"""
        url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
        if url_pattern.search(text):
            self.cleaning_stats['urls_removed'] += 1
        return url_pattern.sub('', text)
    
    def remove_mentions(self, text: str) -> str:
        """Lo·∫°i b·ªè @mentions"""
        mention_pattern = re.compile(r'@[\\w_]+')
        if mention_pattern.search(text):
            self.cleaning_stats['mentions_removed'] += 1
        return mention_pattern.sub('', text)
    
    def remove_hashtags(self, text: str) -> str:
        """Lo·∫°i b·ªè hashtags"""
        hashtag_pattern = re.compile(r'#[\\w_]+')
        if hashtag_pattern.search(text):
            self.cleaning_stats['hashtags_removed'] += 1
        return hashtag_pattern.sub('', text)
    
    def remove_special_chars(self, text: str) -> str:
        """Lo·∫°i b·ªè k√Ω t·ª± ƒë·∫∑c bi·ªát, ch·ªâ gi·ªØ ch·ªØ c√°i v√† s·ªë"""
        return re.sub(r'[^a-zA-Z0-9\\s]', '', text)
    
    def remove_extra_whitespace(self, text: str) -> str:
        """Lo·∫°i b·ªè kho·∫£ng tr·∫Øng th·ª´a"""
        return re.sub(r'\\s+', ' ', text).strip()
    
    def clean_text(self, text: str) -> str:
        """
        √Åp d·ª•ng t·∫•t c·∫£ c√°c b∆∞·ªõc l√†m s·∫°ch
        Theo th·ª© t·ª± gi·ªëng ƒë·ªì √°n g·ªëc
        """
        if not isinstance(text, str) or not text.strip():
            return ""
        
        # 1. Chuy·ªÉn ch·ªØ th∆∞·ªùng (gi·ªëng ƒë·ªì √°n g·ªëc)
        text = text.lower()
        
        # 2. Lo·∫°i b·ªè URLs
        text = self.remove_urls(text)
        
        # 3. Lo·∫°i b·ªè mentions v√† hashtags
        text = self.remove_mentions(text)
        text = self.remove_hashtags(text)
        
        # 4. Lo·∫°i b·ªè k√Ω t·ª± ƒë·∫∑c bi·ªát
        text = self.remove_special_chars(text)
        
        # 5. Lo·∫°i b·ªè kho·∫£ng tr·∫Øng th·ª´a
        text = self.remove_extra_whitespace(text)
        
        # 6. Ki·ªÉm tra ƒë·ªô d√†i t·ªëi thi·ªÉu
        if len(text) < 10:
            self.cleaning_stats['texts_too_short'] += 1
            return ""
        
        return text
    
    def preprocess_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
        """X·ª≠ l√Ω to√†n b·ªô DataFrame"""
        print("üßπ B·∫Øt ƒë·∫ßu ti·ªÅn x·ª≠ l√Ω vƒÉn b·∫£n...")
        
        # Reset stats
        self.cleaning_stats = {key: 0 for key in self.cleaning_stats}
        
        # T·∫°o copy ƒë·ªÉ kh√¥ng thay ƒë·ªïi d·ªØ li·ªáu g·ªëc
        processed_df = df.copy()
        
        # √Åp d·ª•ng l√†m s·∫°ch
        processed_df['cleaned_text'] = processed_df['text'].apply(self.clean_text)
        
        # Lo·∫°i b·ªè tweets r·ªóng sau khi l√†m s·∫°ch
        original_count = len(processed_df)
        processed_df = processed_df[processed_df['cleaned_text'].str.len() > 0]
        removed_count = original_count - len(processed_df)
        
        # Th√™m metadata
        processed_df['original_length'] = df['text'].str.len()
        processed_df['cleaned_length'] = processed_df['cleaned_text'].str.len()
        processed_df['preprocessing_timestamp'] = pd.Timestamp.now()
        
        # In th·ªëng k√™
        print(f"   ‚úÖ X·ª≠ l√Ω ho√†n th√†nh:")
        print(f"      - URLs lo·∫°i b·ªè: {self.cleaning_stats['urls_removed']}")
        print(f"      - Mentions lo·∫°i b·ªè: {self.cleaning_stats['mentions_removed']}")
        print(f"      - Hashtags lo·∫°i b·ªè: {self.cleaning_stats['hashtags_removed']}")
        print(f"      - Texts qu√° ng·∫Øn: {self.cleaning_stats['texts_too_short']}")
        print(f"      - Tweets lo·∫°i b·ªè: {removed_count}")
        print(f"      - Tweets c√≤n l·∫°i: {len(processed_df)}")
        
        return processed_df

# Kh·ªüi t·∫°o preprocessor v√† x·ª≠ l√Ω d·ªØ li·ªáu
preprocessor = TextPreprocessor()
processed_data = preprocessor.preprocess_dataframe(raw_data)

# Hi·ªÉn th·ªã v√≠ d·ª• tr∆∞·ªõc v√† sau khi l√†m s·∫°ch
print("\\nüîç V√ç D·ª§ TR∆Ø·ªöC V√Ä SAU KHI L√ÄM S·∫†CH:")
sample_indices = [0, 5, 10]
for i in sample_indices:
    if i < len(processed_data):
        original = raw_data.iloc[i]['text']
        cleaned = processed_data.iloc[i]['cleaned_text']
        print(f"\\n[{i+1}] G·ªëc: {original}")
        print(f"    S·∫°ch: {cleaned}")

# Th·ªëng k√™ ƒë·ªô d√†i vƒÉn b·∫£n
print("\\nüìè TH·ªêNG K√ä ƒê·ªò D√ÄI VƒÇN B·∫¢N:")
print(f"   ƒê·ªô d√†i trung b√¨nh (g·ªëc): {processed_data['original_length'].mean():.1f} k√Ω t·ª±")
print(f"   ƒê·ªô d√†i trung b√¨nh (s·∫°ch): {processed_data['cleaned_length'].mean():.1f} k√Ω t·ª±")
print(f"   Gi·∫£m: {((processed_data['original_length'].mean() - processed_data['cleaned_length'].mean()) / processed_data['original_length'].mean() * 100):.1f}%")

# 4. T·∫£i v√† C·∫•u h√¨nh M√¥ h√¨nh RoBERTa

S·ª≠ d·ª•ng m√¥ h√¨nh `cardiffnlp/twitter-roberta-base-sentiment-latest` t·ª´ Hugging Face, gi·ªëng nh∆∞ ƒë·ªì √°n g·ªëc.

In [None]:
# C·∫•u h√¨nh m√¥ h√¨nh RoBERTa
MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment-latest"
BATCH_SIZE = 16  # Gi·∫£m ƒë·ªÉ tr√°nh memory issues

# Label mapping cho RoBERTa (theo ƒë·ªì √°n g·ªëc)
SENTIMENT_LABEL_MAPPING = {
    'LABEL_0': 'Negative',
    'LABEL_1': 'Neutral', 
    'LABEL_2': 'Positive'
}

class SentimentAnalyzer:
    """
    Sentiment Analysis s·ª≠ d·ª•ng RoBERTa-Twitter
    T∆∞∆°ng t·ª± Pandas UDF approach trong ƒë·ªì √°n g·ªëc nh∆∞ng ƒë∆°n gi·∫£n h√≥a
    """
    
    def __init__(self, model_name: str = MODEL_NAME, batch_size: int = BATCH_SIZE):
        self.model_name = model_name
        self.batch_size = batch_size
        self.pipeline = None
        self.device = None
        
        self._setup_model()
    
    def _setup_model(self):
        """Kh·ªüi t·∫°o sentiment analysis pipeline"""
        # Ki·ªÉm tra GPU
        self.device = 0 if torch.cuda.is_available() else -1
        device_name = "GPU" if self.device == 0 else "CPU"
        
        print(f"ü§ñ ƒêang t·∫£i m√¥ h√¨nh RoBERTa: {self.model_name}")
        print(f"   Device: {device_name}")
        
        try:
            # T·∫£i pipeline (gi·ªëng ƒë·ªì √°n g·ªëc)
            self.pipeline = pipeline(
                "sentiment-analysis",
                model=self.model_name,
                tokenizer=self.model_name,
                device=self.device,
                return_all_scores=True  # L·∫•y scores cho t·∫•t c·∫£ labels
            )
            
            print("   ‚úÖ M√¥ h√¨nh ƒë√£ ƒë∆∞·ª£c t·∫£i th√†nh c√¥ng!")
            
        except Exception as e:
            print(f"   ‚ùå L·ªói t·∫£i m√¥ h√¨nh: {e}")
            print("   üîÑ ƒêang th·ª≠ l·∫°i v·ªõi CPU...")
            
            # Fallback to CPU
            self.device = -1
            self.pipeline = pipeline(
                "sentiment-analysis",
                model=self.model_name,
                tokenizer=self.model_name,
                device=self.device,
                return_all_scores=True
            )
            print("   ‚úÖ M√¥ h√¨nh ƒë√£ ƒë∆∞·ª£c t·∫£i tr√™n CPU!")
    
    def _process_predictions(self, predictions: List[List[Dict]]) -> List[Dict]:
        """X·ª≠ l√Ω k·∫øt qu·∫£ d·ª± ƒëo√°n th√†nh format d·ªÖ ƒë·ªçc"""
        processed_results = []
        
        for pred_list in predictions:
            # T√¨m prediction c√≥ score cao nh·∫•t
            best_pred = max(pred_list, key=lambda x: x['score'])
            
            # Map label sang ƒë·ªãnh d·∫°ng d·ªÖ ƒë·ªçc
            raw_label = best_pred['label']
            readable_label = SENTIMENT_LABEL_MAPPING.get(raw_label, raw_label)
            
            # T·∫°o result dictionary
            result = {
                'sentiment_label': readable_label,
                'sentiment_score': best_pred['score'],
                'raw_label': raw_label,
                'all_scores': {
                    SENTIMENT_LABEL_MAPPING.get(item['label'], item['label']): item['score'] 
                    for item in pred_list
                }
            }
            
            processed_results.append(result)
        
        return processed_results
    
    def predict_sentiment(self, texts: List[str]) -> List[Dict]:
        """D·ª± ƒëo√°n sentiment cho list texts v·ªõi batch processing"""
        if not texts:
            return []
        
        # L·ªçc texts h·ª£p l·ªá
        valid_texts = [text for text in texts if text and text.strip()]
        
        if not valid_texts:
            print("‚ö†Ô∏è Kh√¥ng c√≥ texts h·ª£p l·ªá ƒë·ªÉ ph√¢n t√≠ch")
            return []
        
        print(f"üîÑ ƒêang ph√¢n t√≠ch sentiment cho {len(valid_texts)} texts...")
        
        results = []
        
        # X·ª≠ l√Ω theo batch (gi·ªëng ƒë·ªì √°n g·ªëc)
        from tqdm import tqdm
        
        for i in tqdm(range(0, len(valid_texts), self.batch_size), desc="Processing batches"):
            batch_texts = valid_texts[i:i + self.batch_size]
            
            try:
                # G·ªçi model
                batch_predictions = self.pipeline(batch_texts)
                
                # X·ª≠ l√Ω k·∫øt qu·∫£
                batch_results = self._process_predictions(batch_predictions)
                results.extend(batch_results)
                
            except Exception as e:
                print(f"‚ùå L·ªói x·ª≠ l√Ω batch {i//self.batch_size + 1}: {e}")
                # Th√™m k·∫øt qu·∫£ tr·ªëng cho batch b·ªã l·ªói
                empty_results = [{
                    'sentiment_label': 'Unknown',
                    'sentiment_score': 0.0,
                    'raw_label': 'ERROR',
                    'all_scores': {}
                }] * len(batch_texts)
                results.extend(empty_results)
        
        return results
    
    def analyze_dataframe(self, df: pd.DataFrame, text_column: str = 'cleaned_text') -> pd.DataFrame:
        """Ph√¢n t√≠ch sentiment cho DataFrame (gi·ªëng Spark DataFrame processing)"""
        if df.empty:
            print("‚ö†Ô∏è DataFrame r·ªóng")
            return df
        
        if text_column not in df.columns:
            print(f"‚ùå Kh√¥ng t√¨m th·∫•y column '{text_column}'")
            return df
        
        print(f"üöÄ B·∫Øt ƒë·∫ßu ph√¢n t√≠ch sentiment cho {len(df)} tweets...")
        
        # T·∫°o copy
        result_df = df.copy()
        
        # L·∫•y texts ƒë·ªÉ ph√¢n t√≠ch
        texts = df[text_column].fillna('').astype(str).tolist()
        
        # Th·ª±c hi·ªán ph√¢n t√≠ch
        predictions = self.predict_sentiment(texts)
        
        if predictions:
            # Th√™m k·∫øt qu·∫£ v√†o DataFrame
            result_df['sentiment_label'] = [pred['sentiment_label'] for pred in predictions]
            result_df['sentiment_score'] = [pred['sentiment_score'] for pred in predictions]
            result_df['raw_sentiment_label'] = [pred['raw_label'] for pred in predictions]
            
            # Th√™m scores ri√™ng l·∫ª
            result_df['positive_score'] = [pred['all_scores'].get('Positive', 0.0) for pred in predictions]
            result_df['negative_score'] = [pred['all_scores'].get('Negative', 0.0) for pred in predictions]
            result_df['neutral_score'] = [pred['all_scores'].get('Neutral', 0.0) for pred in predictions]
            
            # Metadata
            result_df['analysis_timestamp'] = pd.Timestamp.now()
            result_df['model_used'] = self.model_name
            
            print("‚úÖ Ph√¢n t√≠ch ho√†n th√†nh!")
            
            # Th·ªëng k√™ k·∫øt qu·∫£
            sentiment_counts = result_df['sentiment_label'].value_counts()
            print("\\nüìä Ph√¢n b·ªë sentiment:")
            for sentiment, count in sentiment_counts.items():
                percentage = (count / len(result_df)) * 100
                print(f"   {sentiment}: {count} ({percentage:.1f}%)")
        
        else:
            print("‚ùå Kh√¥ng c√≥ k·∫øt qu·∫£ d·ª± ƒëo√°n")
        
        return result_df

# Kh·ªüi t·∫°o analyzer
print("ü§ñ Kh·ªüi t·∫°o Sentiment Analyzer...")
analyzer = SentimentAnalyzer()

# Test v·ªõi m·ªôt v√†i c√¢u m·∫´u tr∆∞·ªõc
test_texts = [
    "i love using gpt its amazing for coding",
    "chatgpt is helpful but sometimes wrong",
    "github copilot saves me time programming"
]

print("\\nüß™ Test v·ªõi c√¢u m·∫´u:")
test_results = analyzer.predict_sentiment(test_texts)
for text, result in zip(test_texts, test_results):
    print(f"   Text: {text}")
    print(f"   ‚Üí {result['sentiment_label']} ({result['sentiment_score']:.3f})")
    print()

# 5. Th·ª±c hi·ªán Ph√¢n t√≠ch C·∫£m x√∫c

√Åp d·ª•ng m√¥ h√¨nh RoBERTa ƒë·ªÉ d·ª± ƒëo√°n sentiment cho to√†n b·ªô dataset ƒë√£ ƒë∆∞·ª£c ti·ªÅn x·ª≠ l√Ω.

In [None]:
# Th·ª±c hi·ªán ph√¢n t√≠ch sentiment cho to√†n b·ªô dataset
print("üöÄ B·∫ÆT ƒê·∫¶U PH√ÇN T√çCH SENTIMENT CHO TO√ÄN B·ªò DATASET")
print("=" * 60)

analyzed_data = analyzer.analyze_dataframe(processed_data, text_column='cleaned_text')

print("\\n‚úÖ HO√ÄN TH√ÄNH PH√ÇN T√çCH SENTIMENT!")
print(f"üìä T·ªïng s·ªë tweets ƒë√£ ph√¢n t√≠ch: {len(analyzed_data)}")

# Hi·ªÉn th·ªã th·ªëng k√™ chi ti·∫øt
print("\\nüìà TH·ªêNG K√ä CHI TI·∫æT:")

# 1. Ph√¢n b·ªë sentiment t·ªïng th·ªÉ
overall_sentiment = analyzed_data['sentiment_label'].value_counts()
print("\\n1Ô∏è‚É£ Ph√¢n b·ªë sentiment t·ªïng th·ªÉ:")
for sentiment, count in overall_sentiment.items():
    percentage = (count / len(analyzed_data)) * 100
    print(f"   {sentiment}: {count} tweets ({percentage:.1f}%)")

# 2. Ph√¢n b·ªë sentiment theo t·ª´ kh√≥a
print("\\n2Ô∏è‚É£ Ph√¢n b·ªë sentiment theo t·ª´ kh√≥a:")
sentiment_by_keyword = analyzed_data.groupby(['keyword', 'sentiment_label']).size().unstack(fill_value=0)
print(sentiment_by_keyword)

# 3. ƒêi·ªÉm confidence trung b√¨nh
print("\\n3Ô∏è‚É£ ƒêi·ªÉm confidence trung b√¨nh:")
avg_confidence = analyzed_data.groupby('sentiment_label')['sentiment_score'].mean()
for sentiment, score in avg_confidence.items():
    print(f"   {sentiment}: {score:.3f}")

# 4. Top tweets cho m·ªói sentiment
print("\\n4Ô∏è‚É£ V√≠ d·ª• tweets cho m·ªói sentiment:")
for sentiment in ['Positive', 'Negative', 'Neutral']:
    if sentiment in analyzed_data['sentiment_label'].values:
        sample = analyzed_data[analyzed_data['sentiment_label'] == sentiment].iloc[0]
        print(f"\\n   {sentiment.upper()}:")
        print(f"   Original: {sample['text'][:100]}...")
        print(f"   Cleaned:  {sample['cleaned_text'][:100]}...")
        print(f"   Score:    {sample['sentiment_score']:.3f}")
        print(f"   Keyword:  {sample['keyword']}")

# 5. L∆∞u k·∫øt qu·∫£
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = f"tweets_analyzed_{timestamp}.csv"

# Ch·ªçn columns quan tr·ªçng ƒë·ªÉ l∆∞u
columns_to_save = [
    'id', 'text', 'cleaned_text', 'keyword', 'created_at', 'username',
    'sentiment_label', 'sentiment_score', 'positive_score', 'negative_score', 'neutral_score',
    'retweet_count', 'like_count'
]

analyzed_data[columns_to_save].to_csv(output_file, index=False, encoding='utf-8')
print(f"\\nüíæ ƒê√£ l∆∞u k·∫øt qu·∫£ v√†o file: {output_file}")

print("\\n" + "=" * 60)
print("üéâ PH√ÇN T√çCH SENTIMENT HO√ÄN T·∫§T!")
print("=" * 60)

# 6. Ph√¢n t√≠ch K·∫øt qu·∫£ v√† Tr·ª±c quan h√≥a

T·∫°o c√°c bi·ªÉu ƒë·ªì t∆∞∆°ng t·ª± nh∆∞ ƒë·ªì √°n g·ªëc (Figures 6.4, 6.5) ƒë·ªÉ so s√°nh v√† ph√¢n t√≠ch k·∫øt qu·∫£.

In [None]:
# C·∫•u h√¨nh cho visualization
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

# Color mapping cho sentiments
sentiment_colors = {
    'Positive': '#2ca02c',    # Green
    'Negative': '#d62728',    # Red  
    'Neutral': '#ff7f0e'      # Orange
}

print("üé® TR·ª∞C QUAN H√ìA K·∫æT QU·∫¢ - T∆Ø∆†NG T·ª∞ ƒê·ªí √ÅN G·ªêC")
print("=" * 60)

# 1. Bi·ªÉu ƒë·ªì c·ªôt ch·ªìng - Sentiment Distribution by Keyword (gi·ªëng Figure 6.4)
print("\\n1Ô∏è‚É£ T·∫°o bi·ªÉu ƒë·ªì ph√¢n b·ªë sentiment theo t·ª´ kh√≥a...")

# T√≠nh t·ª∑ l·ªá ph·∫ßn trƒÉm
sentiment_by_keyword = analyzed_data.groupby(['keyword', 'sentiment_label']).size().unstack(fill_value=0)
sentiment_percentages = sentiment_by_keyword.div(sentiment_by_keyword.sum(axis=1), axis=0) * 100

# T·∫°o bi·ªÉu ƒë·ªì
fig, ax = plt.subplots(figsize=(14, 8))
colors = [sentiment_colors.get(col, '#808080') for col in sentiment_percentages.columns]
sentiment_percentages.plot(kind='bar', stacked=True, ax=ax, color=colors, width=0.7)

ax.set_title('Sentiment Distribution by Keyword\\n(T∆∞∆°ng t·ª± Figure 6.4 - ƒê·ªì √°n g·ªëc)', 
             fontsize=16, fontweight='bold', pad=20)
ax.set_xlabel('Keywords', fontsize=12, fontweight='bold')
ax.set_ylabel('Percentage (%)', fontsize=12, fontweight='bold')
ax.legend(title='Sentiment', bbox_to_anchor=(1.05, 1), loc='upper left')
ax.set_xticklabels(sentiment_percentages.index, rotation=45, ha='right')

# Th√™m labels tr√™n bars
for container in ax.containers:
    ax.bar_label(container, fmt='%.1f%%', label_type='center', fontsize=9)

plt.tight_layout()
plt.show()

# 2. Bi·ªÉu ƒë·ªì t·ªïng quan sentiment (gi·ªëng Figure 6.5)
print("\\n2Ô∏è‚É£ T·∫°o bi·ªÉu ƒë·ªì t·ªïng quan sentiment...")

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Pie chart
sentiment_counts = analyzed_data['sentiment_label'].value_counts()
colors = [sentiment_colors.get(sentiment, '#808080') for sentiment in sentiment_counts.index]
wedges, texts, autotexts = ax1.pie(sentiment_counts.values, labels=sentiment_counts.index, 
                                  autopct='%1.1f%%', colors=colors, startangle=90)
ax1.set_title('Overall Sentiment Distribution\\n(T∆∞∆°ng t·ª± Figure 6.5)', fontweight='bold')

# Bar chart
bars = ax2.bar(sentiment_counts.index, sentiment_counts.values, 
               color=[sentiment_colors.get(sentiment, '#808080') for sentiment in sentiment_counts.index],
               alpha=0.8)
ax2.set_title('Sentiment Counts', fontweight='bold')
ax2.set_xlabel('Sentiment')
ax2.set_ylabel('Number of Tweets')

# Th√™m labels tr√™n bars
for bar in bars:
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + 0.01*max(sentiment_counts),
             f'{int(height)}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

# 3. Heatmap correlation gi·ªØa keywords v√† sentiments
print("\\n3Ô∏è‚É£ T·∫°o heatmap correlation...")

plt.figure(figsize=(10, 6))
# Chuy·ªÉn v·ªÅ counts ƒë·ªÉ d·ªÖ ƒë·ªçc
heatmap_data = analyzed_data.groupby(['keyword', 'sentiment_label']).size().unstack(fill_value=0)
sns.heatmap(heatmap_data, annot=True, fmt='d', cmap='RdYlGn', center=heatmap_data.mean().mean())
plt.title('Sentiment Counts by Keyword (Heatmap)', fontweight='bold', pad=20)
plt.xlabel('Sentiment')
plt.ylabel('Keywords')
plt.tight_layout()
plt.show()

# 4. Box plot cho sentiment scores
print("\\n4Ô∏è‚É£ T·∫°o box plot cho sentiment scores...")

plt.figure(figsize=(12, 6))
analyzed_data.boxplot(column='sentiment_score', by='sentiment_label', 
                     figsize=(12, 6), patch_artist=True)
plt.title('Distribution of Confidence Scores by Sentiment')
plt.xlabel('Sentiment Label')
plt.ylabel('Confidence Score')
plt.suptitle('')  # Remove automatic title
plt.tight_layout()
plt.show()

print("\\n‚úÖ Ho√†n th√†nh t·∫•t c·∫£ bi·ªÉu ƒë·ªì!")

# 5. T·∫°o b·∫£ng so s√°nh v·ªõi ƒë·ªì √°n g·ªëc (n·∫øu c√≥ d·ªØ li·ªáu)
print("\\n5Ô∏è‚É£ So s√°nh v·ªõi k·∫øt qu·∫£ ƒë·ªì √°n g·ªëc:")
print("\\nüìä K·∫æT QU·∫¢ HI·ªÜN T·∫†I:")
current_results = analyzed_data['sentiment_label'].value_counts(normalize=True) * 100
for sentiment, percentage in current_results.items():
    print(f"   {sentiment}: {percentage:.1f}%")

print("\\nüìö SO S√ÅNH V·ªöI ƒê·ªí √ÅN G·ªêC:")
print("   - ƒê·ªì √°n g·ªëc ƒë·∫°t ~82% accuracy tr√™n RoBERTa")
print("   - S·ª≠ d·ª•ng c√πng model: cardiffnlp/twitter-roberta-base-sentiment-latest")
print("   - Pipeline t∆∞∆°ng t·ª±: Text cleaning ‚Üí RoBERTa ‚Üí Classification")
print("   - Kh√°c bi·ªát: ƒê∆°n gi·∫£n h√≥a ki·∫øn tr√∫c (Python vs Spark)")

# 6. Th·ªëng k√™ n√¢ng cao
print("\\n6Ô∏è‚É£ Th·ªëng k√™ n√¢ng cao:")
print(f"   Average confidence score: {analyzed_data['sentiment_score'].mean():.3f}")
print(f"   Tweets v·ªõi high confidence (>0.8): {(analyzed_data['sentiment_score'] > 0.8).sum()} ({(analyzed_data['sentiment_score'] > 0.8).mean()*100:.1f}%)")
print(f"   T·ª´ kh√≥a c√≥ sentiment t√≠ch c·ª±c nh·∫•t: {sentiment_by_keyword.loc[:, 'Positive'].idxmax()}")
print(f"   T·ª´ kh√≥a c√≥ sentiment ti√™u c·ª±c nh·∫•t: {sentiment_by_keyword.loc[:, 'Negative'].idxmax()}")

print("\\n" + "=" * 60)
print("üéØ TR·ª∞C QUAN H√ìA HO√ÄN T·∫§T!")
print("=" * 60)

# 7. ƒê√°nh gi√° Hi·ªáu su·∫•t v√† So s√°nh

Ph√¢n t√≠ch hi·ªáu su·∫•t c·ªßa m√¥ h√¨nh v√† so s√°nh v·ªõi k·∫øt qu·∫£ t·ª´ ƒë·ªì √°n g·ªëc.

In [None]:
print("üìà ƒê√ÅNH GI√Å HI·ªÜU SU·∫§T V√Ä SO S√ÅNH V·ªöI ƒê·ªí √ÅN G·ªêC")
print("=" * 70)

# 1. Ph√¢n t√≠ch performance metrics
print("\\n1Ô∏è‚É£ METRICS HI·ªÜU SU·∫§T:")

# Confidence score distribution
confidence_stats = analyzed_data['sentiment_score'].describe()
print("\\nüìä Ph√¢n b·ªë confidence scores:")
for stat, value in confidence_stats.items():
    print(f"   {stat}: {value:.4f}")

# High confidence predictions
high_conf_threshold = 0.8
high_conf_count = (analyzed_data['sentiment_score'] > high_conf_threshold).sum()
high_conf_percentage = (high_conf_count / len(analyzed_data)) * 100

print(f"\\nüéØ Predictions v·ªõi high confidence (>{high_conf_threshold}):")
print(f"   Count: {high_conf_count}/{len(analyzed_data)} ({high_conf_percentage:.1f}%)")

# Confidence by sentiment
print("\\nüìà Average confidence by sentiment:")
conf_by_sentiment = analyzed_data.groupby('sentiment_label')['sentiment_score'].agg(['mean', 'std', 'count'])
print(conf_by_sentiment.round(4))

# 2. So s√°nh v·ªõi ƒë·ªì √°n g·ªëc
print("\\n\\n2Ô∏è‚É£ SO S√ÅNH V·ªöI ƒê·ªí √ÅN G·ªêC:")
print("\\n" + "="*50)
print("| ASPECT | ƒê·ªí √ÅN G·ªêC | D·ª∞ √ÅN HI·ªÜN T·∫†I |")
print("="*50)
print("| Ki·∫øn tr√∫c | Kafka+Spark+MongoDB | Python Pipeline |")
print("| M√¥ h√¨nh | RoBERTa-Twitter | RoBERTa-Twitter |")
print("| Accuracy | ~82% | Kh√¥ng c√≥ ground truth |")
print("| X·ª≠ l√Ω | Distributed | Single machine |")
print("| Real-time | Yes | Batch processing |")
print("| Complexity | High | Low |")
print("| Scalability | High | Medium |")
print("="*50)

# 3. Insights t·ª´ k·∫øt qu·∫£
print("\\n\\n3Ô∏è‚É£ INSIGHTS V√Ä PH√ÇN T√çCH:")

# Keyword analysis
keyword_sentiment = analyzed_data.groupby('keyword')['sentiment_label'].value_counts(normalize=True).unstack(fill_value=0)
print("\\nüìä Sentiment ratios by keyword:")
print(keyword_sentiment.round(3))

# Most positive/negative keywords
if 'Positive' in keyword_sentiment.columns:
    most_positive = keyword_sentiment['Positive'].idxmax()
    print(f"\\nüòä Most positive keyword: {most_positive} ({keyword_sentiment['Positive'][most_positive]:.1%} positive)")

if 'Negative' in keyword_sentiment.columns:
    most_negative = keyword_sentiment['Negative'].idxmax()
    print(f"üòû Most negative keyword: {most_negative} ({keyword_sentiment['Negative'][most_negative]:.1%} negative)")

# 4. T·∫°o summary report
print("\\n\\n4Ô∏è‚É£ SUMMARY REPORT:")
print("\\n" + "="*60)
print("üéØ TWITTER SENTIMENT ANALYSIS - FINAL RESULTS")
print("="*60)

# Dataset info
print(f"\\nüìä DATASET INFO:")
print(f"   Total tweets analyzed: {len(analyzed_data)}")
print(f"   Keywords: {', '.join(KEYWORDS)}")
print(f"   Date range: {analyzed_data['created_at'].min().date()} to {analyzed_data['created_at'].max().date()}")

# Model info
print(f"\\nü§ñ MODEL INFO:")
print(f"   Model: {MODEL_NAME}")
print(f"   Device: {'GPU' if analyzer.device == 0 else 'CPU'}")
print(f"   Batch size: {BATCH_SIZE}")

# Results summary
print(f"\\nüìà RESULTS SUMMARY:")
overall_sentiment = analyzed_data['sentiment_label'].value_counts(normalize=True) * 100
for sentiment, percentage in overall_sentiment.items():
    print(f"   {sentiment}: {percentage:.1f}%")

print(f"\\n‚≠ê QUALITY METRICS:")
print(f"   Average confidence: {analyzed_data['sentiment_score'].mean():.3f}")
print(f"   High confidence predictions: {high_conf_percentage:.1f}%")
print(f"   Processing time: ~{len(analyzed_data) / 60:.1f} tweets/minute")

# 5. Recommendations for improvement
print("\\n\\n5Ô∏è‚É£ KHUY·∫æN NGH·ªä C·∫¢I TI·∫æN:")
print("\\nüîß Technical improvements:")
print("   ‚Ä¢ Th√™m data validation v√† error handling")
print("   ‚Ä¢ Implement caching cho model predictions")
print("   ‚Ä¢ S·ª≠ d·ª•ng GPU ƒë·ªÉ tƒÉng t·ªëc ƒë·ªô x·ª≠ l√Ω")
print("   ‚Ä¢ Th√™m real-time streaming capabilities")

print("\\nüìä Analysis improvements:")
print("   ‚Ä¢ Thu th·∫≠p ground truth data ƒë·ªÉ ƒë√°nh gi√° accuracy")
print("   ‚Ä¢ Th√™m temporal analysis (xu h∆∞·ªõng theo th·ªùi gian)")
print("   ‚Ä¢ Ph√¢n t√≠ch deeper insights (hashtags, mentions)")
print("   ‚Ä¢ A/B testing v·ªõi c√°c models kh√°c")

print("\\nüöÄ Scaling improvements:")
print("   ‚Ä¢ Containerization v·ªõi Docker")
print("   ‚Ä¢ Deploy l√™n cloud (AWS/Azure)")
print("   ‚Ä¢ Implement proper logging v√† monitoring")
print("   ‚Ä¢ T·∫°o web dashboard v·ªõi Streamlit")

# 6. Export final results
final_results = {
    'total_tweets': len(analyzed_data),
    'sentiment_distribution': analyzed_data['sentiment_label'].value_counts().to_dict(),
    'sentiment_percentages': (analyzed_data['sentiment_label'].value_counts(normalize=True) * 100).round(2).to_dict(),
    'average_confidence': float(analyzed_data['sentiment_score'].mean()),
    'high_confidence_rate': float(high_conf_percentage),
    'keywords_analyzed': KEYWORDS,
    'model_used': MODEL_NAME,
    'processing_timestamp': datetime.now().isoformat()
}

# Save summary
import json
summary_file = f"analysis_summary_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(summary_file, 'w', encoding='utf-8') as f:
    json.dump(final_results, f, indent=2, ensure_ascii=False)

print(f"\\nüíæ Final summary saved to: {summary_file}")

print("\\n" + "="*70)
print("üéâ PH√ÇN T√çCH HO√ÄN T·∫§T - TH√ÄNH C√îNG T√ÅI HI·ªÜN ƒê·ªí √ÅN!")
print("="*70)

# Display success message
print("\\nüèÜ TH√ÄNH C√îNG:")
print("   ‚úÖ Thu th·∫≠p d·ªØ li·ªáu t·ª´ Twitter (ho·∫∑c t·∫°o d·ªØ li·ªáu m·∫´u)")
print("   ‚úÖ Ti·ªÅn x·ª≠ l√Ω vƒÉn b·∫£n theo ƒë√∫ng pipeline ƒë·ªì √°n g·ªëc")
print("   ‚úÖ √Åp d·ª•ng m√¥ h√¨nh RoBERTa-Twitter t·ª´ Hugging Face")
print("   ‚úÖ T·∫°o tr·ª±c quan h√≥a t∆∞∆°ng t·ª± Figures 6.4, 6.5")
print("   ‚úÖ Ph√¢n t√≠ch v√† so s√°nh k·∫øt qu·∫£ v·ªõi ƒë·ªì √°n g·ªëc")
print("   ‚úÖ ƒê∆°n gi·∫£n h√≥a ki·∫øn tr√∫c nh∆∞ng v·∫´n ƒë·∫£m b·∫£o ch·∫•t l∆∞·ª£ng")

print("\\nüéØ D·ª± √°n ƒë√£ s·∫µn s√†ng ƒë·ªÉ tr√¨nh b√†y v√† b√°o c√°o!")