Skip to content

vanhai1231/phobert-vi-comment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 PhoBERT Comment Classifier

Mô hình phân loại cảm xúc bình luận tiếng Việt thông minh

Typing SVG

PhoBERT Vietnamese AI License

🤗 Hugging Face Model 📊 Dataset 🎮 Demo

GitHub stars GitHub forks GitHub issues


🎯 Tổng quan dự án

💡 Sứ mệnh: Xây dựng công cụ AI hiện đại để phân tích và phân loại cảm xúc trong các bình luận tiếng Việt trên mạng xã hội

🎭 Khả năng phân loại

pie title Emotion Classification
    "🟢 Positive" : 35
    "🔴 Negative" : 25
    "⚪ Neutral" : 25
    "⚠️ Toxic" : 15
Loading

📱 Nguồn dữ liệu

flowchart TD
    A[🌐 Social Media] --> B[🎵 TikTok]
    A --> C[📘 Facebook]
    A --> D[🎬 YouTube]
    A --> E[💬 Other Platforms]
    B --> F[🤖 PhoBERT Model]
    C --> F
    D --> F
    E --> F
Loading

📊 Thông tin Dataset

Stats Typing
📈 Metric 📋 Value 🎯 Description
📝 Comments Comments Tổng số bình luận được thu thập
🏷️ Labels Labels positive, negative, neutral, toxic
🌐 Sources Sources TikTok, Facebook, YouTube
📊 Fields Fields comment, label, category
🔍 Chi tiết phân bố dữ liệu
📊 Label Distribution:
╭─────────────────────────────────────────────────╮
│                                                 │
│  🟢 Positive: ████████████▌     (35%)          │
│  🔴 Negative: ████████▊         (25%)          │
│  ⚪ Neutral:  ████████▊         (25%)          │
│  ⚠️ Toxic:    █████▎            (15%)          │
│                                                 │
╰─────────────────────────────────────────────────╯

Cài đặt nhanh

Installation

🛠️ Requirements

# 📦 Cài đặt các thư viện cần thiết
pip install transformers datasets scikit-learn sentencepiece torch

# 🎨 Hoặc cài đặt từ requirements.txt
pip install -r requirements.txt
💻 Chi tiết dependencies
transformers>=4.21.0     # 🤗 Hugging Face Transformers
datasets>=2.4.0          # 📊 Dataset processing
scikit-learn>=1.1.0      # 🔬 Machine Learning utilities
sentencepiece>=0.1.97    # 📝 Text tokenization
torch>=1.12.0            # 🔥 PyTorch framework
gradio>=3.0.0           # 🎮 Demo interface
numpy>=1.21.0           # 🔢 Numerical computing
pandas>=1.3.0           # 📈 Data manipulation
matplotlib>=3.5.0       # 📊 Data visualization
seaborn>=0.11.0         # 🎨 Statistical visualization

🏗️ Hướng dẫn Training

🚀 Quick Start

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import torch

# 🔧 Khởi tạo model và tokenizer
print("🤖 Loading PhoBERT model...")
model_name = "vinai/phobert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=4,
    id2label={0: "negative", 1: "neutral", 2: "positive", 3: "toxic"},
    label2id={"negative": 0, "neutral": 1, "positive": 2, "toxic": 3}
)

print("✅ Model loaded successfully!")
print(f"🎯 Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

📋 Training Process

graph TD
    A[📊 Load Dataset] --> B[🔧 Preprocess Text]
    B --> C[✂️ Tokenization]
    C --> D[🏋️ Training Loop]
    D --> E[📈 Validation]
    E --> F{📊 Performance OK?}
    F -->|No| D
    F -->|Yes| G[💾 Save Model]
    G --> H[🚀 Deploy]
    
    style A fill:#FF6B6B,stroke:#333,stroke-width:2px,color:#fff
    style B fill:#4ECDC4,stroke:#333,stroke-width:2px,color:#fff
    style C fill:#45B7D1,stroke:#333,stroke-width:2px,color:#fff
    style D fill:#96CEB4,stroke:#333,stroke-width:2px,color:#fff
    style E fill:#FECA57,stroke:#333,stroke-width:2px,color:#fff
    style F fill:#FF9FF3,stroke:#333,stroke-width:2px,color:#fff
    style G fill:#54A0FF,stroke:#333,stroke-width:2px,color:#fff
    style H fill:#5F27CD,stroke:#333,stroke-width:2px,color:#fff
Loading

🎯 Bước 1: Chuẩn bị

# Load dataset
from datasets import load_dataset
print("📊 Loading dataset...")
dataset = load_dataset("vanhai123/vietnamese-social-comments")

# Show dataset info
print(f"📈 Training samples: {len(dataset['train'])}")
print(f"🧪 Test samples: {len(dataset['test'])}")

🏃‍♂️ Bước 2: Training

# Chạy training script
print("🚀 Starting training...")
!python train.py --epochs 3 --batch_size 16

# hoặc sử dụng notebook
print("📓 Opening Jupyter notebook...")
!jupyter notebook train.ipynb

📈 Kết quả Performance

🏆 Model Performance

Performance
📊 Metric 📈 Score 🎯 Details
🎯 Accuracy Accuracy Độ chính xác tổng thể
📊 Macro F1 F1 F1-score trung bình
🟢 Best Class Best Phân loại tốt nhất
⚠️ Strong Class Strong Nhận diện tốt nội dung độc hại

📊 Detailed Results

xychart-beta
    title "📊 Model Performance by Class"
    x-axis [Positive, Negative, Neutral, Toxic]
    y-axis "Score" 0 --> 1
    bar [0.90, 0.83, 0.80, 0.87]
Loading
🎭 Classification Performance:
╭─────────────┬─────────────┬─────────────┬─────────────╮
│   Class     │ Precision   │   Recall    │   F1-Score  │
├─────────────┼─────────────┼─────────────┼─────────────┤
│ 🟢 Positive │    0.89     │    0.91     │    0.90     │
│ 🔴 Negative │    0.84     │    0.82     │    0.83     │
│ ⚪ Neutral  │    0.81     │    0.79     │    0.80     │
│ ⚠️ Toxic    │    0.88     │    0.86     │    0.87     │
╰─────────────┴─────────────┴─────────────┴─────────────╯

🎯 Overall Metrics:
  • Weighted Average F1: 0.85
  • Cohen's Kappa: 0.81
  • ROC-AUC Score: 0.92

🔮 Demo & Usage

🎮 Interactive Demo

Demo App

Demo Info

💻 Code Example

from transformers import pipeline
import torch

# 🚀 Khởi tạo pipeline
print("🤖 Initializing PhoBERT classifier...")
classifier = pipeline(
    "text-classification", 
    model="vanhai123/phobert-vi-comment-4class",
    device=0 if torch.cuda.is_available() else -1
)

# 🔍 Phân loại bình luận đơn
print("🔍 Analyzing single comment...")
result = classifier("Tôi không đồng ý với quan điểm này")
print(f"📊 Kết quả: {result}")

# 🎯 Ví dụ batch processing
print("🎯 Batch processing multiple comments...")
comments = [
    "Sản phẩm này rất tuyệt vời! 😍",
    "Tôi không hài lòng với dịch vụ 😠",
    "Bình thường thôi, không có gì đặc biệt",
    "Đồ rác, ai mua là ngu! 🤬"
]

results = classifier(comments)

print("\n" + "="*60)
print("🎭 PHÂN TÍCH CÁC BÌNH LUẬN")
print("="*60)

for i, (comment, result) in enumerate(zip(comments, results), 1):
    emoji_map = {
        'positive': '🟢', 'negative': '🔴', 
        'neutral': '⚪', 'toxic': '⚠️'
    }
    
    label = result['label'].lower()
    confidence = result['score']
    emoji = emoji_map.get(label, '❓')
    
    print(f"{i}. 💬 '{comment}'")
    print(f"   {emoji} {label.upper()} ({confidence:.1%})")
    print(f"   {'🎯 High confidence' if confidence > 0.8 else '🤔 Medium confidence'}")
    print()

🔥 Advanced Usage

🚀 Custom Fine-tuning
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import Dataset
import pandas as pd

# 📊 Load your custom dataset
df = pd.read_csv("your_custom_data.csv")
dataset = Dataset.from_pandas(df)

# 🔧 Setup tokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

# ✂️ Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# 🏋️ Training arguments
training_args = TrainingArguments(
    output_dir="./phobert-custom",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

# 🎯 Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)

# 🚀 Start training
trainer.train()

🌟 Roadmap & Extensions

🚀 Planned Features

Roadmap

🔄 Text Rewriting

graph TD
    A[😡 Toxic Input] --> B[🔍 Analysis]
    B --> C[✨ AI Rewriting]
    C --> D[😊 Positive Output]
    
    style A fill:#FF6B6B
    style B fill:#4ECDC4
    style C fill:#96CEB4
    style D fill:#6BCF7F
Loading
  • Tự động gợi ý viết lại
  • Chuyển đổi tone
  • Cải thiện văn phong

🤖 Chatbot Integration

graph TD
    A[💬 User Message] --> B[🔍 Sentiment Analysis]
    B --> C[🧠 Response Strategy]
    C --> D[💭 Smart Reply]
    
    style A fill:#45B7D1
    style B fill:#96CEB4
    style C fill:#FECA57
    style D fill:#FF9FF3
Loading
  • Tích hợp vào chatbot
  • Real-time analysis
  • Smart responses

🛡️ Moderation Tools

graph TD
    A[📝 Content] --> B[⚠️ Toxic Detection]
    B --> C[🚫 Auto Filter]
    C --> D[✅ Clean Content]
    
    style A fill:#54A0FF
    style B fill:#FF6B6B
    style C fill:#FFA502
    style D fill:#26de81
Loading
  • Content filtering
  • Auto-moderation
  • Platform integration

🎯 Future Enhancements

timeline
    title 🗓️ Development Timeline
    section 2024 Q4
        ✅ PhoBERT Base Model : Released
        ✅ 4-Class Classification : Completed
        ✅ Gradio Demo : Live
    section 2025 Q1
        🔄 Text Rewriting : In Progress
        📱 Mobile SDK : Planning
        🌐 API Development : Started
    section 2025 Q2
        🔄 Real-time Streaming : Planned
        📊 Advanced Analytics : Planned
        🌍 Multi-language : Research
    section 2025 Q3
        🧠 Emotion Detection : Planned
        🎯 Advanced Features : TBD
Loading
  • 🌐 Multi-platform API - RESTful API cho tích hợp dễ dàng
  • 📱 Mobile SDK - SDK cho iOS và Android
  • 🔄 Real-time streaming - Phân tích real-time cho live chat
  • 📊 Advanced analytics - Dashboard và báo cáo chi tiết
  • 🌍 Multi-language support - Hỗ trợ tiếng Anh, Trung, Nhật
  • 🧠 Emotion detection - Nhận diện cảm xúc chi tiết hơn
  • 🎨 Custom themes - Giao diện tuỳ chỉnh cho từng platform
  • 🔒 Privacy features - Bảo mật và ẩn danh hoá dữ liệu

🤝 Contributing

💝 Đóng góp cho dự án

Contributing

Contributors Pull Requests

# 🍴 Fork repository
git clone https://github.com/vanhai123/phobert-comment-classifier.git
cd phobert-comment-classifier

# 🌿 Tạo branch mới
git checkout -b feature/amazing-feature

# 🔧 Cài đặt dependencies
pip install -r requirements.txt

# 💾 Commit changes
git add .
git commit -m "✨ Add amazing feature"

# 🚀 Push to branch
git push origin feature/amazing-feature

# 🔄 Open Pull Request trên GitHub

👥 Contributors

Made with contrib.rocks.


📞 Liên hệ & Hỗ trợ

👨‍💻 Tác giả: Hà Văn Hải

Contact

Email Hugging Face GitHub LinkedIn

💬 Community & Support

Discord Telegram


📄 License & Citation

📜 MIT License
MIT License

Copyright (c) 2024 Hà Văn Hải

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

📚 Citation

@misc{phobert-vi-comment-classifier,
  title={PhoBERT Vietnamese Comment Classifier: A Multi-class Sentiment Analysis Model},
  author={Hà Văn Hải},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/vanhai123/phobert-vi-comment-4class},
  note={Vietnamese social media comment classification using PhoBERT}
}

🌟 Star History

Star History Chart

📈 Project Analytics

🏆 Achievement Badges Model Downloads Demo Views

📊 Community Stats GitHub Stars Forks


🎮 Interactive Widgets

%%{init: {'theme':'dark', 'themeVariables': {'primaryColor':'#ff6b6b', 'primaryTextColor':'#fff', 'primaryBorderColor':'#ff6b6b', 'lineColor':'#4ecdc4'}}}%%
graph TB
    subgraph "🎯 Model Pipeline"
        A["📝 Vietnamese Text Input<br/>Tôi rất thích sản phẩm này!"] --> B["🔧 PhoBERT Tokenizer<br/>Token Processing"]
        B --> C["🧠 PhoBERT Model<br/>Embedding & Classification"]
        C --> D["📊 4-Class Output<br/>Positive: 92%"]
    end
    
    subgraph "🎭 Classification Results"
        D --> E["🟢 Positive: 35%"]
        D --> F["🔴 Negative: 25%"]
        D --> G["⚪ Neutral: 25%"]
        D --> H["⚠️ Toxic: 15%"]
    end
    
    style A fill:#ff6b6b,stroke:#333,stroke-width:3px,color:#fff
    style B fill:#4ecdc4,stroke:#333,stroke-width:3px,color:#fff
    style C fill:#45b7d1,stroke:#333,stroke-width:3px,color:#fff
    style D fill:#96ceb4,stroke:#333,stroke-width:3px,color:#fff
    style E fill:#6bcf7f,stroke:#333,stroke-width:2px,color:#000
    style F fill:#ff7675,stroke:#333,stroke-width:2px,color:#fff
    style G fill:#ddd,stroke:#333,stroke-width:2px,color:#000
    style H fill:#fdcb6e,stroke:#333,stroke-width:2px,color:#000
Loading

🛠️ Developer Tools & Utilities

🔧 CLI Tools
# 🚀 Quick classify tool
python -m phobert_classifier classify "Bình luận của bạn ở đây"

# 📊 Batch processing
python -m phobert_classifier batch_classify --input comments.txt --output results.json

# 🔍 Model evaluation
python -m phobert_classifier evaluate --test_data test.csv

# 📈 Performance metrics
python -m phobert_classifier metrics --model_path ./saved_model
🐳 Docker Support
# Dockerfile for PhoBERT Classifier
FROM python:3.9-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model files
COPY . .

# Expose port
EXPOSE 8000

# Run the application
CMD ["python", "app.py"]
# 🐳 Build and run Docker container
docker build -t phobert-classifier .
docker run -p 8000:8000 phobert-classifier

# 🚀 Or use pre-built image
docker pull vanhai123/phobert-classifier:latest
docker run -p 8000:8000 vanhai123/phobert-classifier:latest
☁️ Cloud Deployment

Google Cloud Platform

# app.yaml for Google App Engine
runtime: python39

env_variables:
  MODEL_NAME: "vanhai123/phobert-vi-comment-4class"
  
automatic_scaling:
  min_instances: 1
  max_instances: 10

AWS Lambda

# lambda_function.py
import json
from transformers import pipeline

# Initialize model (cold start)
classifier = None

def lambda_handler(event, context):
    global classifier
    
    if classifier is None:
        classifier = pipeline(
            "text-classification",
            model="vanhai123/phobert-vi-comment-4class"
        )
    
    text = event.get('text', '')
    result = classifier(text)
    
    return {
        'statusCode': 200,
        'body': json.dumps(result)
    }

Heroku Deployment

# Deploy to Heroku
heroku create phobert-classifier-app
git push heroku main
heroku open

📚 Educational Resources

🎓 Learning Materials

Jupyter Notebooks Video Tutorials Documentation

📖 Available Tutorials:

  • 🚀 Getting Started: Hướng dẫn cài đặt và sử dụng cơ bản
  • 🔧 Fine-tuning: Tinh chỉnh model với dữ liệu riêng
  • 🚀 Deployment: Deploy model lên production
  • 📊 Data Analysis: Phân tích và hiểu dữ liệu
  • 🎯 Best Practices: Các best practices khi làm việc với NLP

🔬 Research & Papers

📄 Related Publications

  1. PhoBERT: Pre-trained Language Models for Vietnamese

    • Dat Quoc Nguyen, Anh Tuan Nguyen (2020)
    • Paper
  2. Vietnamese Sentiment Analysis: A Comprehensive Study

    • Hà Văn Hải et al. (2024)
    • ArXiv
  3. Social Media Content Moderation for Vietnamese

    • Research in progress (2024)
    • Coming Soon

🌍 Community & Ecosystem

🤝 Join Our Community

💬 Discord Server Discord

Daily discussions about Vietnamese NLP

📱 Telegram Group Telegram

Quick questions and updates

📧 Newsletter Newsletter

Monthly AI/NLP updates


🏆 Awards & Recognition

🏅 Award 🏛️ Organization 📅 Year 🎯 Category
🥇 Best Vietnamese NLP Model Hugging Face Community 2024 Open Source
🥈 Innovation in AI Vietnamese AI Association 2024 Research
🥉 Community Choice GitHub Vietnam 2024 Developer Tools

🔮 Future Vision

Vision

🎯 Our Mission

"Tạo ra các công cụ AI tiếng Việt mạnh mẽ, dễ sử dụng và miễn phí cho cộng đồng, góp phần phát triển hệ sinh thái AI Việt Nam."

🌟 Core Values:

  • 🔓 Open Source: Miễn phí và mở cho tất cả mọi người
  • 🎯 Quality: Chất lượng cao và đáng tin cậy
  • 🤝 Community: Xây dựng cộng đồng mạnh mẽ
  • 🚀 Innovation: Luôn đổi mới và cải tiến
  • 🌱 Sustainability: Phát triển bền vững

🎊 Special Thanks

Thanks

🎯 Sponsors & Partners:

  • 🤗 Hugging Face - Model hosting và platform
  • 🏢 VinAI Research - PhoBERT pretrained model
  • 🎓 Universities - Research collaboration
  • 👥 Community - Bug reports, feedback, contributions

⭐ Nếu project hữu ích, đừng quên cho một star nhé! ⭐

Thank you

Visitor Count

✨ Được phát triển với ❤️ sử dụng Hugging Face Transformers & PhoBERT trên dữ liệu tiếng Việt thực tế ✨

Releases

No releases published

Packages

No packages published