# üìä Analisis Sentimen Publik terhadap Isu "Indonesia Gagal Lolos Piala Dunia"

**Proyek Data Mining - Analisis Sentimen menggunakan Support Vector Machine (SVM)**

---

## üéØ Tujuan Penelitian
- Menganalisis persepsi publik terhadap kegagalan Indonesia lolos Piala Dunia melalui komentar YouTube
- Mengimplementasikan model SVM untuk klasifikasi sentimen otomatis
- Memberikan insight tentang pola sentimen dan kata kunci dominan

## üìù Metodologi
- **Sumber Data**: Komentar YouTube dari video terkait timnas Indonesia
- **Preprocessing**: Text cleaning, normalisasi bahasa Indonesia, TF-IDF vectorization
- **Model**: Support Vector Machine (SVM) dengan kernel RBF
- **Evaluasi**: Accuracy, Precision, Recall, F1-Score, Confusion Matrix

---

## üìö Import Libraries dan Setup

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("‚úÖ Libraries imported successfully!")

## üîç 1. Data Collection dari YouTube

In [None]:
# Import custom modules
import sys
import os
sys.path.append('../src')
sys.path.append('..')  # Add parent directory for config

from data_collector import YouTubeDataCollector
from preprocessor import IndonesianTextPreprocessor
from model import SentimentSVMModel
from config import Config

print("‚úÖ Custom modules imported successfully!")

## üîç 2. Data Preprocessing

In [None]:
# Initialize YouTube data collector
collector = YouTubeDataCollector()

print("üéØ Target Keywords:")
for i, keyword in enumerate(Config.SEARCH_KEYWORDS, 1):
    print(f"{i}. {keyword}")

print(f"\nüìä Collection Settings:")
print(f"- Videos per query: {Config.MAX_VIDEOS_PER_QUERY}")
print(f"- Comments per video: {Config.MAX_COMMENTS_PER_VIDEO}")

## ü§ñ 4. Model Training dengan SVM

In [None]:
# Initialize SVM model
svm_model = SentimentSVMModel(
    kernel=Config.SVM_KERNEL,
    C=Config.SVM_C,
    gamma=Config.SVM_GAMMA
)

print("ü§ñ SVM Model Configuration:")
print(f"- Kernel: {Config.SVM_KERNEL}")
print(f"- C (Regularization): {Config.SVM_C}")
print(f"- Gamma: {Config.SVM_GAMMA}")
print(f"- Max Features: {Config.MAX_FEATURES}")
print(f"- N-gram Range: {Config.NGRAM_RANGE}")

## üìà 5. Model Evaluation

In [None]:
# Display detailed classification report
print("üìã Detailed Classification Report:")
print("=" * 60)

class_report = training_results['classification_report']
classes = svm_model.label_encoder.classes_

# Create classification report DataFrame
report_df = pd.DataFrame({
    'Class': classes,
    'Precision': [class_report[cls]['precision'] for cls in classes],
    'Recall': [class_report[cls]['recall'] for cls in classes],
    'F1-Score': [class_report[cls]['f1-score'] for cls in classes],
    'Support': [class_report[cls]['support'] for cls in classes]
})

display(report_df)

## üíæ 9. Save Model dan Export Results

In [None]:
# Save trained model and export results
import joblib
import os

# Create models directory if it doesn't exist
os.makedirs('../models', exist_ok=True)

# Save the trained SVM model
model_filename = '../models/sentiment_svm_model_real_data.pkl'
joblib.dump(svm_classifier, model_filename)
print(f"‚úÖ SVM Model saved to: {model_filename}")

# Save the vectorizer
vectorizer_filename = '../models/tfidf_vectorizer_real_data.pkl'
joblib.dump(vectorizer, vectorizer_filename)
print(f"‚úÖ TF-IDF Vectorizer saved to: {vectorizer_filename}")

# Save the label encoder
encoder_filename = '../models/label_encoder_real_data.pkl'
joblib.dump(label_encoder, encoder_filename)
print(f"‚úÖ Label Encoder saved to: {encoder_filename}")

# Save processed dataset
processed_filename = '../data/processed/final_processed_dataset_real.csv'
os.makedirs('../data/processed', exist_ok=True)
processed_df.to_csv(processed_filename, index=False)
print(f"‚úÖ Processed dataset saved to: {processed_filename}")

# Save predictions on sample data
sample_predictions = pd.DataFrame({
    'text': test_comments,
    'predicted_sentiment': [label_encoder.classes_[pred] for pred in predictions],
    'confidence_negatif': probabilities[:, 0],
    'confidence_netral': probabilities[:, 1],
    'confidence_positif': probabilities[:, 2]
})

predictions_filename = '../data/processed/sample_predictions_real.csv'
sample_predictions.to_csv(predictions_filename, index=False)
print(f"‚úÖ Sample predictions saved to: {predictions_filename}")

# Save model performance metrics
performance_metrics = {
    'accuracy': accuracy,
    'precision': precision,
    'recall': recall,
    'f1_score': f1,
    'cv_mean': cv_scores.mean(),
    'cv_std': cv_scores.std(),
    'cv_scores': cv_scores.tolist(),
    'dataset_size': len(processed_df),
    'n_features': X.shape[1],
    'sentiment_distribution': sentiment_dist.to_dict()
}

metrics_filename = '../models/model_performance_metrics.json'
import json
with open(metrics_filename, 'w') as f:
    json.dump(performance_metrics, f, indent=2)
print(f"‚úÖ Performance metrics saved to: {metrics_filename}")

print(f"\nüìÅ ALL FILES SAVED SUCCESSFULLY!")
print(f"üìä Model artifacts ready for deployment and future use.")

## üìù 10. Kesimpulan dan Rekomendasi

In [None]:
# Generate final summary report for real data analysis
print("üìã LAPORAN AKHIR ANALISIS SENTIMEN")
print("=" * 80)

# Dataset summary
print(f"\nüìä RINGKASAN DATASET REAL:")
print(f"- Total komentar yang dianalisis: {len(processed_df):,}")
print(f"- Periode data: {processed_df['published_date'].min().date()} hingga {processed_df['published_date'].max().date()}")
print(f"- Total video sumber: {processed_df['video_id'].nunique()}")
print(f"- Total channel: {processed_df['channel_title'].nunique()}")

# Sentiment distribution
sentiment_dist = processed_df['sentiment_auto'].value_counts(normalize=True) * 100
print(f"\nüéØ DISTRIBUSI SENTIMEN REAL DATA:")
for sentiment, percentage in sentiment_dist.items():
    print(f"- {sentiment.upper()}: {percentage:.1f}%")

# Model performance with real variables
print(f"\nü§ñ PERFORMA MODEL SVM PADA DATA REAL:")
print(f"- Akurasi: {accuracy:.1%}")
print(f"- Precision: {precision:.1%}")
print(f"- Recall: {recall:.1%}")
print(f"- F1-Score: {f1:.1%}")
print(f"- Cross-validation: {cv_scores.mean():.1%} (¬±{cv_scores.std():.1%})")

# Key insights
print(f"\nüí° INSIGHT UTAMA DARI DATA REAL:")
dominant_sentiment = sentiment_dist.idxmax()
print(f"- Sentimen dominan: {dominant_sentiment.upper()} ({sentiment_dist[dominant_sentiment]:.1f}%)")

avg_likes_by_sentiment = processed_df.groupby('sentiment_auto')['like_count'].mean()
most_liked_sentiment = avg_likes_by_sentiment.idxmax()
print(f"- Sentimen dengan like terbanyak: {most_liked_sentiment.upper()} (rata-rata {avg_likes_by_sentiment[most_liked_sentiment]:.1f} likes)")

# Channel insights
top_channel = processed_df['channel_title'].value_counts().index[0]
top_channel_comments = processed_df['channel_title'].value_counts().iloc[0]
print(f"- Channel dengan komentar terbanyak: {top_channel} ({top_channel_comments} komentar)")

# Text analysis insights
avg_text_length = processed_df.groupby('sentiment_auto')['text_length'].mean()
longest_sentiment = avg_text_length.idxmax()
print(f"- Sentimen dengan teks terpanjang: {longest_sentiment.upper()} (rata-rata {avg_text_length[longest_sentiment]:.0f} karakter)")

print(f"\nüìà ANALISIS TEMPORAL:")
daily_comments = processed_df.groupby('date').size()
peak_date = daily_comments.idxmax()
peak_count = daily_comments.max()
print(f"- Tanggal dengan aktivitas tertinggi: {peak_date} ({peak_count} komentar)")

print(f"\nüéØ REKOMENDASI BERDASARKAN ANALISIS REAL DATA:")
print(f"1. PSSI perlu meningkatkan komunikasi publik untuk mengatasi sentimen negatif ({sentiment_dist['negatif']:.1f}%)")
print(f"2. Manfaatkan sentimen positif ({sentiment_dist['positif']:.1f}%) untuk membangun dukungan publik")
print(f"3. Fokus pada engagement di channel-channel populer seperti {top_channel}")
print(f"4. Perhatikan pola temporal komentar untuk timing komunikasi yang tepat")
print(f"5. Model SVM dengan akurasi {accuracy:.1%} dapat digunakan untuk monitoring sentimen real-time")

print(f"\nüî¨ METODOLOGI DAN VALIDASI:")
print(f"- Dataset: {len(processed_df):,} komentar YouTube real dari {processed_df['video_id'].nunique()} video")
print(f"- Preprocessing: Text cleaning, normalisasi bahasa Indonesia, TF-IDF vectorization")
print(f"- Model: SVM dengan kernel RBF (C=1.0, gamma=scale)")
print(f"- Validasi: 5-fold cross-validation dengan akurasi rata-rata {cv_scores.mean():.1%}")
print(f"- Feature engineering: TF-IDF dengan 5000 fitur, n-gram (1,2)")

print("\n" + "=" * 80)
print("‚úÖ ANALISIS SENTIMEN DATA REAL SELESAI!")
print("üìä Model telah divalidasi dengan data real dan siap untuk deployment.")
print("üéØ Insight yang diperoleh dapat digunakan untuk strategi komunikasi PSSI.")