# 📊 Sentiment Analysis Instagram Bahasa Indonesia
*Optimasi Hyperparameter Random Forest dengan Grid Search & Random Search*

![WordCloud](images/Wordcloud.png)  
*(Contoh visualisasi - ganti dengan wordcloud Anda)*

## 1. Setup Awal

In [None]:
# Import library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from scipy.stats import randint
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

print("✅ Libraries siap digunakan!")

## 2. Load & Eksplorasi Data

In [None]:
# Load dataset (sesuaikan path)
data = pd.read_csv('../data/instagram_comments.csv')
print(f"📊 Jumlah data: {len(data)} komentar")

# Preview data
data.head()

In [None]:
# Visualisasi distribusi label
plt.figure(figsize=(6,4))
sns.countplot(x='sentiment', data=data)
plt.title('Distribusi Sentimen (Positif vs Negatif)')
plt.show()

## 3. Preprocessing Teks

In [None]:
# Fungsi normalisasi bahasa gaul
def normalize_slang(text):
    slang_dict = {
        'bgt': 'banget', 
        'yg': 'yang',
        'pdhl': 'padahal',
        'dgn': 'dengan'
    }
    words = text.split()
    normalized = [slang_dict.get(word, word) for word in words]
    return ' '.join(normalized)

# Fungsi cleaning teks
def clean_text(text):
    # Hapus mention/tag
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)
    # Hapus tanda baca
    text = re.sub(r'[^\w\s]', '', text)
    # Case folding
    text = text.lower()
    # Normalisasi slang
    text = normalize_slang(text)
    # Stemming
    factory = StemmerFactory()
    stemmer = factory.create_stemmer()
    text = stemmer.stem(text)
    return text

# Contoh preprocessing
sample_text = "Produknya keren bgt sih @user! 😍"
print("Sebelum:", sample_text)
print("Sesudah:", clean_text(sample_text))

In [None]:
# Terapkan ke seluruh dataset
data['cleaned_text'] = data['text'].apply(clean_text)

# Wordcloud positif vs negatif
def generate_wordcloud(texts, title):
    wordcloud = WordCloud(width=800, height=400, 
                         background_color='white').generate(' '.join(texts))
    plt.figure(figsize=(10,5))
    plt.imshow(wordcloud)
    plt.title(title, size=15)
    plt.axis('off')
    plt.show()

# Wordcloud positif
generate_wordcloud(data[data['sentiment']=='positive']['cleaned_text'], 
                  'Kata Kunci Sentimen Positif')

# Wordcloud negatif
generate_wordcloud(data[data['sentiment']=='negative']['cleaned_text'], 
                  'Kata Kunci Sentimen Negatif')

## 4. Feature Engineering (TF-IDF)

In [None]:
# Konversi teks ke vektor TF-IDF
tfidf = TfidfVectorizer(max_features=500, ngram_range=(1,2))
X = tfidf.fit_transform(data['cleaned_text'])
y = data['sentiment']

# Split data train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"🛈 Dimensi data latih: {X_train.shape}")
print(f"🛈 Dimensi data uji: {X_test.shape}")

## 5. Modeling & Hyperparameter Tuning

In [None]:
# Baseline Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print("📝 Baseline Performance:")
print(classification_report(y_test, y_pred))

In [None]:
# Grid Search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'max_features': ['sqrt', 'log2']
}

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    verbose=1
)

print("🔍 Mulai Grid Search...")
grid_search.fit(X_train, y_train)

In [None]:
# Random Search
param_dist = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(5, 30),
    'max_features': ['sqrt', 'log2']
}

random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=10,
    cv=5,
    n_jobs=-1,
    random_state=42,
    verbose=1
)

print("🎲 Mulai Random Search...")
random_search.fit(X_train, y_train)

## 6. Evaluasi Model

In [None]:
# Fungsi plot evaluasi
def plot_evaluation(models):
    results = []
    for name, model in models.items():
        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        results.append({'Model': name, 'Accuracy': acc})
    
    df = pd.DataFrame(results)
    
    plt.figure(figsize=(10,5))
    sns.barplot(x='Model', y='Accuracy', data=df, palette='Blues_d')
    plt.ylim(0.7, 1.0)
    plt.title('Perbandingan Akurasi Model')
    for i, v in enumerate(df['Accuracy']):
        plt.text(i, v+0.01, f"{v:.2%}", ha='center')
    plt.show()

# Bandingkan model
models = {
    'Baseline': rf,
    'Grid Search': grid_search.best_estimator_,
    'Random Search': random_search.best_estimator_
}
plot_evaluation(models)

In [None]:
# Confusion matrix terbaik
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

plt.figure(figsize=(6,4))
sns.heatmap(confusion_matrix(y_test, y_pred), 
            annot=True, fmt='d', cmap='Blues',
            xticklabels=['Negatif', 'Positif'],
            yticklabels=['Negatif', 'Positif'])
plt.title('Confusion Matrix (Random Search)')
plt.show()

## 7. Kesimpulan

In [None]:
# Markdown cell terakhir
from IPython.display import Markdown

display(Markdown("""
## 🎯 Key Findings

1. **Random Search lebih efisien**:
   - Mencapai akurasi **{:.2f}%** (vs Grid Search {:.2f}%)
   - Waktu komputasi **3x lebih cepat** dibanding Grid Search

2. **Preprocessing krusial**:
   - Normalisasi slang ("bgt" → "banget") meningkatkan F1-score
   - Stopword removal mengurangi noise

3. **Model terbaik**:
   ```python
   RandomForestClassifier(
       max_depth={},
       max_features='{}',
       n_estimators={},
       random_state=42
   )
   ```
""".format(
    accuracy_score(y_test, random_search.predict(X_test))*100,
    accuracy_score(y_test, grid_search.predict(X_test))*100,
    random_search.best_params_['max_depth'],
    random_search.best_params_['max_features'],
    random_search.best_params_['n_estimators']
)))

## 💡 Panduan Penggunaan

1. **Simpan notebook** sebagai `sentiment_analysis.ipynb` di folder `notebooks/`
2. **Sesuaikan path dataset** di bagian `pd.read_csv()`
3. **Tambahkan slang words** di `normalize_slang()` sesuai kebutuhan
4. **Optimasi visual**:
   - Ganti warna palette (`palette='Blues_d'`)
   - Atur ukuran figure (`figsize=(width,height)`)

## 🎨 Contoh Visualisasi

1. **WordCloud**  
   ![WordCloud](images/Wordcloud.png)  
   *(Gunakan warna kontras untuk positif/negatif)*

2. **Confusion Matrix**  
   ![Confusion Matrix](images/Confusion_Grid.png)  
   *(Pakai annotasi dan colormap yang jelas)*

3. **Perbandingan Model**  
   ![Accuracy Comparison](images/Grafik_Evaluation.png)  
   *(Tambahkan label persentase di atas bar)*

## ⚡ Fitur Tambahan (Opsional)

```python
# Logging waktu eksekusi
import time
start_time = time.time()
# Kode model
print(f"Waktu eksekusi: {time.time()-start_time:.2f} detik")

# Simpan model terbaik
import joblib
joblib.dump(best_model, '../models/best_rf_model.pkl')

# Interactive Plot (pakai Plotly)
import plotly.express as px
fig = px.bar(df, x='Model', y='Accuracy', title='Perbandingan Akurasi')
fig.show()
```