## Import Library

Di bagian ini, saya akan mengimpor berbagai library yang diperlukan untuk scraping, pemrosesan data, dan pelatihan model deep learning.

In [11]:
# Import library yang dibutuhkan
import pandas as pd
import numpy as np
from google_play_scraper import app, reviews
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt

## Scraping Data

Di bagian ini, saya akan melakukan scraping data ulasan dari aplikasi Duolingo menggunakan google-play-scraper.

In [12]:
# Scraping 4000 review dari aplikasi Duolingo
result, _ = reviews(
    'com.duolingo',  # ID aplikasi Duolingo di Play Store
    lang='en',        # Bahasa Inggris
    country='us',     # Negara US
    sort=Sort.NEWEST, # Urutkan berdasarkan yang terbaru
    count=4000        # Ambil 4000 ulasan
)

# Konversi ke DataFrame
df = pd.DataFrame(result)

# Pilih kolom yang relevan
df = df[['userName', 'score', 'content']]

# Simpan sebagai CSV
df.to_csv('duolingo_reviews.csv', index=False)

print("Scraping selesai! Data tersimpan dalam duolingo_reviews.csv")

Scraping selesai! Data tersimpan dalam duolingo_reviews.csv


## Preprocessing Data

Di sini, saya akan membersihkan teks (menghapus tanda baca, angka, dan stopwords), melakukan tokenisasi, dan melabeli sentimen berdasarkan rating.

In [1]:
# Load dataset
df = pd.read_csv('duolingo_reviews.csv')

# Buang data kosong
df.dropna(inplace=True)

# Label sentimen berdasarkan rating
def label_sentiment(score):
    if score >= 4:
        return "Positive"
    elif score == 3:
        return "Neutral"
    else:
        return "Negative"

df['sentiment'] = df['score'].apply(label_sentiment)

# Membersihkan teks
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower()  # Ubah ke huruf kecil
    text = re.sub(r'\W', ' ', text)  # Hapus karakter spesial
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Hapus stopwords
    return text

df['cleaned_text'] = df['content'].apply(clean_text)

# Simpan hasil preprocessing
df.to_csv('duolingo_reviews_cleaned.csv', index=False)

print("Preprocessing selesai! Data tersimpan dalam duolingo_reviews_cleaned.csv")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Preprocessing selesai! Data tersimpan dalam duolingo_reviews_cleaned.csv


## Model Training (LSTM)

Pada bagian ini, saya akan menggunakan LSTM untuk pelatihan model analisis sentimen.

In [3]:
# Load data
df = pd.read_csv('duolingo_reviews_cleaned.csv')

# Pastikan tidak ada NaN atau angka dalam teks
df['cleaned_text'] = df['cleaned_text'].astype(str)
df = df[df['cleaned_text'].str.strip() != '']

# Encode label sentimen ke angka
label_mapping = {"Positive": 2, "Neutral": 1, "Negative": 0}
df['sentiment_encoded'] = df['sentiment'].map(label_mapping)

# Tokenisasi teks
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df['cleaned_text'])  # ✅ Tidak akan error lagi
sequences = tokenizer.texts_to_sequences(df['cleaned_text'])

# Padding sequence
X = pad_sequences(sequences, maxlen=100)
y = df['sentiment_encoded'].values

# Split data (80% training, 20% testing)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model LSTM
model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=100),
    LSTM(64, return_sequences=True),
    LSTM(64),
    Dense(64, activation='relu'),
    Dense(3, activation='softmax')  # Output 3 kelas (Negatif, Netral, Positif)
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Latih model
history = model.fit(X_train, y_train, epochs=5, batch_size=32, validation_data=(X_test, y_test))

# Simpan model
model.save('sentiment_model.h5')
print("Model selesai dilatih dan disimpan sebagai sentiment_model.h5")

Epoch 1/5




[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 118ms/step - accuracy: 0.8281 - loss: 0.5538 - val_accuracy: 0.8737 - val_loss: 0.4245
Epoch 2/5
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 110ms/step - accuracy: 0.8838 - loss: 0.3660 - val_accuracy: 0.8788 - val_loss: 0.3913
Epoch 3/5
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 115ms/step - accuracy: 0.9220 - loss: 0.2385 - val_accuracy: 0.8687 - val_loss: 0.4269
Epoch 4/5
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 115ms/step - accuracy: 0.9364 - loss: 0.1869 - val_accuracy: 0.8775 - val_loss: 0.4934
Epoch 5/5
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 108ms/step - accuracy: 0.9429 - loss: 0.1740 - val_accuracy: 0.8637 - val_loss: 0.5313




Model selesai dilatih dan disimpan sebagai sentiment_model.h5


## Evaluation & Inference

Setelah pelatihan, saya akan melakukan evaluasi model dengan data testing dan memberikan output klasifikasi sentimen.

In [5]:
# Prediksi testing set
y_pred = model.predict(X_test)
y_pred_classes = y_pred.argmax(axis=1)

# Evaluasi akurasi
accuracy = accuracy_score(y_test, y_pred_classes)
print(f"Akurasi Model: {accuracy * 100:.2f}%")

# Laporan klasifikasi
print(classification_report(y_test, y_pred_classes, target_names=["Negative", "Neutral", "Positive"]))

[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 83ms/step
Akurasi Model: 86.38%
              precision    recall  f1-score   support

    Negative       0.41      0.38      0.39        58
     Neutral       0.19      0.09      0.12        44
    Positive       0.92      0.95      0.93       698

    accuracy                           0.86       800
   macro avg       0.51      0.47      0.48       800
weighted avg       0.84      0.86      0.85       800



## Model Testing

Testing model dengan testing set untuk memastikan hasil yang optimal.

In [6]:
# Prediksi pada testing set
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)  # Ambil kelas dengan probabilitas tertinggi

# Hitung akurasi
accuracy = accuracy_score(y_test, y_pred_classes)
print(f'Akurasi Model: {accuracy:.4f}')

# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_classes))

# Classification Report (Precision, Recall, F1-score)
print("Classification Report:")
print(classification_report(y_test, y_pred_classes, target_names=["Negative", "Neutral", "Positive"]))

[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step
Akurasi Model: 0.8638
Confusion Matrix:
[[ 22   6  30]
 [ 10   4  30]
 [ 22  11 665]]
Classification Report:
              precision    recall  f1-score   support

    Negative       0.41      0.38      0.39        58
     Neutral       0.19      0.09      0.12        44
    Positive       0.92      0.95      0.93       698

    accuracy                           0.86       800
   macro avg       0.51      0.47      0.48       800
weighted avg       0.84      0.86      0.85       800



## Conclusion

Di bagian ini, ringkasan dari hasil pelatihan dan evaluasi model yang telah dilakukan.

In [8]:
# Menampilkan kesimpulan dari hasil pelatihan dan evaluasi model
print("="*50)
print("📌 Conclusion Proyek Analisis Sentimen 📌")
print("="*50)

# Menampilkan akurasi testing set
print(f"🎯 Akurasi Model pada Testing Set: {accuracy:.4f}")

# Menampilkan ringkasan classification report
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred_classes, average='weighted')
recall = recall_score(y_test, y_pred_classes, average='weighted')
f1 = f1_score(y_test, y_pred_classes, average='weighted')

print(f"📊 Precision: {precision:.4f}")
print(f"📊 Recall: {recall:.4f}")
print(f"📊 F1-Score: {f1:.4f}")

# Kesimpulan
if accuracy >= 0.85:
    print("\n✅ Model telah memenuhi standar akurasi minimal 85%. Model ini dapat digunakan untuk analisis sentimen dengan performa yang baik.")
else:
    print("\n⚠️ Model belum mencapai akurasi minimal 85%. Perlu dilakukan optimasi lebih lanjut, seperti meningkatkan dataset, tuning hyperparameter, atau mencoba algoritma lain.")

📌 Conclusion Proyek Analisis Sentimen 📌
🎯 Akurasi Model pada Testing Set: 0.8638
📊 Precision: 0.8403
📊 Recall: 0.8638
📊 F1-Score: 0.8507

✅ Model telah memenuhi standar akurasi minimal 85%. Model ini dapat digunakan untuk analisis sentimen dengan performa yang baik.
