<a href="https://colab.research.google.com/github/ArmFriiz/Dicoding-Submission-FDL/blob/main/Analisis%20Sentimen/scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Muhammad Faris Akbar**
<br></br>
**Fundamental Deep Learning - Sentimen Analisis Shopee**

In [100]:
!pip install google-play-scraper transformers



In [101]:
import sys
import os
import pandas as pd
import re

In [102]:
try:
    # Jika dijalankan sebagai file .py
    base_path = os.path.dirname(os.path.abspath(__file__))
except NameError:
    # Jika dijalankan di Jupyter/Interactive
    base_path = os.getcwd()

parent_dir = os.path.abspath(os.path.join(base_path, '..'))
sys.path.append(parent_dir)

print("Base Path:", base_path)
print("Parent Directory:", parent_dir)

Base Path: /content
Parent Directory: /


In [103]:
from google_play_scraper import reviews, Sort
# from Helper import *
from transformers import pipeline
from tqdm import tqdm

**Function Scraping Data Google Play**

In [104]:
def scrape_google_play(app_id, total_count, country='id', lang='id'):
    """
    Fungsi untuk melakukan scraping ulasan dari Google Play Store.

    app_id (str): ID aplikasi di Play Store (contoh: 'com.shopee.id')
    total_count (int): Target jumlah data yang ingin diambil
    country (str): Kode negara (default Indonesia 'id')
    lang (str): Bahasa ulasan (default Indonesia 'id')
    """
    print(f"Proses scraping untuk aplikasi: {app_id}...")

    result, continuation_token = reviews(
        app_id,
        lang=lang,
        country=country,
        sort=Sort.NEWEST, # Sort.NEWEST digunakan untuk mendapat data terbaru
        count=total_count,
        filter_score_with=None # Mengambil semua rating (1-5)
    )

    print(f"Berhasil mengambil {len(result)} data mentah.")

    # Konversi hasil scraping ke dalam DataFrame (Tabel)
    df = pd.DataFrame(result)

    return df

**Function Labeling Data**

In [105]:
def label_with_indobert(df):
  pretrained_name = "w11wo/indonesian-roberta-base-sentiment-classifier"

  nlp = pipeline(
      "sentiment-analysis",
      model=pretrained_name,
      tokenizer=pretrained_name,
      truncation=True, # Potong teks jika terlalu panjang (>512 kata)
      max_length=512
    )

  labels = []
  scores = []

  for text in tqdm(df['soft_clean_content']):
    try:
      result = nlp(text)[0] # Prediksi sentimen

      label = result['label'] # output: 'positive', 'neutral', 'negative'

      # Mapping ulang ke format Bahasa Indonesia
      label_map = {
        'positive': 'Positif',
        'neutral': 'Netral',
        'negative': 'Negatif'
      }
      labels.append(label_map.get(label, label))
      scores.append(result['score'])

    except Exception as e:
      print(f"Error pada teks: {text}")
      labels.append("Netral")
      scores.append(0.0)

  df['sentiment_label'] = labels
  df['confidence_score'] = scores

  return df

In [106]:
# def labeling_data(df):
#     """
#     Fungsi untuk memberikan label otomatis berdasarkan skor bintang.
#     Menggunakan logika:
#     1-2 Bintang = Negatif
#     3 Bintang   = Netral
#     4-5 Bintang = Positif
#     """
#     def get_sentiment(score):
#         if score <= 2:
#             return 'Negatif'
#         elif score == 3:
#             return 'Netral'
#         else:
#             return 'Positif'

#     # Terapkan fungsi get_sentiment ke kolom 'score'
#     df['label'] = df['score'].apply(get_sentiment)

#     return df

**Soft Cleaning Data (Kata berulang, URL, Tags HTML)**

In [107]:
def cleaning_untuk_labeling(text):
    text = str(text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text) # Hapus URL
    text = re.sub(r'<.*?>', '', text) # Hapus HTML tags
    text = re.sub(r'@[A-Za-z0-9_]+', '', text) # Hapus Mentions (@user)
    text = re.sub(r'#[A-Za-z0-9]+', '', text) # Hapus hashtag
    text = re.sub(r'(.)\1{2,}', r'\1\1', text) # Hapus kata berulang yang muncul lebih dari 2x
    text = re.sub(r'[a-zA-Z]+\d+\w*|\w*\d+[a-zA-Z]+', '', text) # Hapus kombinasi angka dan huruf seperti m4ndi, 4yam, dll
    text = re.sub(r'\b\d{7,}\b', '', text) # Hapus angka yang panjangnya lebih dari 7
    text = ' '.join(text.split())

    return text

**Main Section**

In [108]:
# TARGET_APP = 'com.shopee.id'
# JUMLAH_DATA = 50000

# df_ulasan = scrape_google_play(TARGET_APP, JUMLAH_DATA)

In [109]:
df_ulasan = pd.read_csv('https://raw.githubusercontent.com/ArmFriiz/Dicoding-Submission-FDL/refs/heads/main/Analisis%20Sentimen/dataset_ulasan_playstore.csv')

In [110]:
df_ulasan.head(5)

Unnamed: 0,content,score,label
0,CEPAT AMAN TERPERCAYA,5,Positif
1,top markotop,5,Positif
2,belanja online nomor 1 satu di Indonesia.,5,Positif
3,GANGGU DOANG LU ORG LAGI MAIN JUGA,1,Negatif
4,jadi langganan belanja di shopee..,5,Positif


In [111]:
df_ulasan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73478 entries, 0 to 73477
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   content  73478 non-null  object
 1   score    73478 non-null  int64 
 2   label    73478 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB


In [112]:
df_ulasan.describe(include='all')

Unnamed: 0,content,score,label
count,73478,73478.0,73478
unique,72705,,3
top,Baik,,Positif
freq,5,,51173
mean,,3.862802,
std,,1.668418,
min,,1.0,
25%,,2.0,
50%,,5.0,
75%,,5.0,


**Cek Validitas dan Kebersihan Data**

In [113]:
# cek_nan(df_ulasan)

In [114]:
# visualize_row_with_duplicated(df_ulasan)

**Filter kolom content dan score untuk konten data**

In [115]:
df = df_ulasan[['content', 'score']]

In [116]:
# visualize_row_with_duplicated(df)

In [117]:
df.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(inplace=True)


In [118]:
# visualize_row_with_duplicated(df)

In [119]:
print(f"Ukuran Data Setelah Pembersihan NaN dan Duplicated: {df.shape}")

Ukuran Data Setelah Pembersihan NaN dan Duplicated: (73478, 2)


**Soft Cleaning Data**

In [120]:
df['soft_clean_content'] = df['content'].apply(cleaning_untuk_labeling)

In [121]:
df.head(5)

Unnamed: 0,content,score,soft_clean_content
0,CEPAT AMAN TERPERCAYA,5,CEPAT AMAN TERPERCAYA
1,top markotop,5,top markotop
2,belanja online nomor 1 satu di Indonesia.,5,belanja online nomor 1 satu di Indonesia.
3,GANGGU DOANG LU ORG LAGI MAIN JUGA,1,GANGGU DOANG LU ORG LAGI MAIN JUGA
4,jadi langganan belanja di shopee..,5,jadi langganan belanja di shopee..


**Labeling Data**

In [122]:
# print("Labeling data berdasarkan skor bintang")
# df_labeled = labeling_data(df)

In [123]:
df_labeled = label_with_indobert(df)

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/328 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu
  0%|          | 30/73478 [00:06<4:42:20,  4.34it/s]


KeyboardInterrupt: 

In [None]:
df_labeled.head(5)

**Pemeriksaan Distribusi Data, memastikan apakah terdapat imbalance atau tidak**

In [None]:
print("Distribusi Data per Kelas:")
print(df_labeled['label'].value_counts())

**Konversi ke csv untuk mempermudah dalam pembersihan data lebih lanjut**

In [None]:
# nama_file = 'dataset_ulasan_playstore.csv'
# df_labeled.to_csv(nama_file, index=False)

# print(f"Selesai! Data berhasil disimpan ke '{nama_file}'")
# print(f"Total data: {len(df_labeled)}")