# **1. Perkenalan Dataset**


Tahap pertama, Anda harus mencari dan menggunakan dataset dengan ketentuan sebagai berikut:

1. **Sumber Dataset**:  
   Dataset dapat diperoleh dari berbagai sumber, seperti public repositories (*Kaggle*, *UCI ML Repository*, *Open Data*) atau data primer yang Anda kumpulkan sendiri.


# **2. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning atau deep learning.

In [18]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import datetime as dt 
import re 
import string 
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords 
from wordcloud import WordCloud  
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk 
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\WORKPLUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\WORKPLUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\WORKPLUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\WORKPLUS\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

# **3. Memuat Dataset**

Pada tahap ini, Anda perlu memuat dataset ke dalam notebook. Jika dataset dalam format CSV, Anda bisa menggunakan pustaka pandas untuk membacanya. Pastikan untuk mengecek beberapa baris awal dataset untuk memahami strukturnya dan memastikan data telah dimuat dengan benar.

Jika dataset berada di Google Drive, pastikan Anda menghubungkan Google Drive ke Colab terlebih dahulu. Setelah dataset berhasil dimuat, langkah berikutnya adalah memeriksa kesesuaian data dan siap untuk dianalisis lebih lanjut.

Jika dataset berupa unstructured data, silakan sesuaikan dengan format seperti kelas Machine Learning Pengembangan atau Machine Learning Terapan

In [19]:
df = pd.read_csv("../data-raw/whatsapp_reviews.csv") 
df.head()

Unnamed: 0,review_id,rating,review_text,review_date,helpful
0,56887e3c-1684-4ced-834e-befc7a66fc7d,5,Great üëç,11/26/2025 22:08,0
1,0da4488e-7158-4ea6-bcb9-17b9b3867642,1,plz whats up unban,11/26/2025 22:08,0
2,5a20e8e3-9e00-4360-a539-16953e309a3a,1,my contact didn't show on WhatsApp .. for priv...,11/26/2025 22:06,0
3,0cf26263-1c10-473e-ae15-d6390884fef7,1,Can you guys let archived group chats stay arc...,11/26/2025 22:05,1
4,54eebd96-041e-4baf-a8f1-adf603658c28,5,it is the g.o.a.tüáøüáº,11/26/2025 22:04,0


# **4. Exploratory Data Analysis (EDA)**

Pada tahap ini, Anda akan melakukan **Exploratory Data Analysis (EDA)** untuk memahami karakteristik dataset.

Tujuan dari EDA adalah untuk memperoleh wawasan awal yang mendalam mengenai data dan menentukan langkah selanjutnya dalam analisis atau pemodelan.

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5400 entries, 0 to 5399
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_id    5400 non-null   object
 1   rating       5400 non-null   int64 
 2   review_text  5400 non-null   object
 3   review_date  5400 non-null   object
 4   helpful      5400 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 211.1+ KB


In [21]:
df = df.dropna()

In [22]:
df = df.drop_duplicates()

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5400 entries, 0 to 5399
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_id    5400 non-null   object
 1   rating       5400 non-null   int64 
 2   review_text  5400 non-null   object
 3   review_date  5400 non-null   object
 4   helpful      5400 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 211.1+ KB


# **5. Data Preprocessing**

Pada tahap ini, data preprocessing adalah langkah penting untuk memastikan kualitas data sebelum digunakan dalam model machine learning.

Jika Anda menggunakan data teks, data mentah sering kali mengandung nilai kosong, duplikasi, atau rentang nilai yang tidak konsisten, yang dapat memengaruhi kinerja model. Oleh karena itu, proses ini bertujuan untuk membersihkan dan mempersiapkan data agar analisis berjalan optimal.

Berikut adalah tahapan-tahapan yang bisa dilakukan, tetapi **tidak terbatas** pada:
1. Menghapus atau Menangani Data Kosong (Missing Values)
2. Menghapus Data Duplikat
3. Normalisasi atau Standarisasi Fitur
4. Deteksi dan Penanganan Outlier
5. Encoding Data Kategorikal
6. Binning (Pengelompokan Data)

Cukup sesuaikan dengan karakteristik data yang kamu gunakan yah. Khususnya ketika kami menggunakan data tidak terstruktur.

In [24]:
slang_path = "../data-raw/acrynom.csv" 
slang_df = pd.read_csv(slang_path, header=0, names=['slang', 'formal'])
slang_dict = dict(zip(slang_df['slang'], slang_df['formal']))

In [25]:
def cleaningText(text):
    text = text.encode('ascii','ignore').decode('ascii')
    text = re.sub(r'@[A-Za-z0-9]+', '', text) # remove mentions
    text = re.sub(r'#[A-Za-z0-9]+', '', text) # remove hashtag
    text = re.sub(r'RT[\s]', '', text) # remove RT
    text = re.sub(r"http\S+", '', text) # remove link
    text = re.sub(r'[0-9]+', '', text) # remove numbers
    text = text.replace('\n', ' ') # replace new line into space
    text = text.translate(str.maketrans('', '', string.punctuation)) # remove all punctuations
    text = text.strip(' ') # remove characters space from both left and right text
    return text 

def caseFolding(text):
    text = text.lower() 
    return text 

def replace_slang(text):
    global slang_dict 
    words = text.split() 
    normalized_word = [] 
    for word in words:
        formal_word = slang_dict.get(word, word)
        normalized_word.append(formal_word)
    text = " ".join(normalized_word)
    return text

def tokenizingText(text):
    listStopwords = set(stopwords.words('english'))
    stopword_dict = {"n't", "'s", "'m", "'re", "...", "u", "ur"}
    listStopwords.update(stopword_dict)
    tokens = word_tokenize(text)
    filtered = [] 
    for txt in tokens:
        if txt not in listStopwords:
            filtered.append(txt) 
    text = filtered 
    return text 

def stemmingText(text):
    stemmer = PorterStemmer() 
    # words = text.split() 
    stemmed_words = [stemmer.stem(word) for word in text]
    # stemmed_text = ' '.join(stemmed_words)
    return stemmed_words 

def lemmatizingText(list_words):
    lemmatizer = WordNetLemmatizer() 
    lemmatized_words = [lemmatizer.lemmatize(word) for word in list_words]
    return lemmatized_words 

def toSentence(list_words):
    sentence = ' '.join(word for word in list_words)
    return sentence


In [26]:
clean_df = df.copy()
clean_df.head()

Unnamed: 0,review_id,rating,review_text,review_date,helpful
0,56887e3c-1684-4ced-834e-befc7a66fc7d,5,Great üëç,11/26/2025 22:08,0
1,0da4488e-7158-4ea6-bcb9-17b9b3867642,1,plz whats up unban,11/26/2025 22:08,0
2,5a20e8e3-9e00-4360-a539-16953e309a3a,1,my contact didn't show on WhatsApp .. for priv...,11/26/2025 22:06,0
3,0cf26263-1c10-473e-ae15-d6390884fef7,1,Can you guys let archived group chats stay arc...,11/26/2025 22:05,1
4,54eebd96-041e-4baf-a8f1-adf603658c28,5,it is the g.o.a.tüáøüáº,11/26/2025 22:04,0


In [27]:
clean_df['text_clean'] = clean_df['review_text'].apply(cleaningText)
clean_df['text_casefolding'] = clean_df['text_clean'].apply(caseFolding) 
clean_df['text_normalized'] = clean_df['text_casefolding'].apply(replace_slang)
clean_df['text_tokenandstopwords'] = clean_df['text_normalized'].apply(tokenizingText)
clean_df['text_lematizing'] = clean_df['text_tokenandstopwords'].apply(lemmatizingText)
clean_df['text_akhir'] = clean_df['text_lematizing'].apply(toSentence)

In [28]:
clean_df.head()

Unnamed: 0,review_id,rating,review_text,review_date,helpful,text_clean,text_casefolding,text_normalized,text_tokenandstopwords,text_lematizing,text_akhir
0,56887e3c-1684-4ced-834e-befc7a66fc7d,5,Great üëç,11/26/2025 22:08,0,Great,great,great,[great],[great],great
1,0da4488e-7158-4ea6-bcb9-17b9b3867642,1,plz whats up unban,11/26/2025 22:08,0,plz whats up unban,plz whats up unban,please whats up unban,"[please, whats, unban]","[please, whats, unban]",please whats unban
2,5a20e8e3-9e00-4360-a539-16953e309a3a,1,my contact didn't show on WhatsApp .. for priv...,11/26/2025 22:06,0,my contact didnt show on WhatsApp for privacy...,my contact didnt show on whatsapp for privacy...,my contact didnt show on whatsapp for privacy ...,"[contact, didnt, show, whatsapp, privacy, cant...","[contact, didnt, show, whatsapp, privacy, cant...",contact didnt show whatsapp privacy cant share...
3,0cf26263-1c10-473e-ae15-d6390884fef7,1,Can you guys let archived group chats stay arc...,11/26/2025 22:05,1,Can you guys let archived group chats stay arc...,can you guys let archived group chats stay arc...,can you guys let archived group chats stay arc...,"[guys, let, archived, group, chats, stay, arch...","[guy, let, archived, group, chat, stay, archiv...",guy let archived group chat stay archived arch...
4,54eebd96-041e-4baf-a8f1-adf603658c28,5,it is the g.o.a.tüáøüáº,11/26/2025 22:04,0,it is the goat,it is the goat,it is the goat,[goat],[goat],goat
