<a href="https://colab.research.google.com/github/LatiefDataVisionary/nlp-emotikon-slang-id/blob/main/notebooks/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Import Libraries and Install Sastrawi**

In [4]:
!pip install Sastrawi

Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl.metadata (909 bytes)
Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


In [5]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

## **Download NLTK resources**

In [6]:
nltk.download('punkt') # Tokenizer models
nltk.download('stopwords') # Stopwords for Indonesian and English
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## **Load the dataset**

In [7]:
url = 'https://raw.githubusercontent.com/LatiefDataVisionary/nlp-emotikon-slang-id/refs/heads/main/data/raw_merged/merged_bucin_tweet.csv'
df = pd.read_csv(url)
df

Unnamed: 0,conversation_id_str,created_at,favorite_count,full_text,id_str,image_url,in_reply_to_screen_name,lang,location,quote_count,reply_count,retweet_count,tweet_url,user_id_str,username
0,1921236320413687871,Sat May 10 16:10:08 +0000 2025,9190,kdm ️ resident playbook ️ pict atas minta peg...,1921236320413687871,https://pbs.twimg.com/media/GqmaQGPWcAAMaS9.jpg,,in,Rules 👇,452,324,1044,https://x.com/kdrama_menfess/status/1921236320...,1012588105591713793,kdrama_menfess
1,1920994897315733648,Sat May 10 00:10:48 +0000 2025,19038,Jualan tapi julidin customer tu maksudnya gima...,1920994897315733648,https://pbs.twimg.com/media/Gqi-q2RXgAAIW_5.jpg,,in,Indonesia,684,593,1319,https://x.com/Maeliani07/status/19209948973157...,1659026628071288835,Maeliani07
2,1921516689079820765,Sun May 11 10:44:13 +0000 2025,4,Ini pas awal vidio Yeye kaya nyandar di bahu K...,1921516689079820765,https://pbs.twimg.com/media/GqqZOtGaUAAgDdP.jpg,,in,dusun majasri,0,0,1,https://x.com/BeUrCLOUDS/status/19215166890798...,2995021591,BeUrCLOUDS
3,1921414816133996832,Sun May 11 03:59:24 +0000 2025,2128,kdm resident playbook Ini tuh dowon yiyoung p...,1921414816133996832,https://pbs.twimg.com/media/Gqo8l6vW0AAbCWI.jpg,,in,Rules 👇,55,77,198,https://x.com/kdrama_menfess/status/1921414816...,1012588105591713793,kdrama_menfess
4,1921515322135232681,Sun May 11 10:38:47 +0000 2025,438,Bu lena liat kelakuan bucin adiknya AQEELA TER...,1921515322135232681,https://pbs.twimg.com/ext_tw_video_thumb/19215...,,in,,5,1,69,https://x.com/staraquars/status/19215153221352...,1692804014994599937,staraquars
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
610,1921519612543279514,Sun May 11 10:55:50 +0000 2025,12,gue bisa kok bales argumentasi kalian pake kat...,1921519612543279514,,,in,bikini bottom 🫧🪸🫙🐬🧜‍♀️,0,0,0,https://x.com/redhorse44/status/19215196125432...,1642331059634442241,redhorse44
611,1920141611280847139,Wed May 07 15:40:09 +0000 2025,1410,SKSKSKSKSKSKSKSK ternyata gini ya rasanya buc...,1920141611280847139,,,in,Indonesia,27,108,33,https://x.com/tanyarlfes/status/19201416112808...,1371650588,tanyarlfes
612,1921137993051525158,Sat May 10 09:39:25 +0000 2025,274,Mav kalau sensitip tapi kalian anak UT! yang k...,1921137993051525158,,,in,,54,145,13,https://x.com/utfess/status/1921137993051525158,1451424492719251460,utfess
613,1921453647201710170,Sun May 11 06:33:43 +0000 2025,428,mae ning sama gem sama aja ya kalo udah ditany...,1921453647201710170,,,in,𝕜𝕙𝕦𝕟𝕟𝕠𝕠 ☆ 𝕝𝕪𝕜𝕪𝕠𝕦,0,0,80,https://x.com/dycuri/status/1921453647201710170,1794083416058609664,dycuri


## **1. Scenario 1: No Emoticons/Slang**

### **a. Separate the dataset for Scenario 1**

In [8]:
df1_no_emoticons = df

### **b. Apply preprocessing steps**

**1. Clean Text**

In [11]:
def clean_text_scenario1(text):
  """
    Clean text by removing URLs, mentions, hashtags, non-alphabetic characters (including emoticons),
    and Indonesian slang words.

    Args:
        text (str): Raw input text.

    Returns:
        str: Cleaned text with only formal Indonesian/English words.
    """
  if pd.isna(text):  # Handle missing values
    return ''

  # Remove URLs, mentions (@), and hashtags (#)
  text = re.sub(r'http\S+|@\w+|#\w+', '', text)

  # Remove ALL non-alphabetic characters (including emoticons and punctuation)
  text = re.sub(r'[^a-zA-Z\s]', '', text)

  # Remove common Indonesian slang words
  slang_words = ['wkwk', 'bangeet', 'gemess', 'mantul', 'sukab']
  for slang in slang_words:
    text = re.sub(r'\b' + slang + r'\b', '', text)  # \b ensures whole-word matching

  # Normalize whitespace
  text = re.sub(r'\s+', ' ', text).strip()

  return text

In [13]:
df1_no_emoticons['cleaned_text'] = df1_no_emoticons['full_text'].apply(clean_text_scenario1)

**2. Case Folding (convert to lower case)**

In [14]:
df1_no_emoticons['case_folded'] = df1_no_emoticons['cleaned_text'].str.lower()

**3. Tokenization**

In [15]:
df1_no_emoticons['tokens'] = df['case_folded'].apply(nltk.word_tokenize)

### **c. Stopword Removal (Indonesian + English + custom)**

In [17]:
stop_words = set(
    stopwords.words('indonesian') +
    stopwords.words('english') +
    ['dong', 'sih', 'nya', 'lah', 'deh', 'rt']
)

In [18]:
df1_no_emoticons['filtered_tokens'] = df1_no_emoticons['tokens'].apply(
    lambda x: [word for word in x if word not in stop_words]
)

### **d. Drop unused columns (Twitter metadata)**

In [19]:
columns_to_drop = [
    'conversation_id_str', 'id_str', 'image_url', 'in_reply_to_screen_name',
    'location', 'quote_count', 'reply_count', 'tweet_url', 'user_id_str'
]
df1_no_emoticons.drop(columns=columns_to_drop, inplace=True)

### **e. Save to CSV**

In [21]:
df1_no_emoticons.to_csv('preprocessed_scenario1_no_emoticons.csv', index=False)

**1. Cleaning**

In [None]:
def clean_text(text):
  if pd.isna(text):  # Handle missing values
    return ''
  # Remove URLs
  text = re.sub(r'http\S+', '', text)
  # Remove mentions (@) and hashtags (#)
  text = re.sub(r'@\w+|#\w+', '', text)
  # Remove special characters, numbers, and emojis
  text = re.sub(r'[^a-zA-Z\s]', '', text)
  # Remove extra whitespace
  text = re.sub(r'\s+', ' ', text).strip()
  return text

# Apply cleaning to the text ccolumn (replace 'tweet' with your column name)
df['cleaned_text'] = df['full_text'].apply(clean_text)

**2. Case Folding (convert to lowercase)**

In [None]:
df['case_folded'] = df['cleaned_text'].str.lower()

**3. Tokenizing (split text into words)**

In [None]:
def tokenize_text(text):
  return nltk.word_tokenize(text)

df['tokens'] = df['case_folded'].apply(tokenize_text)

**4. Filtering (Stopword Removal)**

In [None]:
# Add custom stopwords for 'bucin' context
custom_stopwords = ['dong', 'sih', 'nya', 'lah', 'deh', 'rt']
stop_words = set(
    stopwords.words('indonesian') +
    stopwords.words('english') +
    custom_stopwords
)

def remove_stopwords(tokens):
  return [word for word in tokens if word not in stop_words]

df['filtered_tokens'] = df['tokens'].apply(remove_stopwords)

**5. Stemming (reduce words to root form)**

In [None]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()

def stem_text(tokens):
  return [stemmer.stem(word) for word in tokens]

df['stemmed_tokens'] = df['filtered_tokens'].apply(stem_text)

**6. Drop unuseful columns**

In [None]:
columns_to_drop = [
    'conversation_id_str', 'id_str', 'image_url', 'in_reply_to_screen_name',
    'location', 'quote_count', 'reply_count', 'tweet_url', 'user_id_str'
]
df = df.drop(columns=columns_to_drop)

**Save preprocessed data**

In [None]:
df.to_csv('preprocessed_bucin.csv', index=False)
df

Unnamed: 0,created_at,favorite_count,full_text,lang,retweet_count,username,cleaned_text,case_folded,tokens,filtered_tokens,stemmed_tokens
0,Sat May 10 16:10:08 +0000 2025,9190,kdm ️ resident playbook ️ pict atas minta peg...,in,1044,kdrama_menfess,kdm resident playbook pict atas minta pegangan...,kdm resident playbook pict atas minta pegangan...,"[kdm, resident, playbook, pict, atas, minta, p...","[kdm, resident, playbook, pict, pegangan, tang...","[kdm, resident, playbook, pict, pegang, tangan..."
1,Sat May 10 00:10:48 +0000 2025,19038,Jualan tapi julidin customer tu maksudnya gima...,in,1319,Maeliani07,Jualan tapi julidin customer tu maksudnya gima...,jualan tapi julidin customer tu maksudnya gima...,"[jualan, tapi, julidin, customer, tu, maksudny...","[jualan, julidin, customer, tu, maksudnya, gim...","[jual, julidin, customer, tu, maksud, gimana, ..."
2,Sun May 11 10:44:13 +0000 2025,4,Ini pas awal vidio Yeye kaya nyandar di bahu K...,in,1,BeUrCLOUDS,Ini pas awal vidio Yeye kaya nyandar di bahu K...,ini pas awal vidio yeye kaya nyandar di bahu k...,"[ini, pas, awal, vidio, yeye, kaya, nyandar, d...","[pas, vidio, yeye, kaya, nyandar, bahu, kyukyu...","[pas, vidio, yeye, kaya, nyandar, bahu, kyukyu..."
3,Sun May 11 03:59:24 +0000 2025,2128,kdm resident playbook Ini tuh dowon yiyoung p...,in,198,kdrama_menfess,kdm resident playbook Ini tuh dowon yiyoung pe...,kdm resident playbook ini tuh dowon yiyoung pe...,"[kdm, resident, playbook, ini, tuh, dowon, yiy...","[kdm, resident, playbook, tuh, dowon, yiyoung,...","[kdm, resident, playbook, tuh, dowon, yiyoung,..."
4,Sun May 11 10:38:47 +0000 2025,438,Bu lena liat kelakuan bucin adiknya AQEELA TER...,in,69,staraquars,Bu lena liat kelakuan bucin adiknya AQEELA TER...,bu lena liat kelakuan bucin adiknya aqeela ter...,"[bu, lena, liat, kelakuan, bucin, adiknya, aqe...","[bu, lena, liat, kelakuan, bucin, adiknya, aqe...","[bu, lena, liat, laku, bucin, adik, aqeela, te..."
...,...,...,...,...,...,...,...,...,...,...,...
610,Sun May 11 10:55:50 +0000 2025,12,gue bisa kok bales argumentasi kalian pake kat...,in,0,redhorse44,gue bisa kok bales argumentasi kalian pake kat...,gue bisa kok bales argumentasi kalian pake kat...,"[gue, bisa, kok, bales, argumentasi, kalian, p...","[gue, bales, argumentasi, pake, sopan, gabisa,...","[gue, bales, argumentasi, pake, sopan, gabisa,..."
611,Wed May 07 15:40:09 +0000 2025,1410,SKSKSKSKSKSKSKSK ternyata gini ya rasanya buc...,in,33,tanyarlfes,SKSKSKSKSKSKSKSK ternyata gini ya rasanya buci...,sksksksksksksksk ternyata gini ya rasanya buci...,"[sksksksksksksksk, ternyata, gini, ya, rasanya...","[sksksksksksksksk, gini, ya, bucin, kaya, mela...","[sksksksksksksksk, gin, ya, bucin, kaya, layan..."
612,Sat May 10 09:39:25 +0000 2025,274,Mav kalau sensitip tapi kalian anak UT! yang k...,in,13,utfess,Mav kalau sensitip tapi kalian anak UT yang ku...,mav kalau sensitip tapi kalian anak ut yang ku...,"[mav, kalau, sensitip, tapi, kalian, anak, ut,...","[mav, sensitip, anak, ut, kuliah, kerja, tuh, ...","[mav, sensitip, anak, ut, kuliah, kerja, tuh, ..."
613,Sun May 11 06:33:43 +0000 2025,428,mae ning sama gem sama aja ya kalo udah ditany...,in,80,dycuri,mae ning sama gem sama aja ya kalo udah ditany...,mae ning sama gem sama aja ya kalo udah ditany...,"[mae, ning, sama, gem, sama, aja, ya, kalo, ud...","[mae, ning, gem, aja, ya, kalo, udah, ditanyai...","[mae, ning, gem, aja, ya, kalo, udah, ditanyai..."


**Example output**

In [None]:
# Example output
print('\nPreprocessing Example:\n')
index = int(input('Enter index of the tweet: '))
print('Original Text\t:\n\t-', df['full_text'][index])  # Adjust column name
print('Cleaned Text\t:\n\t-', df['cleaned_text'][index])
print('Case Folded\t:\n\t-', df['case_folded'][index])
print('Tokens\t:\n\t-', df['tokens'][index])
print('Filtered Tokens\t:\n\t-', df['filtered_tokens'][index])
print('Stemmed Tokens\t:\n\t-', df['stemmed_tokens'][index])


Preprocessing Example:

Enter index of the tweet: 11
Original Text	:
	- #JuniorMark : onetweet AU. ketika Juna bucin level max dan Marka pusing pacarnya kok bawel banget untung sayang. https://t.co/Zpp3JnDq1q
Cleaned Text	:
	- onetweet AU ketika Juna bucin level max dan Marka pusing pacarnya kok bawel banget untung sayang
Case Folded	:
	- onetweet au ketika juna bucin level max dan marka pusing pacarnya kok bawel banget untung sayang
Tokens	:
	- ['onetweet', 'au', 'ketika', 'juna', 'bucin', 'level', 'max', 'dan', 'marka', 'pusing', 'pacarnya', 'kok', 'bawel', 'banget', 'untung', 'sayang']
Filtered Tokens	:
	- ['onetweet', 'au', 'juna', 'bucin', 'level', 'max', 'marka', 'pusing', 'pacarnya', 'bawel', 'banget', 'untung', 'sayang']
Stemmed Tokens	:
	- ['onetweet', 'au', 'juna', 'bucin', 'level', 'max', 'marka', 'pusing', 'pacar', 'bawel', 'banget', 'untung', 'sayang']
