# Preprocessing Dataset alicorn_twilight.csv
Sebelum masuk ke tahap pembuatan model klasifikasi, dataset harus dilakukan preprocessing dulu seperti menghapus tanda spesial, mengubah menjadi huruf kecil, menghapus spasi double dan lain-lain.
Dataset yang digunakan adalah dataset yang berisi kurang lebih 500 tweet tentang komentar orang-orang terhadap peristiwa berubahnya Twilight Sparkle menjadi Alicorn di akhir season 3 dari acara My Little Pony: Friendship is Magic. Dataset ini didapat dari crawling data di twitter menggunakan tools yang disediakan oleh Helmi Satria (Big thanks to him).

## 1. Membaca Dataset

In [5]:
import pandas as pd

# Load the dataset
file_path = 'alicorn_twilight.csv'
dataset = pd.read_csv(file_path)

# Display the first few rows to understand the structure
dataset.head(), dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 507 entries, 0 to 506
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   conversation_id_str      507 non-null    int64 
 1   created_at               507 non-null    object
 2   favorite_count           507 non-null    int64 
 3   full_text                507 non-null    object
 4   id_str                   507 non-null    int64 
 5   image_url                44 non-null     object
 6   in_reply_to_screen_name  154 non-null    object
 7   lang                     507 non-null    object
 8   location                 335 non-null    object
 9   quote_count              507 non-null    int64 
 10  reply_count              507 non-null    int64 
 11  retweet_count            507 non-null    int64 
 12  tweet_url                507 non-null    object
 13  user_id_str              507 non-null    int64 
 14  username                 507 non-null    o

(   conversation_id_str                      created_at  favorite_count  \
 0   349251986438356994  Mon Jun 24 19:45:37 +0000 2013               0   
 1   349044033647157248  Mon Jun 24 06:06:31 +0000 2013               0   
 2   348985526667329537  Mon Jun 24 02:08:49 +0000 2013               0   
 3   348932366217121793  Sun Jun 23 22:35:33 +0000 2013               0   
 4   348913737111048192  Sun Jun 23 21:21:32 +0000 2013               0   
 
                                            full_text              id_str  \
 0  @MLP_Alicorn_Twi oh hi twilight how r u and th...  349251986438356994   
 1  @NerdyPinkie that s great what have you been u...  349045855346638849   
 2  @NEligahn @BlameLoomy @TailsTheBard @PhotoPwne...  348986036539506688   
 3  @MLP_Alicorn_Twi u ok twilight? *looks conserned*  348932366217121793   
 4           @MLP_Alicorn_Twi oh hi Twilight how r u?  348913737111048192   
 
   image_url in_reply_to_screen_name lang                   location  \
 0       NaN

## 2. Cleaning Text

In [None]:
import re

# Fungsi untuk cleaning text
def clean_text(text):
    # Menghapus URLs
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
    # Menghapus mention dan hashtag
    text = re.sub(r"@\w+|#\w+", "", text)
    # Menghapus tanda baca
    text = re.sub(r"[^\w\s]", "", text)
    # Menghapus angka
    text = re.sub(r"\d+", "", text)
    # Menghapus whitespace
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Terapkan fungsi clean_text ke dataset['full_text']
dataset['cleaned_text'] = dataset['full_text'].apply(clean_text)

# Tampilkan kolom full_text dan cleaned_text
dataset[['full_text', 'cleaned_text']].head()

Unnamed: 0,full_text,cleaned_text
0,@MLP_Alicorn_Twi oh hi twilight how r u and th...,oh hi twilight how r u and the other ponys in ...
1,@NerdyPinkie that s great what have you been u...,that s great what have you been up to Did you ...
2,@NEligahn @BlameLoomy @TailsTheBard @PhotoPwne...,As in Cadance is dead Twilights not an alicorn...
3,@MLP_Alicorn_Twi u ok twilight? *looks conserned*,u ok twilight looks conserned
4,@MLP_Alicorn_Twi oh hi Twilight how r u?,oh hi Twilight how r u


## 3. Case Normalization

In [None]:
# Ubah teks menjadi lowercase
dataset['normalized_text'] = dataset['cleaned_text'].str.lower()

# Tampilkan kolom cleaned_text dan normalized_text
dataset[['cleaned_text', 'normalized_text']].head()

Unnamed: 0,cleaned_text,normalized_text
0,oh hi twilight how r u and the other ponys in ...,oh hi twilight how r u and the other ponys in ...
1,that s great what have you been up to Did you ...,that s great what have you been up to did you ...
2,As in Cadance is dead Twilights not an alicorn...,as in cadance is dead twilights not an alicorn...
3,u ok twilight looks conserned,u ok twilight looks conserned
4,oh hi Twilight how r u,oh hi twilight how r u


## 4. Stopword Removal

In [None]:
from nltk.corpus import stopwords
import nltk

# Download stopwords jika belum tersedia
nltk.download('stopwords')

# Atur bahasa yang digunakan
stop_words = set(stopwords.words('english'))

# Fungsi untuk menghapus stopwords
def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in stop_words])

# Terapkan fungsi remove_stopwords ke dataset['normalized_text']
dataset['no_stopwords_text'] = dataset['normalized_text'].apply(remove_stopwords)

# Tampilkan kolom normalized_text dan no_stopwords_text
dataset[['normalized_text', 'no_stopwords_text']].head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


Unnamed: 0,normalized_text,no_stopwords_text
0,oh hi twilight how r u and the other ponys in ...,oh hi twilight r u ponys ponyville
1,that s great what have you been up to did you ...,great hear twilight filly interesting alicorn
2,as in cadance is dead twilights not an alicorn...,cadance dead twilights alicorn show characters...
3,u ok twilight looks conserned,u ok twilight looks conserned
4,oh hi twilight how r u,oh hi twilight r u


## 5. Tokenize

In [9]:
from nltk.tokenize import word_tokenize
import nltk

# Download tokenizer jika belum tersedia
nltk.download('punkt')
nltk.download('punkt_tab')

# Fungsi untuk tokenisasi teks
def tokenize_text(text):
    return word_tokenize(text)

# Ubah list of tokens menjadi string
def list_to_string(token_list):
    return " ".join(token_list)

# Terapkan fungsi tokenize_text ke dataset['no_stopwords_text']
dataset['tokenized_text'] = dataset['no_stopwords_text'].apply(tokenize_text)

# Tampilkan kolom no_stopwords_text dan tokenized_text
dataset[['no_stopwords_text', 'tokenized_text']].head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


Unnamed: 0,no_stopwords_text,tokenized_text
0,oh hi twilight r u ponys ponyville,"[oh, hi, twilight, r, u, ponys, ponyville]"
1,great hear twilight filly interesting alicorn,"[great, hear, twilight, filly, interesting, al..."
2,cadance dead twilights alicorn show characters...,"[cadance, dead, twilights, alicorn, show, char..."
3,u ok twilight looks conserned,"[u, ok, twilight, looks, conserned]"
4,oh hi twilight r u,"[oh, hi, twilight, r, u]"


## 6. Lemmatization

In [10]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

# Inisialisasi WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Fungsi untuk lemmatization
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# Terapkam fungsi lemmatize_tokens ke dataset['tokenized_text']
dataset['lemmatized_text'] = dataset['tokenized_text'].apply(lemmatize_tokens)

# Tampilkan kolom tokenized_text dan lemmatized_text
dataset[['tokenized_text', 'lemmatized_text']].head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...


Unnamed: 0,tokenized_text,lemmatized_text
0,"[oh, hi, twilight, r, u, ponys, ponyville]","[oh, hi, twilight, r, u, pony, ponyville]"
1,"[great, hear, twilight, filly, interesting, al...","[great, hear, twilight, filly, interesting, al..."
2,"[cadance, dead, twilights, alicorn, show, char...","[cadance, dead, twilight, alicorn, show, chara..."
3,"[u, ok, twilight, looks, conserned]","[u, ok, twilight, look, conserned]"
4,"[oh, hi, twilight, r, u]","[oh, hi, twilight, r, u]"


## 7. Handle Emoji

In [11]:
import re

# Ubah list of tokens menjadi string
def list_to_string(token_list):
    return " ".join(token_list)

# Fungsi untuk menghapus emoji
def remove_emojis(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F700-\U0001F77F"  # alchemical symbols
                           u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                           u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                           u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                           u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                           u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                           u"\U00002702-\U000027B0"  # Dingbats
                           u"\U000024C2-\U0001F251"  # Enclosed characters
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# Terapkan fungsi list_to_string ke dataset['lemmatized_text']
dataset['text_as_string'] = dataset['lemmatized_text'].apply(list_to_string)

# Terapkam fungsi remove_emojis ke dataset['text_as_string']
dataset['no_emojis_text'] = dataset['text_as_string'].apply(remove_emojis)

# Display a few rows to verify
dataset[['text_as_string', 'no_emojis_text']].head()

Unnamed: 0,text_as_string,no_emojis_text
0,oh hi twilight r u pony ponyville,oh hi twilight r u pony ponyville
1,great hear twilight filly interesting alicorn,great hear twilight filly interesting alicorn
2,cadance dead twilight alicorn show character d...,cadance dead twilight alicorn show character d...
3,u ok twilight look conserned,u ok twilight look conserned
4,oh hi twilight r u,oh hi twilight r u


# 8. Handle Slang Words

In [12]:
# Definisikan slang dictionary
slang_dict = {
    "u": "you",
    "r": "are",
    "gonna": "going to",
    "lol": "laughing out loud",
    "idk": "I don't know",
    "brb": "be right back",
    "ttyl": "talk to you later",
    "btw": "by the way",
    "omg": "oh my god",
    "nvm": "never mind",
    "smh": "shaking my head",
    "tbh": "to be honest",
    "ikr": "I know right",
    "bff": "best friends forever",
    "afk": "away from keyboard",
    "lmao": "laughing my ass off",
    "rofl": "rolling on the floor laughing",
    "imo": "in my opinion",
    "imho": "in my humble opinion",
    "thx": "thanks",
    "pls": "please",
    "k": "okay",
    "ye": "yes",
    "nah": "no",
    "cya": "see you",
    "luv": "love",
    "b4": "before",
    "h8": "hate",
    "gr8": "great",
    "w8": "wait",
    "yolo": "you only live once",
    "wtf": "what the fuck",
    "wyd": "what are you doing",
    "wya": "where are you at",
    "asap": "as soon as possible",
    "fyi": "for your information",
    "jk": "just kidding",
    "ppl": "people",
    "msg": "message",
    "bday": "birthday",
    "hmu": "hit me up",
    "bae": "before anyone else",
    "fam": "family or close friends",
    "lit": "amazing or fun",
    "dope": "cool",
    "noob": "beginner or newbie",
    "gg": "good game",
    "ty": "thank you",
    "dm": "direct message",
    "rn": "right now",
    "tho": "though",
    "idc": "I don't care",
    "ily": "I love you",
    # Tambahkan lebih banyak slang sesuai kebutuhan
}

# Fungsi untuk mengganti slang dengan kamus yang sudah didefinisi
def replace_slang(text, slang_dict):
    words = text.split()
    replaced_words = [slang_dict[word] if word in slang_dict else word for word in words]
    return " ".join(replaced_words)

# Terapkan hasil ke kolom 'no_emojis_text'
dataset['no_slang_text'] = dataset['no_emojis_text'].apply(lambda x: replace_slang(x, slang_dict))

# Tampilkan kolom 'no_emojis_text' dan 'no_slang_text'
dataset[['no_emojis_text', 'no_slang_text']].head()

Unnamed: 0,no_emojis_text,no_slang_text
0,oh hi twilight r u pony ponyville,oh hi twilight are you pony ponyville
1,great hear twilight filly interesting alicorn,great hear twilight filly interesting alicorn
2,cadance dead twilight alicorn show character d...,cadance dead twilight alicorn show character d...
3,u ok twilight look conserned,you ok twilight look conserned
4,oh hi twilight r u,oh hi twilight are you


# 9. Ekspor Dataset

In [13]:
# Ganti nama kolom 'no_slang_text' menjadi 'full_text'
dataset['full_text'] = dataset['no_slang_text']

# Hapus kolom yang tidak diperlukan
dataset = dataset[['conversation_id_str', 'created_at', 'favorite_count', 'full_text', 'id_str', 'image_url', 'in_reply_to_screen_name', 'lang', 'location', 'quote_count', 'reply_count', 'retweet_count', 'tweet_url', 'user_id_str', 'username']]

# Ekspor dataset yang sudah diproses ke file CSV
output_file_path = 'processed_tweets.csv'
dataset.to_csv(output_file_path, index=False)

print(f"Dataset berhasil diekspor ke: {output_file_path}")


Dataset berhasil diekspor ke: processed_tweets.csv
