<a href="https://colab.research.google.com/github/Abyan12/Hybrid-Learning-Personality/blob/main/cleaning_data_ML_Personality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [None]:
try:
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('wordnet')
    print("Data NLTK berhasil diunduh.")
except Exception as e:
    print(f"Gagal mengunduh data NLTK: {e}")


Data NLTK berhasil diunduh.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
file_path = 'reddit_personality_10k.csv'
try:
    df = pd.read_csv(file_path)
    print("\nDataset berhasil dimuat. Berikut 5 baris pertama:")
    print(df.head())
except FileNotFoundError:
    print(f"\nERROR: File '{file_path}' tidak ditemukan. Pastikan Anda sudah mengunggahnya ke Google Colab.")
    # Hentikan eksekusi jika file tidak ada
    exit()


Dataset berhasil dimuat. Berikut 5 baris pertama:
   subreddit                                              title  \
0  introvert                   Anyone else hybernate in summer?   
1  introvert  Why I started treating social energy like a fi...   
2  introvert           Do People Actually Like It When we Talk?   
3  introvert          Anyone else find they don’t need friends?   
4  introvert  Are there people who love going alone to conce...   

                                             content      label  \
0  I honestly do not see the appeal of summer. I ...  introvert   
1  After months of crashing from back-to-back mee...  introvert   
2  It doesn't always feel like praise whenever an...  introvert   
3  Sorry, mistake in the title. Had to repost. \n...  introvert   
4  I see many people being scared of going alone ...  introvert   

           created_utc  
0  2025-07-18 13:36:27  
1  2025-07-18 14:01:34  
2  2025-07-18 17:45:56  
3  2025-07-18 16:33:28  
4  2025-07-18 10:5

In [None]:
df['content'].fillna('', inplace=True)
df['full_text'] = df['title'] + ' ' + df['content']
print("\nKolom 'title' dan 'content' berhasil digabung menjadi 'full_text'.")


Kolom 'title' dan 'content' berhasil digabung menjadi 'full_text'.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['content'].fillna('', inplace=True)


In [None]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    """
    Fungsi untuk membersihkan teks dengan langkah-langkah berikut:
    1. Menghapus URL
    2. Menghapus mention user/subreddit
    3. Menghapus tanda baca dan angka
    4. Mengubah ke huruf kecil
    5. Tokenisasi (memecah teks menjadi kata)
    6. Menghapus stopwords
    7. Lemmatisasi (mengubah kata ke bentuk dasar)
    """
    if not isinstance(text, str):
        return ""

    # Menghapus URL
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Menghapus mention user dan subreddit
    text = re.sub(r'\@\w+|\/r\/\w+|\/u\/\w+', '', text)
    # Menghapus semua karakter kecuali huruf dan spasi
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Mengubah ke huruf kecil
    text = text.lower()
    # Tokenisasi
    tokens = word_tokenize(text)
    # Menghapus stopwords dan melakukan lemmatisasi
    cleaned_tokens = [
        lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and len(word) > 1
    ]
    # Menggabungkan kembali token menjadi teks
    return ' '.join(cleaned_tokens)

print("\nFungsi preprocessing telah siap.")


Fungsi preprocessing telah siap.


In [None]:
try:
    nltk.download('punkt_tab')
    print("Data NLTK punkt_tab berhasil diunduh.")
except Exception as e:
    print(f"Gagal mengunduh data NLTK punkt_tab: {e}")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Data NLTK punkt_tab berhasil diunduh.


In [None]:
print("Memulai proses pembersihan teks...")
df['cleaned_text'] = df['full_text'].apply(preprocess_text)
print("Pembersihan teks selesai!")

Memulai proses pembersihan teks...
Pembersihan teks selesai!


In [None]:
print("\nBerikut adalah perbandingan teks asli dengan teks yang sudah dibersihkan:")
print(df[['full_text', 'cleaned_text']].head())


Berikut adalah perbandingan teks asli dengan teks yang sudah dibersihkan:
                                           full_text  \
0  Anyone else hybernate in summer? I honestly do...   
1  Why I started treating social energy like a fi...   
2  Do People Actually Like It When we Talk? It do...   
3  Anyone else find they don’t need friends? Sorr...   
4  Are there people who love going alone to conce...   

                                        cleaned_text  
0  anyone else hybernate summer honestly see appe...  
1  started treating social energy like finite res...  
2  people actually like talk doesnt always feel l...  
3  anyone else find dont need friend sorry mistak...  
4  people love going alone concert see many peopl...  


In [None]:
cleaned_file_path = 'cleaned_reddit_personality.csv'
# Memilih kolom yang relevan untuk disimpan
df_to_save = df[['subreddit', 'label', 'created_utc', 'full_text', 'cleaned_text']]
df_to_save.to_csv(cleaned_file_path, index=False)

print(f"\nData yang sudah bersih berhasil disimpan ke file '{cleaned_file_path}'.")


Data yang sudah bersih berhasil disimpan ke file 'cleaned_reddit_personality.csv'.


In [None]:
try:
    from google.colab import files
    print("Anda bisa mengunduh file yang sudah bersih dengan menjalankan perintah di bawah ini di sel baru:")
    print(f"files.download('{cleaned_file_path}')")
except ImportError:
    print("\nSelesai.")

Anda bisa mengunduh file yang sudah bersih dengan menjalankan perintah di bawah ini di sel baru:
files.download('cleaned_reddit_personality.csv')
